# Sentiment

## Part I: VADER

```
INPUT SENTENCE 1 I love apples
VADER OUTPUT {'neg': 0.0, 'neu': 0.192, 'pos': 0.808, 'compound': 0.6369}

INPUT SENTENCE 2 I don't love apples
VADER OUTPUT {'neg': 0.627, 'neu': 0.373, 'pos': 0.0, 'compound': -0.5216}

INPUT SENTENCE 3 I love apples :-)
VADER OUTPUT {'neg': 0.0, 'neu': 0.133, 'pos': 0.867, 'compound': 0.7579}

INPUT SENTENCE 4 These houses are ruins
VADER OUTPUT {'neg': 0.492, 'neu': 0.508, 'pos': 0.0, 'compound': -0.4404}

INPUT SENTENCE 5 These houses are certainly not considered ruins
VADER OUTPUT {'neg': 0.0, 'neu': 0.51, 'pos': 0.49, 'compound': 0.5867}

INPUT SENTENCE 6 He lies in the chair in the garden
VADER OUTPUT {'neg': 0.286, 'neu': 0.714, 'pos': 0.0, 'compound': -0.4215}

INPUT SENTENCE 7 This house is like any house
VADER OUTPUT {'neg': 0.0, 'neu': 0.667, 'pos': 0.333, 'compound': 0.3612}
```

INPUT SENTENCE 1 I love apples
VADER OUTPUT {'neg': 0.0, 'neu': 0.192, 'pos': 0.808, 'compound': 0.6369}
Insights: VADER correctly identifies it as positive, as it is a very simple sentence. 

INPUT SENTENCE 2 I don't love apples
VADER OUTPUT {'neg': 0.627, 'neu': 0.373, 'pos': 0.0, 'compound': -0.5216}
Insights: VADER correctly identifies it as negative, as it detects the negating don't before the like.

INPUT SENTENCE 3 I love apples :-)
VADER OUTPUT {'neg': 0.0, 'neu': 0.133, 'pos': 0.867, 'compound': 0.7579}
Insights: VADER correctly identifies it as positive, this time with a higher pos score compared to sentence 1, due to the emoticon, which VADER correctly detects: ":-)"

INPUT SENTENCE 4 These houses are ruins
VADER OUTPUT {'neg': 0.492, 'neu': 0.508, 'pos': 0.0, 'compound': -0.4404}
Insights: While the neutral score is the highest (even though the neg score is also high), the compound score also indicates that thhe sentence carries negative sentiment.

INPUT SENTENCE 5 These houses are certainly not considered ruins
VADER OUTPUT {'neg': 0.0, 'neu': 0.51, 'pos': 0.49, 'compound': 0.5867}
Insights: VADER correctly identifies the sentence via the normalized compound score. Again, it correctly identified the negating word before the part of the sentence which carries the sentiment.

INPUT SENTENCE 6 He lies in the chair in the garden
VADER OUTPUT {'neg': 0.286, 'neu': 0.714, 'pos': 0.0, 'compound': -0.4215}
Insights: VADER correctly identifies the sentence, based only off of the neg, neu and pos scores, however we can observe that VADERS lexicon might have a problem with the nuances of the word "lies", which gets incorrectly interpreted as an overall negative sentiment, as can be seen in the compound score.

INPUT SENTENCE 7 This house is like any house
VADER OUTPUT {'neg': 0.0, 'neu': 0.667, 'pos': 0.333, 'compound': 0.3612}
Insights: VADER correctly identifies it as neutral, however the compound score reflects a positive sentiment, probably due to the wor "like"

We can conclude that VADER gets a high accuracy score, however it must be mentioned that these sentences overall do not carry that many nuances, compared to the everyday language.

### Collecting 50 tweets for evaluation


In [4]:
import json

In [5]:
my_tweets = json.load(open('my_tweets.json'))

In [6]:
for id_, tweet_info in my_tweets.items():
    print(id_, tweet_info)
    break

1 {'sentiment_label': 'neg', 'text_of_tweet': "Me, ready to go at supermarket during the #COVID19 outbreak. Not because I'm paranoid, but because my food stock is litteraly empty. The #coronavirus is a serious thing, but please, don't panic. It causes shortage...", 'tweet_url': 'https://t.co/usmuaLq72n'}


In [13]:
def vader_output_to_label(vader_output):
    """
    map vader output e.g.,
    {'neg': 0.0, 'neu': 0.0, 'pos': 1.0, 'compound': 0.4215}
    to one of the following values:
    a) positive float -> 'positive'
    b) 0.0 -> 'neutral'
    c) negative float -> 'negative'
    
    :param dict vader_output: output dict from vader
    
    :rtype: str
    :return: 'negative' | 'neutral' | 'positive'
    """
    compound = vader_output['compound']
    
    if compound < 0:
        return 'negative'
    elif compound == 0.0:
        return 'neutral'
    elif compound > 0.0:
        return 'positive'
    
assert vader_output_to_label( {'neg': 0.0, 'neu': 0.0, 'pos': 1.0, 'compound': 0.0}) == 'neutral'
assert vader_output_to_label( {'neg': 0.0, 'neu': 0.0, 'pos': 1.0, 'compound': 0.01}) == 'positive'
assert vader_output_to_label( {'neg': 0.0, 'neu': 0.0, 'pos': 1.0, 'compound': -0.01}) == 'negative'

In [9]:
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from sklearn.metrics import classification_report
from sklearn import metrics

In [10]:
vader_model = SentimentIntensityAnalyzer()

In [20]:
import spacy
nlp = spacy.load('en_core_web_sm') # 'en_core_web_sm'

In [11]:
def run_vader(textual_unit, 
              lemmatize=False, 
              parts_of_speech_to_consider=None,
              verbose=0):
    """
    Run VADER on a sentence from spacy
    
    :param str textual unit: a textual unit, e.g., sentence, sentences (one string)
    (by looping over doc.sents)
    :param bool lemmatize: If True, provide lemmas to VADER instead of words
    :param set parts_of_speech_to_consider:
    -None or empty set: all parts of speech are provided
    -non-empty set: only these parts of speech are considered.
    :param int verbose: if set to 1, information is printed
    about input and output
    
    :rtype: dict
    :return: vader output dict
    """
    doc = nlp(textual_unit)
        
    input_to_vader = []

    for sent in doc.sents:
        for token in sent:

            to_add = token.text

            if lemmatize:
                to_add = token.lemma_

                if to_add == '-PRON-': 
                    to_add = token.text

            if parts_of_speech_to_consider:
                if token.pos_ in parts_of_speech_to_consider:
                    input_to_vader.append(to_add) 
            else:
                input_to_vader.append(to_add)

    scores = vader_model.polarity_scores(' '.join(input_to_vader))
    
    if verbose >= 1:
        print()
        print('INPUT SENTENCE', sent)
        print('INPUT TO VADER', input_to_vader)
        print('VADER OUTPUT', scores)

    return scores

In [21]:
from sklearn.metrics import classification_report

tweets = []
all_vader_output = []
gold = []

# settings (to change for different experiments)
to_lemmatize = True 
pos = set()
i = -1
for id_, tweet_info in my_tweets.items():
    i += 1
    the_tweet = tweet_info['text_of_tweet']
    vader_output = run_vader(the_tweet) # run vader
    vader_label = vader_output_to_label(vader_output) # convert vader output to category
    
    tweets.append(the_tweet)
    all_vader_output.append(vader_label)
    gold.append(tweet_info['sentiment_label'])
    
for i in range(len(all_vader_output)):
    if all_vader_output[i] == 'neutral':
        all_vader_output[i] = 'neg'
    elif all_vader_output[i] == 'positive':
        all_vader_output[i] = 'pos'
    else:
        all_vader_output[i] = 'neu'

print("GOLD LABELS:", gold)
print("VADER LABELS:",all_vader_output)

# use scikit-learn's classification report
print(classification_report(y_true=gold, y_pred=all_vader_output, digits=3))



GOLD LABELS: ['neg', 'neu', 'neu', 'pos', 'neg', 'neg', 'neg', 'neg', 'neg', 'neg', 'pos', 'neg', 'pos', 'neg', 'pos', 'neu', 'neg', 'neu', 'pos', 'neu', 'neg', 'pos', 'neu', 'pos', 'neg', 'neg', 'neg', 'pos', 'neg', 'neg', 'neg', 'neu', 'neg', 'neu', 'neu', 'pos', 'neu', 'neg', 'pos', 'pos', 'neg', 'neg', 'neu', 'neu', 'neg', 'neg', 'pos', 'pos', 'neu', 'neg']
VADER LABELS: ['pos', 'neg', 'neg', 'pos', 'neu', 'neu', 'pos', 'neu', 'neu', 'neu', 'pos', 'neu', 'pos', 'neu', 'neu', 'pos', 'neu', 'neu', 'pos', 'pos', 'neu', 'pos', 'neg', 'pos', 'neu', 'neu', 'pos', 'pos', 'neu', 'neu', 'pos', 'neu', 'pos', 'pos', 'neu', 'pos', 'neu', 'neu', 'pos', 'neg', 'neu', 'neu', 'pos', 'pos', 'pos', 'neu', 'pos', 'pos', 'neg', 'neu']
              precision    recall  f1-score   support

         neg      0.000     0.000     0.000        24
         neu      0.174     0.308     0.222        13
         pos      0.500     0.846     0.629        13

    accuracy                          0.300        50

In [15]:
wrongly_classified_positive_indexes = [14, 39]
wrongly_classified_negative_indexes = [0, 4, 5, 6, 7, 8, 11, 13, 49, 45]
wrongly_classified_neutrals_indexes = [1, 2, 15, 19, 22, 33, 48, 43, 42]

#negatives
for i in range(len(gold)):
    if i in wrongly_classified_negative_indexes:
        print("wrongly classified NEGATIVE tweet:", gold[i], all_vader_output[i], i, tweets[i])

#neutrals
for i in range(len(gold)):
    if i in wrongly_classified_neutrals_indexes:
        print("wrongly classified NEUTRAL tweet:", gold[i], all_vader_output[i], i, tweets[i])

#positives
for i in range(len(gold)):
    if i in wrongly_classified_positive_indexes:
        print("wrongly classified POSIITIVE tweet:", gold[i], all_vader_output[i], i, tweets[i])




wrongly classified NEGATIVE tweet: neg pos 0 Me, ready to go at supermarket during the #COVID19 outbreak. Not because I'm paranoid, but because my food stock is litteraly empty. The #coronavirus is a serious thing, but please, don't panic. It causes shortage...
wrongly classified NEGATIVE tweet: neg neu 4 #COVID19 makes us dumber. Large study shows cognitive impairment (memory, reasoning, executive function) in those who’ve been infected. Mult doses of #vaccine associated w/ LESS cognitive impairment & mult infections w/ MORE cognitive impairment. 
wrongly classified NEGATIVE tweet: neg neu 5 If you know you’re sick with *something* but it’s “NOT COVID” you should still wear a mask. Nasty!
wrongly classified NEGATIVE tweet: neg pos 6 1/2 the world volunteered to shorten their lives so they could protect themselves from a disease that was over 97% survivable. (Same rate as the flu)
wrongly classified NEGATIVE tweet: neg neu 7 Does anyone else deal with anger? I know it is normal to feel

Part a, 

Precision: Precision is pretty low across all categories, with 0 precision for 'neg' and relatively low precision for 'neu' and 'pos'. This means that the model incorrectly classified many instances as belonging to a certain each class and didn't get a single 'neg' correct. At least it got 50% of the positive examples correct.

Recall: In this evaluation, recall is low for 'neg' and 'neu' but relatively higher for 'pos'. Low recall suggests that the model failed to capture a significant portion of instances belonging to the 'neg' and 'neu' categories. However each time the system predicted a positive it was actually most likely a positive, this along with the low .5 precision means that the system is very caucious. 

F1-score: The F1-score is the harmonic mean of precision and recall. It provides a balance between precision and recall. Here, the F1-score is generally low across all sentiment categories. A low F1-score alos demonstrates the poor performance in both precision and recall

Support: Support indicates the number of actual instances belonging to each class. Here, 'neg' has the highest support, followed by 'neu' and 'pos'.

Accuracy: Accuracy measures the overall correctness of the classifier across all classes. In this case, the accuracy is low at 30%, indicating that the model's performance is dreadful :(

Part b,

**Misclassified negatives:**  

1, wrongly classified NEGATIVE tweet: neg pos 0 Me, ready to go at supermarket during the #COVID19 outbreak. Not because I'm paranoid, but because my food stock is litteraly empty. The #coronavirus is a serious thing, but please, don't panic. It causes shortage...

2,wrongly classified NEGATIVE tweet: neg neu 4 #COVID19 makes us dumber. Large study shows cognitive impairment (memory, reasoning, executive function) in those who’ve been infected. Mult doses of #vaccine associated w/ LESS cognitive impairment & mult infections w/ MORE cognitive impairment. 


3,wrongly classified NEGATIVE tweet: neg neu 5 If you know you’re sick with *something* but it’s “NOT COVID” you should still wear a mask. Nasty!



**Misclassified neutrals:**

1, wrongly classified NEUTRAL tweet: neu neg 1 COVID-19 restrictions sparking a run on cannabis stores. They're not closed yet! But Customers are stocking up on cannabis this weekend, preparing for what could be more retail store restrictions in coming days.


2, wrongly classified NEUTRAL tweet: neu pos 15 The grocery store was PACKED at 6am this morning, in west Henrico VA. #VA07 Early Saturday, is when I shop in relative peace. Not THIS morning.


3, wrongly classified NEUTRAL tweet: neu neg 22 ‘#Australia had some of the most stringent border checks you could get and this time last year despite all the testing, the fact few people could go in, the #Omicron variant got in’
 

**Misclassified positives:**

1,
wrongly classified POSIITIVE tweet: pos neu 14 My local corner shop topped up prices on everything. A typical 20 basket cost 33 today. I argued the price increase. He stated he won't survive against the main supermarkets. I willingly paid the 33 and rounded up to 35. I'd be lost without him.

2,
wrongly classified POSIITIVE tweet: pos neg 39 #Israel has vaccinated 30% of all population over 60. At present pace of over 100,000 inoculated daily,  Israel will vaccinate all remaining 60+ citizens in approx 10 days.

    


### VADER on the set of airline tweets 

In [26]:
import pathlib
import os

In [25]:
cwd = pathlib.Path.cwd()
airline_tweets_folder = cwd.joinpath('airlinetweets')
print('path:', airline_tweets_folder)    

path: /home/aronf/Desktop/text-mining/code/ba-text-mining/lab_sessions/lab3/airlinetweets


In [24]:
def get_content(sentiment):
    directory = f'{airline_tweets_folder}/{sentiment}/'
    # List to store the text content of all files
    text_content = []

    # Loop through each file in the directory
    for filename in os.listdir(directory):
        filepath = os.path.join(directory, filename)
        with open(filepath, 'r', encoding='utf-8') as file:
            text_content.append(file.read())
    return text_content


In [44]:
def vader_analyse_1(sent, content):
    tweets = []
    all_vader_output = []
    gold = []

    # settings (to change for different experiments)
    to_lemmatize = True 
    pos = set()
    gold = [sent for i in range(len(content))]

    for i in range(len(content)):
        the_tweet = content[i]
        vader_output = run_vader(the_tweet) # run vader
        vader_label = vader_output_to_label(vader_output) # convert vader output to category
        
        tweets.append(the_tweet)
        all_vader_output.append(vader_label)
        
        
        
    for i in range(len(all_vader_output)):
        if all_vader_output[i] == 'neutral':
            all_vader_output[i] = 'neg'
        elif all_vader_output[i] == 'positive':
            all_vader_output[i] = 'pos'
        else:
            all_vader_output[i] = 'neu'

    # print("GOLD LABELS:", gold)
    # print("VADER LABELS:",all_vader_output)

    return gold, all_vader_output

In [45]:
pos_gold, pos_all_vader_output = vader_analyse_1('pos', get_content('positive'))
neu_gold, neu_all_vader_output = vader_analyse_1('neu', get_content('neutral'))
neg_gold, neg_all_vader_output = vader_analyse_1('neg', get_content('negative'))


gold = pos_gold + neu_gold + neg_gold

all_vader_output = pos_all_vader_output + neu_all_vader_output + neg_all_vader_output

In [46]:
print("BASE")
# use scikit-learn's classification report
print(classification_report(y_true=gold, y_pred=all_vader_output))

BASE
              precision    recall  f1-score   support

         neg       0.29      0.21      0.25      1750
         neu       0.17      0.12      0.14      1515
         pos       0.56      0.88      0.68      1490

    accuracy                           0.39      4755
   macro avg       0.34      0.41      0.36      4755
weighted avg       0.34      0.39      0.35      4755



In [47]:
def vader_analyse_2(sent, content):
    tweets = []
    all_vader_output = []
    gold = []

    # settings (to change for different experiments)
    to_lemmatize = True 
    pos = set()
    gold = [sent for i in range(len(content))]

    for i in range(len(content)):
        the_tweet = content[i]
        vader_output = run_vader(the_tweet, lemmatize=True) # run vader
        vader_label = vader_output_to_label(vader_output) # convert vader output to category
        
        tweets.append(the_tweet)
        all_vader_output.append(vader_label)
        
        
        
    for i in range(len(all_vader_output)):
        if all_vader_output[i] == 'neutral':
            all_vader_output[i] = 'neg'
        elif all_vader_output[i] == 'positive':
            all_vader_output[i] = 'pos'
        else:
            all_vader_output[i] = 'neu'

    # print("GOLD LABELS:", gold)
    # print("VADER LABELS:",all_vader_output)

    return gold, all_vader_output

In [48]:
pos_gold, pos_all_vader_output = vader_analyse_2('pos', get_content('positive'))
neu_gold, neu_all_vader_output = vader_analyse_2('neu', get_content('neutral'))
neg_gold, neg_all_vader_output = vader_analyse_2('neg', get_content('negative'))


gold = pos_gold + neu_gold + neg_gold

all_vader_output = pos_all_vader_output + neu_all_vader_output + neg_all_vader_output
print("LEMMA")
# use scikit-learn's classification report
print(classification_report(y_true=gold, y_pred=all_vader_output))

LEMMA
              precision    recall  f1-score   support

         neg       0.30      0.21      0.24      1750
         neu       0.17      0.13      0.15      1515
         pos       0.56      0.88      0.68      1490

    accuracy                           0.40      4755
   macro avg       0.34      0.41      0.36      4755
weighted avg       0.34      0.40      0.35      4755



In [49]:
def vader_analyse_3(sent, content):
    tweets = []
    all_vader_output = []
    gold = []

    # settings (to change for different experiments)
    to_lemmatize = True 
    pos = set()
    gold = [sent for i in range(len(content))]

    for i in range(len(content)):
        the_tweet = content[i]
        vader_output = run_vader(the_tweet, parts_of_speech_to_consider={'ADJ'}) # run vader
        vader_label = vader_output_to_label(vader_output) # convert vader output to category
        
        tweets.append(the_tweet)
        all_vader_output.append(vader_label)
        
        
        
    for i in range(len(all_vader_output)):
        if all_vader_output[i] == 'neutral':
            all_vader_output[i] = 'neg'
        elif all_vader_output[i] == 'positive':
            all_vader_output[i] = 'pos'
        else:
            all_vader_output[i] = 'neu'

    # print("GOLD LABELS:", gold)
    # print("VADER LABELS:",all_vader_output)

    return gold, all_vader_output

In [50]:
pos_gold, pos_all_vader_output = vader_analyse_3('pos', get_content('positive'))
neu_gold, neu_all_vader_output = vader_analyse_3('neu', get_content('neutral'))
neg_gold, neg_all_vader_output = vader_analyse_3('neg', get_content('negative'))


gold = pos_gold + neu_gold + neg_gold

all_vader_output = pos_all_vader_output + neu_all_vader_output + neg_all_vader_output
print("ADJ")
# use scikit-learn's classification report
print(classification_report(y_true=gold, y_pred=all_vader_output))

ADJ
              precision    recall  f1-score   support

         neg       0.35      0.68      0.46      1750
         neu       0.08      0.02      0.04      1515
         pos       0.66      0.44      0.53      1490

    accuracy                           0.39      4755
   macro avg       0.37      0.38      0.34      4755
weighted avg       0.36      0.39      0.35      4755



In [51]:
def vader_analyse_4(sent, content):
    tweets = []
    all_vader_output = []
    gold = []

    # settings (to change for different experiments)
    to_lemmatize = True 
    pos = set()
    gold = [sent for i in range(len(content))]

    for i in range(len(content)):
        the_tweet = content[i]
        vader_output = run_vader(the_tweet, lemmatize=True, parts_of_speech_to_consider={'ADJ'}) # run vader
        vader_label = vader_output_to_label(vader_output) # convert vader output to category
        
        tweets.append(the_tweet)
        all_vader_output.append(vader_label)
        
        
        
    for i in range(len(all_vader_output)):
        if all_vader_output[i] == 'neutral':
            all_vader_output[i] = 'neg'
        elif all_vader_output[i] == 'positive':
            all_vader_output[i] = 'pos'
        else:
            all_vader_output[i] = 'neu'

    # print("GOLD LABELS:", gold)
    # print("VADER LABELS:",all_vader_output)

    return gold, all_vader_output

In [52]:
pos_gold, pos_all_vader_output = vader_analyse_4('pos', get_content('positive'))
neu_gold, neu_all_vader_output = vader_analyse_4('neu', get_content('neutral'))
neg_gold, neg_all_vader_output = vader_analyse_4('neg', get_content('negative'))


gold = pos_gold + neu_gold + neg_gold

all_vader_output = pos_all_vader_output + neu_all_vader_output + neg_all_vader_output
print("LEMMA+ADJ")
# use scikit-learn's classification report
print(classification_report(y_true=gold, y_pred=all_vader_output))

LEMMA+ADJ
              precision    recall  f1-score   support

         neg       0.35      0.67      0.46      1750
         neu       0.08      0.02      0.04      1515
         pos       0.66      0.44      0.53      1490

    accuracy                           0.39      4755
   macro avg       0.37      0.38      0.34      4755
weighted avg       0.36      0.39      0.35      4755



In [53]:
def vader_analyse_5(sent, content):
    tweets = []
    all_vader_output = []
    gold = []

    # settings (to change for different experiments)
    to_lemmatize = True 
    pos = set()
    gold = [sent for i in range(len(content))]

    for i in range(len(content)):
        the_tweet = content[i]
        vader_output = run_vader(the_tweet, parts_of_speech_to_consider={'NOUN'}) # run vader
        vader_label = vader_output_to_label(vader_output) # convert vader output to category
        
        tweets.append(the_tweet)
        all_vader_output.append(vader_label)
        
        
        
    for i in range(len(all_vader_output)):
        if all_vader_output[i] == 'neutral':
            all_vader_output[i] = 'neg'
        elif all_vader_output[i] == 'positive':
            all_vader_output[i] = 'pos'
        else:
            all_vader_output[i] = 'neu'

    # print("GOLD LABELS:", gold)
    # print("VADER LABELS:",all_vader_output)

    return gold, all_vader_output

In [54]:
pos_gold, pos_all_vader_output = vader_analyse_5('pos', get_content('positive'))
neu_gold, neu_all_vader_output = vader_analyse_5('neu', get_content('neutral'))
neg_gold, neg_all_vader_output = vader_analyse_5('neg', get_content('negative'))


gold = pos_gold + neu_gold + neg_gold

all_vader_output = pos_all_vader_output + neu_all_vader_output + neg_all_vader_output
print("NOUNS")
# use scikit-learn's classification report
print(classification_report(y_true=gold, y_pred=all_vader_output))

NOUNS
              precision    recall  f1-score   support

         neg       0.37      0.73      0.49      1750
         neu       0.17      0.04      0.06      1515
         pos       0.53      0.34      0.41      1490

    accuracy                           0.39      4755
   macro avg       0.36      0.37      0.32      4755
weighted avg       0.36      0.39      0.33      4755



In [55]:
def vader_analyse_6(sent, content):
    tweets = []
    all_vader_output = []
    gold = []

    # settings (to change for different experiments)
    to_lemmatize = True 
    pos = set()
    gold = [sent for i in range(len(content))]

    for i in range(len(content)):
        the_tweet = content[i]
        vader_output = run_vader(the_tweet, lemmatize=True, parts_of_speech_to_consider={'NOUN'}) # run vader
        vader_label = vader_output_to_label(vader_output) # convert vader output to category
        
        tweets.append(the_tweet)
        all_vader_output.append(vader_label)
        
        
        
    for i in range(len(all_vader_output)):
        if all_vader_output[i] == 'neutral':
            all_vader_output[i] = 'neg'
        elif all_vader_output[i] == 'positive':
            all_vader_output[i] = 'pos'
        else:
            all_vader_output[i] = 'neu'

    # print("GOLD LABELS:", gold)
    # print("VADER LABELS:",all_vader_output)

    return gold, all_vader_output

In [56]:
pos_gold, pos_all_vader_output = vader_analyse_6('pos', get_content('positive'))
neu_gold, neu_all_vader_output = vader_analyse_6('neu', get_content('neutral'))
neg_gold, neg_all_vader_output = vader_analyse_6('neg', get_content('negative'))


gold = pos_gold + neu_gold + neg_gold

all_vader_output = pos_all_vader_output + neu_all_vader_output + neg_all_vader_output
print("LEMMA+NOUN")
# use scikit-learn's classification report
print(classification_report(y_true=gold, y_pred=all_vader_output))

LEMMA+NOUN
              precision    recall  f1-score   support

         neg       0.36      0.71      0.48      1750
         neu       0.17      0.04      0.07      1515
         pos       0.52      0.33      0.40      1490

    accuracy                           0.38      4755
   macro avg       0.35      0.36      0.32      4755
weighted avg       0.35      0.38      0.33      4755



In [57]:
def vader_analyse_7(sent, content):
    tweets = []
    all_vader_output = []
    gold = []

    # settings (to change for different experiments)
    to_lemmatize = True 
    pos = set()
    gold = [sent for i in range(len(content))]

    for i in range(len(content)):
        the_tweet = content[i]
        vader_output = run_vader(the_tweet, parts_of_speech_to_consider={'VERB'}) # run vader
        vader_label = vader_output_to_label(vader_output) # convert vader output to category
        
        tweets.append(the_tweet)
        all_vader_output.append(vader_label)
        
        
        
    for i in range(len(all_vader_output)):
        if all_vader_output[i] == 'neutral':
            all_vader_output[i] = 'neg'
        elif all_vader_output[i] == 'positive':
            all_vader_output[i] = 'pos'
        else:
            all_vader_output[i] = 'neu'

    # print("GOLD LABELS:", gold)
    # print("VADER LABELS:",all_vader_output)

    return gold, all_vader_output

In [58]:
pos_gold, pos_all_vader_output = vader_analyse_7('pos', get_content('positive'))
neu_gold, neu_all_vader_output = vader_analyse_7('neu', get_content('neutral'))
neg_gold, neg_all_vader_output = vader_analyse_7('neg', get_content('negative'))


gold = pos_gold + neu_gold + neg_gold

all_vader_output = pos_all_vader_output + neu_all_vader_output + neg_all_vader_output
print("VERB")
# use scikit-learn's classification report
print(classification_report(y_true=gold, y_pred=all_vader_output))

VERB
              precision    recall  f1-score   support

         neg       0.32      0.59      0.42      1750
         neu       0.16      0.07      0.10      1515
         pos       0.57      0.34      0.43      1490

    accuracy                           0.35      4755
   macro avg       0.35      0.34      0.31      4755
weighted avg       0.35      0.35      0.32      4755



In [59]:
def vader_analyse_8(sent, content):
    tweets = []
    all_vader_output = []
    gold = []

    # settings (to change for different experiments)
    to_lemmatize = True 
    pos = set()
    gold = [sent for i in range(len(content))]

    for i in range(len(content)):
        the_tweet = content[i]
        vader_output = run_vader(the_tweet, lemmatize=True, parts_of_speech_to_consider={'VERB'}) # run vader
        vader_label = vader_output_to_label(vader_output) # convert vader output to category
        
        tweets.append(the_tweet)
        all_vader_output.append(vader_label)
        
        
        
    for i in range(len(all_vader_output)):
        if all_vader_output[i] == 'neutral':
            all_vader_output[i] = 'neg'
        elif all_vader_output[i] == 'positive':
            all_vader_output[i] = 'pos'
        else:
            all_vader_output[i] = 'neu'

    # print("GOLD LABELS:", gold)
    # print("VADER LABELS:",all_vader_output)

    return gold, all_vader_output

In [60]:
pos_gold, pos_all_vader_output = vader_analyse_8('pos', get_content('positive'))
neu_gold, neu_all_vader_output = vader_analyse_8('neu', get_content('neutral'))
neg_gold, neg_all_vader_output = vader_analyse_8('neg', get_content('negative'))


gold = pos_gold + neu_gold + neg_gold

all_vader_output = pos_all_vader_output + neu_all_vader_output + neg_all_vader_output
print("LEMMA + VERB")
# use scikit-learn's classification report
print(classification_report(y_true=gold, y_pred=all_vader_output))

LEMMA + VERB
              precision    recall  f1-score   support

         neg       0.33      0.59      0.42      1750
         neu       0.19      0.09      0.12      1515
         pos       0.57      0.35      0.43      1490

    accuracy                           0.36      4755
   macro avg       0.36      0.34      0.33      4755
weighted avg       0.36      0.36      0.33      4755



## Part II: scikit-learn assignments

Train the scikit-learn classifier (Naive Bayes) using the airline tweets.

In [101]:
import pathlib
import sklearn
import numpy
import nltk
from nltk.corpus import stopwords
from collections import Counter
from sklearn.datasets import load_files
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

In [2]:
cwd = pathlib.Path.cwd()
airline_tweets_folder = cwd.joinpath('airlinetweets')
print('path:', airline_tweets_folder)

path: /home/mihaly/Documents/uni/VU/text_mining/ba-text-mining/lab_sessions/lab3/airlinetweets


In [18]:
airline_tweets_train = load_files(str(airline_tweets_folder))

In [63]:
#warnings are annoying, this way we supress them
#thanks to https://stackoverflow.com/a/33616192
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn

#for question 6
def important_features_per_class(vectorizer,classifier,n=80):
    class_labels = classifier.classes_
    feature_names =vectorizer.get_feature_names_out()
    topn_class1 = sorted(zip(classifier.feature_count_[0], feature_names),reverse=True)[:n]
    topn_class2 = sorted(zip(classifier.feature_count_[1], feature_names),reverse=True)[:n]
    topn_class3 = sorted(zip(classifier.feature_count_[2], feature_names),reverse=True)[:n]
    print("Important words in negative documents")
    for coef, feat in topn_class1:
        print(class_labels[0], coef, feat)
    print("-----------------------------------------")
    print("Important words in neutral documents")
    for coef, feat in topn_class2:
        print(class_labels[1], coef, feat) 
    print("-----------------------------------------")
    print("Important words in positive documents")
    for coef, feat in topn_class3:
        print(class_labels[2], coef, feat) 

#list of all the df values we want to test: 
min_dfs = [2,5,10]

#we will store the different vectorizings in this list
vectorized = []

#transformer init
tfidf_transformer = TfidfTransformer()

for min_df in min_dfs:
    airline_vec = CountVectorizer(min_df=min_df, # iterating through the values we want to test
                             tokenizer=nltk.word_tokenize,
                             stop_words=stopwords.words('english'))
    
    #we have to redo these due to changing min_df
    airline_counts = airline_vec.fit_transform(airline_tweets_train.data)
    vectorized.append(airline_counts)
    
    airline_tfidf = tfidf_transformer.fit_transform(airline_counts)
    vectorized.append(airline_tfidf)
    
    for index, representation in enumerate(vectorized):
        
        docs_train, docs_test, y_train, y_test = train_test_split(
        representation, # the current representation
        airline_tweets_train.target, # the category values for each tweet 
        test_size = 0.20 # we use 80% for training and 20% for development
        ) 
        
        clf = MultinomialNB().fit(docs_train, y_train)
        
        y_pred = clf.predict(docs_test)
        
        #for question 6
        if min_df == 2 and not index:
            print ("Important features:")
            important_features_per_class(airline_vec, clf)
            print()
        
        print("Vectorization:", "TF-IDF" if index else "Bag of Words")
        print("Min DF:", min_df)
        print("Classification report:")
        print(classification_report(y_true=y_test,
                                    y_pred=y_pred), "\n")
        
        
    #clear after every iteration
    vectorized.clear()

Important features:
Important words in negative documents
0 1537.0 @
0 1412.0 united
0 1259.0 .
0 418.0 ``
0 399.0 ?
0 393.0 flight
0 341.0 !
0 337.0 #
0 220.0 n't
0 141.0 ''
0 120.0 's
0 114.0 service
0 111.0 :
0 109.0 virginamerica
0 102.0 get
0 98.0 cancelled
0 95.0 delayed
0 89.0 plane
0 89.0 customer
0 85.0 time
0 83.0 bag
0 81.0 'm
0 75.0 -
0 74.0 ;
0 69.0 gate
0 68.0 ...
0 67.0 http
0 67.0 &
0 65.0 hours
0 64.0 help
0 62.0 still
0 62.0 late
0 62.0 hour
0 62.0 2
0 60.0 airline
0 59.0 would
0 58.0 amp
0 55.0 one
0 53.0 flights
0 53.0 delay
0 49.0 like
0 49.0 flightled
0 49.0 ca
0 48.0 waiting
0 48.0 never
0 46.0 worst
0 46.0 $
0 45.0 (
0 44.0 3
0 44.0 )
0 43.0 us
0 43.0 've
0 42.0 back
0 41.0 fly
0 40.0 seat
0 40.0 luggage
0 39.0 really
0 39.0 due
0 38.0 check
0 37.0 ticket
0 37.0 lost
0 35.0 people
0 35.0 day
0 35.0 bags
0 35.0 another
0 34.0 ever
0 33.0 wait
0 33.0 u
0 33.0 trying
0 33.0 thanks
0 33.0 going
0 33.0 baggage
0 32.0 staff
0 31.0 got
0 31.0 could
0 31.0 airport
0 30.

## End of this notebook