# **By Timofei Polivanov, Sami Rahali**

# Lab3 - Assignment Sentiment

Copyright: Vrije Universiteit Amsterdam, Faculty of Humanities, CLTL

This notebook describes the LAB-3 assignment of the Text Mining course. It is about sentiment analysis.

The aims of the assignment are:
* Learn how to run a rule-based sentiment analysis module (VADER)
* Learn how to run a machine learning sentiment analysis module (Scikit-Learn/ Naive Bayes)
* Learn how to run scikit-learn metrics for the quantitative evaluation
* Learn how to perform and interpret a quantitative evaluation of the outcomes of the tools (in terms of Precision, Recall, and F<sub>1</sub>)
* Learn how to evaluate the results qualitatively (by examining the data) 
* Get insight into differences between the two applied methods
* Get insight into the effects of using linguistic preprocessing
* Be able to describe differences between the two methods in terms of their results
* Get insight into issues when applying these methods across different  domains

In this assignment, you are going to create your own gold standard set from 50 tweets. You will the VADER and scikit-learn classifiers to these tweets and evaluate the results by using evaluation metrics and inspecting the data.

We recommend you go through the notebooks in the following order:
* **Read the assignment (see below)**
* **Lab3.2-Sentiment-analysis-with-VADER.ipynb**
* **Lab3.3-Sentiment-analysis.with-scikit-learn.ipynb**
* **Answer the questions of the assignment (see below) using the provided notebooks and submit**

In this assignment you are asked to perform both quantitative evaluations and error analyses:
* a quantitative evaluation concerns the scores (Precision, Recall, and F<sub>1</sub>) provided by scikit's classification_report. It includes the scores per category, as well as micro and macro averages. Discuss whether the scores are balanced or not between the different categories (positive, negative, neutral) and between precision and recall. Discuss the shortcomings (if any) of the classifier based on these scores
* an error analysis regarding the misclassifications of the classifier. It involves going through the texts and trying to understand what has gone wrong. It servers to get insight in what could be done to improve the performance of the classifier. Do you observe patterns in misclassifications?  Discuss why these errors are made and propose ways to solve them.

## Credits
The notebooks in this block have been originally created by [Marten Postma](https://martenpostma.github.io) and [Isa Maks](https://research.vu.nl/en/persons/e-maks). Adaptations were made by [Filip Ilievski](http://ilievski.nl).

## Part I: VADER assignments


### Preparation (nothing to submit):
To be able to answer the VADER questions you need to know how the tool works. 
* Read more about the VADER tool in [this blog](https://www.geeksforgeeks.org/python-sentiment-analysis-using-vader/).  
* VADER provides 4 scores (positive, negative, neutral, compound). Be sure to understand what they mean and how they are calculated.
* VADER uses rules to handle linguistic phenomena such as negation and intensification. Be sure to understand which rules are used, how they work, and why they are important.
* VADER makes use of a sentiment lexicon. Have a look at the lexicon. Be sure to understand which information can be found there (lemma?, wordform?, part-of-speech?, polarity value?, word meaning?) What do all scores mean? https://github.com/cjhutto/vaderSentiment/blob/master/vaderSentiment/vader_lexicon.txt) 


### [3.5 points] Question1:

Regard the following sentences and their output as given by VADER. Regard sentences 1 to 7, and explain the outcome **for each sentence**. Take into account both the rules applied by VADER and the lexicon that is used. You will find that some of the results are reasonable, but others are not. Explain what is going wrong or not when correct and incorrect results are produced. 

```
INPUT SENTENCE 1 I love apples
VADER OUTPUT {'neg': 0.0, 'neu': 0.192, 'pos': 0.808, 'compound': 0.6369}
```

This is a correct result (positive). "love" is a strongly positive word, and this is a simple and positive sentence.

```
INPUT SENTENCE 2 I don't love apples
VADER OUTPUT {'neg': 0.627, 'neu': 0.373, 'pos': 0.0, 'compound': -0.5216}
```

Correct result (negative). "don't" is a negation that flips the positive sentiment of "love" into negative.


```
INPUT SENTENCE 3 I love apples :-)
VADER OUTPUT {'neg': 0.0, 'neu': 0.133, 'pos': 0.867, 'compound': 0.7579}
```

VADER recognizes ":-)" as a positive emoticon, so it intensifies the positiveness. Correct result (positive).

```
INPUT SENTENCE 4 These houses are ruins
VADER OUTPUT {'neg': 0.492, 'neu': 0.508, 'pos': 0.0, 'compound': -0.4404}
```

Correct result (negative). The word "ruins" has a negative score in the lexicon, as it associates with destruction. So the sentence is flagged as negative.

```
INPUT SENTENCE 5 These houses are certainly not considered ruins
VADER OUTPUT {'neg': 0.0, 'neu': 0.51, 'pos': 0.49, 'compound': 0.5867}
```

This result is slighlty incorrect (positive). "not" negates “ruins”, which is negative, and can become either positive or neutral. In this case it should be neutral,
but due to ambiguity VADER classifies as positive, potentially because of the booster word “certainly”.

```
INPUT SENTENCE 6 He lies in the chair in the garden
VADER OUTPUT {'neg': 0.286, 'neu': 0.714, 'pos': 0.0, 'compound': -0.4215}
```

Incorrect result (negative). VADER likely assigns it negative sentiment because word 'lies' can either mean deception or reclining depending on context. So in this case it's misclassified due to lexical ambiguity.

```
INPUT SENTENCE 7 This house is like any house
VADER OUTPUT {'neg': 0.0, 'neu': 0.667, 'pos': 0.333, 'compound': 0.3612}
```

Slightly incorrect result (positive). In this case the sentence expresses that the houses aren't anything special, so it should be neutral. However, VADER classifies as positive, probably misinterpreting the meaning of the word 'like' in this context, as it can mean either 'to enjoy' or 'to resemble'.

### [Points: 2.5] Exercise 2: Collecting 50 tweets for evaluation
Collect 50 tweets. Try to find tweets that are interesting for sentiment analysis, e.g., very positive, neutral, and negative tweets. These could be your own tweets (typed in) or collected from the Twitter stream. If you have trouble accessing Twitter, try to find an existing dataset (on websites like kaggle or huggingface).

We will store the tweets in the file **my_tweets.json** (use a text editor to edit).
For each tweet, you should insert:
* sentiment analysis label: negative | neutral | positive (this you determine yourself, this is not done by a computer)
* the text of the tweet
* the Tweet-URL

from:
```
    "1": {
        "sentiment_label": "",
        "text_of_tweet": "",
        "tweet_url": "",
```
to:
```
"1": {
        "sentiment_label": "positive",
        "text_of_tweet": "All across America people chose to get involved, get engaged and stand up. Each of us can make a difference, and all of us ought to try. So go keep changing the world in 2018.",
        "tweet_url" : "https://twitter.com/BarackObama/status/946775615893655552",
    },
```

You can load your tweets with human annotation in the following way.

In [1]:
import json
import spacy
import nltk
from nltk.sentiment import vader
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from sklearn.metrics import classification_report



In [2]:
my_tweets = json.load(open('my_tweets.json'))

vader_model = SentimentIntensityAnalyzer()

nlp = spacy.load('en_core_web_sm')

In [3]:
for id_, tweet_info in my_tweets.items():
    print(id_, tweet_info)
    break

1 {'sentiment_label': 'negative', 'text_of_tweet': "@DisavowTrump20 @RepJasmine She's a narcissist and a foul mouth idiot!", 'tweet_url': 'https://twitter.com/i/web/status/1919382851948068930'}


### [5 points] Question 3:

Run VADER on your own tweets (see function **run_vader** from notebook **Lab2-Sentiment-analysis-using-VADER.ipynb**). You can use the code snippet below this explanation as a starting point. 
* [2.5 points] a. Perform a quantitative evaluation. Explain the different scores, and explain which scores are most relevant and why.
* [2.5 points] b. Perform an error analysis: select 10 positive, 10 negative and 10 neutral tweets that are not correctly classified and try to understand why. Refer to the VADER-rules and the VADER-lexicon. Of course, if there are less than 10 errors for a category, you only have to check those. For example, if there are only 5 errors for positive tweets, you just describe those.

In [17]:
def vader_output_to_label(vader_output):
    """
    map vader output e.g.,
    {'neg': 0.0, 'neu': 0.0, 'pos': 1.0, 'compound': 0.4215}
    to one of the following values:
    a) positive float -> 'positive'
    b) 0.0 -> 'neutral'
    c) negative float -> 'negative'
    
    :param dict vader_output: output dict from vader
    
    :rtype: str
    :return: 'negative' | 'neutral' | 'positive'
    """
    compound = vader_output['compound']
    
    if compound < -0.05:
        return 'negative'
        
    if compound > 0.05:
        return 'positive'
    
    return 'neutral'
    

def run_vader(textual_unit, 
              lemmatize=False, 
              parts_of_speech_to_consider=None,
              verbose=0):
    """
    Run VADER on a sentence from spacy
    
    :param str textual unit: a textual unit, e.g., sentence, sentences (one string)
    (by looping over doc.sents)
    :param bool lemmatize: If True, provide lemmas to VADER instead of words
    :param set parts_of_speech_to_consider:
    -None or empty set: all parts of speech are provided
    -non-empty set: only these parts of speech are considered.
    :param int verbose: if set to 1, information is printed
    about input and output
    
    :rtype: dict
    :return: vader output dict
    """
    doc = nlp(textual_unit)
        
    input_to_vader = []

    for sent in doc.sents:
        for token in sent:

            to_add = token.text

            if lemmatize:
                to_add = token.lemma_

                if to_add == '-PRON-': 
                    to_add = token.text

            if parts_of_speech_to_consider:
                if token.pos_ in parts_of_speech_to_consider:
                    input_to_vader.append(to_add) 
            else:
                input_to_vader.append(to_add)

    scores = vader_model.polarity_scores(' '.join(input_to_vader))
    
    if verbose >= 1:
        print()
        print('INPUT SENTENCE', sent)
        print('INPUT TO VADER', input_to_vader)
        print('VADER OUTPUT', scores)

    return scores


# assert vader_output_to_label( {'neg': 0.0, 'neu': 0.0, 'pos': 1.0, 'compound': 0.0}) == 'neutral'
# assert vader_output_to_label( {'neg': 0.0, 'neu': 0.0, 'pos': 1.0, 'compound': 0.06}) == 'positive'
# assert vader_output_to_label( {'neg': 0.0, 'neu': 0.0, 'pos': 1.0, 'compound': -0.06}) == 'negative'

In [19]:
tweets = []
all_vader_output = []
gold = []

# settings (to change for different experiments)
to_lemmatize = True 
pos = set()

count = 1
for id_, tweet_info in my_tweets.items():
    the_tweet = tweet_info['text_of_tweet']
    
    vader_output = run_vader(the_tweet, lemmatize=to_lemmatize, verbose=1) # run vader
    vader_label = vader_output_to_label(vader_output) # convert vader output to category
    
    tweets.append(the_tweet)
    all_vader_output.append(vader_label)
    gold.append(tweet_info['sentiment_label'])

    print(f'#{count}')
    count += 1
    
# use scikit-learn's classification report
print(classification_report(gold, all_vader_output))



# Output:
#               precision    recall  f1-score   support

#     negative       0.71      0.60      0.65        25
#      neutral       0.60      0.18      0.27        17
#     positive       0.29      0.88      0.44         8

#     accuracy                           0.50        50
#    macro avg       0.54      0.55      0.45        50
# weighted avg       0.61      0.50      0.49        50


INPUT SENTENCE She's a narcissist and a foul mouth idiot!
INPUT TO VADER ['@DisavowTrump20', '@RepJasmine', 'she', 'be', 'a', 'narcissist', 'and', 'a', 'foul', 'mouth', 'idiot', '!']
VADER OUTPUT {'neg': 0.31, 'neu': 0.69, 'pos': 0.0, 'compound': -0.5562}
#1

INPUT SENTENCE I'm sorry if I disturbed you.
INPUT TO VADER ['@drp825', '_', 'I', 'meet', 'a', 'great', 'blogger', '.', '\n', 'he', 'mainly', 'trade', 'Tesla', ',', 'Nvidia', ',', 'Apple', ',', 'Google', ',', 'and', 'have', 'a', 'very', 'good', 'grasp', 'of', 'the', 'buying', 'and', 'sell', 'price', '.', 'if', 'it', 'help', 'you', ',', 'you', 'can', 'contact', '\n', 'https://t.co/p8L9pX0KDa', '\n', 'I', 'be', 'sorry', 'if', 'I', 'disturb', 'you', '.']
VADER OUTPUT {'neg': 0.091, 'neu': 0.682, 'pos': 0.227, 'compound': 0.7902}
#2

INPUT SENTENCE @haleyrumbles Man 2025 is a wack timeline
INPUT TO VADER ['@haleyrumble', 'Man', '2025', 'be', 'a', 'wack', 'timeline']
VADER OUTPUT {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}
#

In [None]:
# Print header
print("actual label, vader predicted label")

# Print each pair of elements
for i in range(len(gold)):
    print(f"#{i+1} {gold[i]}, {all_vader_output[i]}")


actual label, vader predicted label
#1 negative, negative
#2 positive, positive
#3 negative, neutral
#4 neutral, positive
#5 neutral, positive
#6 neutral, negative
#7 negative, negative
#8 negative, negative
#9 neutral, positive
#10 negative, negative
#11 negative, negative
#12 neutral, neutral
#13 negative, negative
#14 neutral, negative
#15 neutral, positive
#16 negative, negative
#17 positive, positive
#18 neutral, positive
#19 neutral, negative
#20 negative, negative
#21 neutral, positive
#22 positive, positive
#23 positive, positive
#24 positive, positive
#25 neutral, neutral
#26 positive, positive
#27 negative, positive
#28 negative, neutral
#29 positive, negative
#30 neutral, positive
#31 negative, positive
#32 negative, positive
#33 negative, positive
#34 negative, positive
#35 negative, negative
#36 neutral, neutral
#37 neutral, negative
#38 negative, negative
#39 negative, negative
#40 positive, positive
#41 neutral, positive
#42 negative, negative
#43 negative, negative
#44 

### Question 3.a:

#### Scores:

* Negative: Highest precision (0.71) and a high recall (0.60), meaning VADER is reliable at classifying negative tweets correctly.
* Positive: Very high recall (0.88) but low precision (0.29), meaning VADER predicts many false positives for the positive sentiment class.
* Neutral: Average precision (0.60) but very low recall (0.18), meaning VADER isn't very reliable at identifying neutral tweets.


#### Most relevant scores:

* The F1-score is most relevant because it takes into account both precision and recall, which is useful when the class distribution is unbalanced like in our case (only 8 positive tweets vs. 25 negative).
* Macro average f1 (0.45) is more balanced and is more useful than accuracy (0.50), which can be skewed by class imbalance.
* Accuracy can be misleading in this case, since the model may perform well on the majority class but badly on others.



### Question 3.b:

#### Sentence #3 (negative -> neutral)
Sentence: '@haleyrumbles Man 2025 is a wack timeline'

VADER's lexicon does not include the slang 'wack', so it treats that token as neutral by default. No other words carry sentiment or trigger any of the rules. As a result, the model returns a neutral score (compound 0.0) even though 'wack' signals a negative judgment in everyday usage.

#### Sentence #4 (neutral -> positive)
Sentence: '@yobinoko1981 maybe God want we to meet a few wrong people before meet the right one, so that when we finally meet the person, we will know how to be grateful. 🦵🥱♿'

VADER's lexicon includes 'grateful' as a positive term and 'wrong' as negative, but the +2.0 weight on 'grateful' is higher. The emojis are also mapped to positive values. This pushes the compound to +0.3182, classifying it as positive even though the sentence is neutral.

#### Sentence #5 (neutral -> positive)
Sentence: '@MilaLovesJoe @data_republican Same for my daughters dog. It does help a lot. Her dog also gets a Pepsid pill 30 minutes before eating.'

VADER's lexicon marks 'help' as +1.7 and 'a lot' functions as an intensifier, resulting in a positive compound of +0.4019. The statement is interpreted as a praise for 'help', so VADER labels it positive despite its neutral intent.

#### Sentence #6 (neutral -> negative)
Sentence: "@FadedSolMaxi @ChimpersNFT I've got a thread dropping tomorrow giving the chimpdown on everything"

VADER's lexicon assigns a negative score to 'drop' (-1.1), so there are no positive words to balance it. In this context, 'dropping' means 'to post', however, VADER interpreted it as 'discarded' or 'let go', which has a negative connotation. So this incorrectly turns the neutral statement into a negative.

#### Sentence #9 (neutral -> positive)
Sentence: '@yobinoko1981 A handful of common sense is worth a bushel of learning.🚑🚮♣🟢🔊'

VADER's lexicon treats 'learning' as positive and maps some emojis to positive weights as well. This makes VADER classify this as positive, even though it is a neutral proverb.

#### Sentence #14 (neutral -> negative)
Sentence: '@SuperteamEarn Yoh! Can't believe I fell for this\n\nI was scared for  a minute'

VADER's lexicon flags 'scared' (-1.9) as negative and provides no positive tokens to balance. The negation 'Can't' is not realted to 'scared', so the model labels it negative, even though it is a neutral expression of surprise.

#### Sentence #15 (neutral -> positive)
Sentence: '@Satabdimo @anumnasreemkhan @raomeenakshi7 He's thinking of his career and wants to work with well know FLs. Also think he signed this before Iqtadars success. Doubt he'd sign a project where he's more supporting role again. But yes I can't watch them together 😖'

VADER's emoji lexicon mistakenly treats 😖 as enthusiastic, adding positive weight. The only other word is 'yes', and the negation 'can't' carries no meaning by itself. This yields a positive label, despite the user expressing discomfort.

#### Sentence #18 (neutral -> positive)
Sentence: '@_zaibii Chad Gable is a great in ring work, not Angle but he’s a great talent\n\nPriest has come up through the game so is what it is he’s a worker.\n\nFatu is awesome, I might get sick of him but for now amazing \n\nFuck Goldberg - Brons already had better matches then Bill ever would'

VADER's lexicon flags multiple strong positives: 'great', 'talent', 'awesome', 'amazing', and potentially applies capitalization and punctuation heuristics. These return a very high compound value. However, there is also negativity present in the sentence in terms of explicit langauge, but there are more positive words, so VADER interprets a mixed sentence as mostly positive.

#### Sentence #19 (neutral -> negative)
Sentence: '@07_melle No idea mate there isn't a clear obvious choice for me. Every winger is a risk imo'

VADER's lexicon assigns 'risk' a negative score. 'imo' is not included in the lexicon and no rules apply, so the single negative results in flagging it negative although the statement is a mostly neutral opinion.

#### Sentence #21 (neutral -> positive)
Sentence: '@AnnGutierrpear A true athlete plays with integrity, win or lose.🤕🌻ℹ'

VADER's lexicon tags 'win', 'true', and 'integrity' as positive, and at least one emoji is mapped positively. These signals sum to a high value, so VADER marks it positive even though the tone is neutral.

#### Sentence #27 (negative -> positive)
Sentence: '@KonateFC Hes leaving on a free mate.'

VADER's lexicon treats 'free' as a positive attribute with no negative entry for the context. The single positive token results in a positive classification, misreading sarcasm as positivity.

#### Sentence #28 (negative -> neutral)
Sentence: '@MarioNawfal Nothing matters until you build a couple hundred…'

VADER's lexicon scores 'nothing' as negative, but without any intensifier or contrastive conjunction rule, it results in only a weak negative compound, which is classified neutral. The strong intended negativity is lost becuse it is implicit.

#### Sentence #29 (positive -> negative)
Sentence: '@WHLeavitt @LeaderJohnThune Heck yeah! It's a privilege to vote but it has degraded to a fraudulent scam.'

VADER's lexicon includes 'privilege' and 'fraudulent', 'scam'. Its contrastive-conjunction rule degrades the first clause and emphasizes what follows 'but', so the heavier negatives dominate and the compound is mostly negative. A mixed mostly positive statement becomes negative because of this.

#### Sentence #30 (neutral -> positive)
Sentence: '@TokenScape Hello 👋 ! I have a community of 60k best crypto $investors 🤑and $buyers 🔥on my X and Telegram channel! You must listen to my plan once; DM me!💌🤝'

VADER's emoji lexicon directly maps the emojis to a positive. There is also a positive word 'best', and exclamation marks. Because of this, even though the message is just an ad or promotion, it is incorrectly classified as positive.

#### Sentence #31 (negative -> positive)
Sentence: '@BRICSinfo Lmfao ! 🤣😂 he's entirely turned the USA into planet earths punchline .  A toddler with downs syndrome could do a better job running that nation'

The sentence includes a lot of laughing emojis, and exclamation marks. VADER's lexicon also includes 'better' and 'job' but has no entry for 'downs' or 'syndrome', so positive tokens overwhelm. So VADER misses the hateful context entirely, because it's sarcasm.

#### Sentence #32 (negative -> positive)
Sentence: '@AfricanHub_ That's easy. They can control the corrupt bad guys with money. A dumb question, but you know this. Ask better.'

VADER's lexicon marks 'better' and 'easy' as positive. There are also negative words, but positive ones have a greater sum, which results in a wrong positive classification. Somehow VADER misses the explicit negativity.

#### Sentence #33 (negative -> positive)
Sentence: '@ScuderiaFerrari @LewisHamilton @Charles_Leclerc @ScuderiaFerrari continuing to make it really hard to support them. If it weren’t for the fact that A) I’m a die hard fan B) I have faith in Fred and C) I believe in Charles, I would have given up a long long time back.'

VADER gets a high compound value from the tweet because of positive words: 'support', 'faith', 'fan'. However, this misses the first part of the tweet, which is negative. Thus VADER misclassifies this as positive.

#### Sentence #34 (negative -> positive)
Sentence: '@bycoinhunter @Revolving_Games @HatchCoin 3.2m market cap for #HATCHonBNB is just a warm up. People buying memes on 100-200M mcap, to get 2x or more. You think a token backed by RG with strong community and 20 season of different kinds of genres gaming experience, solid IP can’t do it, then you are wrong. 44.908'

VADER analyzed the entire text: 'solid', 'strong', 'community', 'experience', and found many positive lexicon terms, resulting in a high compound. Even thouhg this tweet's purpose is to point out someone's mistake, which is negative intent, it gets classified as positive due positive words, which aren't really positive in this context.

#### Sentence #45 (negative -> positive)
Sentence: '@Junia___ @tiredofitallUSA @pooL_rM311_7221 @brometheus0x yo this gesara stuff sounds like a pyramid scheme for nerds'

VADER's lexicon has a high value for 'like', which turns the sentence positive, missing the context and the insult.

#### Sentence #46 (negative -> positive)
Sentence: '@AidenHunterX Need a more permanent solution so shlomo cant go running to banks to crybully people.'

VADER's lexicon tags 'solution' as positive, while 'crybully' is not included. The model flags this as positive, missing the insult and negative tone.

#### Sentence #48 (negative -> positive)
Sentence: '@TopMGSzn @FabrizioRomano You have fixated issue! A Real Madrid that have more La liga  and champions league trophies than your bribing referee Barcelona fc'

VADER lexicon marks the words 'champions' and 'trophies' as positive, as well as the exclamation point. There is a balancing word 'bribing', but it doesn't outweigh the positive terms. Thus, the intended insult and negative tone are missed by the model.

### [4 points] Question 4:
Run VADER on the set of airline tweets with the following settings:

* Run VADER (as it is) on the set of airline tweets 
* Run VADER on the set of airline tweets after having lemmatized the text
* Run VADER on the set of airline tweets with only adjectives
* Run VADER on the set of airline tweets with only adjectives and after having lemmatized the text
* Run VADER on the set of airline tweets with only nouns
* Run VADER on the set of airline tweets with only nouns and after having lemmatized the text
* Run VADER on the set of airline tweets with only verbs
* Run VADER on the set of airline tweets with only verbs and after having lemmatized the text

* [1 point] a. Generate for all separate experiments the classification report, i.e., Precision, Recall, and F<sub>1</sub> scores per category as well as micro and macro averages. **Use a different code cell (or multiple code cells) for each experiment.**
* [3 points] b. Compare the scores and explain what they tell you.
* - Does lemmatisation help? Explain why or why not.
* - Are all parts of speech equally important for sentiment analysis? Explain why or why not.

In [22]:
from sklearn.datasets import load_files
from sklearn.metrics import classification_report
import pathlib

In [None]:
# Load the tweets from the directory
cwd = pathlib.Path.cwd()
airline_tweets_folder = cwd.joinpath('airlinetweets')
dataset = load_files(str(airline_tweets_folder), encoding='utf-8', decode_error='replace')

def call_vader(dataset, to_lemmatize=False, parts_of_speech_to_consider=None, verbose=0):
    tweets = []
    all_vader_output = []
    gold = []

    # Settings
    pos = set()

    label_names = dataset.target_names  # ['negative', 'neutral', 'positive']

    # Process each tweet
    for count, (tweet_text, gold_label_idx) in enumerate(zip(dataset.data, dataset.target), start=1):
        vader_output = run_vader(tweet_text, to_lemmatize, parts_of_speech_to_consider, verbose)
        vader_label = vader_output_to_label(vader_output)

        tweets.append(tweet_text)
        all_vader_output.append(vader_label)
        gold.append(label_names[gold_label_idx])

    # Print classification report
    print(classification_report(gold, all_vader_output, labels=label_names))


In [None]:
# Default
call_vader(dataset)

              precision    recall  f1-score   support

    negative       0.80      0.51      0.62      1750
     neutral       0.60      0.51      0.55      1515
    positive       0.56      0.88      0.69      1490

    accuracy                           0.63      4755
   macro avg       0.65      0.64      0.62      4755
weighted avg       0.66      0.63      0.62      4755



In [33]:
# Lemmatized
call_vader(dataset, True)

              precision    recall  f1-score   support

    negative       0.79      0.52      0.62      1750
     neutral       0.59      0.49      0.54      1515
    positive       0.56      0.88      0.68      1490

    accuracy                           0.62      4755
   macro avg       0.64      0.63      0.61      4755
weighted avg       0.65      0.62      0.61      4755



In [34]:
# Adjectives
call_vader(dataset, False, {'ADJ'})

              precision    recall  f1-score   support

    negative       0.87      0.21      0.34      1750
     neutral       0.41      0.89      0.56      1515
    positive       0.67      0.45      0.54      1490

    accuracy                           0.50      4755
   macro avg       0.65      0.52      0.48      4755
weighted avg       0.66      0.50      0.47      4755



In [35]:
# Lemmatized adjectives
call_vader(dataset, True, {'ADJ'})

              precision    recall  f1-score   support

    negative       0.86      0.21      0.34      1750
     neutral       0.41      0.89      0.56      1515
    positive       0.67      0.45      0.54      1490

    accuracy                           0.50      4755
   macro avg       0.65      0.52      0.48      4755
weighted avg       0.66      0.50      0.47      4755



In [36]:
# Nouns
call_vader(dataset, False, {'NOUN'})

              precision    recall  f1-score   support

    negative       0.71      0.14      0.23      1750
     neutral       0.36      0.82      0.50      1515
    positive       0.54      0.34      0.42      1490

    accuracy                           0.42      4755
   macro avg       0.54      0.43      0.38      4755
weighted avg       0.55      0.42      0.37      4755



In [37]:
# Lemmatized nouns
call_vader(dataset, True, {'NOUN'})

              precision    recall  f1-score   support

    negative       0.70      0.15      0.25      1750
     neutral       0.36      0.81      0.50      1515
    positive       0.53      0.33      0.41      1490

    accuracy                           0.42      4755
   macro avg       0.53      0.43      0.38      4755
weighted avg       0.54      0.42      0.38      4755



In [38]:
# Verbs
call_vader(dataset, False, {'VERB'})

              precision    recall  f1-score   support

    negative       0.79      0.28      0.42      1750
     neutral       0.38      0.82      0.52      1515
    positive       0.58      0.35      0.43      1490

    accuracy                           0.47      4755
   macro avg       0.59      0.48      0.46      4755
weighted avg       0.60      0.47      0.46      4755



In [39]:
# Lemmatized verbs
call_vader(dataset, True, {'VERB'})

              precision    recall  f1-score   support

    negative       0.76      0.29      0.42      1750
     neutral       0.38      0.79      0.51      1515
    positive       0.58      0.36      0.44      1490

    accuracy                           0.47      4755
   macro avg       0.57      0.48      0.46      4755
weighted avg       0.58      0.47      0.46      4755



### When we looked across all eight experiments, two things were noticeable:

#### Lemmatization has virtually no impact:

Full text (raw): accuracy 0.63, macro F1 0.62
Full text (lemmatized): accuracy 0.62, macro F1 0.61
Adjectives (raw): accuracy 0.50, macro F1 0.48
Adjectives (lemmatized): accuracy 0.50, macro F1 0.48
Verbs (raw): accuracy 0.47, macro F1 0.46
Verbs (lemmatized): accuracy 0.47, macro F1 0.46
Nouns (raw): accuracy 0.42, macro F1 0.38
Nouns (lemmatized): accuracy 0.42, macro F1 0.38

The drop from 0.63 to 0.62 accuracy (and from 0.62 to 0.61 macro F1) in the full-text run is negligible. In the parts‑of‑speech‑only runs, the raw and lemmatized versions are identical. VADER’s built‑in normalization already handles different variants of words, so extra lemmatization doesn't benefit.


#### Parts of speech are not equally important:

Adjectives only: accuracy 0.50, macro F1 0.48
Verbs only: accuracy 0.47, macro F1 0.46
Nouns only: accuracy 0.42, macro F1 0.38

Adjectives carry the strongest sentiment. Verbs help somewhat, but many are neutral. Nouns are weakest, as most are topic markers. And also no POS subset matches the full‑text run, because VADER relies on the combination of all parts of speech plus additional rules.

## Part II: scikit-learn assignments
### [4 points] Question 5
Train the scikit-learn classifier (Naive Bayes) using the airline tweets.

+ Train the model on the airline tweets with 80% training and 20% test set and default settings (TF-IDF representation, min_df=2)
+ Train with different settings:
    + with respect to vectorizing: TF-IDF ('airline_tfidf') vs. Bag of words representation ('airline_count') 
    + with respect to the frequency threshold (min_df). Carry out experiments with increasing values for document frequency (min_df = 2; min_df = 5; min_df =10) 
* [1 point] a. Generate a classification_report for all experiments
* [3 points] b. Look at the results of the experiments with the different settings and try to explain why they differ: 
    + which category performs best, is this the case for any setting?
    + does the frequency threshold affect the scores? Why or why not according to you?

In [42]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from nltk.corpus import stopwords
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split


In [48]:
def train_model(data_type='tf', min_df=2):
    airline_vec = CountVectorizer(min_df=min_df,
                                tokenizer=nltk.word_tokenize,
                                stop_words=stopwords.words('english'))

    airline_counts = airline_vec.fit_transform(dataset.data)
    tfidf_transformer = TfidfTransformer()
    airline_tfidf = tfidf_transformer.fit_transform(airline_counts)

    data = airline_tfidf if data_type == 'tf' else airline_counts

    docs_train, docs_test, y_train, y_test = train_test_split(
        data, # the selected model
        dataset.target, # the category values for each tweet 
        test_size = 0.20 # we use 80% for training and 20% for testing
        ) 

    clf = MultinomialNB().fit(docs_train, y_train)
    y_pred = clf.predict(docs_test)

    y_test_labels = [dataset.target_names[i] for i in y_test]
    y_pred_labels = [dataset.target_names[i] for i in y_pred]

    # Output the classification report
    print(classification_report(y_test_labels, y_pred_labels, labels=dataset.target_names))

In [49]:
# tf-idf, df=2
train_model('tf', 2)



              precision    recall  f1-score   support

    negative       0.80      0.90      0.85       365
     neutral       0.82      0.67      0.74       286
    positive       0.84      0.85      0.84       300

    accuracy                           0.82       951
   macro avg       0.82      0.81      0.81       951
weighted avg       0.82      0.82      0.81       951



In [50]:
# tf-idf, df=5
train_model('tf', 5)



              precision    recall  f1-score   support

    negative       0.83      0.89      0.86       360
     neutral       0.81      0.71      0.75       299
    positive       0.83      0.85      0.84       292

    accuracy                           0.82       951
   macro avg       0.82      0.82      0.82       951
weighted avg       0.82      0.82      0.82       951



In [51]:
# tf-idf, df=10
train_model('tf', 10)



              precision    recall  f1-score   support

    negative       0.82      0.91      0.86       353
     neutral       0.82      0.72      0.76       309
    positive       0.82      0.83      0.83       289

    accuracy                           0.82       951
   macro avg       0.82      0.82      0.82       951
weighted avg       0.82      0.82      0.82       951



In [52]:
# bag, df=2
train_model('bag', 2)



              precision    recall  f1-score   support

    negative       0.84      0.91      0.87       341
     neutral       0.87      0.73      0.79       297
    positive       0.84      0.89      0.87       313

    accuracy                           0.85       951
   macro avg       0.85      0.84      0.84       951
weighted avg       0.85      0.85      0.85       951



In [53]:
# bag, df=5
train_model('bag', 5)



              precision    recall  f1-score   support

    negative       0.83      0.91      0.87       347
     neutral       0.82      0.73      0.77       297
    positive       0.83      0.84      0.83       307

    accuracy                           0.83       951
   macro avg       0.83      0.83      0.83       951
weighted avg       0.83      0.83      0.83       951



In [54]:
# bag, df=10
train_model('bag', 10)



              precision    recall  f1-score   support

    negative       0.83      0.89      0.86       369
     neutral       0.78      0.75      0.76       303
    positive       0.84      0.79      0.81       279

    accuracy                           0.82       951
   macro avg       0.81      0.81      0.81       951
weighted avg       0.82      0.82      0.81       951



### Conclusion

Across all six experiments, the negative class consistently achieves the highest F1‑score (around 0.85–0.87), regardless of whether TF‑IDF or bag‑of‑words is used, and no matter which min_df value is used. Positive class come in second place (F1 around 0.83–0.87), while neutral tweets are the hardest to classify (F1 around 0.74–0.79). This pattern is true for all experiments as well. The better performance on negative and positive tweets is probably due to the fact that negative and positive language such as complaints or expressions of happines are more explicit, so the classifier can easily notice strong signal words, while neutral language is way more subtle and implicit, and doesn't use strong words.

Increasing the document frequency threshold (min_df) from 2 to 5 to 10 has only a slight impact when using TF‑IDF: neutral recall increases slightly (from 0.67 to 0.72) and overall scores stay almost the same. TF‑IDF focuses less on common words and already ignores rare words, so filtering out rare words even further doesn't change much. By contrast, with raw counts when using the bag-of-words approach, raising min_df reduces performance for neutral and positive classes slightly, because Naive Bayes focuses more on rare but distinct words. So removing rare words can hurt performance.

### [4 points] Question 6: Inspecting the best scoring features 

+ Train the scikit-learn classifier (Naive Bayes) model with the following settings (airline tweets 80% training and 20% test;  Bag of words representation ('airline_count'), min_df=2)
* [1 point] a. Generate the list of best scoring features per class (see function **important_features_per_class** below) [1 point]
* [3 points] b. Look at the lists and consider the following issues: 
    + [1 point] Which features did you expect for each separate class and why?
    + [1 point] Which features did you not expect and why ? 
    + [1 point] The list contains all kinds of words such as names of airlines, punctuation, numbers and content words (e.g., 'delay' and 'bad'). Which words would you remove or keep when trying to improve the model and why? 

In [60]:
def important_features_per_class(vectorizer,classifier,n=80):
    class_labels = classifier.classes_
    feature_names = vectorizer.get_feature_names_out()
    topn_class1 = sorted(zip(classifier.feature_count_[0], feature_names),reverse=True)[:n]
    topn_class2 = sorted(zip(classifier.feature_count_[1], feature_names),reverse=True)[:n]
    topn_class3 = sorted(zip(classifier.feature_count_[2], feature_names),reverse=True)[:n]
    print("Important words in negative documents")
    for coef, feat in topn_class1:
        print(class_labels[0], coef, feat)
    print("-----------------------------------------")
    print("Important words in neutral documents")
    for coef, feat in topn_class2:
        print(class_labels[1], coef, feat) 
    print("-----------------------------------------")
    print("Important words in positive documents")
    for coef, feat in topn_class3:
        print(class_labels[2], coef, feat) 

# example of how to call from notebook:
#important_features_per_class(airline_vec, clf)

def train_model(data_type='tf', min_df=2, show_features=False, top_n=80):
    airline_vec = CountVectorizer(min_df=min_df,
                                  tokenizer=nltk.word_tokenize,
                                  stop_words=stopwords.words('english'))

    airline_counts = airline_vec.fit_transform(dataset.data)
    tfidf_transformer = TfidfTransformer()
    airline_tfidf = tfidf_transformer.fit_transform(airline_counts)

    data = airline_tfidf if data_type == 'tf' else airline_counts

    docs_train, docs_test, y_train, y_test = train_test_split(
        data,
        dataset.target,
        test_size=0.20
    )

    clf = MultinomialNB().fit(docs_train, y_train)
    y_pred = clf.predict(docs_test)

    y_test_labels = [dataset.target_names[i] for i in y_test]
    y_pred_labels = [dataset.target_names[i] for i in y_pred]

    print(classification_report(y_test_labels, y_pred_labels, labels=dataset.target_names))

    # Show top features if requested
    if show_features and data_type != 'tf':
        important_features_per_class(airline_vec, clf, n=top_n)


In [61]:
train_model('bag', 2, True)



              precision    recall  f1-score   support

    negative       0.84      0.91      0.87       365
     neutral       0.85      0.72      0.78       294
    positive       0.83      0.87      0.85       292

    accuracy                           0.84       951
   macro avg       0.84      0.83      0.83       951
weighted avg       0.84      0.84      0.84       951

Important words in negative documents
0 1484.0 @
0 1359.0 united
0 1214.0 .
0 425.0 ``
0 382.0 flight
0 358.0 ?
0 328.0 !
0 324.0 #
0 218.0 n't
0 164.0 ''
0 121.0 's
0 111.0 :
0 108.0 virginamerica
0 108.0 service
0 92.0 get
0 87.0 cancelled
0 86.0 customer
0 85.0 plane
0 85.0 delayed
0 80.0 bag
0 77.0 time
0 77.0 -
0 75.0 ;
0 73.0 'm
0 70.0 http
0 68.0 &
0 66.0 ...
0 63.0 hours
0 63.0 gate
0 63.0 amp
0 62.0 hour
0 62.0 help
0 62.0 airline
0 59.0 ca
0 58.0 still
0 56.0 would
0 56.0 late
0 53.0 2
0 51.0 like
0 50.0 worst
0 49.0 flights
0 49.0 $
0 46.0 delay
0 46.0 (
0 46.0 've
0 44.0 really
0 44.0 never
0 43.0 wa

### Conclusion

Which features did you expect for each separate class and why?
For negative, we expected words such as 'cancelled', 'delayed', and 'late' because those directly express dissatisfaction. We also expected domain nouns like 'flight' and 'bag', since complaints are often about baggage or other services. For neutral, we expected airline handles and names, because neutral tweets tend to be inquiries addressed to the companies. For positive, we expected appraisal terms such as 'thanks' and 'awesome', because they clearly signal positive sentiment.

Which features did you not expect and why?
We were surprised by the number of punctuation tokens and numbers, such as '@' and '#', since they reflect tweet structure rather than sentiment. Finally, we didn't expect to see job-related nouns like 'crew' or 'attendant' in the positive list, as they're relevant to the context but not inherently positive. Perhaps it is because most positive tweets are about the employees, while negative ones are about the company.

Which words would you remove or keep when trying to improve the model and why?
We would remove punctuation tokens, URL tokens ('http'), conjunction artifacts ('amp'), standalone numbers, and other similar tokens, since they add noise without carrying any sentiment. We'd also drop airline‐specific names because they introduce bias, and don't have an inherent sentiment. We would keep core sentiment words and strong negation words, because they directly indicate sentiment.

### [Optional! (will not  be graded)] Question 7
Train the model on airline tweets and test it on your own set of tweets
+ Train the model with the following settings (airline tweets 80% training and 20% test;  Bag of words representation ('airline_count'), min_df=2)
+ Apply the model on your own set of tweets and generate the classification report
* [1 point] a. Carry out a quantitative analysis.
* [1 point] b. Carry out an error analysis on 10 correctly and 10 incorrectly classified tweets and discuss them
* [2 points] c. Compare the results (cf. classification report) with the results obtained by VADER on the same tweets and discuss the differences.

### [Optional! (will not be graded)] Question 8: trying to improve the model
* [2 points] a. Think of some ways to improve the scikit-learn Naive Bayes model by playing with the settings or applying linguistic preprocessing (e.g., by filtering on part-of-speech, or removing punctuation). Do not change the classifier but continue using the Naive Bayes classifier. Explain what the effects might be of these other settings 
+ [1 point] b. Apply the model with at least one new setting (train on the airline tweets using 80% training, 20% test) and generate the scores
* [1 point] c. Discuss whether the model achieved what you expected.

## End of this notebook

In [None]:
'''
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣠⣤⣤⣤⣤⣤⣤⣤⣤⣄⡀⠀⠀⠀⠀⠀⠀⠀⠀ 
⠀⠀⠀⠀⠀⠀⠀⠀⢀⣴⣿⡿⠛⠉⠙⠛⠛⠛⠛⠻⢿⣿⣷⣤⡀⠀⠀⠀⠀⠀ 
⠀⠀⠀⠀⠀⠀⠀⠀⣼⣿⠋⠀⠀⠀⠀⠀⠀⠀⢀⣀⣀⠈⢻⣿⣿⡄⠀⠀⠀⠀ 
⠀⠀⠀⠀⠀⠀⠀⣸⣿⡏⠀⠀⠀⣠⣶⣾⣿⣿⣿⠿⠿⠿⢿⣿⣿⣿⣄⠀⠀⠀ 
⠀⠀⠀⠀⠀⠀⠀⣿⣿⠁⠀⠀⢰⣿⣿⣯⠁⠀⠀⠀⠀⠀⠀⠀⠈⠙⢿⣷⡄⠀ 
⠀⠀⣀⣤⣴⣶⣶⣿⡟⠀⠀⠀⢸⣿⣿⣿⣆⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣿⣷⠀ 
⠀⢰⣿⡟⠋⠉⣹⣿⡇⠀⠀⠀⠘⣿⣿⣿⣿⣷⣦⣤⣤⣤⣶⣶⣶⣶⣿⣿⣿⠀ 
⠀⢸⣿⡇⠀⠀⣿⣿⡇⠀⠀⠀⠀⠹⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⡿⠃⠀ 
⠀⣸⣿⡇⠀⠀⣿⣿⡇⠀⠀⠀⠀⠀⠉⠻⠿⣿⣿⣿⣿⡿⠿⠿⠛⢻⣿⡇⠀⠀ 
⠀⣿⣿⠁⠀⠀⣿⣿⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢸⣿⣧⠀⠀ 
⠀⣿⣿⠀⠀⠀⣿⣿⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢸⣿⣿⠀⠀ 
⠀⣿⣿⠀⠀⠀⣿⣿⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢸⣿⣿⠀⠀ 
⠀⢿⣿⡆⠀⠀⣿⣿⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢸⣿⡇⠀⠀ 
⠀⠸⣿⣧⡀⠀⣿⣿⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣿⣿⠃⠀⠀ 
⠀⠀⠛⢿⣿⣿⣿⣿⣇⠀⠀⠀⠀⠀⣰⣿⣿⣷⣶⣶⣶⣶⠶⠀⢠⣿⣿⠀⠀⠀ 
⠀⠀⠀⠀⠀⠀⠀⣿⣿⠀⠀⠀⠀⠀⣿⣿⡇⠀⣽⣿⡏⠁⠀⠀⢸⣿⡇⠀⠀⠀ 
⠀⠀⠀⠀⠀⠀⠀⣿⣿⠀⠀⠀⠀⠀⣿⣿⡇⠀⢹⣿⡆⠀⠀⠀⣸⣿⠇⠀⠀⠀ 
⠀⠀⠀⠀⠀⠀⠀⢿⣿⣦⣄⣀⣠⣴⣿⣿⠁⠀⠈⠻⣿⣿⣿⣿⡿⠏⠀⠀⠀⠀ 
⠀⠀⠀⠀⠀⠀⠀⠈⠛⠻⠿⠿⠿⠿⠋⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
'''