# Lab3 - Assignment Sentiment

Copyright: Vrije Universiteit Amsterdam, Faculty of Humanities, CLTL

This notebook describes the LAB-2 assignment of the Text Mining course. It is about sentiment analysis.

The aims of the assignment are:
* Learn how to run a rule-based sentiment analysis module (VADER)
* Learn how to run a machine learning sentiment analysis module (Scikit-Learn/ Naive Bayes)
* Learn how to run scikit-learn metrics for the quantitative evaluation
* Learn how to perform and interpret a quantitative evaluation of the outcomes of the tools (in terms of Precision, Recall, and F<sub>1</sub>)
* Learn how to evaluate the results qualitatively (by examining the data) 
* Get insight into differences between the two applied methods
* Get insight into the effects of using linguistic preprocessing
* Be able to describe differences between the two methods in terms of their results
* Get insight into issues when applying these methods across different  domains

In this assignment, you are going to create your own gold standard set from 50 tweets. You will the VADER and scikit-learn classifiers to these tweets and evaluate the results by using evaluation metrics and inspecting the data.

We recommend you go through the notebooks in the following order:
* **Read the assignment (see below)**
* **Lab3.2-Sentiment-analysis-with-VADER.ipynb**
* **Lab3.3-Sentiment-analysis.with-scikit-learn.ipynb**
* **Answer the questions of the assignment (see below) using the provided notebooks and submit**

In this assignment you are asked to perform both quantitative evaluations and error analyses:
* a quantitative evaluation concerns the scores (Precision, Recall, and F<sub>1</sub>) provided by scikit's classification_report. It includes the scores per category, as well as micro and macro averages. Discuss whether the scores are balanced or not between the different categories (positive, negative, neutral) and between precision and recall. Discuss the shortcomings (if any) of the classifier based on these scores
* an error analysis regarding the misclassifications of the classifier. It involves going through the texts and trying to understand what has gone wrong. It servers to get insight in what could be done to improve the performance of the classifier. Do you observe patterns in misclassifications?  Discuss why these errors are made and propose ways to solve them.

## Credits
The notebooks in this block have been originally created by [Marten Postma](https://martenpostma.github.io) and [Isa Maks](https://research.vu.nl/en/persons/e-maks). Adaptations were made by [Filip Ilievski](http://ilievski.nl).

## Part I: VADER assignments


### Preparation (nothing to submit):
To be able to answer the VADER questions you need to know how the tool works. 
* Read more about the VADER tool in [this blog](http://t-redactyl.io/blog/2017/04/using-vader-to-handle-sentiment-analysis-with-social-media-text.html).  
* VADER provides 4 scores (positive, negative, neutral, compound). Be sure to understand what they mean and how they are calculated.
* VADER uses rules to handle linguistic phenomena such as negation and intensification. Be sure to understand which rules are used, how they work, and why they are important.
* VADER makes use of a sentiment lexicon. Have a look at the lexicon. Be sure to understand which information can be found there (lemma?, wordform?, part-of-speech?, polarity value?, word meaning?) What do all scores mean? https://github.com/cjhutto/vaderSentiment/blob/master/vaderSentiment/vader_lexicon.txt) 


### [3.5 points] Question1:

Regard the following sentences and their output as given by VADER. Regard sentences 1 to 7, and explain the outcome **for each sentence**. Take into account both the rules applied by VADER and the lexicon that is used. You will find that some of the results are reasonable, but others are not. Explain what is going wrong or not when correct and incorrect results are produced. 

```
INPUT SENTENCE 1 I love apples
VADER OUTPUT {'neg': 0.0, 'neu': 0.192, 'pos': 0.808, 'compound': 0.6369}

INPUT SENTENCE 2 I don't love apples
VADER OUTPUT {'neg': 0.627, 'neu': 0.373, 'pos': 0.0, 'compound': -0.5216}

INPUT SENTENCE 3 I love apples :-)
VADER OUTPUT {'neg': 0.0, 'neu': 0.133, 'pos': 0.867, 'compound': 0.7579}

INPUT SENTENCE 4 These houses are ruins
VADER OUTPUT {'neg': 0.492, 'neu': 0.508, 'pos': 0.0, 'compound': -0.4404}

INPUT SENTENCE 5 These houses are certainly not considered ruins
VADER OUTPUT {'neg': 0.0, 'neu': 0.51, 'pos': 0.49, 'compound': 0.5867}

INPUT SENTENCE 6 He lies in the chair in the garden
VADER OUTPUT {'neg': 0.286, 'neu': 0.714, 'pos': 0.0, 'compound': -0.4215}

INPUT SENTENCE 7 This house is like any house
VADER OUTPUT {'neg': 0.0, 'neu': 0.667, 'pos': 0.333, 'compound': 0.3612}
```

VADER is based on lexicons of sentiment-related words. Each of the words is rated as positive or negative, but they are also rated based on the degree in which they are positive or negative. This means that more positive words get a higher positive rating, and more negative words get a more negative rating. 
The first input sentence has a compound value of 0.6 which is a quite strong positive value, which is accurate for a sentence indicating that something is loved. Punctuation marks are taken into account by VADER, since this tool is often used in social media content where punctuation is often used to express sentiment. This explains why the compound value of the third input sentence is higher compared to the first, while the words are the same.
VADER takes the context of the entire sentence into account, explaining the 0.0 positive value for the second sentence, even though it contains the word 'love'. It therefore produces a correct negative compound value of -0.5
For the fourth input sentence, the word ruins has a negative connotation when relating to houses, therefore the negative compound value of -0.4 is correct. The fifth input sentence also contains the negatively loaded word 'ruins', but also the word 'not' which indicates that this sentence is positively loaded on the contrary. This also explains why the negative value for this input sentence is 0.0
The output for the sixth input sentence is incorrect. This sentence is neutral and does not convey any emotion or sentiment, yet the compound value is a negative value of -0.4. A possible explanation might be that the word 'lies' is interpreted as the wrong verb.
The compound value for the seventh input sentence is accurate because the sentence is neutral and does not convey any emotion as well, just like the previous input sentence. However, we can observe that the compound values between these two are quite different. An explanation might be that the word 'like' causes the positive value to be higher. 

### [Points: 2.5] Exercise 2: Collecting 50 tweets for evaluation
Collect 50 tweets. Try to find tweets that are interesting for sentiment analysis, e.g., very positive, neutral, and negative tweets. These could be your own tweets (typed in) or collected from the Twitter stream.

We will store the tweets in the file **my_tweets.json** (use a text editor to edit).
For each tweet, you should insert:
* sentiment analysis label: negative | neutral | positive (this you determine yourself, this is not done by a computer)
* the text of the tweet
* the Tweet-URL

from:
```
    "1": {
        "sentiment_label": "",
        "text_of_tweet": "",
        "tweet_url": "",
```
to:
```
"1": {
        "sentiment_label": "positive",
        "text_of_tweet": "All across America people chose to get involved, get engaged and stand up. Each of us can make a difference, and all of us ought to try. So go keep changing the world in 2018.",
        "tweet_url" : "https://twitter.com/BarackObama/status/946775615893655552",
    },
```

You can load your tweets with human annotation in the following way.

In [None]:
import json
import spacy
import nltk
import json
from nltk.sentiment import vader
from sklearn.metrics import classification_report
from nltk.sentiment.vader import SentimentIntensityAnalyzer

In [None]:
my_tweets = json.load(open('my_tweets.json'))

In [None]:
for id_, tweet_info in my_tweets.items():
    print(id_, tweet_info)
    break

### [5 points] Question 3:

Run VADER on your own tweets (see function **run_vader** from notebook **Lab2-Sentiment-analysis-using-VADER.ipynb**). You can use the code snippet below this explanation as a starting point. 
* [2.5 points] a. Perform a quantitative evaluation. Explain the different scores, and explain which scores are most relevant and why.
* [2.5 points] b. Perform an error analysis: select 10 positive, 10 negative and 10 neutral tweets that are not correctly classified and try to understand why. Refer to the VADER-rules and the VADER-lexicon. Of course, if there are less than 10 errors for a category, you only have to check those. For example, if there are only 5 errors for positive tweets, you just describe those.

In [1]:
def vader_output_to_label(vader_output):
    """
    map vader output e.g.,
    {'neg': 0.0, 'neu': 0.0, 'pos': 1.0, 'compound': 0.4215}
    to one of the following values:
    a) positive float -> 'positive'
    b) 0.0 -> 'neutral'
    c) negative float -> 'negative'
    
    :param dict vader_output: output dict from vader
    
    :rtype: str
    :return: 'negative' | 'neutral' | 'positive'
    """
    compound = vader_output['compound']
    
    if compound < 0:
        return 'negative'
    elif compound == 0.0:
        return 'neutral'
    elif compound > 0.0:
        return 'positive'
    
assert vader_output_to_label( {'neg': 0.0, 'neu': 0.0, 'pos': 1.0, 'compound': 0.0}) == 'neutral'
assert vader_output_to_label( {'neg': 0.0, 'neu': 0.0, 'pos': 1.0, 'compound': 0.01}) == 'positive'
assert vader_output_to_label( {'neg': 0.0, 'neu': 0.0, 'pos': 1.0, 'compound': -0.01}) == 'negative'

In [2]:
nlp = spacy.load("en_core_web_sm")  # 'en_core_web_sm'

vader_model = SentimentIntensityAnalyzer()

def run_vader(
    textual_unit, lemmatize=False, parts_of_speech_to_consider=None, verbose=0
):
    """
    Run VADER on a sentence from spacy

    :param str textual unit: a textual unit, e.g., sentence, sentences (one string)
    (by looping over doc.sents)
    :param bool lemmatize: If True, provide lemmas to VADER instead of words
    :param set parts_of_speech_to_consider:
    -None or empty set: all parts of speech are provided
    -non-empty set: only these parts of speech are considered.
    :param int verbose: if set to 1, information is printed
    about input and output

    :rtype: dict
    :return: vader output dict
    """
    doc = nlp(textual_unit)

    input_to_vader = []

    for sent in doc.sents:
        for token in sent:

            to_add = token.text

            if lemmatize:
                to_add = token.lemma_

                if to_add == "-PRON-":
                    to_add = token.text

            if parts_of_speech_to_consider:
                if token.pos_ in parts_of_speech_to_consider:
                    input_to_vader.append(to_add)
            else:
                input_to_vader.append(to_add)

    scores = vader_model.polarity_scores(" ".join(input_to_vader))

    if verbose >= 1:
        print()
        print("INPUT SENTENCE", sent)
        print("INPUT TO VADER", input_to_vader)
        print("VADER OUTPUT", scores)

    return scores

In [19]:
tweets = []
all_vader_output = []
gold = []

with open("my_tweets.json", "r") as f:
    my_tweets = json.load(f)

# settings (to change for different experiments)
to_lemmatize = True 
pos = set()

for id_, tweet_info in my_tweets.items():
    the_tweet = tweet_info['text_of_tweet']
    vader_output = run_vader(the_tweet)  # run vader
    vader_label = vader_output_to_label(vader_output) # convert vader output to category

    tweets.append(the_tweet)
    all_vader_output.append(vader_label)
    gold.append(tweet_info['sentiment_label'])


# Generate classification report
report = classification_report(gold, all_vader_output)
print(report)

              precision    recall  f1-score   support

    negative       0.73      0.61      0.67        18
     neutral       0.47      0.47      0.47        15
    positive       0.55      0.65      0.59        17

    accuracy                           0.58        50
   macro avg       0.58      0.57      0.58        50
weighted avg       0.59      0.58      0.58        50



#### Question 3.1: Quantitative Analysis

In the classification report table we can metrics such as precision, recall, f1-score, and support. 

First, I would like to explain each metric and why they are being explored. Precision measures accuracy of positive predictions among all positive instances (73% of negative tweets were actually labeled as negative by VADER). Recall demonstrates the ability of the system to identify all correct labels in the sample. (61% of all existing negative tweets were correctly labeled as negative). F1 score combines the scores of precision and recall, higher f1 indicates better performance. Support shows how many times different classes occur. Accuracy shows the proportion of correctly predicted sentiments in the whole population together. Lastly, we have macro-average and weighted-average. These two measures are ways to play around with class imbalances and whether differences in class count have an impact on the metrics above. Macro-average strips class imbalance from the metrics, precision, recall, and f1 is calculated for all classes individually and an average is taken from all. For weighted-average, the number of instances in each class is taken into account so, unlike macro-average, it doesn't consider all classes as equal as it takes into account that classes with more instances are more important. 

We can see that all these metrics are useful tools for understanding the results. In conclusion, precision is an important metric to explore as it demonstrates that the model identifies negative tweets better compared to neutral or positive ones (all metrics for that category are higher). Furthermore, the accuracy shows 58% as 58% of tweets were correctly labeled by the model. F1 is another critical metric as it strikes a balance between recall and precision, thus, looking at the f1 score gives us a holistic view on the performance of the model.

In [20]:
misclassified_tweets = []

for tweet, predicted_label, actual_label in zip(tweets, all_vader_output, gold):
    if predicted_label != actual_label:
        misclassified_tweets.append((tweet, predicted_label, actual_label))

# Print misclassified tweets
for tweet, predicted_label, actual_label in misclassified_tweets:
    print("Tweet:", tweet)
    print("Predicted Label:", predicted_label)
    print("Actual Label:", actual_label)
    print()

Tweet: Rival dad next door and his family are moving today. So I'm sitting out front judging his u-haul tetris stacking abilities and giving a disapproving groan when he does something wrong.
Predicted Label: positive
Actual Label: negative

Tweet: Avoid eye contact, he may ask you to help!
Predicted Label: positive
Actual Label: negative

Tweet: Great way to kill a little extra time!
Predicted Label: negative
Actual Label: positive

Tweet: “Only stacking two boxes on the dolly at once I see, huh Terry? Hm. Guess that’s one way to go.”
Predicted Label: neutral
Actual Label: negative

Tweet: Whatever will you do for entertainment now? How else will you assert your superior masculinity?
Predicted Label: positive
Actual Label: neutral

Tweet: One time, my uncle's VERY inexperienced neighbour was removing a fair-sized tree by himself. He watched and chuckled he's going to kill himself...' the tree came down - right onto my uncle's fence.
Predicted Label: negative
Actual Label: positive

Tw

#### Question 3.2: Error Analysis

1. Misclassified tweets were identified in the code above with the tweet, the predicted label and the actual label. Now we take 10 from each category to analyse why they were misclassified.
2. Select 10 misclassified tweets for each sentiment category --> there are 7 negative, 8 neutral, and 6 positive tweets that were wrongly classified.
3. Analyze the content of these tweets to understand why they were misclassified (refer to the VADER lexicon and rules to identify potential reasons for misclassification) such as ambiguous language, sarcasm, negations, or emoticons.

As the tweets are visualized above, I will not analyse each tweet individually as that would be repetitive and not time efficient. I will explore the main findings and reasons behind why they were misclassified in the section below.

First, when it comes to the negative tweets that were misclassified, 2 were classified as positive and 5 as neutral. Meaning that the model had a propensity to classify negative tweets as neutral tweets. Looking deeper we can see that tweets such as "A leap of faith is scheduling a kids birthday party in the middle of winter..." have a positive conotation due to the phrase "leap of faith" but in actuality, this is being said in a sarcastic way, resulting in an originally negative tweet. Lastly, there is the tweet "Rival dad next door and his family are moving today...", this is marked as positive but it misses the fact that the writer judges the dad in a negative manner. 

Second, out of 8 neutral tweets that were misclassified, 7 were classified as positive and 1 was negative. Here we can see that neutral tweets were seen by the model as more positive. Looking deeper at the specific tweets, we can see that "Whatever will you do for entertainment now?..." has been labeled as positive but in actuality doesn't express positivity but rather is a neutral rhetorical question, it doesn't convey positive curiosity. The tweet "I’m here to demand an update in about 12-14 years from now" is labeled as negative but in actuality conveys a neutral, maybe even a positive playful tone. However, VADER interpreted the humorous tone as one that is negative and overlooked the sentiment expressed.

Lastly, out of the 6 positive tweets misclassified, 3 were classified as negative and 3 as neutral. This is interesting as it seems that the model had serious problems identifying positive sentiment as it misclassified 50% of the misclassified cases as negative. Looking deeper we can see that in the tweet: "Great way to kill a little extra time!", VADER possibly misinterpreted the phrase "kill a little extra time" as negative due to the word "kill," overlooking the overall positive sentiment of the tweet. Furthermore, in the tweet: "Lol I saw this too and died. What are you gonna do, make the baby stay alone in her room all day????", VADER predicted it to be negative due to heavy language used such as "die" but did not take into account internet language such as "lol" and the sarcastic tone used in the tweet.

It is evident that these misclassifications occurred due to  VADER's lexicon; its rules may not fully capture the nuances of everyday language, such as sarcasm, humor, or rhetorical devices, leading to errors in sentiment classification.


### [4 points] Question 4:
Run VADER on the set of airline tweets with the following settings:

* Run VADER (as it is) on the set of airline tweets 
* Run VADER on the set of airline tweets after having lemmatized the text
* Run VADER on the set of airline tweets with only adjectives
* Run VADER on the set of airline tweets with only adjectives and after having lemmatized the text
* Run VADER on the set of airline tweets with only nouns
* Run VADER on the set of airline tweets with only nouns and after having lemmatized the text
* Run VADER on the set of airline tweets with only verbs
* Run VADER on the set of airline tweets with only verbs and after having lemmatized the text


[1 point] a. Generate for all separate experiments the classification report, i.e., Precision, Recall, and F<sub>1</sub> scores per category as well as micro and macro averages. **Use a different code cell (or multiple code cells) for each experiment.**

[3 points] b. Compare the scores and explain what they tell you.
- Does lemmatisation help? Explain why or why not.
- Are all parts of speech equally important for sentiment analysis? Explain why or why not.

We are choosing the subset of negative tweets, as negative labels were the ones most accurately predicted by the model. First, we need to convert the txt files in the negative folder into .json for easier implementation of VADER. After this implementation, we apply for all settings of the different experiments.

In [3]:
import os
import json
import random


def compile_tweets_to_json(main_folder):
    data = {}
    tweet_count = 0

    # Loop through each folder (positive, negative, neutral) in the main folder
    for sentiment_label in os.listdir(main_folder):
        sentiment_folder = os.path.join(main_folder, sentiment_label)

        # Check if the item is a directory
        if os.path.isdir(sentiment_folder):
            # Loop through each txt file in the sentiment folder
            for file_name in os.listdir(sentiment_folder):
                if file_name.endswith(".txt"):
                    tweet_count += 1
                    file_path = os.path.join(sentiment_folder, file_name)

                    # Read the content of the txt file
                    with open(file_path, "r", encoding="utf-8") as file:
                        tweet_text = file.read().strip()

                    # Add tweet data to the dictionary
                    data[str(tweet_count)] = {
                        "sentiment_label": sentiment_label,
                        "text_of_tweet": tweet_text,
                        "tweet_url": "",  # You can add tweet URLs if you have them
                    }

    # Shuffle the order of tweets
    shuffled_data = {k: data[k] for k in random.sample(list(data.keys()), len(data))}

    # Write the data to a JSON file
    with open("compiled_tweets.json", "w") as json_file:
        json.dump(shuffled_data, json_file, indent=4)


# Example usage:
main_folder = "airlinetweets"
compile_tweets_to_json(main_folder)

Now that all negative tweets are converted into the appropriate format, we apply VADER in different settings:

In [6]:
# * Run VADER (as it is) on the set of airline tweets
tweets = []
vader_model = SentimentIntensityAnalyzer()
all_vader_output = []
gold = []

with open("compiled_tweets.json", "r") as f:
    airline_tweets = json.load(f)

# settings (to change for different experiments)
to_lemmatize = True
pos = set()

for id_, tweet_info in airline_tweets.items():
    the_tweet = tweet_info["text_of_tweet"]
    vader_output = run_vader(the_tweet)  # run vader
    vader_label = vader_output_to_label(
        vader_output
    )  # convert vader output to category

    tweets.append(the_tweet)
    all_vader_output.append(vader_label)
    gold.append(tweet_info["sentiment_label"])


# Generate classification report
report = classification_report(gold, all_vader_output)
print(report)

              precision    recall  f1-score   support

    negative       0.80      0.51      0.63      1750
     neutral       0.60      0.51      0.55      1515
    positive       0.56      0.88      0.68      1490

    accuracy                           0.63      4755
   macro avg       0.65      0.64      0.62      4755
weighted avg       0.66      0.63      0.62      4755



In [7]:
# * Run VADER on the set of airline tweets after having lemmatized the text
tweets = []
vader_model = SentimentIntensityAnalyzer()
all_vader_output = []
gold = []

with open("compiled_tweets.json", "r") as f:
    airline_tweets = json.load(f)

# settings (to change for different experiments)
to_lemmatize = True
pos = set()

for id_, tweet_info in airline_tweets.items():
    the_tweet = tweet_info["text_of_tweet"]
    vader_output = run_vader(the_tweet, lemmatize=True)  # run vader
    vader_label = vader_output_to_label(
        vader_output
    )  # convert vader output to category

    tweets.append(the_tweet)
    all_vader_output.append(vader_label)
    gold.append(tweet_info["sentiment_label"])


# Generate classification report
report = classification_report(gold, all_vader_output)
print(report)

              precision    recall  f1-score   support

    negative       0.79      0.52      0.63      1750
     neutral       0.60      0.49      0.54      1515
    positive       0.56      0.88      0.68      1490

    accuracy                           0.62      4755
   macro avg       0.65      0.63      0.62      4755
weighted avg       0.65      0.62      0.62      4755



In [8]:
# * Run VADER on the set of airline tweets with only adjectives
tweets = []
vader_model = SentimentIntensityAnalyzer()
all_vader_output = []
gold = []

with open("compiled_tweets.json", "r") as f:
    airline_tweets = json.load(f)

# settings (to change for different experiments)
to_lemmatize = True
pos = set()

for id_, tweet_info in airline_tweets.items():
    the_tweet = tweet_info["text_of_tweet"]
    vader_output = run_vader(
        the_tweet,
        parts_of_speech_to_consider={"ADJ"},
    )  # run vader
    vader_label = vader_output_to_label(
        vader_output
    )  # convert vader output to category

    tweets.append(the_tweet)
    all_vader_output.append(vader_label)
    gold.append(tweet_info["sentiment_label"])


# Generate classification report
report = classification_report(gold, all_vader_output)
print(report)

              precision    recall  f1-score   support

    negative       0.87      0.21      0.34      1750
     neutral       0.40      0.89      0.56      1515
    positive       0.66      0.44      0.53      1490

    accuracy                           0.50      4755
   macro avg       0.65      0.51      0.47      4755
weighted avg       0.66      0.50      0.47      4755



In [9]:
# * Run VADER on the set of airline tweets with only adjectives and after having lemmatized the text
tweets = []
vader_model = SentimentIntensityAnalyzer()
all_vader_output = []
gold = []

with open("compiled_tweets.json", "r") as f:
    airline_tweets = json.load(f)

# settings (to change for different experiments)
to_lemmatize = True
pos = set()

for id_, tweet_info in airline_tweets.items():
    the_tweet = tweet_info["text_of_tweet"]
    vader_output = run_vader(
        the_tweet,
        lemmatize=True,
        parts_of_speech_to_consider={"ADJ"},
    )  # run vader
    vader_label = vader_output_to_label(
        vader_output
    )  # convert vader output to category

    tweets.append(the_tweet)
    all_vader_output.append(vader_label)
    gold.append(tweet_info["sentiment_label"])


# Generate classification report
report = classification_report(gold, all_vader_output)
print(report)

              precision    recall  f1-score   support

    negative       0.87      0.21      0.34      1750
     neutral       0.40      0.89      0.56      1515
    positive       0.66      0.44      0.53      1490

    accuracy                           0.50      4755
   macro avg       0.65      0.51      0.47      4755
weighted avg       0.66      0.50      0.47      4755



In [10]:
# * Run VADER on the set of airline tweets with only nouns
tweets = []
vader_model = SentimentIntensityAnalyzer()
all_vader_output = []
gold = []

with open("compiled_tweets.json", "r") as f:
    airline_tweets = json.load(f)

# settings (to change for different experiments)
to_lemmatize = True
pos = set()

for id_, tweet_info in airline_tweets.items():
    the_tweet = tweet_info["text_of_tweet"]
    vader_output = run_vader(
        the_tweet,
        parts_of_speech_to_consider={"NOUN"},
    )  # run vader
    vader_label = vader_output_to_label(
        vader_output
    )  # convert vader output to category

    tweets.append(the_tweet)
    all_vader_output.append(vader_label)
    gold.append(tweet_info["sentiment_label"])


# Generate classification report
report = classification_report(gold, all_vader_output)
print(report)

              precision    recall  f1-score   support

    negative       0.73      0.14      0.24      1750
     neutral       0.36      0.82      0.50      1515
    positive       0.53      0.34      0.41      1490

    accuracy                           0.42      4755
   macro avg       0.54      0.43      0.38      4755
weighted avg       0.55      0.42      0.38      4755



In [11]:
# * Run VADER on the set of airline tweets with only nouns and after having lemmatized the text
tweets = []
vader_model = SentimentIntensityAnalyzer()
all_vader_output = []
gold = []

with open("compiled_tweets.json", "r") as f:
    airline_tweets = json.load(f)

# settings (to change for different experiments)
to_lemmatize = True
pos = set()

for id_, tweet_info in airline_tweets.items():
    the_tweet = tweet_info["text_of_tweet"]
    vader_output = run_vader(
        the_tweet,
        lemmatize=True,
        parts_of_speech_to_consider={"NOUN"},
    )  # run vader
    vader_label = vader_output_to_label(
        vader_output
    )  # convert vader output to category

    tweets.append(the_tweet)
    all_vader_output.append(vader_label)
    gold.append(tweet_info["sentiment_label"])


# Generate classification report
report = classification_report(gold, all_vader_output)
print(report)

              precision    recall  f1-score   support

    negative       0.72      0.16      0.26      1750
     neutral       0.36      0.81      0.50      1515
    positive       0.52      0.33      0.40      1490

    accuracy                           0.42      4755
   macro avg       0.53      0.43      0.39      4755
weighted avg       0.54      0.42      0.38      4755



In [20]:
# * Run VADER on the set of airline tweets with only verbs
tweets = []
vader_model = SentimentIntensityAnalyzer()
all_vader_output = []
gold = []

with open("compiled_tweets.json", "r") as f:
    airline_tweets = json.load(f)

# settings (to change for different experiments)
to_lemmatize = True
pos = set()

for id_, tweet_info in airline_tweets.items():
    the_tweet = tweet_info["text_of_tweet"]
    vader_output = run_vader(
        the_tweet,
        parts_of_speech_to_consider={"VERB"},
    )  # run vader
    vader_label = vader_output_to_label(
        vader_output
    )  # convert vader output to category

    tweets.append(the_tweet)
    all_vader_output.append(vader_label)
    gold.append(tweet_info["sentiment_label"])


# Generate classification report
report = classification_report(gold, all_vader_output)
print(report)

              precision    recall  f1-score   support

    negative       0.77      0.29      0.42      1750
     neutral       0.38      0.81      0.52      1515
    positive       0.57      0.34      0.43      1490

    accuracy                           0.47      4755
   macro avg       0.58      0.48      0.46      4755
weighted avg       0.59      0.47      0.45      4755



In [13]:
# * Run VADER on the set of airline tweets with only verbs and after having lemmatized the text
tweets = []
vader_model = SentimentIntensityAnalyzer()
all_vader_output = []
gold = []

with open("compiled_tweets.json", "r") as f:
    airline_tweets = json.load(f)

# settings (to change for different experiments)
to_lemmatize = True
pos = set()

for id_, tweet_info in airline_tweets.items():
    the_tweet = tweet_info["text_of_tweet"]
    vader_output = run_vader(
        the_tweet,
        lemmatize=True,
        parts_of_speech_to_consider={"VERB"},
    )  # run vader
    vader_label = vader_output_to_label(
        vader_output
    )  # convert vader output to category

    tweets.append(the_tweet)
    all_vader_output.append(vader_label)
    gold.append(tweet_info["sentiment_label"])


# Generate classification report
report = classification_report(gold, all_vader_output)
print(report)

              precision    recall  f1-score   support

    negative       0.74      0.30      0.42      1750
     neutral       0.38      0.78      0.51      1515
    positive       0.57      0.35      0.43      1490

    accuracy                           0.47      4755
   macro avg       0.56      0.48      0.46      4755
weighted avg       0.57      0.47      0.45      4755



#### [3 points] b. Compare the scores and explain what they tell you.

- *Does lemmatisation help? Explain why or why not.*

Comparing them, we can see that there is no significant difference between the performance of models with and without lemmatization. Models with and without perform similarly across all sentiment labels. We can see that the normalization method can help improve the performance in general but in this specific case, it demonstrates limited effectiveness. VADER is specifically designed for social media text such as in this case with many informal words which reduces its effectiveness. 

- *Are all parts of speech equally important for sentiment analysis? Explain why or why not.*

Although lemmatisation had limited effect, we can see that different parts of speech do have varying effects. Models trained on ADJectives show better performance compared to models trained on verbs or nouns. Adjectives are usually more informative of the sentiment as they are used to describe attributes or the qualities of nouns. This is the main reason why adjectives would be easier for the model to identify.

Second, nouns are lower on the performance scale than both verbs and adjectives and don't convey sentiment as effectively as adjectives. Nouns are more neutral and don't often indicate sentiment. 

Third, models trained on verbs are lower on performance compared to adjectives but perform very similarly to nouns. Describing states and actions could indirectly convey sentiment depending on the context but they are still not as direct as adjectives and thus, demonstrate lower perfomance which was expected.

In conclusion, it is always best to consider a combination of different parts of speech, providing a more comprehensive understanding of sentiment in text data.

## Part II: scikit-learn assignments
### [4 points] Question 5
Train the scikit-learn classifier (Naive Bayes) using the airline tweets.

+ Train the model on the airline tweets with 80% training and 20% test set and default settings (TF-IDF representation, min_df=2)
+ Train with different settings:
    + with respect to vectorizing: TF-IDF ('airline_tfidf') vs. Bag of words representation ('airline_count') 
    + with respect to the frequency threshold (min_df). Carry out experiments with increasing values for document frequency (min_df = 2; min_df = 5; min_df =10) 
* [1 point] a. Generate a classification_report for all experiments
* [3 points] b. Look at the results of the experiments with the different settings and try to explain why they differ: 
    + which category performs best, is this the case for any setting?
    + does the frequency threshold affect the scores? Why or why not according to you?

In [None]:
# Your code here


### [4 points] Question 6: Inspecting the best scoring features 

+ Train the scikit-learn classifier (Naive Bayes) model with the following settings (airline tweets 80% training and 20% test;  Bag of words representation ('airline_count'), min_df=2)
* [1 point] a. Generate the list of best scoring features per class (see function **important_features_per_class** below) [1 point]
* [3 points] b. Look at the lists and consider the following issues: 
    + [1 point] Which features did you expect for each separate class and why?
    + [1 point] Which features did you not expect and why ? 
    + [1 point] The list contains all kinds of words such as names of airlines, punctuation, numbers and content words (e.g., 'delay' and 'bad'). Which words would you remove or keep when trying to improve the model and why? 

In [None]:
def important_features_per_class(vectorizer,classifier,n=80):
    class_labels = classifier.classes_
    feature_names =vectorizer.get_feature_names()
    topn_class1 = sorted(zip(classifier.feature_count_[0], feature_names),reverse=True)[:n]
    topn_class2 = sorted(zip(classifier.feature_count_[1], feature_names),reverse=True)[:n]
    topn_class3 = sorted(zip(classifier.feature_count_[2], feature_names),reverse=True)[:n]
    print("Important words in negative documents")
    for coef, feat in topn_class1:
        print(class_labels[0], coef, feat)
    print("-----------------------------------------")
    print("Important words in neutral documents")
    for coef, feat in topn_class2:
        print(class_labels[1], coef, feat) 
    print("-----------------------------------------")
    print("Important words in positive documents")
    for coef, feat in topn_class3:
        print(class_labels[2], coef, feat) 

# example of how to call from notebook:
#important_features_per_class(airline_vec, clf)

In [14]:
import pathlib
import sklearn
import numpy
import nltk
from nltk.corpus import stopwords
from collections import Counter
from sklearn.datasets import load_files
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

In [15]:
help(load_files)

Help on function load_files in module sklearn.datasets._base:

load_files(container_path, *, description=None, categories=None, load_content=True, shuffle=True, encoding=None, decode_error='strict', random_state=0, allowed_extensions=None)
    Load text files with categories as subfolder names.

    Individual samples are assumed to be files stored a two levels folder
    structure such as the following:

        container_folder/
            category_1_folder/
                file_1.txt
                file_2.txt
                ...
                file_42.txt
            category_2_folder/
                file_43.txt
                file_44.txt
                ...

    The folder names are used as supervised signal label names. The individual
    file names are not important.

    This function does not try to extract features into a numpy array or scipy
    sparse matrix. In addition, if load_content is false it does not try to
    load the files in memory.

    To use text files in

In [16]:
cwd = pathlib.Path.cwd()
airline_tweets_folder = cwd.joinpath("airlinetweets")
print("path:", airline_tweets_folder)
print("this will print True if the folder exists:", airline_tweets_folder.exists())

path: /Users/owhy/Desktop/github/TextMining-VU-2024/airlinetweets
this will print True if the folder exists: True


In [17]:
str(airline_tweets_folder)

'/Users/owhy/Desktop/github/TextMining-VU-2024/airlinetweets'

In [18]:
# loading all files as training data.
airline_tweets_train = load_files(str(airline_tweets_folder))

In [19]:
len(airline_tweets_train.data)

4755

### [Optional! (will not  be graded)] Question 7
Train the model on airline tweets and test it on your own set of tweets
+ Train the model with the following settings (airline tweets 80% training and 20% test;  Bag of words representation ('airline_count'), min_df=2)
+ Apply the model on your own set of tweets and generate the classification report
* [1 point] a. Carry out a quantitative analysis.
* [1 point] b. Carry out an error analysis on 10 correctly and 10 incorrectly classified tweets and discuss them
* [2 points] c. Compare the results (cf. classification report) with the results obtained by VADER on the same tweets and discuss the differences.

### [Optional! (will not be graded)] Question 8: trying to improve the model
* [2 points] a. Think of some ways to improve the scikit-learn Naive Bayes model by playing with the settings or applying linguistic preprocessing (e.g., by filtering on part-of-speech, or removing punctuation). Do not change the classifier but continue using the Naive Bayes classifier. Explain what the effects might be of these other settings 
+ [1 point] b. Apply the model with at least one new setting (train on the airline tweets using 80% training, 20% test) and generate the scores
* [1 point] c. Discuss whether the model achieved what you expected.

## End of this notebook