# Lab3 - Assignment Sentiment

Copyright: Vrije Universiteit Amsterdam, Faculty of Humanities, CLTL

This notebook describes the LAB-3 assignment of the Text Mining course. It is about sentiment analysis.

The aims of the assignment are:
* Learn how to run a rule-based sentiment analysis module (VADER)
* Learn how to run a machine learning sentiment analysis module (Scikit-Learn/ Naive Bayes)
* Learn how to run scikit-learn metrics for the quantitative evaluation
* Learn how to perform and interpret a quantitative evaluation of the outcomes of the tools (in terms of Precision, Recall, and F<sub>1</sub>)
* Learn how to evaluate the results qualitatively (by examining the data) 
* Get insight into differences between the two applied methods
* Get insight into the effects of using linguistic preprocessing
* Be able to describe differences between the two methods in terms of their results
* Get insight into issues when applying these methods across different  domains

In this assignment, you are going to create your own gold standard set from 50 tweets. You will the VADER and scikit-learn classifiers to these tweets and evaluate the results by using evaluation metrics and inspecting the data.

We recommend you go through the notebooks in the following order:
* **Read the assignment (see below)**
* **Lab3.2-Sentiment-analysis-with-VADER.ipynb**
* **Lab3.3-Sentiment-analysis.with-scikit-learn.ipynb**
* **Answer the questions of the assignment (see below) using the provided notebooks and submit**

In this assignment you are asked to perform both quantitative evaluations and error analyses:
* a quantitative evaluation concerns the scores (Precision, Recall, and F<sub>1</sub>) provided by scikit's classification_report. It includes the scores per category, as well as micro and macro averages. Discuss whether the scores are balanced or not between the different categories (positive, negative, neutral) and between precision and recall. Discuss the shortcomings (if any) of the classifier based on these scores
* an error analysis regarding the misclassifications of the classifier. It involves going through the texts and trying to understand what has gone wrong. It servers to get insight in what could be done to improve the performance of the classifier. Do you observe patterns in misclassifications?  Discuss why these errors are made and propose ways to solve them.

## Credits
The notebooks in this block have been originally created by [Marten Postma](https://martenpostma.github.io) and [Isa Maks](https://research.vu.nl/en/persons/e-maks). Adaptations were made by [Filip Ilievski](http://ilievski.nl).

## Part I: VADER assignments


### Preparation (nothing to submit):
To be able to answer the VADER questions you need to know how the tool works. 
* Read more about the VADER tool in [this blog](https://www.geeksforgeeks.org/python-sentiment-analysis-using-vader/).  
* VADER provides 4 scores (positive, negative, neutral, compound). Be sure to understand what they mean and how they are calculated.
* VADER uses rules to handle linguistic phenomena such as negation and intensification. Be sure to understand which rules are used, how they work, and why they are important.
* VADER makes use of a sentiment lexicon. Have a look at the lexicon. Be sure to understand which information can be found there (lemma?, wordform?, part-of-speech?, polarity value?, word meaning?) What do all scores mean? https://github.com/cjhutto/vaderSentiment/blob/master/vaderSentiment/vader_lexicon.txt) 


### [3.5 points] Question1:

Regard the following sentences and their output as given by VADER. Regard sentences 1 to 7, and explain the outcome **for each sentence**. Take into account both the rules applied by VADER and the lexicon that is used. You will find that some of the results are reasonable, but others are not. Explain what is going wrong or not when correct and incorrect results are produced. 

```
INPUT SENTENCE 1 I love apples
VADER OUTPUT {'neg': 0.0, 'neu': 0.192, 'pos': 0.808, 'compound': 0.6369}

INPUT SENTENCE 2 I don't love apples
VADER OUTPUT {'neg': 0.627, 'neu': 0.373, 'pos': 0.0, 'compound': -0.5216}

INPUT SENTENCE 3 I love apples :-)
VADER OUTPUT {'neg': 0.0, 'neu': 0.133, 'pos': 0.867, 'compound': 0.7579}

INPUT SENTENCE 4 These houses are ruins
VADER OUTPUT {'neg': 0.492, 'neu': 0.508, 'pos': 0.0, 'compound': -0.4404}

INPUT SENTENCE 5 These houses are certainly not considered ruins
VADER OUTPUT {'neg': 0.0, 'neu': 0.51, 'pos': 0.49, 'compound': 0.5867}

INPUT SENTENCE 6 He lies in the chair in the garden
VADER OUTPUT {'neg': 0.286, 'neu': 0.714, 'pos': 0.0, 'compound': -0.4215}

INPUT SENTENCE 7 This house is like any house
VADER OUTPUT {'neg': 0.0, 'neu': 0.667, 'pos': 0.333, 'compound': 0.3612}
```

### [Points: 2.5] Exercise 2: Collecting 50 tweets for evaluation
Collect 50 tweets. Try to find tweets that are interesting for sentiment analysis, e.g., very positive, neutral, and negative tweets. These could be your own tweets (typed in) or collected from the Twitter stream. If you have trouble accessing Twitter, try to find an existing dataset (on websites like kaggle or huggingface).

We will store the tweets in the file **my_tweets.json** (use a text editor to edit).
For each tweet, you should insert:
* sentiment analysis label: negative | neutral | positive (this you determine yourself, this is not done by a computer)
* the text of the tweet
* the Tweet-URL

from:
```
    "1": {
        "sentiment_label": "",
        "text_of_tweet": "",
        "tweet_url": "",
```
to:
```
"1": {
        "sentiment_label": "positive",
        "text_of_tweet": "All across America people chose to get involved, get engaged and stand up. Each of us can make a difference, and all of us ought to try. So go keep changing the world in 2018.",
        "tweet_url" : "https://twitter.com/BarackObama/status/946775615893655552",
    },
```

You can load your tweets with human annotation in the following way.

In [68]:
import json

In [69]:
my_tweets = None
with open('my_tweets.json', encoding="utf8") as f:
    my_tweets = json.load(f)

In [70]:
for id_, tweet_info in my_tweets.items():
    print(id_, tweet_info)
    break

1 {'sentiment_label': 'Positive', 'text_of_tweet': 'Be sure to tune in and watch Donald Trump on Late Night with David Letterman as he presents the Top Ten List tonight!', 'tweet_url': 'https://twitter.com/realDonaldTrump/status/1698308935'}


### [5 points] Question 3:

Run VADER on your own tweets (see function **run_vader** from notebook **Lab2-Sentiment-analysis-using-VADER.ipynb**). You can use the code snippet below this explanation as a starting point. 
* [2.5 points] a. Perform a quantitative evaluation. Explain the different scores, and explain which scores are most relevant and why.
* [2.5 points] b. Perform an error analysis: select 10 positive, 10 negative and 10 neutral tweets that are not correctly classified and try to understand why. Refer to the VADER-rules and the VADER-lexicon. Of course, if there are less than 10 errors for a category, you only have to check those. For example, if there are only 5 errors for positive tweets, you just describe those.

In [71]:

from nltk.sentiment.vader import SentimentIntensityAnalyzer
from sklearn.metrics import classification_report

def vader_output_to_label(vader_output):
    """
    map vader output e.g.,
    {'neg': 0.0, 'neu': 0.0, 'pos': 1.0, 'compound': 0.4215}
    to one of the following values:
    a) positive float -> 'positive'
    b) 0.0 -> 'neutral'
    c) negative float -> 'negative'
    
    :param dict vader_output: output dict from vader
    
    :rtype: str
    :return: 'negative' | 'neutral' | 'positive'
    """
    compound = vader_output['compound']
    
    if compound < 0:
        return 'negative'
    elif compound == 0.0:
        return 'neutral'
    elif compound > 0.0:
        return 'positive'
    
assert vader_output_to_label( {'neg': 0.0, 'neu': 0.0, 'pos': 1.0, 'compound': 0.0}) == 'neutral'
assert vader_output_to_label( {'neg': 0.0, 'neu': 0.0, 'pos': 1.0, 'compound': 0.01}) == 'positive'
assert vader_output_to_label( {'neg': 0.0, 'neu': 0.0, 'pos': 1.0, 'compound': -0.01}) == 'negative'

In [72]:
analyzer = SentimentIntensityAnalyzer()
tweets = []
all_vader_output = []
gold = []

# settings (to change for different experiments)
to_lemmatize = True 
pos = set()

for id_, tweet_info in my_tweets.items():
    the_tweet = tweet_info['text_of_tweet']
    vader_output = analyzer.polarity_scores(the_tweet) # run vader
    vader_label = vader_output_to_label(vader_output) # convert vader output to category
    
    tweets.append(the_tweet)
    all_vader_output.append(vader_label)
    gold.append(tweet_info['sentiment_label'].strip().lower()) # Since the json file contained capitilized labels.
    
# use scikit-learn's classification report
print(classification_report(gold, all_vader_output))

              precision    recall  f1-score   support

    negative       0.50      0.33      0.40        12
     neutral       0.60      0.45      0.51        20
    positive       0.48      0.72      0.58        18

    accuracy                           0.52        50
   macro avg       0.53      0.50      0.50        50
weighted avg       0.53      0.52      0.51        50



### 3A:
 
* [2.5 points] a. Perform a quantitative evaluation. Explain the different scores, and explain which scores are most relevant and why.

The precision measures for each lable (negative, neutral, positive), how many predictions actually were accruate on target. 

For negative, 50% of the tweet predictions that were marked as negative were actually negative.
For neutral, 60% of the tweet predictions that were marked as neutral were actually neutral.
For positive, 48% of the tweet predictions that were marked as positive were actually postive.

The recall measures for each lable (negative, neutral, positive) the ability of idenitfing the correct label based on the tweet information.

For negative, 33% of the negative tweets were correcly indentified.
For neutral, 45% of the neutral tweets were correcly indentified.
For positive, 72% of the postive tweets were correcly indentified.

The f1-score measures for each lable (negative, neutral, positive) the mean of precision and recall. f1-score indicates a balance measurement.

For negative, 40% highlights a poorly preformance.
For neutral, 51% highlights a monderate preformance.
For positive, 58% highlights a better preformance relatively.


The support provides an overview of the acrtual amount of negativve, neautral and positve(12,20,18) tweets.

Regarding the accuracy, the model was able to correctly classify 52% of all the tweets.

Macro avg represents the average metrics equally over the classes, while weighted avg illustrats the average meterics weighted by support.

# Most relevant scores:
F1-score provided a view over the balace between precision and recall, this is important when classes are imbalanaced.
Weighted avg provided a more accruacte view, since wieghted avg took imbalancing inaccount compared to macro avg.







### 3B:
 


* [2.5 points] b. Perform an error analysis: select 10 positive, 10 negative and 10 neutral tweets that are not correctly classified and try to understand why. Refer to the VADER-rules and the VADER-lexicon. Of course, if there are less than 10 errors for a category, you only have to check those. For example, if there are only 5 errors for positive tweets, you just describe those.

### [4 points] Question 4:
Run VADER on the set of airline tweets with the following settings:

* Run VADER (as it is) on the set of airline tweets 
* Run VADER on the set of airline tweets after having lemmatized the text
* Run VADER on the set of airline tweets with only adjectives
* Run VADER on the set of airline tweets with only adjectives and after having lemmatized the text
* Run VADER on the set of airline tweets with only nouns
* Run VADER on the set of airline tweets with only nouns and after having lemmatized the text
* Run VADER on the set of airline tweets with only verbs
* Run VADER on the set of airline tweets with only verbs and after having lemmatized the text

* [1 point] a. Generate for all separate experiments the classification report, i.e., Precision, Recall, and F<sub>1</sub> scores per category as well as micro and macro averages. **Use a different code cell (or multiple code cells) for each experiment.**
* [3 points] b. Compare the scores and explain what they tell you.
* - Does lemmatisation help? Explain why or why not.
* - Are all parts of speech equally important for sentiment analysis? Explain why or why not.

In [None]:
# nltk.download()
import nltk
import pathlib
from sklearn.datasets import load_files

cwd = pathlib.Path.cwd()
airline_tweets_folder = cwd.joinpath('airlinetweets/airlinetweets')
# loading all files as training data.
airline_tweets_train = load_files(str(airline_tweets_folder))



In [74]:
import spacy
# !python -m spacy download en_core_web_sm
nlp = spacy.load('en_core_web_sm')

def run_vader(textual_unit,
              sentiments,
              lemmatize=False, 
              parts_of_speech_to_consider=None,
              ):
    
    scores = []
    for text in textual_unit:
        doc = nlp(str(text))
            
        input_to_vader = []

        for sent in doc.sents:
            for token in sent:

                to_add = token.text

                if lemmatize:
                    to_add = token.lemma_

                    if to_add == '-PRON-': 
                        to_add = token.text

                if parts_of_speech_to_consider:
                    if token.pos_ in parts_of_speech_to_consider:
                        input_to_vader.append(to_add) 
                else:
                    input_to_vader.append(to_add)
        score = vader_output_to_label(analyzer.polarity_scores(' '.join(input_to_vader)))
        scores.append(score)

    print(classification_report(sentiments, scores, zero_division=.0))
    # return scores

In [75]:
text_of_tweets = airline_tweets_train.data
sent_of_tweets = []
for i in airline_tweets_train.target:
    sent_of_tweets.append(airline_tweets_train.target_names[i])
# sent_of_tweets = airline_tweets_train.target

In [76]:
# Run VADER (as it is) on the set of airline tweets 

run_vader(text_of_tweets, sent_of_tweets)

              precision    recall  f1-score   support

    negative       0.80      0.51      0.63      1750
     neutral       0.60      0.51      0.55      1515
    positive       0.56      0.88      0.68      1490

    accuracy                           0.63      4755
   macro avg       0.65      0.63      0.62      4755
weighted avg       0.66      0.63      0.62      4755



In [77]:
# Run VADER on the set of airline tweets after having lemmatized the text
run_vader(text_of_tweets, sent_of_tweets, lemmatize=True)

              precision    recall  f1-score   support

    negative       0.79      0.52      0.63      1750
     neutral       0.60      0.49      0.54      1515
    positive       0.56      0.88      0.68      1490

    accuracy                           0.62      4755
   macro avg       0.65      0.63      0.62      4755
weighted avg       0.65      0.62      0.62      4755



In [78]:
# Run VADER on the set of airline tweets with only adjectives
run_vader(text_of_tweets, sent_of_tweets, parts_of_speech_to_consider=["JJ"])

              precision    recall  f1-score   support

    negative       0.00      0.00      0.00      1750
     neutral       0.32      1.00      0.48      1515
    positive       0.00      0.00      0.00      1490

    accuracy                           0.32      4755
   macro avg       0.11      0.33      0.16      4755
weighted avg       0.10      0.32      0.15      4755



In [79]:
# Run VADER on the set of airline tweets with only adjectives and after having lemmatized the text
run_vader(text_of_tweets, sent_of_tweets, parts_of_speech_to_consider=["JJ"], lemmatize=True)

              precision    recall  f1-score   support

    negative       0.00      0.00      0.00      1750
     neutral       0.32      1.00      0.48      1515
    positive       0.00      0.00      0.00      1490

    accuracy                           0.32      4755
   macro avg       0.11      0.33      0.16      4755
weighted avg       0.10      0.32      0.15      4755



In [80]:
# Run VADER on the set of airline tweets with only nouns
run_vader(text_of_tweets, sent_of_tweets, parts_of_speech_to_consider=["NN"])

              precision    recall  f1-score   support

    negative       0.00      0.00      0.00      1750
     neutral       0.32      1.00      0.48      1515
    positive       0.00      0.00      0.00      1490

    accuracy                           0.32      4755
   macro avg       0.11      0.33      0.16      4755
weighted avg       0.10      0.32      0.15      4755



In [81]:
# Run VADER on the set of airline tweets with only nouns and after having lemmatized the text
run_vader(text_of_tweets, sent_of_tweets, parts_of_speech_to_consider=["NN"], lemmatize=True)

              precision    recall  f1-score   support

    negative       0.00      0.00      0.00      1750
     neutral       0.32      1.00      0.48      1515
    positive       0.00      0.00      0.00      1490

    accuracy                           0.32      4755
   macro avg       0.11      0.33      0.16      4755
weighted avg       0.10      0.32      0.15      4755



In [82]:
# Run VADER on the set of airline tweets with only verbs
run_vader(text_of_tweets, sent_of_tweets, parts_of_speech_to_consider=["VB"])

              precision    recall  f1-score   support

    negative       0.00      0.00      0.00      1750
     neutral       0.32      1.00      0.48      1515
    positive       0.00      0.00      0.00      1490

    accuracy                           0.32      4755
   macro avg       0.11      0.33      0.16      4755
weighted avg       0.10      0.32      0.15      4755



In [83]:
# Run VADER on the set of airline tweets with only verbs and after having lemmatized the text
run_vader(text_of_tweets, sent_of_tweets, parts_of_speech_to_consider=["VB"], lemmatize=True)

              precision    recall  f1-score   support

    negative       0.00      0.00      0.00      1750
     neutral       0.32      1.00      0.48      1515
    positive       0.00      0.00      0.00      1490

    accuracy                           0.32      4755
   macro avg       0.11      0.33      0.16      4755
weighted avg       0.10      0.32      0.15      4755



## Part II: scikit-learn assignments
### [4 points] Question 5
Train the scikit-learn classifier (Naive Bayes) using the airline tweets.

+ Train the model on the airline tweets with 80% training and 20% test set and default settings (TF-IDF representation, min_df=2)
+ Train with different settings:
    + with respect to vectorizing: TF-IDF ('airline_tfidf') vs. Bag of words representation ('airline_count') 
    + with respect to the frequency threshold (min_df). Carry out experiments with increasing values for document frequency (min_df = 2; min_df = 5; min_df =10) 
* [1 point] a. Generate a classification_report for all experiments
* [3 points] b. Look at the results of the experiments with the different settings and try to explain why they differ: 
    + which category performs best, is this the case for any setting?
    + does the frequency threshold affect the scores? Why or why not according to you?

B. While they both perform similar (TF-IDF and BoW) BoW just barely outperforms it. The results of changing ```min_df``` had minimal impact on performance.
The result of the TF-IDF is explained by its inner workings. Some tweets may have fewer words to describe the context necessary for it to work well,
this leads to it either emphasizing or ignoring critical portions of the text. Whereas BoW simply keeps track of all the "sentimental" words' frequencies,
which allows it to conclude correlations (in this) quite well.

In [87]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

def tests5(use_tfidf, min_df):
    # Train a Multimoda Naive Bayes classifier
    airplane_vec = CountVectorizer(min_df=min_df, # If a token appears fewer times than this, across all documents, it will be ignored
                                tokenizer=nltk.word_tokenize, # we use the nltk tokenizer
                                stop_words=stopwords.words('english')) # stopwords are removed
    to_use = airplane_vec.fit_transform(airline_tweets_train.data)

    if use_tfidf:
        tfidf_transformer = TfidfTransformer()
        to_use = tfidf_transformer.fit_transform(to_use)

    docs_train, docs_test, y_train, y_test = train_test_split(
        to_use, # the tf-idf model
        airline_tweets_train.target,
        test_size = 0.20 # we use 80% for training and 20% for testing
    ) 

    clf = MultinomialNB().fit(docs_train, y_train)
    pred = clf.predict(docs_test)
    print(classification_report(y_test, pred, zero_division=.0))

In [88]:
# Bag of words
tests5(False, 2)



              precision    recall  f1-score   support

           0       0.87      0.91      0.89       349
           1       0.88      0.77      0.82       308
           2       0.84      0.89      0.87       294

    accuracy                           0.86       951
   macro avg       0.86      0.86      0.86       951
weighted avg       0.86      0.86      0.86       951



In [91]:
tests5(True, 2)



              precision    recall  f1-score   support

           0       0.84      0.92      0.88       371
           1       0.87      0.71      0.78       316
           2       0.77      0.84      0.80       264

    accuracy                           0.83       951
   macro avg       0.83      0.82      0.82       951
weighted avg       0.83      0.83      0.82       951



In [89]:
tests5(True, 5)



              precision    recall  f1-score   support

           0       0.80      0.88      0.84       327
           1       0.84      0.73      0.78       332
           2       0.83      0.87      0.85       292

    accuracy                           0.82       951
   macro avg       0.82      0.82      0.82       951
weighted avg       0.82      0.82      0.82       951



In [90]:
tests5(True, 10)



              precision    recall  f1-score   support

           0       0.82      0.89      0.85       345
           1       0.79      0.74      0.76       303
           2       0.84      0.82      0.83       303

    accuracy                           0.82       951
   macro avg       0.82      0.81      0.81       951
weighted avg       0.82      0.82      0.81       951



### [4 points] Question 6: Inspecting the best scoring features 

+ Train the scikit-learn classifier (Naive Bayes) model with the following settings (airline tweets 80% training and 20% test;  Bag of words representation ('airline_count'), min_df=2)
* [1 point] a. Generate the list of best scoring features per class (see function **important_features_per_class** below) [1 point]
* [3 points] b. Look at the lists and consider the following issues: 
    + [1 point] Which features did you expect for each separate class and why?
    + [1 point] Which features did you not expect and why ? 
    + [1 point] The list contains all kinds of words such as names of airlines, punctuation, numbers and content words (e.g., 'delay' and 'bad'). Which words would you remove or keep when trying to improve the model and why? 

In [None]:
import os
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report

nltk.download('punkt')
nltk.download('stopwords')

stop_words = set(stopwords.words('english'))
tweet_texts = []
tweet_labels = []
base_dir = "airlinetweets"  

for sentiment_label in ['positive', 'neutral', 'negative']:
    sentiment_folder = os.path.join(base_dir, sentiment_label)
    if not os.path.isdir(sentiment_folder):
        print(f"Folder not found: {sentiment_folder}")
        continue
    for filename in os.listdir(sentiment_folder):
        if filename.endswith(".txt"):
            file_path = os.path.join(sentiment_folder, filename)
            with open(file_path, 'r', encoding='utf-8') as f:
                tweet = f.read().strip().lower()
                tokens = word_tokenize(tweet)
                filtered_tokens = [w for w in tokens if w.isalpha() and w not in stop_words]
                if filtered_tokens:
                    tweet_texts.append(" ".join(filtered_tokens))
                    tweet_labels.append(sentiment_label)

#print(f"Loaded {len(tweet_texts)} tweets.")

vectorizer = CountVectorizer(min_df=2)
tweet_counts = vectorizer.fit_transform(tweet_texts)

X_train, X_test, y_train, y_test = train_test_split(tweet_counts, tweet_labels, test_size=0.2)

clf = MultinomialNB().fit(X_train, y_train)

pred = clf.predict(X_test)
print(classification_report(y_test, pred, zero_division=0))

def important_features_per_class(vectorizer, classifier, n=20):
    class_labels = classifier.classes_
    feature_names = vectorizer.get_feature_names_out()
    for i, label in enumerate(class_labels):
        topn = sorted(zip(classifier.feature_count_[i], feature_names), reverse=True)[:n]
        print(f"Important words in {label} documents:")
        for coef, feat in topn:
            print(f"{label:>10} {coef:<5.1f} {feat}")
        print("-" * 40)

important_features_per_class(vectorizer, clf, n=20)


part b

1. Our expectations are in line with many of the features in the list of important words for each sentiment class. An example of this is in the negative class. Words such as “terrible”, “n’t” and “fuck” are sensible because they show strong negative emotions or complaints. They are common in tweets that express dissatisfaction. In the positive class, words like “wishes”, “!” and “top” stood out. These words are most often used when someone wants to convey excitement, compliments or celebrations. In the case of neutral tweets, we expected to see words that are more factually or emotionally neutral like “think”, “watch” and “donald”. With the new dataset, we now saw more airline-specific terms appearing. For example, "delayed", "cancelled" and "customer" in negative tweets, and "thanks", "great" and "awesome" in positive tweets. This better reflects the sentiment being analyzed within the airline context.

2. We were surprised by some features in the list, as they were less clearly linked to sentiment. For example, we didn’t expect to see the high frequency of punctuation like colons, quotation marks and commas. This is because symbols don’t express any sentiment. We were also surprised by the fact that usernames, links and generic names such as “j” were present. These words are more topic related and don’t particularly have a positive or negative tone. Words like “china” and “language”, which appeared in the negative class were also unexpected. The tone is probably highly influenced by the context they were used in. Similarly, in the new dataset, airline names and customer service phrases were often frequent across multiple classes, which can make interpretation slightly more difficult.

3. If we were to improve the model, we would want to remove some of the louder and unnecessary features. For example quotation marks, links, user handles and common names, as they are often related to the topic rather than the sentiment. On the other hand, we would definitely keep words that are emotionally charged. Words such as “terrible” and “fuck”. This is because they carry strong sentimental signals. For expressing strong emotions, words like “n’t” can be useful. Through keeping these kinds of emotionally relevant words, the model will be able to better separate positive, negative and neutral tweets. Additionally, filtering out airline names may help the model rely more on emotional and contextual words rather than brand mentions.

### [Optional! (will not be graded)] Question 8: trying to improve the model
* [2 points] a. Think of some ways to improve the scikit-learn Naive Bayes model by playing with the settings or applying linguistic preprocessing (e.g., by filtering on part-of-speech, or removing punctuation). Do not change the classifier but continue using the Naive Bayes classifier. Explain what the effects might be of these other settings 
+ [1 point] b. Apply the model with at least one new setting (train on the airline tweets using 80% training, 20% test) and generate the scores
* [1 point] c. Discuss whether the model achieved what you expected.

## End of this notebook