<b> Name: </b> MANAY, Justin Gabrielle A.



# Programming Exercise # 04: Sentiment Classification

## PART I. Loading the Data

In [292]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Justin\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Justin\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Justin\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Justin\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

We first load the corpus onto Python. We use UTF-8 encoding to account for the emojis. We also classify the text based on the tagged sentiment (positive, negative and neutral). 

In [293]:
import csv

with open("Virgin America and US Airways Tweets.csv", mode = "r", encoding = "utf-8") as inp:
    reader = csv.DictReader(inp, delimiter = "\t")
    sentiments = []
    airlines = []
    tweets = []
    for row in reader:
        sentiments.append(row["airline_sentiment"])
        tweets.append(row["text"])

print(sentiments[:5])
print(tweets[:5])

['neutral', 'positive', 'neutral', 'negative', 'negative']
['@VirginAmerica What @dhepburn said.', "@VirginAmerica plus you've added commercials to the experience... tacky.", "@VirginAmerica I didn't today... Must mean I need to take another trip!", '@VirginAmerica it\'s really aggressive to blast obnoxious "entertainment" in your guests\' faces &amp; they have little recourse', "@VirginAmerica and it's a really big bad thing about it"]


To characterize the data, we determine how much tweets we have per category. 

In [294]:
import numpy as np

print("Document Count: " + str(len(sentiments)))
labels, counts = np.unique(sentiments, return_counts=True)
for label, count in zip(labels, counts):
    print("%s: %s (%.4f)" % (label, count, count/len(sentiments)))

Document Count: 3417
negative: 2444 (0.7152)
neutral: 552 (0.1615)
positive: 421 (0.1232)


From this, we can see that an overwhelming majority of the tweets are negative. Later on, when we split the dataset into training and testing sets, we can check if the distribution of tweets in the training and testing sets are roughly in tune with the proportions we computed here.

## PART II. Data Pre-Processing

Summarily, we need to perform the following pre-processing procedures:
- Examine and replace links and mentions (@'s)
- Examine and replace HTML entities (e.g., `&lt;` for <)     
- Convert to lowercase
- Clear of stop words

First, we examine the links and mentions in the tweets. Fortunately, all the links are of the form `http(s)://t.co/[ALPHANUMERIC CHARACTERS]` and all the mentions are of the form `@[ALPHANUMERIC CHARACTERS]` so that they can easily be located using regular expressions. We replace them since we can use the number of links and mentions as features, since dissatified customers may tend to link to images of their complaints or voice out to more people/companies. 

In [295]:
import re

re_link = r'http(s?):\/\/t.co\/[a-zA-z0-9]+'
re_mention = r'@[a-zA-z0-9]+'

# Remove links and mentions
tweets = [re.sub(re_link, "LINK", tweet) for tweet in tweets]
tweets = [re.sub(re_mention, "MENTION", tweet) for tweet in tweets]

We then deal with the HTML entities (i.e., `&amp;`, `\n`). 

From a cursory examination, it seems that `'`, `\n`, `>`, `<` and `&` are the only HTML entities we need to be concerned with. 

In the case of `\'`, we ignore it since both characters will eventually be removed when we get rid of punctuation marks.

As for the rest, since they are all escape charaters/punctuation marks, we simply remove them from the tweets. Notably, however, we keep `&lt;3` or `<3` and `&lt;/3` or `</3` since they can be used as sentiment-based features.

In [296]:
re_heart = "&lt;3"
re_broken_heart = "&lt;/3"

# Replace heart and broken heart
tweets = [re.sub(re_heart, "<3", tweet) for tweet in tweets]
tweets = [re.sub(re_broken_heart, "</3", tweet) for tweet in tweets]

# Remove the rest of the HTML entities
tweets = [re.sub("&lt;", "", tweet) for tweet in tweets]
tweets = [re.sub("&gt;", "", tweet) for tweet in tweets]
tweets = [re.sub("&amp;", "", tweet) for tweet in tweets]
tweets = [re.sub("\\n", "", tweet) for tweet in tweets]

We then convert to lowercase...

In [297]:
# Convert to lowercase
tweets = list(map(lambda str: str.lower(), tweets))

...and then proceed with punctuation removal, stemming and the removal of stop words. We remove punctuation marks so that we can analyze based solely on words and emojis. However, we keep hashtags (`#`) and exclamation points (`!`) since they can be used as features. 

In [298]:
# Remove punctuation
punct = '"$%\'()*+,-./:;=?[\\]^_`{}~…“”'
transtab = str.maketrans(dict.fromkeys(punct, ' '))

tweets = [tweet.translate(transtab) for tweet in tweets]

We also remove stop words since we assume that they are not significant in distinguishing between positive, neutral and negative tweets. We obtain our list of stopwords from the nltk package.

In [299]:
# Remove stop words
stopwords = set(nltk.corpus.stopwords.words('english'))

def remove_stopwords(tweet):
    return ' '.join([word for word in tweet.split() if word not in stopwords])

tweets = [remove_stopwords(tweet) for tweet in tweets]

## PART III. Training and Testing Sets

After performing pre-processing, we split the data into training and testing sets. The machine learns from the training set and tests whatever it learned in the testing set. To ensure a proportional split in the dataset, we use the scikit-learn package.
The `test_size` argument tells us how much of the total dataset we are allocating towards the testing set (In this case, 30% of the dataset goes towards the testing set) while the `random_state` argument sets the seed so as to allow us to shuffle the dataset but with the same results for each run.

In [300]:
from sklearn.model_selection import train_test_split

tweet_train, tweet_test, sentiment_train, sentiment_test = train_test_split(tweets, sentiments, test_size = 0.3, random_state = 100)

print("Training Data")
print("Document count: %s" % len(sentiment_train))
labels, counts = np.unique(sentiment_train, return_counts=True)
for label, count in zip(labels, counts):
  print("%s: %s (%.4f)" % (label, count, count/len(sentiment_train)))
        
print("=====\nTesting Data")
print("Document count: %s" % len(sentiment_test))
labels, counts = np.unique(sentiment_test, return_counts=True)
for label, count in zip(labels, counts):
  print("%s: %s (%.4f)" % (label, count, count/len(sentiment_test)))

Training Data
Document count: 2391
negative: 1728 (0.7227)
neutral: 365 (0.1527)
positive: 298 (0.1246)
=====
Testing Data
Document count: 1026
negative: 716 (0.6979)
neutral: 187 (0.1823)
positive: 123 (0.1199)


Note that because we used scikit-learn to split our dataset, the proportions are in tune with the proportions that we computed earlier. This is because by default, `stratify` = True and scikit-learn makes a stratified split of the dataset.

## PART IV. Feature Extraction

After splitting the dataset, we extract features from the tweets that we deem relevant in determining the sentiment of a tweet. By default, we will employ the bag-of-words model and consider words independently. This assumption is not entirely accurate, since we lose some context in considering words one-by-one (For example, `thumbs` can be thought of as neutral, while `thumbs up` is indicative of positive sentiment).

For this assignment, we will be extracting the following features:

- If a word occurs in many positive tweets, the word must be indicative of positive sentiment. Thus, we can use binary counts. 

- Moreover, if a word occurs frequently in these positive tweets, we can use said word to characterize positive tweets. Thus, we can use the term counts.

- However, if it occurs across many neutral and negative tweets as well, this word loses its value as a classifier. Thus, to select only words that occur regularly across one category, we will employ the TF-IDF weighing scheme.

- Neutral tweets are typically difficult to distinguish based on term frequency alone. However, these tweets may be less descriptive and may thus employ less adjectives and adverbs. For this purpose, the number of part-of-speech tags may be a useful feature.
 
The number of links, mentions and exclamation points, the length of the tweets, emojis, emoticons and hashtags can also be used to determine sentiment, but we won't do it here.

### A. Binary Counts

To determine which terms characterize positive, neutral and negative tweets, we can examine which words occur in the most tweets per category. For this example, we use scikit-learn's Count Vectorizer.

We first use `fit_transform` in CountVectorizer to tokenize each tweet and determine whether the token occurs in the tweet or not. In the output below, the first number represents the document/tweet index and the second number represents the feature/token index..

In [301]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(binary = True)
import pandas as pd

tf_bin = vectorizer.fit_transform(tweet_train)
print(tf_bin)

  (0, 577)	1
  (0, 936)	1
  (0, 2166)	1
  (0, 2719)	1
  (0, 2140)	1
  (0, 4399)	1
  (0, 3919)	1
  (0, 1208)	1
  (0, 451)	1
  (0, 2645)	1
  (1, 2655)	1
  (1, 715)	1
  (1, 1898)	1
  (1, 1977)	1
  (1, 2645)	1
  (2, 4045)	1
  (2, 2205)	1
  (2, 1010)	1
  (2, 2356)	1
  (2, 2410)	1
  (2, 1210)	1
  (2, 2645)	1
  (3, 1988)	1
  (3, 4390)	1
  (3, 2045)	1
  :	:
  (2388, 3512)	1
  (2388, 3308)	1
  (2388, 2139)	1
  (2388, 895)	1
  (2388, 562)	1
  (2388, 2173)	1
  (2388, 2486)	1
  (2388, 3850)	1
  (2388, 2645)	1
  (2389, 2847)	1
  (2389, 4437)	1
  (2389, 4497)	1
  (2389, 3022)	1
  (2389, 994)	1
  (2389, 2111)	1
  (2389, 1821)	1
  (2389, 577)	1
  (2389, 2645)	1
  (2390, 2782)	1
  (2390, 777)	1
  (2390, 4487)	1
  (2390, 1689)	1
  (2390, 2045)	1
  (2390, 1858)	1
  (2390, 2645)	1


To more easily visualize this data, we employ a pandas DataFrame object, converting it to a (sparse) matrix using `todense`, where the column names are names of the features/tokens.

In [302]:
count_vect_bin_df = pd.DataFrame(tf_bin.todense(), columns = vectorizer.get_feature_names())

count_vect_bin_df.head()

Unnamed: 0,00,000,000114,000419,000lbs,00am,00p,00pm,0185,03,...,yr,yrs,ystrdy,yuma,yvonne,yvr,yyz,z1,zero,zone
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


We add the column `sentiment` to determine how each tweet is classified, and then get the mean of the counts across all positive, negative and neutral tweets.

In [303]:
count_vect_bin_df["sentiment"] = sentiment_train
count_vect_bin_df_grouped = count_vect_bin_df.groupby("sentiment").mean()

In [304]:
for idx in list(count_vect_bin_df_grouped.index):
    print(idx)
    print(count_vect_bin_df_grouped.loc[idx].sort_values(ascending = False).head(11))
    print("\n")

negative
mention      0.999421
flight       0.273727
hold         0.104745
get          0.095486
cancelled    0.087384
service      0.082176
hours        0.082176
help         0.073495
us           0.068866
hour         0.061921
plane        0.061343
Name: negative, dtype: float64


neutral
mention    1.000000
flight     0.221918
link       0.156164
help       0.093151
get        0.073973
please     0.068493
us         0.063014
need       0.063014
flights    0.060274
thanks     0.043836
dm         0.038356
Name: neutral, dtype: float64


positive
mention    1.000000
thanks     0.211409
thank      0.177852
flight     0.151007
great      0.097315
link       0.093960
you        0.077181
service    0.073826
us         0.063758
love       0.063758
get        0.060403
Name: positive, dtype: float64




Note in this case that the token `flight` occurs frequently across all three categories. Thus, we cannot use `flight` to classify the tweets based on sentiment.

Note also that most negative tweets are not accompanied by a link. Thus, we can probably use the number of links as a feature for sentiment classification.

### B. Term Counts

To determine which terms characterize positive, neutral and negative tweets, we can examine which words occur most frequently per category. Again, we use scikit-learn's Count Vectorizer.

In [305]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
import pandas as pd

tf = vectorizer.fit_transform(tweet_train)
print(tf)

  (0, 577)	1
  (0, 936)	1
  (0, 2166)	1
  (0, 2719)	1
  (0, 2140)	1
  (0, 4399)	1
  (0, 3919)	1
  (0, 1208)	1
  (0, 451)	1
  (0, 2645)	2
  (1, 2655)	1
  (1, 715)	1
  (1, 1898)	1
  (1, 1977)	1
  (1, 2645)	1
  (2, 4045)	1
  (2, 2205)	1
  (2, 1010)	1
  (2, 2356)	1
  (2, 2410)	1
  (2, 1210)	1
  (2, 2645)	1
  (3, 1988)	1
  (3, 4390)	1
  (3, 2045)	1
  :	:
  (2388, 3512)	1
  (2388, 3308)	1
  (2388, 2139)	1
  (2388, 895)	1
  (2388, 562)	1
  (2388, 2173)	1
  (2388, 2486)	1
  (2388, 3850)	1
  (2388, 2645)	1
  (2389, 2847)	1
  (2389, 4437)	1
  (2389, 4497)	1
  (2389, 3022)	1
  (2389, 994)	1
  (2389, 2111)	1
  (2389, 1821)	1
  (2389, 577)	1
  (2389, 2645)	2
  (2390, 2782)	1
  (2390, 777)	1
  (2390, 4487)	1
  (2390, 1689)	1
  (2390, 2045)	1
  (2390, 1858)	1
  (2390, 2645)	1


Using a Pandas DataFrame,

In [306]:
count_vect_df = pd.DataFrame(tf.todense(), columns = vectorizer.get_feature_names())

count_vect_df.head()

Unnamed: 0,00,000,000114,000419,000lbs,00am,00p,00pm,0185,03,...,yr,yrs,ystrdy,yuma,yvonne,yvr,yyz,z1,zero,zone
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


We add the column `sentiment` to determine how each tweet is classified, and then sum the counts across all positive, negative and neutral tweets.

In [307]:
count_vect_df["sentiment"] = sentiment_train
count_vect_df_grouped = count_vect_df.groupby("sentiment").mean()

In [308]:
for idx in list(count_vect_df_grouped.index):
    print(idx)
    print(count_vect_df_grouped.loc[idx].sort_values(ascending = False).head(11))
    print("\n")

negative
mention      1.133102
flight       0.331597
hold         0.109954
get          0.100694
cancelled    0.092014
service      0.085069
hours        0.085069
help         0.076968
us           0.072917
plane        0.065394
hour         0.062500
Name: negative, dtype: float64


neutral
mention    1.227397
flight     0.243836
link       0.169863
help       0.098630
get        0.073973
please     0.073973
us         0.065753
need       0.065753
flights    0.060274
thanks     0.043836
change     0.043836
Name: neutral, dtype: float64


positive
mention    1.097315
thanks     0.214765
thank      0.181208
flight     0.164430
great      0.104027
link       0.097315
you        0.080537
service    0.077181
us         0.063758
love       0.063758
get        0.060403
Name: positive, dtype: float64




We encounter the same problem here with `flight`. Thus, we must employ a TF-IDF weighing scheme to downweight words like `flight` which occur frequently across tweets of all sentiments. 

### C. TF-IDF (Term Frequency-Inverse Document Frequency)

TF-IDF is a variant of term frequency which takes into account both the term and the (inverse) document frequency. Including the inverse document frequency ensures that words that occur frequently across many categories (like `flight`) are downweighted in place of words that occur frequently in a unique category.

To implement this, we use TdfidfTransformer from the scikit-learn package along with CountVectorizer

In [309]:
from sklearn.feature_extraction.text import TfidfTransformer
transformer = TfidfTransformer()

tf_idf = transformer.fit_transform(tf)
print(tf_idf)

  (0, 2645)	0.10188981026358468
  (0, 451)	0.39115815358457234
  (0, 1208)	0.4118059012450132
  (0, 3919)	0.3249939716134084
  (0, 4399)	0.2499679225185752
  (0, 2140)	0.31271312871261886
  (0, 2719)	0.29454995241956966
  (0, 2166)	0.35586059581807306
  (0, 936)	0.31271312871261886
  (0, 577)	0.29991529039113285
  (1, 2645)	0.08615779703879868
  (1, 1977)	0.4661418529758807
  (1, 1898)	0.4981415698528017
  (1, 715)	0.44062691193075654
  (1, 2655)	0.5770541636454808
  (2, 2645)	0.06410165966335368
  (2, 1210)	0.3186585778336565
  (2, 2410)	0.5181566568959034
  (2, 2356)	0.3616635956547493
  (2, 1010)	0.3215715734020716
  (2, 2205)	0.4089251497693661
  (2, 4045)	0.4737433435519612
  (3, 2645)	0.024729257974958434
  (3, 3603)	0.19989544275631682
  (3, 1339)	0.10246379021791147
  :	:
  (2388, 2645)	0.07393823013617931
  (2388, 3850)	0.2969382898308456
  (2388, 2486)	0.27334482236931185
  (2388, 2173)	0.28159221961515896
  (2388, 562)	0.41401637075702413
  (2388, 895)	0.43128221971339786
  

The output of `fit.transform` is the same as in part B but instead of counts, we have the TF-IDF values instead. Again, we store the results in a pandas DataFrame. 

In [310]:
tf_idf_df = pd.DataFrame(tf_idf.todense(), columns = vectorizer.get_feature_names())

tf_idf_df.head()

Unnamed: 0,00,000,000114,000419,000lbs,00am,00p,00pm,0185,03,...,yr,yrs,ystrdy,yuma,yvonne,yvr,yyz,z1,zero,zone
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


As with before, we add the column `sentiment`. However, it would not make sense to sum across TF-IDF values because TF-IDF is term frequency times the log of inverse document frequency and you cannot just add log values. Moreover, take the word `flight`. The word `flight` might have a smaller TF-IDF value but since it occurs in more documents, summing up the TF-IDF values would unnecessarily inflate the sum, making it seem more relevant than it actually is.

Thus, we instead compare the highest TF-IDF values for some words. Take the words `flight`, `best` and `terrible`.  

In [311]:
tf_idf_df["sentiment"] = sentiment_train

tf_idf_df[["flight", "terrible", "thanks", "sentiment"]].groupby("sentiment").max()

Unnamed: 0_level_0,flight,terrible,thanks
sentiment,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
negative,0.556495,0.844728,0.589098
neutral,0.367513,0.0,0.547264
positive,0.276377,0.0,0.96767


From the scores, we can see that `terrible` and `thanks` are deemed better classifiers than flight, since they most likely occur more frequently in positive and negative tweets, respectively, than in the other tweets.

### D. Part-of-Speech (POS) Counts

We can use the NLTK POS tagger to tag the words and count the number of words in a tweet belonging to a particular part of speech. 

In [312]:
tweet_train_tags = []
for tweet in tweet_train:
    tokens = nltk.word_tokenize(tweet)
    tokens = [token for token in tokens if token not in ["mention", "link"]]
    tags = nltk.pos_tag(tokens)
    tweet_train_tags.append(tags)

In [313]:
print(tweet_train_tags)



Note from the output that the tokenizer and the POS tagger did not work perfectly. 

The tokenizer, for example, did not tokenize the words after the hashtags. Also, the emojis are classified differently: some as nouns, some as verbs and some as adjectives. Nonetheless, most of the words seem to have been classified correctly.

We now proceed with counting the POS tags per sentence.

In [314]:
from collections import Counter
tweet_train_tags_count = []
tweet_test_tags_count = []

for tweet in tweet_train_tags:
    pos_count = Counter([tag for word, tag in tweet])
    tweet_train_tags_count.append(pos_count)  

This gives us a list of counter objects, which we can easier visualize in a DataFrame.

In [326]:
pos_counts = pd.DataFrame(tweet_train_tags_count).fillna(0)
pos_counts.head()

Unnamed: 0,#,$,.,CC,CD,DT,FW,IN,JJ,JJR,...,VB,VBD,VBG,VBN,VBP,VBZ,WDT,WP,WP$,WRB
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,1.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0


Now we check our earlier assumption: do neutral tweets have less adjectives and adverbs than either positive or negative tweets? We average across the number of tweets per category since there are more negative tweets than neutral and positive ones. 

In [328]:
pos_counts["sentiment"] = sentiment_train

pos_counts[["JJ", "RB", "sentiment"]].groupby("sentiment").mean()

Unnamed: 0_level_0,JJ,RB
sentiment,Unnamed: 1_level_1,Unnamed: 2_level_1
negative,1.380208,0.653935
neutral,1.024658,0.356164
positive,1.241611,0.540268


We note that neutral tweets have around 0.22 less adjectives on average and 0.09 less adverbs on average compared to positive tweets, which is pretty significant. The difference is even greater for negative tweets. Thus, we can probably use POS tag counts to distinguish neutral tweets from positive and negative ones.

## PART V. Sentiment Classification using Multinomial Naive Bayes

For simplicity, we will simply be using a multinomial Naive Bayes classifier to classify the tweets.

Again, we extract term counts and TF-IDF although this time, we specify the parameters `max_df`, `min_df`, `ngram_range` and `binary`.

- `max_df` and `min_df` specify the bounds for the document frequency. Words with a document frequency outside of these bounds will be excluded from the vocabulary and will not factor in the sentiment classification.
- `ngram_range` specifies the bounds for the n-grams that will be used.
- As with before, `binary` allows for binary counts.

By default, we will consider the specifications below and modify whenever necessary.

In [318]:
#Specification 1
tweet_train, tweet_test, sentiment_train, sentiment_test = train_test_split(tweets, sentiments, test_size = 0.3, random_state = 100)

vectorizer = CountVectorizer(max_df = 0.9, min_df = 0.01, ngram_range=(1,1), binary=True)
tweet_train = vectorizer.fit_transform(tweet_train)
tweet_test = vectorizer.transform(tweet_test)

Performing sentiment classification,

In [319]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score

nb = MultinomialNB()

nb.fit(tweet_train, sentiment_train)
sentiment_pred = nb.predict(tweet_test)
acc = accuracy_score(sentiment_test, sentiment_pred)
f1 = f1_score(sentiment_test, sentiment_pred, average = "weighted")  
  
print("Accuracy: %s\nF1 Score: %s\n" % (acc, f1))

Accuracy: 0.7309941520467836
F1 Score: 0.6971208181261093



We can improve on this model even further. For example, we can consider using term counts instead. From our output of words with the largest term counts, our `max_df` should be fine at 0.9, but there may be low-frequency words that uniquely occur in one category, so we might want to adjust our `min_df` accordingly. We might also want to conisder the possibility of bigrams. Considering these,

In [320]:
#Specification 2
tweet_train, tweet_test, sentiment_train, sentiment_test = train_test_split(tweets, sentiments, test_size = 0.3, random_state = 100)

vectorizer = CountVectorizer(max_df = 0.9, min_df = 0.001, ngram_range=(1,2))
tweet_train = vectorizer.fit_transform(tweet_train)
tweet_test = vectorizer.transform(tweet_test)

In [321]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score

nb = MultinomialNB()

nb.fit(tweet_train, sentiment_train)
sentiment_pred = nb.predict(tweet_test)
acc = accuracy_score(sentiment_test, sentiment_pred)
f1 = f1_score(sentiment_test, sentiment_pred, average = "weighted")  
  
print("Accuracy: %s\nF1 Score: %s\n" % (acc, f1))

Accuracy: 0.7660818713450293
F1 Score: 0.747817716530204



The model improved! Since we know, however, that a TF-IDF weighing scheme would yield better classifiers, we use the TF-IDF weights in our Naive Bayes classifier.

In [322]:
#Specification 3
from sklearn.feature_extraction.text import TfidfVectorizer
tweet_train, tweet_test, sentiment_train, sentiment_test = train_test_split(tweets, sentiments, test_size = 0.3, random_state = 100)

transformer = TfidfVectorizer(max_df = 1.0, min_df = 0.01, ngram_range=(1,2), use_idf = True, smooth_idf = True)
tweet_train = transformer.fit_transform(tweet_train)
tweet_test = transformer.transform(tweet_test)

In [323]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score

nb = MultinomialNB()

nb.fit(tweet_train, sentiment_train)
sentiment_pred = nb.predict(tweet_test)
acc = accuracy_score(sentiment_test, sentiment_pred)
f1 = f1_score(sentiment_test, sentiment_pred, average = "weighted")  
  
print("Accuracy: %s\nF1 Score: %s\n" % (acc, f1))

Accuracy: 0.7426900584795322
F1 Score: 0.6743986570452933



TF-IDF actually performs slightly worse compared to the previous model. Why is this so?

- There are only a few words like `flight` which occur commonly for all categories (probably because we removed stop words). In the case of tweets, there is also a lot of slang and unique vocabulary so that only a few words occur across all categories.
- The commonly occurring words reflect the sentiment of the tweet pretty clearly.