# Lab3.3 Sentiment Analysis with Scikit-Learn

Copyright: Vrije Universiteit Amsterdam, Faculty of Humanities, CLTL

The focus of this notebook is on performing sentiment analysis using the scikit-learn package. Material from [this notebook](http://www.pitt.edu/~naraehan/presentation/Movie+Reviews+sentiment+analysis+with+Scikit-Learn.html) was used.

**At the end of this notebook, you will be able to**:
* load the training data, i.e., the movie reviews
* inspect the training data, i.e., the movie reviews
* extracting features from the training data
* training and evaluating the *NaiveBayesClassifier*
* apply the classifier to fake movie reviews

**If you want to learn more, you might find the following link useful:**
* [documentation on dataset loading](http://scikit-learn.org/stable/datasets/)

For training a machine learning system we need a number of packages, the most important ones are *sklearn* and *numpy* to manipulate out data and call machine learning functions. Since we are dealing with texts, we also need some specific packages from *sklearn* to operate on texts to get words as features.

In [1]:
import pathlib
import sklearn
import numpy
import nltk
from nltk.corpus import stopwords
from collections import Counter
from sklearn.datasets import load_files
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

## Loading the dataset
We are first going to load and inspect the **airlinetweets** dataset (which is included in the zip file you downloaded from Github). We are going to use the method **load_files** as part of sklearn.
Let's first inspect what the help message of the function **load_files** states.

In [4]:
help(load_files)

Help on function load_files in module sklearn.datasets.base:

load_files(container_path, description=None, categories=None, load_content=True, shuffle=True, encoding=None, decode_error='strict', random_state=0)
    Load text files with categories as subfolder names.
    
    Individual samples are assumed to be files stored a two levels folder
    structure such as the following:
    
        container_folder/
            category_1_folder/
                file_1.txt
                file_2.txt
                ...
                file_42.txt
            category_2_folder/
                file_43.txt
                file_44.txt
                ...
    
    The folder names are used as supervised signal label names. The individual
    file names are not important.
    
    This function does not try to extract features into a numpy array or scipy
    sparse matrix. In addition, if load_content is false it does not try to
    load the files in memory.
    
    To use text files in a scikit

Ok, so the function requires the following structure in order for it to work:
* container_folder/
    * category_1_folder/ (e.g., 'pos')
        * file_1.txt
        * file_2.txt
        * ...
        file_42.txt
    * category_2_folder/ (e.g., 'neg')
        * file_43.txt
        * file_44.txt
        * ...

Let's check whether our **airlinetweets** corpus has this structure.

In [2]:
cwd = pathlib.Path.cwd()
airline_tweets_folder = cwd.joinpath('airlinetweets')
print('path:', airline_tweets_folder)
print('this will print True if the folder exists:', 
      airline_tweets_folder.exists())

path: /Users/piek/Desktop/TextMiningFEW-BA-2020/labs/text-mining-ba/lab_sessions/lab3/airlinetweets
this will print True if the folder exists: True


In [3]:
str(airline_tweets_folder)

'/Users/piek/Desktop/TextMiningFEW-BA-2020/labs/text-mining-ba/lab_sessions/lab3/airlinetweets'

Inspect whether the corpus has the required structure.

....

....

Hopefully, it is! Let's now load it using the function.

In [6]:
# loading all files as training data.
airline_tweets_train = load_files(str(airline_tweets_folder))

## Inspecting dataset

How many files do we have?

In [8]:
len(airline_tweets_train.data)

4755

In [9]:
# target names ("classes") are automatically generated from subfolder names
airline_tweets_train.target_names

['negative', 'neutral', 'positive']

If you do not agree with these labels, you could change the names of the subdirectories. If you do not agree with the distinctions, you could add other folders with other category names and move files to these folders.

How many do we have for each category?

In [10]:
freqs = Counter(airline_tweets_train.target)
for category, frequency in freqs.items():
    print(airline_tweets_train.target_names[category], frequency)

neutral 1515
positive 1490
negative 1750


In [11]:
# Let's inspect the first file
airline_tweets_train.data[0]

b'@AmericanAir Why is your cover photo of TWA? Just wondering.'

In [12]:
# first file is in "neutral" folder
airline_tweets_train.filenames[0]

'/Users/piek/Desktop/CBS2020/text-mining-ba-changed/lab_sessions/lab3/airlinetweets/neutral/AL_570069345818161152.txt'

In [13]:
# first file is a neutral review and is mapped to index 1 in target_names
airline_tweets_train.target[0]

1

We can find out what the index means by inserting it into **target_names**

In [14]:
airline_tweets_train.target_names[1]

'neutral'

Remember from the previous labs that some of the NLTK data is structured precisely in the same way as the ailinetweets data, e.g. nltk_data/corpora/movie_reviews.

Likewise, we can load this datat in the same way using the load_files function. First adapt the path below to point to the location on your local laptop.

In [15]:
movie_reviews_folder = cwd.joinpath('/Users/piek/nltk_data/corpora/movie_reviews')
print('path:', movie_reviews_folder)
print('this will print True if the folder exists:', 
      movie_reviews_folder.exists())

path: /Users/piek/nltk_data/corpora/movie_reviews
this will print True if the folder exists: True


In [16]:
movie_reviews_train = load_files(str(movie_reviews_folder))

In [17]:
print(len(movie_reviews_train.data))
print(movie_reviews_train.target_names)

2000
['neg', 'pos']


## Extracting features from training data (see notebook Lab2.3 Feature representation.ipynb for more information)
Note: you might get a warning when you run the following cell. You do NOT have to resolve the warning.

In [4]:
# initialize airline object, and then turn airline tweets train data into a vector 

airline_vec = CountVectorizer(min_df=2, # If a token appears fewer times than this, across all documents, it will be ignored
                             tokenizer=nltk.word_tokenize, # we use the nltk tokenizer
                             stop_words=stopwords.words('english')) # stopwords are removed

In order to represent each document in terms of this vector, we use the *fit_transform* function to generate a matrix of documents (the rows) and the vectors with the scores for each words that occurs in each document.

In [7]:
airline_counts = airline_vec.fit_transform(airline_tweets_train.data)

  'stop_words.' % sorted(inconsistent))


We have now created a vector representation *airline_vec* of the complete vocabulary of the full data set. Every position in this vector represents a unique word token.

In [8]:
#Total number of word features or the length of the total vector
print(len(airline_vec.vocabulary_))

2902


In [21]:
# First 50 feature names
print(list(airline_vec.get_feature_names())[:50])

['!', '#', '$', '%', '&', "'", "''", "'d", "'ll", "'m", "'re", "'s", "'various", "'ve", '(', ')', '+', '+1', '-', '--', '-/', '.', '..', '...', '.50', '.still', '/', '//t.co/8wbzorrn3c', '//t.co/aqjn4hwnac', '//t.co/f2lfulcbq7', '0', '1', '1.', '1.5', '1.75', '1/2', '10', '10-24', '100', '100+', '1000', '103', '1051', '1071', '10:30', '10:55', '11', '1142', '12', '1230']


In [22]:
# 'plane' is found in the corpus, mapped to index 1948
airline_vec.vocabulary_.get('plane')

1948

We can now inspect the dimensions of our feature array by getting the spape: the rows (documents) and columns (the word vector length).

In [23]:
# large dimensions! 4,755 documents, 2902 unique terms. 
airline_counts.shape

(4755, 2902)

We can convert the matrix to an array and get the first element and look at the vector values for slots 100 till 200:

In [24]:
print(airline_counts.toarray()[0][100:200])

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]


We can see that most values are zero's and just a few have the value 1. This is what we call a sparse vector.

As we have seen in the previous Lab, we can also transform the counts into information value scores using the *TfidfTransformer* function.

In [9]:
# Convert raw frequency counts into TF-IDF values
tfidf_transformer = TfidfTransformer()
airline_tfidf = tfidf_transformer.fit_transform(airline_counts)

Obviously the shape remains the same but the values are now scores between zero and one.

In [26]:
# Same dimensions, now with tf-idf values instead of raw frequency counts
print(airline_tfidf.shape)
print(airline_tfidf.toarray()[0][100:200])

(4755, 2902)
[0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.19103974 0.07301937 0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.       

## Training and testing a Naive Bayes classifier

We can now use the above data representation as training data to build a classifier. the Sklearn package already associated each row (a document) in our data represenation with a label by taking the name of the data subfolder.

In [27]:
airline_tweets_train.target_names[2]

'positive'

We are going to use a simple Naive Bayes classifier to train a model. Because we have multiple labels (negative, positive, neutral), we need a multinomial classifier as we are dealing with 3 classes.

In [10]:
# Now ready to build a classifier. 
# We will use Multinominal Naive Bayes as our model
from sklearn.naive_bayes import MultinomialNB

It is easy for machine learning package to read the above vector representations and associated these with any type of label. However, we also want to test the data. For that purpose, we need to exclude part of the data from a training set.

To train the classifier, we will first split the data into train and test.

In [11]:
# Split data into training and test sets
# from sklearn.cross_validation import train_test_split  # deprecated in 0.18
from sklearn.model_selection import train_test_split

We choose 80% training and 20% test. 

In [12]:
docs_train, docs_test, y_train, y_test = train_test_split(
    airline_tfidf, # the tf-idf model
    airline_tweets_train.target, # the category values for each tweet 
    test_size = 0.20 # we use 80% for training and 20% for development
    ) 

One instance looks like this:

In [13]:
docs_train[55].toarray()

array([[0.        , 0.12342841, 0.        , ..., 0.        , 0.        ,
        0.        ]])

it's label is:

In [14]:
y_train[55]

0

which we know is then:

In [15]:
airline_tweets_train.target_names[y_train[55]]

'negative'

The *fit* function of sklearn takes as input the training data and the labels.

In [16]:
# Train a Multimoda Naive Bayes classifier
clf = MultinomialNB().fit(docs_train, y_train)

After training the classifier, we can apply it to test data that is represented in the same way.

In [17]:
# Predicting the Test set results, find macro recall
y_pred = clf.predict(docs_test)

We can now obtain each tweet

In [18]:
print('one tweet review:', airline_tweets_train.data[0])
print('gold label:', airline_tweets_train.target[0])
print('classifier predicted:', y_pred[0])

one tweet review: b'@AmericanAir Why is your cover photo of TWA? Just wondering.'
gold label: 1
classifier predicted: 1


Sklearn provides functions to obtain the recall, precision and f-measure for the test set results

In [19]:
sklearn.metrics.recall_score(y_true=y_test,
                             y_pred=y_pred,
                             average='micro')

0.8096740273396424

We can also inspect the least and most important features per category.

In [20]:
def important_features_per_class(vectorizer,classifier,n=10): #n is the number of top features
    class_labels = classifier.classes_
    feature_names =vectorizer.get_feature_names()
    topn_class1 = sorted(zip(classifier.feature_count_[0], feature_names),reverse=True)[:n]
    topn_class2 = sorted(zip(classifier.feature_count_[1], feature_names),reverse=True)[:n]
    topn_class3 = sorted(zip(classifier.feature_count_[2], feature_names),reverse=True)[:n]
    print("Important words in negative documents")
    for coef, feat in topn_class1:
        print(class_labels[0], coef, feat)
    print("-----------------------------------------")
    print("Important words in neutral documents")
    for coef, feat in topn_class2:
        print(class_labels[1], coef, feat) 
    print("-----------------------------------------")
    print("Important words in positive documents")
    for coef, feat in topn_class3:
        print(class_labels[2], coef, feat) 

# example of how to call from notebook:
important_features_per_class(airline_vec, clf)

Important words in negative documents
0 159.40464318077093 united
0 112.14589872712645 .
0 98.34080034584191 @
0 97.68313441595663 ``
0 57.82518310123395 flight
0 50.83231068351349 ?
0 43.89285669654893 #
0 42.48226162224598 !
0 39.095987614300554 n't
0 29.76060493138173 ''
-----------------------------------------
Important words in neutral documents
1 105.5412314595449 @
1 83.6189215124316 ?
1 64.91299350367485 jetblue
1 57.97638752805202 .
1 56.10804917029334 southwestair
1 56.06856541207693 ``
1 52.4017474772822 :
1 52.23933888402645 americanair
1 41.57744163001819 usairways
1 40.6311553077198 flight
-----------------------------------------
Important words in positive documents
2 177.36389672662492 !
2 105.91919226646378 @
2 87.94957742538955 .
2 85.90033570289233 thanks
2 76.82162941213106 thank
2 67.18373946752153 southwestair
2 63.537671376526994 jetblue
2 61.03136299883954 ``
2 52.33885497983838 #
2 50.70106371666208 americanair


## Applying classifier on our own data
Now we can apply our classifier to new data.
In the example below, these are movie reviews. In the exercise, you will choose tweets that you've selected.

In [21]:
# very short and fake movie reviews
reviews_new = ['This movie was excellent', 
               'Absolute joy ride', 
               'Steven Seagal was terrible', 
               'Steven Seagal shined through.', 
               'This was certainly a movie', 
               'Two thumbs up', 
               'I fell asleep halfway through', 
               "We can't wait for the sequel!!", 
               'I cannot recommend this highly enough', 
               'instant classic.', 
               'Steven Seagal was amazing.']
len(reviews_new)

11

To be able to apply our model to these text, we need to represent the text using the same vectors as we used for training the model.
The sklearn transformer function does this work for you.

In [22]:
# We re-use airline_vec to transform it in the same way as the training data
new_counts = airline_vec.transform(reviews_new)
new_counts.shape

(11, 2902)

Note that words in our movie reviews that are NOT in the training data, will not be represented as there are no slots in the vectors from the training data.

In [23]:
# we compute tf idf values
reviews_new_tfidf = tfidf_transformer.transform(new_counts)

In [42]:
reviews_new_tfidf.shape

(11, 2902)

In [43]:
# have classifier make a prediction
pred = clf.predict(reviews_new_tfidf)

In [44]:
# print out results ()
for review, predicted_label in zip(reviews_new, pred):
    
    print('%s => %s' % (review, 
                        airline_tweets_train.target_names[predicted_label]))

This movie was excellent => positive
Absolute joy ride => positive
Steven Seagal was terrible => negative
Steven Seagal shined through. => negative
This was certainly a movie => negative
Two thumbs up => negative
I fell asleep halfway through => neutral
We can't wait for the sequel!! => negative
I cannot recommend this highly enough => negative
instant classic. => negative
Steven Seagal was amazing. => positive


## Training a classifier with movie reviews

Remember that we loaded the NLTK movie review data set using the sklearn function *load_files* in the same way as we have done for the ailinetweets. This means we can build a simple classifier from this data as well and apply it to the same set of review_news and compare the two systems.

We proceed in three simple steps:

<ol>
    <li>We create a CountVectorizer to vectorize the training texts based on the total vocabulary using the *fit_transform* function
    <li>We transfer the values using the tfidf_transformer into information values
    <li>We create a *MultinomialNB* classifier from the vectorised documents and their labels
</ol>

Once we trained the classifier, we can apply it to the above examples by transforming these to repsentations that are compatible to our model.

In [45]:
movie_vec = CountVectorizer(min_df=2, # If a token appears fewer times than this, across all documents, it will be ignored
                             tokenizer=nltk.word_tokenize, # we use the nltk tokenizer
                             stop_words=stopwords.words('english')) # stopwords are removed

In [46]:
movie_counts = movie_vec.fit_transform(movie_reviews_train.data)

  'stop_words.' % sorted(inconsistent))


In [47]:
movie_tfidf = tfidf_transformer.fit_transform(movie_counts)

The sklearn load_files functions not only return the data for training but also the classification labels that go with it based in the folder names that contain the text files. The *target* attribute contains the lists of values as integer indexes corresponding to each document representation in the data and *target_names* contain the meaning of the labels.

In [48]:
print(movie_reviews_train.target_names)
# print the labels for the first ten documents
print(movie_reviews_train.target[:10])

['neg', 'pos']
[0 1 1 0 1 1 1 1 1 0]


In [49]:
clf = MultinomialNB().fit(movie_tfidf, movie_reviews_train.target)

In [50]:
# We re-use airline_vec to transform it in the same way as the training data
new_counts = movie_vec.transform(reviews_new)
new_counts.shape

(11, 25138)

In [51]:
reviews_new_tfidf = tfidf_transformer.transform(new_counts)

In [52]:
pred = clf.predict(reviews_new_tfidf)

In [53]:
# print out results ()
for review, predicted_label in zip(reviews_new, pred):
    
    print('%s => %s' % (review, 
                        movie_reviews_train.target_names[predicted_label]))

This movie was excellent => pos
Absolute joy ride => pos
Steven Seagal was terrible => neg
Steven Seagal shined through. => neg
This was certainly a movie => neg
Two thumbs up => neg
I fell asleep halfway through => neg
We can't wait for the sequel!! => neg
I cannot recommend this highly enough => pos
instant classic. => pos
Steven Seagal was amazing. => neg


## End of this notebook