# Applying machine learning to sentiment analysis. 

For this we'll be analyzing the IMDb review data taken from http://ai.stanford.edu/~amaas/data/sentiment/  
The data has been downloaded to the current directory.

### Preprocessing the movie dataset into more convinient format

Having successfully extracted the dataset we'll now assemble the individual text documnets from download archive into a single CSV file. We'll be reading the movie reviews in a padas dataframe object, which can take a like 10 minutes on a standard desktop computer. To visualize the progress and the estimated time until completion we'll be using the **PyPrind**  package.

In [1]:
import pyprind
import pandas as pd
import os
pbar = pyprind.ProgBar(50000)
labels = { 'pos': 1, 'neg': 0}
df = pd.DataFrame()
for s in ('test', 'train'):
    for l in ('pos', 'neg'):
        path = './aclImdb/%s/%s' % (s, l)
        for file in os.listdir(path):
            with open(os.path.join(path, file), 'r', encoding='utf-8') as infile:
                txt = infile.read()
            df = df.append([[txt, labels[l]]], ignore_index=True)
            pbar.update()
            
df.columns = ['review', 'sentiment']

0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:09:01


Since the class labels in the assembled dataset are sorted we'll now shuffle the dataframe using the permutation function from the np.random submodule - this will be useful to split the data into training and test sets later. For convinience, we'll also store the assembled and shuffled movie review dataset as a CSV file.

In [2]:
import numpy as np
np.random.seed(0)
df = df.reindex(np.random.permutation(df.index))
df.to_csv('./movie_data.csv', index = False)


Let's quickly confirm that we saved the data in the right format.

In [3]:
df = pd.read_csv('./movie_data.csv')
df.head(3)

Unnamed: 0,review,sentiment
0,"Just watched this movie over the weekend, and ...",1
1,"A real insult to the original ""Spoorloos"", whi...",0
2,How good is Gwyneth Paltrow! This is the right...,0


## Introducing the bag-of-words model  
Bag-of-words model allows us to represent text as numerical feature vectors. The idea behind bag-of-words is quite simple and can be summerized as follow:

1. We create a **vocabulary** of unique **tokens** - for example, words from the entire set of documents.
2. We construct a feature vector from each document that contains the counts of how often each word occurs in the particular document.  

Since the unique words in each document represent only a small subset of all the
words in the bag-of-words vocabulary, the feature vectors will consist of mostly
zeros, which is why we call them **sparse**.

### Transforming words into feature vectors  
To construct bag fo words model based on word counts in the respective documents, we can use the CountVectorizer class implemented in scikit-learn, the CountVectorizer class takes an array of text data and creates an array of text data, which can be documents or just sentences, and creates a bag-of-words model for us:

In [4]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
count = CountVectorizer()
docs = np.array([
        'The sun is shinning',
        'The weather is sweet',
        'The sun is shinning and the weather is sweet'
    ])
bag = count.fit_transform(docs)

Now let's print the contents of the vocabulary to get a better understanding of the underlysing concepts.

In [5]:
print(count.vocabulary_)

{'the': 5, 'sun': 3, 'shinning': 2, 'weather': 6, 'and': 0, 'is': 1, 'sweet': 4}


As we can see from the output of the preeceding command, the vocabulary is stored in a Python dictionary, which maps the unique words that are mapped to integer indices. Now let's print the feature vectors that we just created:

In [6]:
print(bag.toarray())

[[0 1 1 1 0 1 0]
 [0 1 0 0 1 1 1]
 [1 2 1 1 1 2 1]]


Each index position in the feature vectors shown here corresponds to the integer values that are stored as dictionary items in the CountVectorizer vocabulary.

### Assessing word relevancy via term frequency - inverse document frequency
When we are analyzing text data we often encounter words that occur across multiple documents from both classes. Those frequently occurring words typically don't contain useful or discriminatory information thus we use a technique tf-idf(term frequency - inverse document frequency) that can be used to downweight those frequently occurring words in the feature vectors. 

Scikit-learn implements yet another transformer, the TfidfTransformer, that takes the raw term frequencies from CountVectorizer as input and transforms them into tf-idfs:


In [7]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf = TfidfTransformer()
np.set_printoptions(precision = 2)
print(tfidf.fit_transform(count.fit_transform(docs)).toarray())

[[ 0.    0.43  0.56  0.56  0.    0.43  0.  ]
 [ 0.    0.43  0.    0.    0.56  0.43  0.56]
 [ 0.4   0.48  0.31  0.31  0.31  0.48  0.31]]


### Cleaning the text data

Before we build are bag-of-words model for the movie review data we need to clean the text data by stripping it of all unwanted characters. To display why this is important let's display the last 200 characters from the second document in the reshuffled movie review dataset:

In [8]:
df.loc[1,'review'][-200:]

'uin his masterpiece in such a fashion is beyond me.<br /><br />Avoid this abomination at all cost, as it might spoil the original for you even if watched *after* that, let alone the other way round...'

As we can see here, the text contains HTML markup as well as punctuations and non letter characters. While HTML markup does not contain much useful semantics, punctuation marks can represent useful, additional information in certain NLP contexts. However, for simplicity, we'll now remove all punctuation marks but only keep emoticon characters such as ":)" since those are certainly useful for sentiment analysis. To accomplish this task we'll use Python's regular expression (regex) library, re:

In [9]:
import re
def preprocessor(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text)
    text = re.sub('[\W]+',' ', text.lower()) + ' '.join(emoticons).replace('-','')
    return text


Although the addition of the emoticon characters to the end of the cleaned document strings may not look like the most eligant approach, the order of words doesn't matter in our bag-of-words model if our vocabulary consists of only consists of 1-word tokens. Before we move ahead let's confirm that our preprocessor works correctly:

In [10]:
preprocessor(df.loc[1,'review'])[-200:]

'uizer decided to ruin his masterpiece in such a fashion is beyond me avoid this abomination at all cost as it might spoil the original for you even if watched after that let alone the other way round '

In [11]:
preprocessor('</a>This :) is :( a test :-)!')

'this is a test :) :( :)'

Lastly, since we'll make use of the cleaned text data over and over again during the next sections, let's now apply our preprocessor fuction to all movie reviews in our DataFrame:

In [12]:
df['review'] = df['review'].apply(preprocessor)

## Processing Documents into tokens

Having successfully prepared the movie review dataset, we now need to think about how to split the text corpora into individual elements, one way to tokenize them would be to use white space as a tokenizer by splitting the document into individual words.

In [13]:
def tokenizer(text):
    return text.split()

In [14]:
tokenizer('Runners like running and thus they run')

['Runners', 'like', 'running', 'and', 'thus', 'they', 'run']

Another useful technique for tokenizing is word stemming, which is the process of transforming a word into its root form that allows us to map related words to the same stem. The Natural Language Toolkit for Python(NLTK, http://nltk.org) implements the Porter stemming algorithm, which we'll use in the following section.

In [15]:
from nltk.stem.porter import PorterStemmer
porter = PorterStemmer()
def tokenizer_porter(text):
    return [porter.stem(word) for word in tokenizer(text)]


In [16]:
tokenizer_porter('Runners like running and thus they run')

['Runner', 'like', 'run', 'and', 'thu', 'they', 'run']

We can see that the word running was reduced to run and thus was reduced to thu

Before moving further let's remove the **stop-words**. Stop-words are those words that are extremely common in all sorts of texts and likely bear no(or only little) useful information that can be used to distinguish between different classes of documents. Example of stop words are: *is, and, has* etc. Removing stop-words can be useful if we are working with raw frequencies rather than tf-idf. 

For removing english stop words we can use the set of 127 English stop-words that is available from the NLTK library, which can obtained by calling the nltk.download function 

In [17]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /home/piyush/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [18]:
from nltk.corpus import stopwords 
stop = stopwords.words('english')
[ w for w in tokenizer_porter('a runner likes running and runs a lot')[-10:] if w not in stop]

['runner', 'like', 'run', 'run', 'lot']

As we can see the stopwords *a, and* were removed

## Training a logistic regression model for document classification 

Here, we'll train a logistic regression model to classify the movie reviews into positive and negative reviews. First, we will divide the DataFrame of cleaned text documents into 25000 documents for traning and 25000 documents for testing:

In [19]:
X_train = df.loc[:25000, 'review'].values
y_train = df.loc[:25000, 'sentiment'].values
X_test = df.loc[25000:, 'review'].values
y_test = df.loc[25000:, 'sentiment'].values

Now we'll use a GridSearchCV object to find the optimal set of parameters for our logistic regression model using 5-fold stratified cross-validation.

In [20]:
from sklearn.grid_search import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression 
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(strip_accents=None,
                       lowercase=False,
                       preprocessor=None)

param_grid = [{'vect__ngram_range': [(1,1)],
              'vect__stop_words': [stop, None],
              'vect__tokenizer': [tokenizer, tokenizer_porter],
              'clf__penalty': ['l1', 'l2'],
              'clf__C': [1.0, 10.0, 100.0]},
             {'vect__ngram_range': [(1,1)],
             'vect__stop_words': [stop, None],
             'vect__tokenizer': [tokenizer, tokenizer_porter],
             'vect__use_idf': [False],
             'vect__norm': [None],
             'clf__penalty': ['l1', 'l2'],
             'clf__C': [1.0, 10.0, 100.0]}
             ]
lr_tfidf = Pipeline([('vect', tfidf), ('clf', LogisticRegression(random_state=0))])
gs_lr_tfidf = GridSearchCV(lr_tfidf, param_grid, scoring='accuracy', cv=5, verbose=1, n_jobs=-1)

gs_lr_tfidf.fit(X_train, y_train)

Fitting 5 folds for each of 48 candidates, totalling 240 fits


[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed: 10.8min
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed: 70.2min
[Parallel(n_jobs=-1)]: Done 240 out of 240 | elapsed: 86.4min finished


GridSearchCV(cv=5, error_score='raise',
       estimator=Pipeline(steps=[('vect', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=False, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
 ...nalty='l2', random_state=0, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))]),
       fit_params={}, iid=True, n_jobs=-1,
       param_grid=[{'clf__penalty': ['l1', 'l2'], 'vect__stop_words': [['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's"...ction tokenizer_porter at 0x7fcd2729ca60>], 'clf__penalty': ['l1', 'l2'], 'vect__use_idf': [False]}],
       pre_dispatch='2*n_jobs', refit=True, scoring='accuracy', verbose=1

After the grid search has finished, we can print the best parameter set:

In [21]:
print('Best parameter set: %s' % gs_lr_tfidf.best_params_)

Best parameter set: {'clf__C': 10.0, 'vect__stop_words': None, 'vect__ngram_range': (1, 1), 'vect__tokenizer': <function tokenizer at 0x7fcd2729c7b8>, 'clf__penalty': 'l2'}


As we can see here, we obtained the best grid search results using the regular tokenizer without the Porter stemming, no stop-word library and tf-idfs in combination with a logistic regression classifier that uses L2 regularization with C=10.0

Using the best model from this grid search, let us print the 5-fold cross validation accuracy scores on the training set and the classification accuracy of the test dataset:

In [22]:
print('CV accuracy: %0.3f' % gs_lr_tfidf.best_score_)

CV accuracy: 0.895


In [23]:
clf = gs_lr_tfidf.best_estimator_
print('Test accuracy: %0.3f' % clf.score(X_test, y_test))

Test accuracy: 0.900


The results reveal that our machine learning model can predict whether a movie review is positive or negative with 90 precent accuracy.

## Working with bigger data - online algorithms and out-of-core learning   
It took more than three minutes to get the best model up and running with a simple desktop computer for the above classification problem. In many real world applications it is not uncommon to work with even large datasets that may even exceed our computer's memory. Since not everyone has access to supercomputer facilities, we will now apply a technique called out-of-core learning that allows us to work with such large datasets with much less computational load.

Here we'll make use of an optimization algorithm called **Stochastic Gradient Descent** which updates the model's weights using one sample at a time. Here we'll be using the partial_fit function of the SGDClassifier in scikit-learn to stream the documents directly from our local drive and train a logistic regression model using small minibatches of documents.

First, we define a tokenizer fuction that cleans the unprocessed text data from the movie_data.csv file and seperates into word tokens while removing stop words:

In [24]:
import numpy as np 
import re 
from nltk.corpus import stopwords 
stop = stopwords.words('english')
def tokenizer(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text.lower())
    text = re.sub('[\W]+', ' ', text.lower()) + ' '.join(emoticons).replace('-', '')
    tokenized = [w for w in text.split() if w not in stop]
    return tokenized


Next we define a generator function, stream_docs, that reads in and returns one document at a time.

In [25]:
def stream_docs(path):
    with open(path, 'r') as csv:
        next(csv) #skip header
        for line in csv:
            text, label = line[:-3], int(line[-2])
            yield text, label

To verify that our stream_docs function works correctly, let's read in the first document from the movie_data.csv file :

In [26]:
next(stream_docs(path='./movie_data.csv'))

('"Just watched this movie over the weekend, and I must say I thoroughly enjoyed it. The 2 Italo American actors are excellent as usual (Michael Imperioli and John Ventimiglia). It is obvious that the director was influenced by 2 great films of the past directed by Italians. Primarily he was influenced by Dino Risi and his film IL SORPASSO. It is the story of 2 young men who meet by chance and become friends. One is extroverted and the other is introverted. They enjoy the whole day together and by the end of the day, the shy one learns that there is more to life than his usual routine monotony. The same thing happens to Albert De Santi. Unfortunately, IL SORPASSO has a very similar ending and this apparently influenced the director of ON THE RUN because he uses the same technique but with a twist. I had expected something but was surprised to see that it turned out to be the opposite. If you watch both movies you will understand. The other film that influenced the director is AFTER HOU

W'll now define a function get_minibatch, that will take a document stream from the stream_docs function and return a particular number of documents specified by the size parameter: 

In [27]:
def get_minibatch(doc_stream, size):
    docs, y = [], []
    try:
        for _ in range(size):
            text, label = next(doc_stream)
            docs.append(text)
            y.append(label)
    except StopIteration:
        return None, None
    return docs, y


Unfortunately, we can't use the *CountVectorizer* for out-of-core learning since it requires holding the entire vocabulary in memory. Also, the *TfidfVectorizr* needs to keep all the feature vectors of the training dataset in memory to calculate the inverse document frequencies. However another useful vectorizer for text processing implemented in scikit-learn is *HashingVectorizer* which is data independent and makes use of the hashing trick via the 32-bit MurmurHash3 algorithm by Austin Appleby.

In [28]:
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.linear_model import SGDClassifier
vect = HashingVectorizer(decode_error='ignore',
                        n_features=2**21,
                        preprocessor=None,
                        tokenizer=tokenizer)
clf = SGDClassifier(loss='log', random_state=1, n_iter=1)
doc_stream = stream_docs(path='./movie_data.csv')

Having set up all the complementary functions, we can now start the out-of-core learning using the following code:

In [29]:
import pyprind
pbar = pyprind.ProgBar(45)
classes = np.array([0, 1])
for _ in range(45):
    X_train, y_train = get_minibatch(doc_stream, size=1000)
    if not X_train:
        break
    X_train = vect.transform(X_train)
    clf.partial_fit(X_train, y_train, classes=classes)
    pbar.update()

0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:00:37


We initialized the progress bar object with 45 iterations and in the following for loop, iterated over 45 minibatches of documents where minibatch consists of 1000 documents each.  
Having comleted the incremental learning process, we'll use the last 5000 documents to evaluate the performance of the model.

In [30]:
X_test, y_test = get_minibatch(doc_stream, size=5000)
X_test = vect.transform(X_test)
print('Accuracy: %0.3f' % clf.score(X_test, y_test))

Accuracy: 0.868


As we can see that the accuracy of the model is 87 percent, slightly below the accuracy we achieved using grid search for hyperparameter tuning. However, out-of-core learning is vrey memory efficient and took less than two minutes to complete. Finally we can use the last 5000 documents to update the model.

In [31]:
clf = clf.partial_fit(X_test, y_test)

## Summary

In this chapter I learned how to use machine learning algorithms to classify text documents based on their polarity, which is a basic task in sentiment analysis in the field of natural language processing.