In [1]:
# Sentiment analysis on 50k imdb movie reviews

import tarfile
with tarfile.open('/Users/mike/Downloads/aclImdb_v1.tar.gz', 'r:gz') as tar:
    tar.extractall()

In [2]:
import pandas as pd
import os

basepath = 'aclImdb'

labels = {'pos': 1, 'neg': 0}
df = pd.DataFrame()

for s in ('test', 'train'):
    for l in ('pos', 'neg'):
        path = os.path.join(basepath, s, l)
        for file in sorted(os.listdir(path)):
            with open(os.path.join(path, file), 'r', encoding='utf-8') as infile:
                txt = infile.read()
            df = df.append([[txt, labels[l]]],
                          ignore_index=True)                

In [3]:
df.columns = ['review', 'sentiment']
df.head()

Unnamed: 0,review,sentiment
0,I went and saw this movie last night after bei...,1
1,Actor turned director Bill Paxton follows up h...,1
2,As a recreational golfer with some knowledge o...,1
3,"I saw this film in a sneak preview, and it is ...",1
4,Bill Paxton has taken the true story of the 19...,1


In [4]:
import numpy as np

# Shuffle data since they are currently sorted
np.random.seed(0)
df = df.reindex(np.random.permutation(df.index))
df.to_csv('movie_data.csv', index=False, encoding='utf-8')

In [5]:
# Confirm csv is saved by trying to read it back
df = pd.read_csv('movie_data.csv', encoding='utf-8')
df.head(3)

Unnamed: 0,review,sentiment
0,"In 1974, the teenager Martha Moxley (Maggie Gr...",1
1,OK... so... I really like Kris Kristofferson a...,0
2,"***SPOILER*** Do not read this, if you think a...",0


In [6]:
df.shape

(50000, 2)

<h3> word2vec </h3>

A more modern alternative to the bag-of-words model is <b>word2vec</b>, an algorithm that Google released in 2013. The word2vec algorithm is an unsupervised learning algorithm based on neural networks that attempts to automatically learn the relationships between words. The idea behind it is to put words that have similar meanings into similar clusters, and via clever vector-spacing, the model can reproduce certain words using simple vector math, for example, <i> king - man + woman = queen. </i>

https://code.google.com/archive/p/word2vec/

In [7]:
# Transform the text to numerical feature vectors (bag-of-words model)
# Construct bag-of-words model based on word counts in the documents using CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer

oneGramCount = CountVectorizer()
twoGramCount = CountVectorizer(ngram_range=(2,2))

docs = np.array([
    'The sun is shining',
    'The weather is sweet',
    'The sun is shining, the weather is sweet, and one and one is two'
])

# transform the three sentences into sparse feature vectors
bag = oneGramCount.fit_transform(docs)
# The values represent the index at which the word is stored
print(oneGramCount.vocabulary_)

{'the': 6, 'sun': 4, 'is': 1, 'shining': 3, 'weather': 8, 'sweet': 5, 'and': 0, 'one': 2, 'two': 7}


In [8]:
# The first index position here represents the word 'and' (index 0 from vocabulary_ dict). It is 0 except for 
# the last document, where 'and' appears twice. These values are called raw term frequencies tf(t,d) --
# the number of times a term t occurs in a document d
print(bag.toarray())

[[0 1 0 1 1 0 1 0 0]
 [0 1 0 0 0 1 1 0 1]
 [2 3 2 1 1 1 2 1 1]]


The sequence of items in the bag-of-words model just created is also called the <b>1-gram</b> or <b>unigram</b> model -- each item or token in the vocabulary represents a single word. The contiguous sequences of items in NLP -- words, letters, or symbols -- are also called <b>n-grams</b>. n-grams of size 3 and 4 yield good performances for anti-spam filtering for email messages. 1-gram vs 2-gram representation for the first document "the sun is shining" would be:

* 1-gram: "the", "sun", "is", "shining"
* 2-gram: "the sun", "sun is", "is shining"

We can use different n-gram models with `CountVectorizer` (1-gram is the default).

In [10]:
from sklearn.feature_extraction.text import TfidfTransformer

tfidf = TfidfTransformer(use_idf=True, norm='l2', smooth_idf=True)
np.set_printoptions(precision=2)
print(tfidf.fit_transform(oneGramCount.fit_transform(docs)).toarray())

[[0.   0.43 0.   0.56 0.56 0.   0.43 0.   0.  ]
 [0.   0.43 0.   0.   0.   0.56 0.43 0.   0.56]
 [0.5  0.45 0.5  0.19 0.19 0.19 0.3  0.25 0.19]]


As we saw in the previous section, the word 'is' had the largest term frequency in the third document (it appeared 3 times). After transforming the same feature vector into tf-idfs, we see that the word 'is' is now associated with a relatively small tf-idf (0.45) in the third document. This is because it is also present in the first and second documents and thus is unlikely to contain any useful or discriminatory information.

In [16]:
print(twoGramCount.vocabulary_, '\n')
print(tfidf.fit_transform(twoGramCount.fit_transform(docs)).toarray())

{'the sun': 9, 'sun is': 7, 'is shining': 1, 'the weather': 10, 'weather is': 11, 'is sweet': 2, 'shining the': 6, 'sweet and': 8, 'and one': 0, 'one and': 4, 'one is': 5, 'is two': 3} 

[[0.   0.58 0.   0.   0.   0.   0.   0.58 0.   0.58 0.   0.  ]
 [0.   0.   0.58 0.   0.   0.   0.   0.   0.   0.   0.58 0.58]
 [0.57 0.22 0.22 0.28 0.28 0.28 0.28 0.22 0.28 0.22 0.22 0.22]]


In [11]:
# clean the text data
df.loc[0, 'review'][-50:]

'is seven.<br /><br />Title (Brazil): Not Available'

In [17]:
import re

def preprocessor(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text)
    text = (re.sub('[\W]+', ' ', text.lower()) + ' '.join(emoticons).replace('-', ''))
    return text

In [18]:
preprocessor(df.loc[0, 'review'][-50:])

'is seven title brazil not available'

In [19]:
preprocessor('</a>This :) is :( a test :-)!')

'this is a test :) :( :)'

In [20]:
# clean entire df
df['review'] = df['review'].apply(preprocessor)

In [21]:
#Tokenize document by splitting each text corpora into individual elements
def tokenizer(text):
    return text.split()
tokenizer('runners like running and thus they run')

['runners', 'like', 'running', 'and', 'thus', 'they', 'run']

In [22]:
# word stemming by transforming word to its root form, see below
from nltk.stem.porter import PorterStemmer

porter = PorterStemmer()
def tokenizer_porter(text):
    return [porter.stem(word) for word in text.lower().split()]

tokenizer_porter('runners like running and thus they run')

['runner', 'like', 'run', 'and', 'thu', 'they', 'run']

The Porter stemming algorithm is one of the oldest and simplest stemming algorithms. There are other stemming algorithms including the <b>Snowball stemmer</b> (Porter2 or English stemmer) and the <b>Lancaster stemmer</b> (Paice/Husk stemmer), which is faster but also more aggressive than the Porter stemmer. These are also available via NLTK.

Note that in the above example, stemming can create non-real words, such as 'thu' (from 'thus'). A technique called <b>lemmmatization</b> aims to obtain the canonical (grammatically correct) forms of individual words -- the so called <b>lemmas</b>. However, lemmatization is computationally more difficult and expensive compared to stemming and, in practice, it has been observed that stemming and lemmatization have little impact on the performance of text classification. 

Another useful topic is <b>stop-word-removal</b>. Stop-words are those that are extremely common in all sorts of texts and probably bear no (or little) useful information that can be used to distinguish between different classes and documents. Examples of stop-words are <i>is, and, has, like</i>. Removing stop-words can be useful if we are working with raw or normalized term frequencies rather than tf-idfs, which are already downweighting frequently occurring words.

In order to remove the stop-words from the movie reviews, we will use the set of 179 English stop-words that are available in the NLTK library.

In [23]:
import nltk

nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /Users/mike/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [24]:
from nltk.corpus import stopwords

stop = stopwords.words('english')
len(stop)

179

In [25]:
[w for w in tokenizer_porter('a runner likes running and thus they run') if w not in stop]

['runner', 'like', 'run', 'thu', 'run']

In [39]:
# Train logistic regression model to classify the movie reviews as positive or negative
# Divide df into 25k documents for training and 25k for testing
X_train = df.loc[:25000, 'review'].values
y_train = df.loc[:25000, 'sentiment'].values
X_test = df.loc[25000:, 'review'].values
y_test = df.loc[25000:, 'sentiment'].values

In [31]:
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(strip_accents=None, lowercase=False, preprocessor=None)
param_grid = [
    {
    'vect__ngram_range': [(1, 1)],
    'vect__stop_words': [stop, None],
    'vect__tokenizer': [tokenizer, tokenizer_porter],
    'clf__penalty': ['l1', 'l2'],
    'clf__C': [1.0, 10.0, 100.0]
    },
    {
    'vect__ngram_range': [(1, 1)],
    'vect__stop_words': [stop, None],
    'vect__tokenizer': [tokenizer, tokenizer_porter],
    'vect__use_idf': [False],
    'vect__norm': [None],
    'clf__penalty': ['l1', 'l2'],
    'clf__C': [1.0, 10.0, 100.0]
    }
]

In [32]:
lr_tfidf = Pipeline([('vect', tfidf),
                     ('clf',
                     LogisticRegression(random_state=0))])
# settings n_jobs = -1 uses all cores on machine to speed up grid search
# may not work on windows machines
gs_lr_tfidf = GridSearchCV(lr_tfidf, param_grid, scoring='accuracy', cv=5, verbose=1, n_jobs=-1)
gs_lr_tfidf.fit(X_train, y_train);

Fitting 5 folds for each of 48 candidates, totalling 240 fits


 0.9  0.89  nan  nan  nan  nan 0.89 0.88 0.89 0.88  nan  nan  nan  nan
 0.88 0.87 0.88 0.88  nan  nan  nan  nan 0.87 0.86 0.88 0.87  nan  nan
  nan  nan 0.87 0.86 0.88 0.87]
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [33]:
You print('Best parameter set: %s ' % gs_lr_tfidf.best_params_)

Best parameter set: {'clf__C': 10.0, 'clf__penalty': 'l2', 'vect__ngram_range': (1, 1), 'vect__stop_words': None, 'vect__tokenizer': <function tokenizer at 0x7f9229372e50>} 


As we can see from the above output, the best grid search results came from using the regular `tokenizer` without Porter stemming, no stop-word library, and tf-idfs in the combination with logistic regression classifier using L2-regularization with the regularization strength C of 10.0.

Let's use the best model from this grid search and print the average 5-fold cross-validation accuracy scores on the training set and the classification accuracy on the test dataset.

In [42]:
print('CV Accuracy: %.3f' % gs_lr_tfidf.best_score_)

CV Accuracy: 0.897


In [43]:
clf = gs_lr_tfidf.best_estimator_

In [44]:
print('Test Accuracy: %.3f' % clf.score(X_test, y_test))

Test Accuracy: 0.899


In [53]:
clf.predict(["This movie is awesome"]), clf.predict(["This movie sucks!"])

(array([1]), array([0]))

This shows that we are able to predict whether a movie review is positive or negative with nearly 90% accuracy. Another popular classifier for text classification is the Naïve Bayes classifier, which gained its popularity in applications of email spam filtering. Naïve Bayes classifiers are easy to implement, computationally efficient, and tend to perform particularly well on relatively small datasets compared to other algorithms. Here is an article discussing them on arXiv: https://arxiv.org/pdf/1410.5329v3.pdf


<h3>Working with bigger data -- online algorithms and out-of-core learning</h3>

The code to train the grid search above took 32 minutes to run. It can be computationally quite expensive to construct the feature vectors for the 50,000 movie review dataset during grid search. In many real-world applications, it is not uncommon to work with even larger datasets that can exceed our computer's memory. Since not everyone has access to supercomputer facilities, there's a technique called <b>out-of-core learning</b>, which allows us to work with such large datasets by fitting the classifier incrementally on smaller batches of the dataset. We've already discussed <b>stochastic gradient descent</b>, which is an optimization algorithm that updates the model's weights using one sample at a time. In this section, we will make use of the `partial_fit` function of the `SGDClassifier` in scikit-learn to stream the documents directly from our local drive, and train a logistic regression model using small mini-batches of documents.

First, we define a `tokenizer` function that cleans the unprocessed text data from the `movie_data.csv` file that we constrcuted at the beginning of this notebook and separate it into word tokens while removing stop words.

In [55]:
import numpy as np
import re
from nltk.corpus import stopwords

stop = stopwords.words('english')

def tokenizer(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text.lower())
    text = (re.sub('[\W]+', ' ', text.lower()) + ' '.join(emoticons).replace('-', ''))
    tokenized = [w for w in text.split() if w not in stop]
    return tokenized

In [56]:
def stream_docs(path):
    """Generator that reads and returns one document at a time"""
    with open(path, 'r', encoding='utf-8') as csv:
        next(csv) # skip header
        for line in csv:
            text, label = line[:-3], int(line[-2])
            yield text, label

In [57]:
# should return tuple consisting of review text and class label
next(stream_docs(path='movie_data.csv'))

('"In 1974, the teenager Martha Moxley (Maggie Grace) moves to the high-class area of Belle Haven, Greenwich, Connecticut. On the Mischief Night, eve of Halloween, she was murdered in the backyard of her house and her murder remained unsolved. Twenty-two years later, the writer Mark Fuhrman (Christopher Meloni), who is a former LA detective that has fallen in disgrace for perjury in O.J. Simpson trial and moved to Idaho, decides to investigate the case with his partner Stephen Weeks (Andrew Mitchell) with the purpose of writing a book. The locals squirm and do not welcome them, but with the support of the retired detective Steve Carroll (Robert Forster) that was in charge of the investigation in the 70\'s, they discover the criminal and a net of power and money to cover the murder.<br /><br />""Murder in Greenwich"" is a good TV movie, with the true story of a murder of a fifteen years old girl that was committed by a wealthy teenager whose mother was a Kennedy. The powerful and rich f

In [58]:
def get_minibatch(doc_stream, size):
    docs, y = [], []
    try:
        for _ in range(size):
            text, label = next(doc_stream)
            docs.append(text)
            y.append(label)
    except StopIteration:
        return None, None
    return docs, y

We cannot use `CountVectorizer` for out-of-core learning since it requires holding the complete vocabulary in memory. Also, `TfidfVectorizer` needs to keep all the feature vectors of the training dataset in memory to calculate the inverse document frequencies. However, another useful vectorizer for text processing implemented in scikit-learn is `HashingVectorizer` -- it is data-independent and makes use of the hashing trick via the 32-bit MurmurHash3 function https://en.wikipedia.org/wiki/MurmurHash

In [60]:
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.linear_model import SGDClassifier

vect = HashingVectorizer(decode_error='ignore',
                         n_features=2**21,
                         preprocessor=None,
                         tokenizer=tokenizer
                        )
clf = SGDClassifier(loss='log', random_state=1, max_iter=1)
doc_stream = stream_docs(path='movie_data.csv')

The above cell initialized a `HashingVectorizer` with our tokenizer function and set the number of features to `2**21`. We reinitialized a logistic regression classifier by setting the `loss` parameter of the `SGDClassifier` to `log`. Note that by using a large number of features in the `HashingVectorizer`, we reduce the chance of causing hash collisions, but we also increase the number of coefficients in our logistic regression model. Having set up all the complementary functions, we can now start the out-of-core learning using the code in the cell below.

In [61]:
classes = np.array([0, 1])
for _ in range(45):
    X_train, y_train = get_minibatch(doc_stream, size=1000)
    if not X_train:
        break
    X_train = vect.transform(X_train)
    clf.partial_fit(X_train, y_train, classes=classes)

Here we iterated over 45 mini-batches of documents where each mini-batch consists of 1,000 documents. Having completed the incremental learning process, we will use the last 5,000 documents to evaluate the performance of our model.

In [62]:
X_test, y_test = get_minibatch(doc_stream, size=5000)
X_test = vect.transform(X_test)
print('Accuracy: %.3f' % clf.score(X_test, y_test))

Accuracy: 0.868


As we can see, the accuracy of the model is around 87%, only slightly below the accuracy achieved with the grid search for hyperparameter tuning. However, out-of-core learning is extremely memory efficient and took less than a minute to complete. Finally, we can use the last 5,000 documents to update our model.

In [65]:
clf = clf.partial_fit(X_test, y_test)

<h3>Topic Modeling with Latent Dirichlet Allocation (LDA)</h3>

Topic modeling describes the broad task of assigning topics to unlabelled text documents. For example, a typical application would be the categorization of documents in a large corpus of newspaper articles where we don't know on which specific page or category they appear in. 

In applications of topic modeling, we aim to assign category labels to those articles -- for example, sports, finance, world news, politics, local news, etc. We can consider topic modeling as a clustering task, a subcategory of unsupervised learning.

In [66]:
# TODO: finish out this section, do Flask Web App section first

<h3>Serializing fitted scikit-learn estimators</h3>

Training a machine learning model can be computationally quite expensive, as we have seen. We definitely do not want to train our model every time we close our Python interpreter and want to make a new prediction or reload our web application. One option for model persistence is Python's built in `pickle` module, which allows us to serialize and deserialize Python object structures to compact bytecode so that we can save our classifier in its current state and reload it if we want to classify new samples, without needing the model to learn from the training data all over again.

Before executing the following code, make sure you have trained the out-of-core logistic regression model from above.

In [67]:
import pickle
import os

dest = os.path.join('movieclassifier', 'pkl_objects')
if not os.path.exists(dest):
    os.makedirs(dest)

In [68]:
pickle.dump(stop,
           open(os.path.join(dest, 'stopwords.pkl'), 'wb'), protocol=4)

In [69]:
pickle.dump(clf,
           open(os.path.join(dest, 'classifier.pkl'), 'wb'), protocol=4)

Using this code, we created a `movieclassifier` directory where we will later store the files and data for our web application. Within the `movieclassifier` directory, we created a `pkl_objects` subdirectory to save the serialized Python objects to our local drive. Via the `dump` method, we then serialized the trained logistic regression model as well as the stop word set from the NLTK library, so that we don't have to install the NLTK vocabulary on our server.

The `dump` method takes as its first argument the object that we want to pickle, and for the second argument we provided an open file object that the Python object will be written to. Via the `wb` argument inside the `open` function, we opened the file in binary mode for pickle, and we set `protocol=4` to choose the latest and most efficient protocol. (It actually goes up to 5 now, but I'll still use 4 here for consistency: https://docs.python.org/3/library/pickle.html).

Note that our logistic regression model contains several NumPy arrays, such as the weight vector, and a more efficient way to serialize NumPy arrays is to use the alternative `joblib` library. To ensure compatability with the server environment used later on, we will use the standard pickle approach.