# Chapter 8 - Applying Machine Learning To Sentiment Analysis

In the modern internet and social media age, people’s opinions, reviews, and recommendations have
become a valuable resource for political science and businesses. Thanks to modern technologies, we are
now able to collect and analyze such data most efficiently. In this chapter, we will delve into a subfield
of natural language processing (NLP) called sentiment analysis and learn how to use machine learning
algorithms to classify documents based on their sentiment: the attitude of the writer. In particular, we
are going to work with a dataset of 50,000 movie reviews from the Internet Movie Database (IMDb)
and build a predictor that can distinguish between positive and negative reviews.


The topics that we will cover in this chapter include the following:

•Cleaning and preparing text data
• Building feature vectors from text documents
• Training a machine learning model to classify positive and negative movie reviews
• Working with large text datasets using out-of-core learning
• Inferring topics from document collections for categorization

### Overview

- [Preparing the IMDb movie review data for text processing](#Preparing-the-IMDb-movie-review-data-for-text-processing)
  - [Obtaining the IMDb movie review dataset](#Obtaining-the-IMDb-movie-review-dataset)
  - [Preprocessing the movie dataset into more convenient format](#Preprocessing-the-movie-dataset-into-more-convenient-format)
- [Introducing the bag-of-words model](#Introducing-the-bag-of-words-model)
  - [Transforming words into feature vectors](#Transforming-words-into-feature-vectors)
  - [Assessing word relevancy via term frequency-inverse document frequency](#Assessing-word-relevancy-via-term-frequency-inverse-document-frequency)
  - [Cleaning text data](#Cleaning-text-data)
  - [Processing documents into tokens](#Processing-documents-into-tokens)
- [Training a logistic regression model for document classification](#Training-a-logistic-regression-model-for-document-classification)
- [Working with bigger data – online algorithms and out-of-core learning](#Working-with-bigger-data-–-online-algorithms-and-out-of-core-learning)
- [Topic modeling](#Topic-modeling)
  - [Decomposing text documents with Latent Dirichlet Allocation](#Decomposing-text-documents-with-Latent-Dirichlet-Allocation)
  - [Latent Dirichlet Allocation with scikit-learn](#Latent-Dirichlet-Allocation-with-scikit-learn)
- [Summary](#Summary)

In [1]:
from IPython.display import Image
%matplotlib inline

# Preparing the IMDb movie review data for text processing 

## Obtaining the IMDb movie review dataset

The IMDB movie review set can be downloaded from [http://ai.stanford.edu/~amaas/data/sentiment/](http://ai.stanford.edu/~amaas/data/sentiment/).
After downloading the dataset, decompress the files.

A) If you are working with Linux or MacOS X, open a new terminal window, `cd` into the download directory and execute 

`tar -zxf aclImdb_v1.tar.gz`

B) If you are working with Windows, download an archiver such as [7Zip](http://www.7-zip.org) to extract the files from the download archive.

**Optional code to download and unzip the dataset via Python:**

In [2]:
import os
import sys
import tarfile
import time
import urllib.request

source = 'http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz'
target = 'aclImdb_v1.tar.gz'

if os.path.exists(target):
    os.remove(target)

def reporthook(count, block_size, total_size):
    global start_time
    if count == 0:
        start_time = time.time()
        return
    duration = time.time() - start_time
    progress_size = int(count * block_size)
    speed = progress_size / (1024.**2 * duration)
    percent = count * block_size * 100. / total_size

    sys.stdout.write(f'\r{int(percent)}% | {progress_size / (1024.**2):.2f} MB '
                     f'| {speed:.2f} MB/s | {duration:.2f} sec elapsed')
    sys.stdout.flush()


if not os.path.isdir('aclImdb') and not os.path.isfile('aclImdb_v1.tar.gz'):
    urllib.request.urlretrieve(source, target, reporthook)

In [3]:
if not os.path.isdir('aclImdb'):

    with tarfile.open(target, 'r:gz') as tar:
        tar.extractall()

## Preprocessing the movie dataset into more convenient format

Install pyprind by uncommenting the next code cell.

In [4]:
!pip install -q  pyprind

Having successfully extracted the dataset, we will now assemble the individual text documents from
the decompressed download archive into a single CSV file. In the following code section, we will be
reading the movie reviews into a pandas DataFrame object, which can take up to 10 minutes on a
standard desktop computer.

To visualize the progress and estimated time until completion, we will use the Python Progress Indica-
tor (PyPrind, https://pypi.python.org/pypi/PyPrind/ ) package, which was developed several years
ago for such purposes. PyPrind can be installed by executing the pip install pyprind command.

In [5]:
import pyprind
import pandas as pd
import os
import sys
from packaging import version


# change the `basepath` to the directory of the
# unzipped movie dataset

basepath = 'aclImdb'

labels = {'pos': 1, 'neg': 0}

# if the progress bar does not show, change stream=sys.stdout to stream=2
pbar = pyprind.ProgBar(50000, stream=sys.stdout)

df = pd.DataFrame()
for s in ('test', 'train'):
    for l in ('pos', 'neg'):
        path = os.path.join(basepath, s, l)
        for file in sorted(os.listdir(path)):
            with open(os.path.join(path, file), 
                      'r', encoding='utf-8') as infile:
                txt = infile.read()
                
            if version.parse(pd.__version__) >= version.parse("1.3.2"):
                x = pd.DataFrame([[txt, labels[l]]], columns=['review', 'sentiment'])
                df = pd.concat([df, x], ignore_index=False)

            else:
                df = df.append([[txt, labels[l]]], 
                               ignore_index=True)
            pbar.update()
df.columns = ['review', 'sentiment']

0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:00:46


In the preceding code, we first initialized a new progress bar object, pbar , with 50,000 iterations, which
was the number of documents we were going to read in. Using the nested for loops, we iterated over
the train and test subdirectories in the main aclImdb directory and read the individual text files
from the pos and neg subdirectories that we eventually appended to the df pandas DataFrame , together
with an integer class label (1 = positive and 0 = negative).

Since the class labels in the assembled dataset are sorted, we will now shuffle the DataFrame using the
permutation function from the np.random submodule—this will be useful for splitting the dataset into
training and test datasets in later sections, when we will stream the data from our local drive directly.
For our own convenience, we will also store the assembled and shuffled movie review dataset as a
CSV file

Shuffling the Dataframe:

In [6]:
import numpy as np


if version.parse(pd.__version__) >= version.parse("1.3.2"):
    df = df.sample(frac=1, random_state=0).reset_index(drop=True)
    
else:
    np.random.seed(0)
    df = df.reindex(np.random.permutation(df.index))

Optional: Saving the assembled data as CSV file:

In [7]:
df.to_csv('movie_data.csv', index=False, encoding='utf-8')

In [8]:
import pandas as pd

df = pd.read_csv('movie_data.csv', encoding='utf-8')

# the following is necessary on some computers:
df = df.rename(columns={"0": "review", "1": "sentiment"})

df.head(3)

Unnamed: 0,review,sentiment
0,"In 1974, the teenager Martha Moxley (Maggie Gr...",1
1,OK... so... I really like Kris Kristofferson a...,0
2,"***SPOILER*** Do not read this, if you think a...",0


In [9]:
df.shape

(50000, 2)

---

### Note

If you have problems with creating the `movie_data.csv`, you can find a download a zip archive at 
https://github.com/rasbt/machine-learning-book/tree/main/ch08/

---

# Introducing the bag-of-words model

You may remember from Chapter 4, Building Good Training Datasets – Data Preprocessing, that we have
to convert categorical data, such as text or words, into a numerical form before we can pass it on to a
machine learning algorithm. In this section, we will introduce the bag-of-words model, which allows
us to represent text as numerical feature vectors. The idea behind bag-of-words is quite simple and
can be summarized as follows:

1. We create a vocabulary of unique tokens—for example, words—from the entire set of documents.
2. We construct a feature vector from each document that contains the counts of how often each
word occurs in the particular document.

Since the unique words in each document represent only a small subset of all the words in the bag-of-
words vocabulary, the feature vectors will mostly consist of zeros, which is why we call them sparse.
Do not worry if this sounds too abstract; in the following subsections, we will walk through the process
of creating a simple bag-of-words model step by step.

## Transforming documents into feature vectors

By calling the fit_transform method on CountVectorizer, we just constructed the vocabulary of the bag-of-words model and transformed the following three sentences into sparse feature vectors:
1. The sun is shining
2. The weather is sweet
3. The sun is shining, the weather is sweet, and one and one is two


In [10]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

count = CountVectorizer()
docs = np.array([
        'The sun is shining',
        'The weather is sweet',
        'The sun is shining, the weather is sweet, and one and one is two'])
bag = count.fit_transform(docs)

Now let us print the contents of the vocabulary to get a better understanding of the underlying concepts:

In [11]:
print(count.vocabulary_)

{'the': 6, 'sun': 4, 'is': 1, 'shining': 3, 'weather': 8, 'sweet': 5, 'and': 0, 'one': 2, 'two': 7}


As we can see from executing the preceding command, the vocabulary is stored in a Python dictionary, which maps the unique words to integer indices. Next let us print the feature vectors that we just created:

In [12]:
print(bag.toarray())

[[0 1 0 1 1 0 1 0 0]
 [0 1 0 0 0 1 1 0 1]
 [2 3 2 1 1 1 2 1 1]]


Each index position in the feature vectors shown here corresponds to the integer values that are stored as dictionary items in the CountVectorizer vocabulary. For example, the first feature at index position 0 resembles the count of the word "and", which only occurs in the last document, and the word "is" at index position 1 (the 2nd feature in the document vectors) occurs in all three sentences. Those values in the feature vectors are also called the raw term frequencies: *tf (t,d)*—the number of times a term t occurs in a document *d*.

It should be noted that, in the bag-of-words model, the word or term
order in a sentence or document does not matter. The order in which the term frequencies appear in
the feature vector is derived from the vocabulary indices, which are usually assigned alphabetically.

## Assessing word relevancy via term frequency-inverse document frequency

In [13]:
np.set_printoptions(precision=2)

When we are analyzing text data, we often encounter words that occur across multiple documents from both classes. Those frequently occurring words typically don't contain useful or discriminatory information. In this subsection, we will learn about a useful technique called term frequency-inverse document frequency (tf-idf) that can be used to downweigh those frequently occurring words in the feature vectors. The tf-idf can be defined as the product of the term frequency and the inverse document frequency:

$$\text{tf-idf}(t,d)=\text{tf (t,d)}\times \text{idf}(t,d)$$

Here the tf(t, d) is the term frequency that we introduced in the previous section,
and the inverse document frequency *idf(t, d)* can be calculated as:

$$\text{idf}(t,d) = \text{log}\frac{n_d}{1+\text{df}(d, t)},$$

where $n_d$ is the total number of documents, and *df(d, t)* is the number of documents *d* that contain the term *t*. Note that adding the constant 1 to the denominator is optional and serves the purpose of assigning a non-zero value to terms that occur in all training examples; the log is used to ensure that low document frequencies are not given too much weight.

Scikit-learn implements yet another transformer, the `TfidfTransformer`, that takes the raw term frequencies from `CountVectorizer` as input and transforms them into tf-idfs:

In [14]:
from sklearn.feature_extraction.text import TfidfTransformer

tfidf = TfidfTransformer(use_idf=True, 
                         norm='l2', 
                         smooth_idf=True)
print(tfidf.fit_transform(count.fit_transform(docs))
      .toarray())

[[0.   0.43 0.   0.56 0.56 0.   0.43 0.   0.  ]
 [0.   0.43 0.   0.   0.   0.56 0.43 0.   0.56]
 [0.5  0.45 0.5  0.19 0.19 0.19 0.3  0.25 0.19]]


As we saw in the previous subsection, the word "is" had the largest term frequency in the 3rd document, being the most frequently occurring word. However, after transforming the same feature vector into tf-idfs, we see that the word "is" is now associated with a relatively small tf-idf (0.45) in document 3 since it is also contained in documents 1 and 2 and thus is unlikely to contain any useful, discriminatory information.


However, if we'd manually calculated the tf-idfs of the individual terms in our feature vectors, we'd have noticed that the `TfidfTransformer` calculates the tf-idfs slightly differently compared to the standard textbook equations that we defined earlier. The equations for the idf and tf-idf that were implemented in scikit-learn are:

$$\text{idf} (t,d) = log\frac{1 + n_d}{1 + \text{df}(d, t)}$$

The tf-idf equation that was implemented in scikit-learn is as follows:

$$\text{tf-idf}(t,d) = \text{tf}(t,d) \times (\text{idf}(t,d)+1)$$

While it is also more typical to normalize the raw term frequencies before calculating the tf-idfs, the `TfidfTransformer` normalizes the tf-idfs directly.

By default (`norm='l2'`), scikit-learn's TfidfTransformer applies the L2-normalization, which returns a vector of length 1 by dividing an un-normalized feature vector *v* by its L2-norm:

$$v_{\text{norm}} = \frac{v}{||v||_2} = \frac{v}{\sqrt{v_{1}^{2} + v_{2}^{2} + \dots + v_{n}^{2}}} = \frac{v}{\big (\sum_{i=1}^{n} v_{i}^{2}\big)^\frac{1}{2}}$$

To make sure that we understand how `TfidfTransformer` works, let us walk through an example and calculate the tf-idf of the word "is" in the 3rd document.

The word "is" has a term frequency of 3 (tf = 3) in document 3 ($d_3$), and the document frequency of this term is 3 since the term "is" occurs in all three documents (df = 3). Thus, we can calculate the idf as follows:

$$\text{idf}("is", d_3) = log \frac{1+3}{1+3} = 0$$

Now in order to calculate the tf-idf, we simply need to add 1 to the inverse document frequency and multiply it by the term frequency:

$$\text{tf-idf}("is", d_3)= 3 \times (0+1) = 3$$

In [15]:
tf_is = 3
n_docs = 3
idf_is = np.log((n_docs+1) / (3+1))
tfidf_is = tf_is * (idf_is + 1)
print(f'tf-idf of term "is" = {tfidf_is:.2f}')

tf-idf of term "is" = 3.00


If we repeated these calculations for all terms in the 3rd document, we'd obtain the following tf-idf vectors: [3.39, 3.0, 3.39, 1.29, 1.29, 1.29, 2.0 , 1.69, 1.29]. However, we notice that the values in this feature vector are different from the values that we obtained from the `TfidfTransformer` that we used previously. The final step that we are missing in this tf-idf calculation is the L2-normalization, which can be applied as follows:

$$\text{tfi-df}_{norm} = \frac{[3.39, 3.0, 3.39, 1.29, 1.29, 1.29, 2.0 , 1.69, 1.29]}{\sqrt{[3.39^2, 3.0^2, 3.39^2, 1.29^2, 1.29^2, 1.29^2, 2.0^2 , 1.69^2, 1.29^2]}}$$

$$=[0.5, 0.45, 0.5, 0.19, 0.19, 0.19, 0.3, 0.25, 0.19]$$

$$\Rightarrow \text{tfi-df}_{norm}("is", d3) = 0.45$$

As we can see, the results match the results returned by scikit-learn's `TfidfTransformer` (below). Since we now understand how tf-idfs are calculated, let us proceed to the next sections and apply those concepts to the movie review dataset.

In [16]:
tfidf = TfidfTransformer(use_idf=True, norm=None, smooth_idf=True)
raw_tfidf = tfidf.fit_transform(count.fit_transform(docs)).toarray()[-1]
raw_tfidf 

array([3.39, 3.  , 3.39, 1.29, 1.29, 1.29, 2.  , 1.69, 1.29])

In [17]:
l2_tfidf = raw_tfidf / np.sqrt(np.sum(raw_tfidf**2))
l2_tfidf

array([0.5 , 0.45, 0.5 , 0.19, 0.19, 0.19, 0.3 , 0.25, 0.19])

## Cleaning text data

To illustrate why this is important, let’s display the last 50 characters from the first document in the
reshuffled movie review dataset:

In [18]:
df.loc[0, 'review'][-50:]

'is seven.<br /><br />Title (Brazil): Not Available'

Via the first regex, <[^>]*> , in the preceding code section, we tried to remove all of the HTML markup
from the movie reviews. Although many programmers generally advise against the use of regex to parse
HTML, this regex should be sufficient to clean this particular dataset. Since we are only interested in
removing HTML markup and do not plan to use the HTML markup further, using regex to do the job
should be acceptable. However, if you prefer to use sophisticated tools for removing HTML markup
from text, you can take a look at Python’s HTML parser module, which is described at https://docs.
python.org/3/library/html.parser.html . After we removed the HTML markup, we used a slightly
more complex regex to find emoticons, which we temporarily stored as emoticons. Next, we removed all
non-word characters from the text via the regex [\W]+ and converted the text into lowercase characters.

In [19]:
import re
def preprocessor(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)',
                           text)
    text = (re.sub('[\W]+', ' ', text.lower()) +
            ' '.join(emoticons).replace('-', ''))
    return text

In [20]:
preprocessor(df.loc[0, 'review'][-50:])

'is seven title brazil not available'

Although the addition of the emoticon characters to the end of the cleaned document strings may not
look like the most elegant approach, we must note that the order of the words doesn’t matter in our
bag-of-words model if our vocabulary consists of only one-word tokens. But before we talk more about
the splitting of documents into individual terms, words, or tokens, let’s confirm that our preprocessor
function works correctly:

In [21]:
preprocessor("</a>This :) is :( a test :-)!")

'this is a test :) :( :)'

Lastly, since we will make use of the cleaned text data over and over again during the next sections,
let’s now apply our preprocessor function to all the movie reviews in our DataFrame :

In [22]:
df['review'] = df['review'].apply(preprocessor)

## Processing documents into tokens

After successfully preparing the movie review dataset, we now need to think about how to split the
text corpora into individual elements. One way to tokenize documents is to split them into individual
words by splitting the cleaned documents at their whitespace characters:

In [23]:
from nltk.stem.porter import PorterStemmer

porter = PorterStemmer()

def tokenizer(text):
    return text.split()


def tokenizer_porter(text):
    return [porter.stem(word) for word in text.split()]

In the context of tokenization, another useful technique is word stemming, which is the process of
transforming a word into its root form. It allows us to map related words to the same stem. The origi-
nal stemming algorithm was developed by Martin F. Porter in 1979 and is hence known as the Porter
stemmer algorithm

In [24]:
tokenizer('runners like running and thus they run')

['runners', 'like', 'running', 'and', 'thus', 'they', 'run']

In [25]:
tokenizer_porter('runners like running and thus they run')

['runner', 'like', 'run', 'and', 'thu', 'they', 'run']

Using the PorterStemmer from the nltk package, we modified our tokenizer function to reduce words
to their root form, which was illustrated by the simple preceding example where the word 'running'
was stemmed to its root form 'run' .

Before we jump into the next section, where we will train a machine learning model using the bag-
of-words model, let’s briefly talk about another useful topic called stop word removal. Stop words
are simply those words that are extremely common in all sorts of texts and probably bear no (or only
a little) useful information that can be used to distinguish between different classes of documents.
Examples of stop words are is, and, has, and like. Removing stop words can be useful if we are working
with raw or normalized term frequencies rather than tf-idfs, which already downweight the frequently
occurring words.

To remove stop words from the movie reviews, we will use the set of 127 English stop words that is
available from the NLTK library, which can be obtained by calling the nltk.download function:

In [26]:
import nltk

nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/openbravo/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [27]:
from nltk.corpus import stopwords

stop = stopwords.words('english')
[w for w in tokenizer_porter('a runner likes running and runs a lot')
 if w not in stop]

['runner', 'like', 'run', 'run', 'lot']

# Training a logistic regression model for document classification

Strip HTML and punctuation to speed up the GridSearch later:

In this section, we will train a logistic regression model to classify the movie reviews into positive and
negative reviews based on the bag-of-words model. First, we will divide the DataFrame of cleaned text
documents into 25,000 documents for training and 25,000 documents for testing:

In [28]:
X_train = df.loc[:25000, 'review'].values
y_train = df.loc[:25000, 'sentiment'].values
X_test = df.loc[25000:, 'review'].values
y_test = df.loc[25000:, 'sentiment'].values

Next, we will use a GridSearchCV object to find the optimal set of parameters for our logistic regression
model using 5-fold stratified cross-validation:

In [29]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import GridSearchCV

tfidf = TfidfVectorizer(strip_accents=None,
                        lowercase=False,
                        preprocessor=None)

"""
param_grid = [{'vect__ngram_range': [(1, 1)],
               'vect__stop_words': [stop, None],
               'vect__tokenizer': [tokenizer, tokenizer_porter],
               'clf__penalty': ['l1', 'l2'],
               'clf__C': [1.0, 10.0, 100.0]},
              {'vect__ngram_range': [(1, 1)],
               'vect__stop_words': [stop, None],
               'vect__tokenizer': [tokenizer, tokenizer_porter],
               'vect__use_idf':[False],
               'vect__norm':[None],
               'clf__penalty': ['l1', 'l2'],
               'clf__C': [1.0, 10.0, 100.0]},
              ]
"""

small_param_grid = [{'vect__ngram_range': [(1, 1)],
                     'vect__stop_words': [None],
                     'vect__tokenizer': [tokenizer, tokenizer_porter],
                     'clf__penalty': ['l2'],
                     'clf__C': [1.0, 10.0]},
                    {'vect__ngram_range': [(1, 1)],
                     'vect__stop_words': [stop, None],
                     'vect__tokenizer': [tokenizer],
                     'vect__use_idf':[False],
                     'vect__norm':[None],
                     'clf__penalty': ['l2'],
                  'clf__C': [1.0, 10.0]},
              ]

lr_tfidf = Pipeline([('vect', tfidf),
                     ('clf', LogisticRegression(solver='liblinear'))])

gs_lr_tfidf = GridSearchCV(lr_tfidf, small_param_grid,
                           scoring='accuracy',
                           cv=5,
                           verbose=1,
                           n_jobs=-1)

When we initialized the GridSearchCV object and its parameter grid using the preceding code, we
restricted ourselves to a limited number of parameter combinations, since the number of feature
vectors, as well as the large vocabulary, can make the grid search computationally quite expensive.
Using a standard desktop computer, our grid search may take 5-10 minutes to complete.
In the previous code example, we replaced CountVectorizer and TfidfTransformer from the previous
subsection with TfidfVectorizer , which combines CountVectorizer with the TfidfTransformer . Our
param_grid consisted of two parameter dictionaries. In the first dictionary, we used TfidfVectorizer
with its default settings ( use_idf=True , smooth_idf=True , and norm='l2' ) to calculate the tf-idfs; in
the second dictionary, we set those parameters to use_idf=False , smooth_idf=False , and norm=None
in order to train a model based on raw term frequencies. Furthermore, for the logistic regression
classifier itself, we trained models using L2 regularization via the penalty parameter and compared
different regularization strengths by defining a range of values for the inverse-regularization parameter
C . As an optional exercise, you are also encouraged to add L1 regularization to the parameter grid by
changing 'clf__penalty': ['l2'] to 'clf__penalty': ['l2', 'l1'] .

**Important Note about `n_jobs`**

Please note that it is highly recommended to use `n_jobs=-1` (instead of `n_jobs=1`) in the previous code example to utilize all available cores on your machine and speed up the grid search. However, some Windows users reported issues when running the previous code with the `n_jobs=-1` setting related to pickling the tokenizer and tokenizer_porter functions for multiprocessing on Windows. Another workaround would be to replace those two functions, `[tokenizer, tokenizer_porter]`, with `[str.split]`. However, note that the replacement by the simple `str.split` would not support stemming.

In [30]:
gs_lr_tfidf.fit(X_train, y_train)

Fitting 5 folds for each of 8 candidates, totalling 40 fits




As you can see in the preceding output, we obtained the best grid search results using the regular
tokenizer without Porter stemming, no stop word library, and tf-idfs in combination with a logistic
regression classifier that uses L2-regularization with the regularization strength C of 10.0.

Using the best model from this grid search, let’s print the average 5-fold cross-validation accuracy
scores on the training dataset and the classification accuracy on the test dataset:

In [31]:
print(f'Best parameter set: {gs_lr_tfidf.best_params_}')
print(f'CV Accuracy: {gs_lr_tfidf.best_score_:.3f}')

Best parameter set: {'clf__C': 10.0, 'clf__penalty': 'l2', 'vect__ngram_range': (1, 1), 'vect__stop_words': None, 'vect__tokenizer': <function tokenizer at 0x7f7557057560>}
CV Accuracy: 0.897


In [32]:
clf = gs_lr_tfidf.best_estimator_
print(f'Test Accuracy: {clf.score(X_test, y_test):.3f}')

Test Accuracy: 0.899


####  Start comment:
    
Please note that `gs_lr_tfidf.best_score_` is the average k-fold cross-validation score. I.e., if we have a `GridSearchCV` object with 5-fold cross-validation (like the one above), the `best_score_` attribute returns the average score over the 5-folds of the best model. To illustrate this with an example:

### Working with bigger data – online algorithms and out-of-core learning

If you executed the code examples in the previous section, you may have noticed that it could be
computationally quite expensive to construct the feature vectors for the 50,000-movie review dataset
during a grid search. In many real-world applications, it is not uncommon to work with even larger
datasets that can exceed our computer’s memory.

Since not everyone has access to supercomputer facilities, we will now apply a technique called out-of-
core learning, which allows us to work with such large datasets by fitting the classifier incrementally
on smaller batches of a dataset.

Back in Chapter 2, Training Simple Machine Learning Algorithms for Classification, the concept of sto-
chastic gradient descent was introduced; it is an optimization algorithm that updates the model’s
weights using one example at a time. In this section, we will make use of the partial_fit function
of SGDClassifier in scikit-learn to stream the documents directly from our local drive and train a
logistic regression model using small mini-batches of documents.

First, we will define a tokenizer function that cleans the unprocessed text data from the movie_data.
csv file that we constructed at the beginning of this chapter and separates it into word tokens while
removing stop words:

In [33]:
from sklearn.linear_model import LogisticRegression
import numpy as np

from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score

np.random.seed(0)
np.set_printoptions(precision=6)
y = [np.random.randint(3) for i in range(25)]
X = (y + np.random.randn(25)).reshape(-1, 1)

cv5_idx = list(StratifiedKFold(n_splits=5, shuffle=False).split(X, y))
    
lr = LogisticRegression()
cross_val_score(lr, X, y, cv=cv5_idx)

array([0.6, 0.4, 0.6, 0.2, 0.6])

By executing the code above, we created a simple data set of random integers that shall represent our class labels. Next, we fed the indices of 5 cross-validation folds (`cv3_idx`) to the `cross_val_score` scorer, which returned 5 accuracy scores -- these are the 5 accuracy values for the 5 test folds.  

Next, let us use the `GridSearchCV` object and feed it the same 5 cross-validation sets (via the pre-generated `cv3_idx` indices):

In [34]:
from sklearn.model_selection import GridSearchCV

lr = LogisticRegression()
gs = GridSearchCV(lr, {}, cv=cv5_idx, verbose=3).fit(X, y) 

Fitting 5 folds for each of 1 candidates, totalling 5 fits
[CV 1/5] END ..................................., score=0.600 total time=   0.0s
[CV 2/5] END ..................................., score=0.400 total time=   0.0s
[CV 3/5] END ..................................., score=0.600 total time=   0.0s
[CV 4/5] END ..................................., score=0.200 total time=   0.0s
[CV 5/5] END ..................................., score=0.600 total time=   0.0s


As we can see, the scores for the 5 folds are exactly the same as the ones from `cross_val_score` earlier.

Now, the best_score_ attribute of the `GridSearchCV` object, which becomes available after `fit`ting, returns the average accuracy score of the best model:

In [35]:
gs.best_score_

0.48

As we can see, the result above is consistent with the average score computed with `cross_val_score`.

In [36]:
lr = LogisticRegression()
cross_val_score(lr, X, y, cv=cv5_idx).mean()

0.48

#### End comment.

<hr>
<hr>

# Working with bigger data - online algorithms and out-of-core learning

In [37]:
# This cell is not contained in the book but
# added for convenience so that the notebook
# can be executed starting here, without
# executing prior code in this notebook

import os
import gzip


if not os.path.isfile('movie_data.csv'):
    if not os.path.isfile('movie_data.csv.gz'):
        print('Please place a copy of the movie_data.csv.gz'
              'in this directory. You can obtain it by'
              'a) executing the code in the beginning of this'
              'notebook or b) by downloading it from GitHub:'
              'https://github.com/rasbt/machine-learning-book/'
              'blob/main/ch08/movie_data.csv.gz')
    else:
        with gzip.open('movie_data.csv.gz', 'rb') as in_f, \
                open('movie_data.csv', 'wb') as out_f:
            out_f.write(in_f.read())

In [38]:
import numpy as np
import re
from nltk.corpus import stopwords


# The `stop` is defined as earlier in this chapter
# Added it here for convenience, so that this section
# can be run as standalone without executing prior code
# in the directory
stop = stopwords.words('english')


def tokenizer(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text)
    text = re.sub('[\W]+', ' ', text.lower()) +\
        ' '.join(emoticons).replace('-', '')
    tokenized = [w for w in text.split() if w not in stop]
    return tokenized

#Next, we will define a generator function, stream_docs , that reads in and returns one document at a time:

def stream_docs(path):  
    with open(path, 'r', encoding='utf-8') as csv:
        next(csv)  # skip header
        for line in csv:
            text, label = line[:-3], int(line[-2])
            yield text, label

To verify that our stream_docs function works correctly, let’s read in the first document from the
movie_data.csv file, which should return a tuple consisting of the review text as well as the corre-
sponding class label:

In [39]:
next(stream_docs(path='movie_data.csv'))

('"In 1974, the teenager Martha Moxley (Maggie Grace) moves to the high-class area of Belle Haven, Greenwich, Connecticut. On the Mischief Night, eve of Halloween, she was murdered in the backyard of her house and her murder remained unsolved. Twenty-two years later, the writer Mark Fuhrman (Christopher Meloni), who is a former LA detective that has fallen in disgrace for perjury in O.J. Simpson trial and moved to Idaho, decides to investigate the case with his partner Stephen Weeks (Andrew Mitchell) with the purpose of writing a book. The locals squirm and do not welcome them, but with the support of the retired detective Steve Carroll (Robert Forster) that was in charge of the investigation in the 70\'s, they discover the criminal and a net of power and money to cover the murder.<br /><br />""Murder in Greenwich"" is a good TV movie, with the true story of a murder of a fifteen years old girl that was committed by a wealthy teenager whose mother was a Kennedy. The powerful and rich f

We will now define a function, get_minibatch , that will take a document stream from the stream_docs
function and return a particular number of documents specified by the size parameter:

In [40]:
def get_minibatch(doc_stream, size):
    docs, y = [], []
    try:
        for _ in range(size):
            text, label = next(doc_stream)
            docs.append(text)
            y.append(label)
    except StopIteration:
        return None, None
    return docs, y

Unfortunately, we can’t use CountVectorizer for out-of-core learning since it requires holding the
complete vocabulary in memory. Also, TfidfVectorizer needs to keep all the feature vectors of the
training dataset in memory to calculate the inverse document frequencies. However, another useful
vectorizer for text processing implemented in scikit-learn is HashingVectorizer . HashingVectorizer
is data-independent and makes use of the hashing trick via the 32-bit MurmurHash3 function by Austin
Appleby (you can find more information about MurmurHash at https://en.wikipedia.org/wiki/
MurmurHash ):

In [41]:
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.linear_model import SGDClassifier


vect = HashingVectorizer(decode_error='ignore', 
                         n_features=2**21,
                         preprocessor=None, 
                         tokenizer=tokenizer)

In [45]:
from distutils.version import LooseVersion as Version
from sklearn import __version__ as sklearn_version

clf = SGDClassifier(loss='log_loss', random_state=1)


doc_stream = stream_docs(path='movie_data.csv')

Using the preceding code, we initialized HashingVectorizer with our tokenizer function and set the
number of features to 2**21 . Furthermore, we reinitialized a logistic regression classifier by setting
the loss parameter of SGDClassifier to 'log' . Note that by choosing a large number of features in
HashingVectorizer , we reduce the chance of causing hash collisions, but we also increase the number
of coefficients in our logistic regression model.

Now comes the really interesting part—having set up all the complementary functions, we can start
the out-of-core learning using the following code:

In [46]:
import pyprind
pbar = pyprind.ProgBar(45)

classes = np.array([0, 1])
for _ in range(45):
    X_train, y_train = get_minibatch(doc_stream, size=1000)
    if not X_train:
        break
    X_train = vect.transform(X_train)
    clf.partial_fit(X_train, y_train, classes=classes)
    pbar.update()



Again, we made use of the PyPrind package to estimate the progress of our learning algorithm. We
initialized the progress bar object with 45 iterations and, in the following for loop, we iterated over
45 mini-batches of documents where each mini-batch consists of 1,000 documents. Having completed
the incremental learning process, we will use the last 5,000 documents to evaluate the performance
of our model:

In [47]:
X_test, y_test = get_minibatch(doc_stream, size=5000)
X_test = vect.transform(X_test)
print(f'Accuracy: {clf.score(X_test, y_test):.3f}')

Accuracy: 0.868


As you can see, the accuracy of the model is approximately 87 percent, slightly below the accuracy
that we achieved in the previous section using the grid search for hyperparameter tuning. However,
out-of-core learning is very memory efficient, and it took less than a minute to complete

Finally, we can use the last 5,000 documents to update our model:

In [48]:
clf = clf.partial_fit(X_test, y_test)

## Topic modeling

Topic modeling describes the broad task of assigning topics to unlabeled text documents. For example,
a typical application is the categorization of documents in a large text corpus of newspaper articles.
In applications of topic modeling, we then aim to assign category labels to those articles, for example,
sports, finance, world news, politics, and local news. Thus, in the context of the broad categories of
machine learning that we discussed in Chapter 1, Giving Computers the Ability to Learn from Data, we
can consider topic modeling as a clustering task, a subcategory of unsupervised learning.

In this section, we will discuss a popular technique for topic modeling called latent Dirichlet allocation
(LDA). However, note that while latent Dirichlet allocation is often abbreviated as LDA, it is not to be
confused with linear discriminant analysis, a supervised dimensionality reduction technique that was
introduced in Chapter 5, Compressing Data via Dimensionality Reduction.

### Decomposing text documents with Latent Dirichlet Allocation

Since the mathematics behind LDA is quite involved and requires knowledge of Bayesian inference,
we will approach this topic from a practitioner’s perspective and interpret LDA using layman’s terms.
However, the interested reader can read more about LDA in the following research paper: Latent
Dirichlet Allocation, by David M. Blei, Andrew Y. Ng, and Michael I. Jordan, Journal of Machine Learning
Research 3, pages: 993-1022, Jan 2003, https://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf .
LDA is a generative probabilistic model that tries to find groups of words that appear frequently to-
gether across different documents. These frequently appearing words represent our topics, assuming
that each document is a mixture of different words. The input to an LDA is the bag-of-words model
that we discussed earlier in this chapter.

Given a bag-of-words matrix as input, LDA decomposes it into two new matrices:
• A document-to-topic matrix
• A word-to-topic matrix

LDA decomposes the bag-of-words matrix in such a way that if we multiply those two matrices to-
gether, we will be able to reproduce the input, the bag-of-words matrix, with the lowest possible error.
In practice, we are interested in those topics that LDA found in the bag-of-words matrix. The only
downside may be that we must define the number of topics beforehand—the number of topics is a
hyperparameter of LDA that has to be specified manually.

### Latent Dirichlet Allocation with scikit-learn

In [49]:
import pandas as pd

df = pd.read_csv('movie_data.csv', encoding='utf-8')

# the following is necessary on some computers:
df = df.rename(columns={"0": "review", "1": "sentiment"})

df.head(3)

Unnamed: 0,review,sentiment
0,"In 1974, the teenager Martha Moxley (Maggie Gr...",1
1,OK... so... I really like Kris Kristofferson a...,0
2,"***SPOILER*** Do not read this, if you think a...",0


In [50]:
from sklearn.feature_extraction.text import CountVectorizer

count = CountVectorizer(stop_words='english',
                        max_df=.1,
                        max_features=5000)
X = count.fit_transform(df['review'].values)

Notice that we set the maximum document frequency of words to be considered to 10 percent ( max_
df=.1 ) to exclude words that occur too frequently across documents. The rationale behind the removal
of frequently occurring words is that these might be common words appearing across all documents
that are, therefore, less likely to be associated with a specific topic category of a given document.
Also, we limited the number of words to be considered to the most frequently occurring 5,000 words
( max_features=5000 ), to limit the dimensionality of this dataset to improve the inference performed
by LDA. However, both max_df=.1 and max_features=5000 are hyperparameter values chosen arbi-
trarily, and readers are encouraged to tune them while comparing the results.



In [51]:
from sklearn.decomposition import LatentDirichletAllocation

lda = LatentDirichletAllocation(n_components=10,
                                random_state=123,
                                learning_method='batch')
X_topics = lda.fit_transform(X)

By setting learning_method='batch' , we let the lda estimator do its estimation based on all available
training data (the bag-of-words matrix) in one iteration, which is slower than the alternative 'online'
learning method, but can lead to more accurate results (setting learning_method='online' is anal-
ogous to online or mini-batch learning, which we discussed in Chapter 2, Training Simple Machine
Learning Algorithms for Classification, and previously in this chapter).

After fitting the LDA, we now have access to the components_ attribute of the lda instance, which stores
a matrix containing the word importance (here, 5000 ) for each of the 10 topics in increasing order:

In [52]:
lda.components_.shape

(10, 5000)

To analyze the results, let’s print the five most important words for each of the 10 topics. Note that the
word importance values are ranked in increasing order. Thus, to print the top five words, we need to
sort the topic array in reverse order:

In [53]:
n_top_words = 5
feature_names = count.get_feature_names_out()

for topic_idx, topic in enumerate(lda.components_):
    print(f'Topic {(topic_idx + 1)}:')
    print(' '.join([feature_names[i]
                    for i in topic.argsort()\
                        [:-n_top_words - 1:-1]]))

Topic 1:
comedy jokes laugh humor fun
Topic 2:
guy girl sex women woman
Topic 3:
war american wife murder men
Topic 4:
human book feel audience documentary
Topic 5:
series tv episode dvd episodes
Topic 6:
horror gore house scary blood
Topic 7:
performance role wonderful beautiful family
Topic 8:
action john western killer hero
Topic 9:
script worst minutes awful budget
Topic 10:
action fun music animation disney


Based on reading the 5 most important words for each topic, we may guess that the LDA identified the following topics:
    
1. Generally bad movies (not really a topic category)
2. Movies about families
3. War movies
4. Art movies
5. Crime movies
6. Horror movies
7. Comedies
8. Movies somehow related to TV shows
9. Movies based on books
10. Action movies

To confirm that the categories make sense based on the reviews, let's plot 5 movies from the horror movie category (category 6 at index position 5):

In [54]:
horror = X_topics[:, 5].argsort()[::-1]

for iter_idx, movie_idx in enumerate(horror[:3]):
    print(f'\nHorror movie #{(iter_idx + 1)}:')
    print(df['review'][movie_idx][:300], '...')


Horror movie #1:
Over Christmas break, a group of college friends stay behind to help prepare the dorms to be torn down and replaced by apartment buildings. To make the work a bit more difficult, a murderous, Chucks-wearing psycho is wandering the halls of the dorm, preying on the group in various violent ways.<br / ...

Horror movie #2:
Basically, take the concept of every Asian horror ghost movie and smash it into one and you get this movie. The story goes like this: a bunch of college kids get voice mails from their own phones that are foretelling their deaths. There's some s*** going on with ghosts, which if you've seen any Asia ...

Horror movie #3:
I personally have a soft spot for horror films that are set in hospitals and asylums so I had a good feeling about watching this "Don't Look in the Basement", even though its reputation is doubtful. Well, turned out I was right! This is great, trashy entertainment with a couple of efficient shocks a ...


Using the preceeding code example, we printed the first 300 characters from the top 3 horror movies and indeed, we can see that the reviews -- even though we don't know which exact movie they belong to -- sound like reviews of horror movies, indeed. (However, one might argue that movie #2 could also belong to topic category 1.)

Using the preceding code example, we printed the first 300 characters from the top three horror mov-
ies. The reviews—even though we don’t know which exact movie they belong to—sound like reviews
of horror movies (however, one might argue that Horror movie #2 could also be a good fit for topic
category 1: Generally bad movies).

In [55]:
!jupyter nbconvert --to script  chapter_8.ipynb --output ch08

[NbConvertApp] Converting notebook chapter_8.ipynb to script
[NbConvertApp] Writing 23932 bytes to ch08.py
