<a href="https://colab.research.google.com/github/linhoangce/ml_with_pytorch_and_scikitlearn/blob/main/chapter8_sentiment_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Topics covered:

* Cleaning and preparing text data
* Building feature vectors from text document
* Train a machine learning model to classify positive and negative movie reviews
* Working with large text datasets using out-of-core learning
* Inferring topics from document collections for categorization

In [1]:
# add folder to path to load from check_packages.py scrip
import sys

sys.path.insert(0, '..')

In [2]:
# check recommned package version
try:
  from python_environment_check import check_packages
except:
  !wget https://raw.githubusercontent.com/rasbt/machine-learning-book/refs/heads/main/python_environment_check.py
  from python_environment_check import check_packages

--2025-07-31 03:58:11--  https://raw.githubusercontent.com/rasbt/machine-learning-book/refs/heads/main/python_environment_check.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1815 (1.8K) [text/plain]
Saving to: ‘python_environment_check.py’


2025-07-31 03:58:12 (21.6 MB/s) - ‘python_environment_check.py’ saved [1815/1815]

[OK] Your Python version is 3.11.13 (main, Jun  4 2025, 08:57:29) [GCC 11.4.0]


In [3]:
d = {
    'numpy': '1.21.2',
    'pandas': '1.3.2',
    'sklearn': '1.0',
    'pyprind': '2.11.3',
    'nltk': '3.6'
}

check_packages(d)

[FAIL]: pyprind is not installed and/or cannot be imported.
[OK] numpy 2.0.2
[OK] pandas 2.2.2
[OK] sklearn 1.6.1
[OK] nltk 3.9.1


# Preparing the IMDb movie review data for text processing

## Obtaining the movie review dataset

In [4]:
# download and unzip the dataset
import os
import sys
import tarfile
import time
import urllib.request

source = 'http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz'
target = 'aclImdb_v1.tar.gz'

if os.path.exists(target):
  os.remove(target)

def reporthook(count, block_size, total_size):
  global start_time

  if count == 0:
    start_time = time.time()
    return

  duration = time.time() - start_time
  progress_size = int(count * block_size)
  speed = progress_size / (1024.**2 * duration)
  percent = count * block_size * 100. / total_size

  sys.stdout.write(f'\r{int(percent)}% | {progress_size / (1023.**2):.3f} MB'
                  f'| {speed:.2f} MB/s | {duration:.2f} sec elapsed')

if not os.path.isdir('aclImdb') and not os.path.isfile('aclImdb_v1.tar.gz'):
  urllib.request.urlretrieve(source, target, reporthook)

100% | 80.391 MB| 48.40 MB/s | 1.66 sec elapsed

In [5]:
if not os.path.isdir('aclImdb'):
  with tarfile.open(target, 'r:gz') as tar:
    tar.extractall()

## Preprocessing the movie dataset into a more convenient format

In [6]:
# install PyPrind to visual progress and time estimated
# for reading movie reviews into a pandas DataFrame (10m)
!pip install pyprind -q

In [7]:
import pyprind
import pandas as pd
import os
import sys
from packaging import version

# change 'basepath' to dir of unzipped movie dataset
basepath = 'aclImdb'

labels = {'pos': 1, 'neg': 0}
pbar = pyprind.ProgBar(50000, stream=2)

df = pd.DataFrame()

for s in ('test', 'train'):
  for l in ('pos', 'neg'):
    path = os.path.join(basepath, s, l)
    for file in sorted(os.listdir(path)):
      with open(os.path.join(path, file),
                'r', encoding='utf-8') as infile:
        txt = infile.read()

      if version.parse(pd.__version__) >= version.parse('1.3.2'):
        x = pd.DataFrame([[txt, labels[l]]],
                         columns=['review', 'sentiment'])
        df = pd.concat([df, x],
                       ignore_index=True)

      else:
        df = df.append([[txt, labels[l]]],
                       ignore_index=True)

      pbar.update()

df.columns = ['review', 'sentiment']


0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:01:09


In [8]:
df.index

RangeIndex(start=0, stop=50000, step=1)

In [9]:
# shuffle sorted dataset saved in DataFrame
import numpy as np

if version.parse(pd.__version__) >= version.parse('1.3.2'):
  df = df.sample(frac=1, random_state=0).reset_index(drop=True)

else:
  np.random.seed(0)
  df = df.reindex(np.random.permutation(df.index))

In [10]:
df

Unnamed: 0,review,sentiment
0,"In 1974, the teenager Martha Moxley (Maggie Gr...",1
1,OK... so... I really like Kris Kristofferson a...,0
2,"***SPOILER*** Do not read this, if you think a...",0
3,hi for all the people who have seen this wonde...,1
4,"I recently bought the DVD, forgetting just how...",0
...,...,...
49995,"OK, lets start with the best. the building. al...",0
49996,The British 'heritage film' industry is out of...,0
49997,I don't even know where to begin on this one. ...,0
49998,Richard Tyler is a little boy who is scared of...,0


In [11]:
# save assembled data as CSV file
df.to_csv('movie_data.csv',
          index=False,
          encoding='utf-8')

In [12]:
# confirm saved data in right format - csv
df = pd.read_csv('movie_data.csv',
                 encoding='utf-8')
df.head(5)

Unnamed: 0,review,sentiment
0,"In 1974, the teenager Martha Moxley (Maggie Gr...",1
1,OK... so... I really like Kris Kristofferson a...,0
2,"***SPOILER*** Do not read this, if you think a...",0
3,hi for all the people who have seen this wonde...,1
4,"I recently bought the DVD, forgetting just how...",0


In [13]:
df.shape

(50000, 2)

# Introducing the bag-of-word model

Concept summarized:

1. We create a vocabulary of unique tokens - for example, words - from the entire set of documents.
2. We construct a feature vector from each document that contains the counts of how often each word occurs in the particular document.

Since the unique words in each document represent only a small subset of all the words in the bag-of-words vocabulary, the feature vectors will mostly conist of zeros, which is why we call them **sparse**.

## Transforming words into feature vectors

In [14]:
# construct a bag-of-words model based on word counts
# `CountVectorizer` takes an array of text data
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

count = CountVectorizer()
docs = np.array([
    'The sun is shining',
    'The weather is sweet',
    'The sun is shining, the weather is sweet, and one and one is two'
])

bag = count.fit_transform(docs)
bag.toarray()

array([[0, 1, 0, 1, 1, 0, 1, 0, 0],
       [0, 1, 0, 0, 0, 1, 1, 0, 1],
       [2, 3, 2, 1, 1, 1, 2, 1, 1]])

In [15]:
count.vocabulary_

{'the': 6,
 'sun': 4,
 'is': 1,
 'shining': 3,
 'weather': 8,
 'sweet': 5,
 'and': 0,
 'one': 2,
 'two': 7}

* `count.vocabulary_`: The vocabulary is stored in a Python dictionary, which maps the unique words to integer indices.

* `bag.toarray()`: Each index position in the feature vectors shown corresponds to the integer values that are stored as dictionary items in the CountVectorizer vocabulary. For example, the first feature at index position 0 resembles the count of the word "and", which only occurs in the last document, and the "is" at index position 1 (the 2nd feature in the documnet vectors) occurs in all three sentences. Those values in the feature vectors are called the raw term frequencies: *tf(t,d) - the number of items a term t occurs in a document d.

## Assessing word relevancy via term frequency-inverse document frequency

In [16]:
np.set_printoptions(precision=2)

When we are analyzing text data, we often encounter words that occur across multiple documents from both classes. Thoes frequently occuring words typically don't contain useful or discriminatory information. In this subsection, we will learn about a useful technique called **term frequency-inverse document frequency** (tf-idf) that can be used to downweigh those frequently occuring words in the feature vectors. The ef-idf can be defined as the product of the term frequency and the inverse document frequency:

<code>

tf-idf(t, d) = tf(t,d) x idf(t,d)

</code>

The inverse document frequency *idf(t,d)* can be calculated as:

$idf(t,d) = log\frac{n_d}{1 + df(d,t)}$

where $n_d$ is the total number of documents, and *df(d, t) is the number of documents *d* that contrain the term *t*. The constant 1 is added to the denominator as optional for the purpose of assigning a non-zero value to terms that occur in all training examples; the log is used to ensure that low document frequencies are not given too much weight.

Scikit-learn implements yet another transformer, the `TfidfTransformer`, that takes the raw term frequencies from `CountVectorizer` as input and transform them int tf-idfs:

In [17]:
from sklearn.feature_extraction.text import TfidfTransformer

tfidf = TfidfTransformer(use_idf=True,
                        norm='l2',
                        smooth_idf=True)

tfidf.fit_transform(count.fit_transform(docs)).toarray()


array([[0.  , 0.43, 0.  , 0.56, 0.56, 0.  , 0.43, 0.  , 0.  ],
       [0.  , 0.43, 0.  , 0.  , 0.  , 0.56, 0.43, 0.  , 0.56],
       [0.5 , 0.45, 0.5 , 0.19, 0.19, 0.19, 0.3 , 0.25, 0.19]])

The word "is" had the largest term frequency in the 3rd document, being the most occurring word. However, after transforming the same feature vector into tf-idfs, we see that the word "is" is now associated with a relatively small tf-idf(0.45) in document 3 since it is also contained in documents 1 and 2 and thus is unlikely to contain any useful, discriminatory information.

The `TfidfTransformer` calculates the tf-idfs slightly dfferently compared to the stantard textbook equation defined earlier. The equations for the idf and tf-idf that were implemented in Scikit-learn are:

$ idf(t,d) = log\frac{1 + n_d}{1 + df(d,t)}$

The tf-idf equation that was implemented in Scikit-learn is as follows:

$ tf-idf(t,d) = tf(t,d) \times (idf(t,d) + 1) $

While it is also more typical to normalize the raw term frequencies before calculating the tf-idfs, the `TfidfTransformer` normalizes the tf-idfs directly.

By default (`norm='l2'`), scikit-learn Tfidftransformer applies the L2-normalization, which returns a vector of length 1 by dividing an un-normalized feature vector *v* by its L2-norm:

$ v_{\text{norm}} = \frac{v}{\|v\|_2} = \frac{v}{\sqrt{v_1^2 + v_2^2 + \dots + v_n^2}} = \frac{v}{\left(\sum_{i=1}^{n} {v_i^2}\right)^{1/2}} $

Let's walk through an example and calculate the tf-idf of the word "is" in the 3rd document.

The word "is" has a term frequency of 3 (tf = 3) in the document 3 ($d_3$), and the document frequency of this term is 3 since the term "is" occurs in all three documents (df = 3). Thus, we can calculate the idf as follows:

$ idf("is", d_3) = log\frac{1 + 3}{1 + 3} = 0 $

Now to calculate the tf-idf, we simple need to add 1 to the inverse document frequency and multiply it by the term frequency:

$ tf-idf("is", d_3) = 3 \times (0 + 1) = 3 $

In [18]:
tf_is = 3
n_docs = 3
idf_is = np.log((n_docs + 1) / (3 + 1))
tfidf_is = tf_is * (idf_is + 1)
print(f'tf-idf of term "is" - {tfidf_is:.2f}')

tf-idf of term "is" - 3.00


If we repeated these calculations for all terms in the 3rd document, we'd obtain the following td-idf vectors: [3.39, 3.0, 3.39, 1.29, 1.29, 1.29, 2.0 , 1.69, 1.29]. However, the values in this feature vector are different from the values that we obtained from the `TfidfTransformer` that we used previously. The final step that we need to do is L2-normalization:

$ \text{tf-idf}_\text{norm} = \frac{[3.39, 3.0, 3.39, 1.29, 1.29, 1.29, 2.0 , 1.69, 1.29]}{\sqrt{[3.39^2, 3.0^2, 3.39^2, 1.29^2, 1.29^2, 1.29^2, 2.0^2 , 1.69^2, 1.29^2]}} = [0.5, 0.45, 0.5, 0.19, 0.19, 0.19, 0.3, 0.25, 0.19]
=> \text{tf-idf}_\text{norm}("is", d_3) = 0.45 $

In [19]:
tfidf = TfidfTransformer(use_idf=True,
                         norm=None,
                         smooth_idf=True)
raw_tfidf = tfidf.fit_transform(count.fit_transform(docs)).toarray()[-1]
raw_tfidf

array([3.39, 3.  , 3.39, 1.29, 1.29, 1.29, 2.  , 1.69, 1.29])

In [20]:
l2_tfidf = raw_tfidf / np.sqrt(np.sum(raw_tfidf ** 2))
l2_tfidf

array([0.5 , 0.45, 0.5 , 0.19, 0.19, 0.19, 0.3 , 0.25, 0.19])

## Cleaning text data

In [21]:
df.head(3)

Unnamed: 0,review,sentiment
0,"In 1974, the teenager Martha Moxley (Maggie Gr...",1
1,OK... so... I really like Kris Kristofferson a...,0
2,"***SPOILER*** Do not read this, if you think a...",0


In [22]:
# display last 50 characters from first doc
# from reshuffled movie review dataset
df.loc[0, 'review'][-50:]

'is seven.<br /><br />Title (Brazil): Not Available'

In [23]:
# remove all punctuation marks
# leave emoticon characters, e.g :)
import re # regex lib from Python

def preprocessor(text):
  text = re.sub('<[^>]*>', '', text) # remove all HTML markups
  emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)',
                         text)
  text = (re.sub('[\W]+', ' ', text.lower()) +
          ' '.join(emoticons).replace('-', ''))
  return text

In [24]:
preprocessor(df.loc[0, 'review'][-50:])

'is seven title brazil not available'

In [25]:
preprocessor("</a>This :) is a :( test :-)!")

'this is a test :) :( :)'

In [26]:
# apply preprocessor function to all movie reviews
df['review'] = df['review'].apply(preprocessor)
df.loc[0, 'review'][-50:]

'zation my vote is seven title brazil not available'

## Processing documents into tokens

In [27]:
# tokenize document by spliting them into individual words
def tokenizer(text):
  return text.split()

tokenizer('runners like running and thus they run')

['runners', 'like', 'running', 'and', 'thus', 'they', 'run']

In [28]:
# install Natural Language Toolkit (NLTK)
# to use implementation of Porter stemming algorithm
!pip install nltk -q

In [29]:
from nltk.stem.porter import PorterStemmer

porter = PorterStemmer()

def tokenizer_porter(text):
  return [porter.stem(word) for word in text.split()]

tokenizer_porter('runners like running and thus they run')

['runner', 'like', 'run', 'and', 'thu', 'they', 'run']

**Stop words** are words that are extremely common in all sorts of texts and probably bear no (or only a little) useful information that can be used to distinguish between different classes of documents. Examples of stop words are *is, and, has,* and *like*. Removing stop words can be useful if we are working with raw or normalized term frequencies rather than tf-idf, which already downweight the frequently occurring words.

In [30]:
# use set of 127 English stop words from NLTK
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [31]:
# load and apply stop word set
from nltk.corpus import stopwords

stop = stopwords.words('english')

[w for w in tokenizer_porter('a runner likes'
  ' running and runs a lot')
  if w not in stop]

['runner', 'like', 'run', 'run', 'lot']

# Training a logistic regression model for document classification

In [32]:
# divide DataFrame of cleaned text documents into
# 25k docs for training and 25k for testing
X_train = df.loc[:25000, 'review'].values
y_train = df.loc[:25000, 'sentiment'].values
X_test = df.loc[25000:, 'review'].values
y_test = df.loc[25000:, 'sentiment'].values

In [33]:
# use a GridSearchCV object to find optimal set of params
# for logistic regression model using 5-fold cross-validation
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(strip_accents=None,
                        lowercase=False,
                        preprocessor=None)

small_param_grid = [{'vect__ngram_range': [(1, 1)],
                     'vect__stop_words': [None],
                     'vect__tokenizer': [tokenizer, tokenizer_porter],
                     'clf__penalty': ['l2'],
                     'clf__C': [1., 10.]},
                    {'vect__ngram_range': [(1, 1)],
                     'vect__stop_words': [stop, None],
                     'vect__tokenizer': [tokenizer],
                     'vect__use_idf': [False],
                     'vect__norm': [None],
                     'clf__penalty': ['l2'],
                     'clf__C': [1.0, 10.0]}
                    ]

lr_tfidf = Pipeline([('vect', tfidf),
                     ('clf', LogisticRegression(solver='liblinear'))])

gs_lr_tfidf = GridSearchCV(estimator=lr_tfidf,
                           param_grid=small_param_grid,
                           scoring='accuracy',
                           cv=5,
                           verbose=1,
                           n_jobs=-1)

In [34]:
# train the model
gs_lr_tfidf.fit(X_train, y_train)

Fitting 5 folds for each of 8 candidates, totalling 40 fits




In [35]:
gs_lr_tfidf.best_params_

{'clf__C': 10.0,
 'clf__penalty': 'l2',
 'vect__ngram_range': (1, 1),
 'vect__stop_words': None,
 'vect__tokenizer': <function __main__.tokenizer(text)>}

In [36]:
gs_lr_tfidf.best_score_

np.float64(0.8970842631473704)

In [37]:
gs_lr_tfidf.score(X_test, y_test)

0.89876

`gs_lr_tfidf.best_Score_` is the average k-fold cross-validation score. i.e, if we have a `GridSearchCV` object with 5-fold cross-validation, the `best_score_` attribute returns the average score over the 5-folds of the best model.

In [38]:
from sklearn.linear_model import LogisticRegression
import numpy as np

from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score

np.random.seed(0)
np.set_printoptions(precision=6)

y = [np.random.randint(3) for i in range(25)]
X = (y + np.random.randn(25)).reshape(-1, 1)

cv5_idx = list(StratifiedKFold(n_splits=5,
                               shuffle=False).split(X, y))

lr = LogisticRegression()
cross_val_score(estimator=lr,
                X=X, y=y,
                cv=cv5_idx)

array([0.6, 0.4, 0.6, 0.2, 0.6])

We created a simple data set of random integers that represent class labels. Next, we'll feed the indices of 5 cross-validation folds (`cv5_idx`) to the cross_val_score, which will return 5 accuracy scores for 5 test folds.

In [39]:
from sklearn.model_selection import GridSearchCV

lr = LogisticRegression()
gs = GridSearchCV(estimator=lr,
                  param_grid={},
                  cv=cv5_idx,
                  verbose=3).fit(X, y)

Fitting 5 folds for each of 1 candidates, totalling 5 fits
[CV 1/5] END ..................................., score=0.600 total time=   0.0s
[CV 2/5] END ..................................., score=0.400 total time=   0.0s
[CV 3/5] END ..................................., score=0.600 total time=   0.0s
[CV 4/5] END ..................................., score=0.200 total time=   0.0s
[CV 5/5] END ..................................., score=0.600 total time=   0.0s


In [40]:
gs.best_score_

np.float64(0.48)

In [41]:
cross_val_score(estimator=lr,
                X=X,
                y=y,
                cv=cv5_idx).mean()

np.float64(0.48)

# Working with bigger data - online algorithms and out-of-score learning

In [42]:
# define a `tokenizer` function that cleans unprocessed text
# from movie_data.csv file constructed previously
# then separate it into word tokens while removing stop words
import numpy as np
import re
from nltk.corpus import stopwords

stop = stopwords.words('english')

def tokenizer(text):
  text = re.sub('<[^>]*>', '', text)
  emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)',
                         text.lower())
  text = re.sub('[\W]+', ' ', text.lower()) \
          + ' '.join(emoticons).replace('-', '')
  tokenized = [w for w in text.split() if w not in stop]
  return tokenized

In [43]:
# define a generator function that reads in and returns
# one document at a time
def stream_docs(path):
  with open(path, 'r', encoding='utf-8') as csv:
    next(csv) # skip header
    for line in csv:
      text, label = line[:-3], int(line[-2])
      yield text, label

In [44]:
# verifu if functions work
text, label = next(stream_docs(path='movie_data.csv'))
text, label

('"In 1974, the teenager Martha Moxley (Maggie Grace) moves to the high-class area of Belle Haven, Greenwich, Connecticut. On the Mischief Night, eve of Halloween, she was murdered in the backyard of her house and her murder remained unsolved. Twenty-two years later, the writer Mark Fuhrman (Christopher Meloni), who is a former LA detective that has fallen in disgrace for perjury in O.J. Simpson trial and moved to Idaho, decides to investigate the case with his partner Stephen Weeks (Andrew Mitchell) with the purpose of writing a book. The locals squirm and do not welcome them, but with the support of the retired detective Steve Carroll (Robert Forster) that was in charge of the investigation in the 70\'s, they discover the criminal and a net of power and money to cover the murder.<br /><br />""Murder in Greenwich"" is a good TV movie, with the true story of a murder of a fifteen years old girl that was committed by a wealthy teenager whose mother was a Kennedy. The powerful and rich f

In [45]:
text

'"In 1974, the teenager Martha Moxley (Maggie Grace) moves to the high-class area of Belle Haven, Greenwich, Connecticut. On the Mischief Night, eve of Halloween, she was murdered in the backyard of her house and her murder remained unsolved. Twenty-two years later, the writer Mark Fuhrman (Christopher Meloni), who is a former LA detective that has fallen in disgrace for perjury in O.J. Simpson trial and moved to Idaho, decides to investigate the case with his partner Stephen Weeks (Andrew Mitchell) with the purpose of writing a book. The locals squirm and do not welcome them, but with the support of the retired detective Steve Carroll (Robert Forster) that was in charge of the investigation in the 70\'s, they discover the criminal and a net of power and money to cover the murder.<br /><br />""Murder in Greenwich"" is a good TV movie, with the true story of a murder of a fifteen years old girl that was committed by a wealthy teenager whose mother was a Kennedy. The powerful and rich fa

In [46]:
tokenizer(text)

['1974',
 'teenager',
 'martha',
 'moxley',
 'maggie',
 'grace',
 'moves',
 'high',
 'class',
 'area',
 'belle',
 'greenwich',
 'connecticut',
 'mischief',
 'night',
 'eve',
 'halloween',
 'murdered',
 'backyard',
 'house',
 'murder',
 'remained',
 'unsolved',
 'twenty',
 'two',
 'years',
 'later',
 'writer',
 'mark',
 'fuhrman',
 'christopher',
 'meloni',
 'former',
 'la',
 'detective',
 'fallen',
 'disgrace',
 'perjury',
 'j',
 'simpson',
 'trial',
 'moved',
 'idaho',
 'decides',
 'investigate',
 'case',
 'partner',
 'stephen',
 'weeks',
 'andrew',
 'mitchell',
 'purpose',
 'writing',
 'book',
 'locals',
 'squirm',
 'welcome',
 'support',
 'retired',
 'detective',
 'steve',
 'carroll',
 'robert',
 'forster',
 'charge',
 'investigation',
 '70',
 'discover',
 'criminal',
 'net',
 'power',
 'money',
 'cover',
 'murder',
 'murder',
 'greenwich',
 'good',
 'tv',
 'movie',
 'true',
 'story',
 'murder',
 'fifteen',
 'years',
 'old',
 'girl',
 'committed',
 'wealthy',
 'teenager',
 'whose',


In [47]:
# define a function that takes a doc stream from `stream_docs`
# and returns a particular number of docs specified by `size` param
def get_minibatch(doc_stream, size):
  docs, y = [], []
  try:
    for _ in range(size):
      text, label = next(doc_stream)
      docs.append(text)
      y.append(label)
  except StopIteration:
    return None, None
  return docs, y

In [48]:
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.linear_model import SGDClassifier

vect = HashingVectorizer(decode_error='ignore',
                         n_features=2**21,
                         preprocessor=None,
                         tokenizer=tokenizer)

clf = SGDClassifier(loss='log_loss',
                    random_state=1)
doc_stream = stream_docs(path='movie_data.csv')

In [49]:
# start out-of-core learning
import pyprind

pbar = pyprind.ProgBar(45)
classes = np.array([0, 1])

for _ in range(45):
  X_train, y_train = get_minibatch(doc_stream, size=1000)

  if not X_train:
    break

  X_train = vect.transform(X_train)
  clf.partial_fit(X=X_train,
                  y=y_train,
                  classes=classes)
  pbar.update()

0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:00:47


In [50]:
# use last 5000 docs to evaluate model performance
X_test, y_test = get_minibatch(doc_stream, size=5000)
X_test = vect.transform(X_test)

print(f'Test acc: {clf.score(X_test, y_test):.3f}')

Test acc: 0.868


In [51]:
# use last 5000 docs to update our model
clf = clf.partial_fit(X=X_test, y=y_test)

# Topic modeling with latent Dirichlet allocation

**Topic modeling** describes the broad task of assigning topics to unlabeled text documents, which can be considered as a clustering task, a subcategory of unsupervised learning.

## Decomposing text documents with LDA (Latent Dirichlet Allocation)

LDA is a generative probabilistic model that tries to find groups of words that appear frequently together across different documents. These frequently appearing words represent our topics, assuming that each document is a mixture of different words. The inpyt to an LAD is a the bag-of-words model.

Given a bag-of-words matrix as an input, LDA decomposes it into two new matrices:

* A document-to-topic matrix
* A word-to-topic matrix

LDA decomposes the bag-of-words matrix in such a way that if we multiply those two matrices together, we will be able to reproduce the input, the bag-of-words matrix, with the lowest possible error. In practice, we are interested in those topics that LDA found in the bag-of-words matrix. The only downside may be that we must define the number of topics beforehand - the number of topics is a hyperparameter of LDA that has to be specified manually.

## LDA with scikit-learn

In [52]:
# load dataset into a pandas DataFrame with movie_data.csv
import pandas as pd

df = pd.read_csv('movie_data.csv', encoding='utf-8')
df.head()

Unnamed: 0,review,sentiment
0,"In 1974, the teenager Martha Moxley (Maggie Gr...",1
1,OK... so... I really like Kris Kristofferson a...,0
2,"***SPOILER*** Do not read this, if you think a...",0
3,hi for all the people who have seen this wonde...,1
4,"I recently bought the DVD, forgetting just how...",0


In [53]:
# create bag-of-words matrix as input to LDA
# use English stop word lib from scikit-learn
from sklearn.feature_extraction.text import CountVectorizer

count = CountVectorizer(stop_words='english',
                        max_df=0.1,
                        max_features=5000)

X = count.fit_transform(df['review'].values)


We set the maximum document frequency of words to be considered to 10 percent (`max_df=0.1`) to exclude wordds that occur too frequently across documents. The rationale behind the removal of frequently occuring words is that these might be common words appearing across all documents that are, therefore, less likely to be associated with a specific topic category of a given document.

Also, we limit the number of words to be considered to the most frequently occurring 5,000 words (`max_features=5000`), to limit the dimensionality of this dataset to improve the inference performed by LDA.

In [54]:
# fit a `LatentDirichletAllocation` estimator to
# bag-of-words matrix and infer 10 different topics from docs
from sklearn.decomposition import LatentDirichletAllocation

lda = LatentDirichletAllocation(n_components=10,
                                random_state=123,
                                learning_method='batch') # train on all available training data

X_topics = lda.fit_transform(X)

In [56]:
# inspect matrix containing word importance
lda.components_

array([[1.733112e+01, 1.029445e+02, 2.233174e+02, ..., 1.081049e+03,
        7.277838e+02, 4.760836e+01],
       [1.058859e+02, 7.337142e+01, 2.119002e+02, ..., 1.000138e-01,
        1.000115e-01, 6.096654e-01],
       [1.020831e+00, 1.140663e+02, 8.653192e+01, ..., 1.000076e-01,
        1.000131e-01, 6.832233e+01],
       ...,
       [1.243436e+00, 8.196615e+01, 3.380253e+01, ..., 8.158088e+01,
        9.386868e+01, 6.407596e+01],
       [7.505545e+00, 2.800143e+00, 8.785812e+01, ..., 1.000053e-01,
        1.000058e-01, 1.000227e-01],
       [5.358888e+01, 1.992614e+02, 4.419852e+01, ..., 1.000091e-01,
        1.000111e-01, 2.970839e+00]])

In [57]:
lda.components_.shape

(10, 5000)

In [58]:
# print five most import words for each of 10 topics
# word importance values are ranked in increasing order
# so we need to sort topic array in reverse order
n_top_words = 5
feature_names = count.get_feature_names_out()

for topic_idx, topic in enumerate(lda.components_):
  print(f'Topic {(topic_idx + 1)}:')
  print(' '.join([feature_names[i]
                  for i in topic.argsort()[:-n_top_words - 1:-1]]))

Topic 1:
horror worst script effects budget
Topic 2:
dvd watched video music guy
Topic 3:
war american series history documentary
Topic 4:
game killer murder thriller crime
Topic 5:
kids comedy episode series school
Topic 6:
family woman mother beautiful feel
Topic 7:
role performance comedy john plays
Topic 8:
action horror john effects dr
Topic 9:
book version original read music
Topic 10:
action wife father police james


In [59]:
# confirm categories makes sense based on reviews
# let's plot three movies from horor category (category 8, index 7)
horror = X_topics[:, 7].argsort()[::-1]

for iter_idx, movie_idx in enumerate(horror[:3]):
  print(f'\nHorror movie #{(iter_idx + 1)}:')
  print(df['review'][movie_idx][:300], '...')



Horror movie #1:
Screamers is an Italian fantasy film (L'Isola degli Uomini Pesce) bought by Roger Corman and released through his New World Pictures. Of course Corman has to carve his initials on it by having one of his lackeys (Dan T. Miller) direct some additional gore footage before he has it released in the sta ...

Horror movie #2:
OZ is the greatest show ever mad full stop.OZ is the greatest show ever mad full stop.OZ is the greatest show ever mad full stop.OZ is the greatest show ever mad full stop.OZ is the greatest show ever mad full stop.OZ is the greatest show ever mad full stop.OZ is the greatest show ever mad full stop ...

Horror movie #3:
This film marked the end of the "serious" Universal Monsters era (Abbott and Costello meet up with the monsters later in "Abbott and Costello Meet Frankentstein"). It was a somewhat desparate, yet fun attempt to revive the classic monsters of the Wolf Man, Frankenstein's monster, and Dracula one "la ...
