<table align="center">
   <td align="center"><a target="_blank" href="https://colab.research.google.com/github/ds5110/summer-2021/blob/master/11a-sentiment-imdb.ipynb">
<img src="https://github.com/ds5110/summer-2021/raw/master/colab.png"  style="padding-bottom:5px;" />Run in Google Colab</a></td>
</table>

# 11a Sentiment analysis of IMDb dataset

* Use logistic regression to classify labeled [IMDb movie reviews](http://ai.stanford.edu/~amaas/data/sentiment/)
* Python's [urllib.request](https://docs.python.org/3/library/urllib.request.html) for processing files 
* Text processing with Python regular expressions

### References

* [IMDb dataset](http://ai.stanford.edu/~amaas/data/sentiment/) -- stanford.edu
* Python Machine Learning, 3rd Edition (2019) Raschka & Mirjalili
  * Raschka's [ch08.ipynb](https://github.com/rasbt/python-machine-learning-book-3rd-edition/blob/master/ch08/ch08.ipynb) -- github


# Get the dataset

* The next cell uses [urlib.request.urlretrieve](https://docs.python.org/3/library/urllib.request.html#urllib.request.URLopener.retrieve) to get a "gzipped tar file"
* The file is a compressed archive -- it cannot be read directly by Pandas
* Once you download the file locally, you can inspect the contents of the directories

In [None]:
# Get the data file from the original source (takes ~30 seconds in Colab)
import os
import sys
import tarfile
import time
import urllib.request

source = 'http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz'
target = 'aclImdb_v1.tar.gz'

def reporthook(count, block_size, total_size):
    global start_time
    if count == 0:
        start_time = time.time()
        return
    duration = time.time() - start_time
    progress_size = int(count * block_size)
    speed = progress_size / (1024.**2 * duration)
    percent = count * block_size * 100. / total_size

    sys.stdout.write("\r%d%% | %d MB | %.2f MB/s | %d sec elapsed" %
                    (percent, progress_size / (1024.**2), speed, duration))
    sys.stdout.flush()


if not os.path.isdir('aclImdb') and not os.path.isfile('aclImdb_v1.tar.gz'):
    urllib.request.urlretrieve(source, target, reporthook)

100% | 80 MB | 2.78 MB/s | 28 sec elapsed

In [None]:
if not os.path.isdir('aclImdb'):

    with tarfile.open(target, 'r:gz') as tar:
        tar.extractall()

In [None]:
# Install Raschka's pyprind in Colab
# It's a progress bar -- no functional contribution.
!pip install pyprind

Collecting pyprind
  Downloading PyPrind-2.11.3-py2.py3-none-any.whl (8.4 kB)
Installing collected packages: pyprind
Successfully installed pyprind-2.11.3


In [None]:
# This cell takes about 1.5 minutes on Colab
import pyprind
import pandas as pd
import os

# change the `basepath` to the directory of the
# unzipped movie dataset

basepath = 'aclImdb'

labels = {'pos': 1, 'neg': 0}
pbar = pyprind.ProgBar(50000)
df = pd.DataFrame()
for s in ('test', 'train'):
    for l in ('pos', 'neg'):
        path = os.path.join(basepath, s, l)
        for file in sorted(os.listdir(path)):
            with open(os.path.join(path, file), 
                      'r', encoding='utf-8') as infile:
                txt = infile.read()
            df = df.append([[txt, labels[l]]], 
                           ignore_index=True)
            pbar.update()
df.columns = ['review', 'sentiment']

0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:01:36


In [None]:
# Shuffle the dataframe (reproducibly)
import numpy as np

np.random.seed(0)
df = df.reindex(np.random.permutation(df.index))

In [None]:
assert df.shape == (50000, 2)
df.head(3)

Unnamed: 0,review,sentiment
11841,"In 1974, the teenager Martha Moxley (Maggie Gr...",1
19602,OK... so... I really like Kris Kristofferson a...,0
45519,"***SPOILER*** Do not read this, if you think a...",0


In [None]:
# Read movie reviews from CSV in Raschka's github repo
# This cell replaces cells 2, 3 & 4
import os
import sys
import time
import pandas as pd
import urllib.request

def reporthook(count, block_size, total_size):
    global start_time
    if count == 0:
        start_time = time.time()
        return
    duration = time.time() - start_time
    progress_size = int(count * block_size)
    speed = progress_size / (1024.**2 * duration)
    percent = count * block_size * 100. / total_size

    sys.stdout.write("\r%d%% | %d MB | %.2f MB/s | %d sec elapsed" %
                    (percent, progress_size / (1024.**2), speed, duration))
    sys.stdout.flush()

target = "movie_data.csv.gz"
source = "https://github.com/rasbt/python-machine-learning-book-3rd-edition/raw/master/ch08/" + target
if not os.path.isfile(target):
    urllib.request.urlretrieve(source, target, reporthook)

df = pd.read_csv(target, compression='gzip')

assert df.shape == (50000, 2)
df.head(3)

Unnamed: 0,review,sentiment
0,"In 1974, the teenager Martha Moxley (Maggie Gr...",1
1,OK... so... I really like Kris Kristofferson a...,0
2,"***SPOILER*** Do not read this, if you think a...",0


# Cleaning the data

* This dataset has HTML markup
* Python regular expressions 
* [Pands supports regular expressions](https://github.com/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/03.10-Working-With-Strings.ipynb) (VanderPlas) -- github.com

In [None]:
# An example of markup inside a document
df.loc[0, 'review'][-50:]

'is seven.<br /><br />Title (Brazil): Not Available'

In [None]:
# Remove HTML, keep imoticons (but take off their noses)
import re
def preprocessor(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)',
                        text)
    text = (re.sub('[\W]+', ' ', text.lower()) +
        ' '.join(emoticons).replace('-', ''))
    return text

In [None]:
# Test it
preprocessor("</a>This :) is :( a test :-)!")

'this is a test :) :( :)'

In [None]:
# Apply it
df['review'] = df['review'].apply(preprocessor)

In [None]:
# Verify it
df.loc[0, 'review'][-50:]

'zation my vote is seven title brazil not available'

# Tokenizer

* you an tokenize documents by simply splitting them into individual words at their whitespace characters
* you can also use "word stemming" to transform words to their root form
  * the Porter stemmer algorithm was published in 1980

In [None]:
from nltk.stem.porter import PorterStemmer

porter = PorterStemmer()

def tokenizer(text):
    return text.split()

def tokenizer_porter(text):
    return [porter.stem(word) for word in text.split()]

In [None]:
# Compare a simple tokenizer
tokenizer('runners like running and thus they run')

['runners', 'like', 'running', 'and', 'thus', 'they', 'run']

In [None]:
# ...with a Porter stemmer algorithm -- notice what happens to "thus"!
tokenizer_porter('runners like running and thus they run')

['runner', 'like', 'run', 'and', 'thu', 'they', 'run']

# Load and remove some stop words



In [None]:
import nltk

nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [None]:
from nltk.corpus import stopwords

stop = stopwords.words('english')
[w for w in tokenizer_porter('a runner likes running and runs a lot')[-10:]
if w not in stop]

['runner', 'like', 'run', 'run', 'lot']

# Train/test split

In [None]:
X_train = df.loc[:25000, 'review'].values
y_train = df.loc[:25000, 'sentiment'].values
X_test = df.loc[25000:, 'review'].values
y_test = df.loc[25000:, 'sentiment'].values

# Logistic regression

* Parameter tuning with cross validation
* The next cell will take a long time (up to an hour)
* The cell after that searches a reduced parameter space

In [None]:
# Don't run this cell, unless you want to wait a while
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import GridSearchCV

tfidf = TfidfVectorizer(strip_accents=None,
                        lowercase=False,
                        preprocessor=None)

# This param_grid results in 240 model runs, which takes 30-60 minutes
param_grid = [{'vect__ngram_range': [(1, 1)],
               'vect__stop_words': [stop, None],
               'vect__tokenizer': [tokenizer, tokenizer_porter],
               'clf__penalty': ['l1', 'l2'],
               'clf__C': [1.0, 10.0, 100.0]},
              {'vect__ngram_range': [(1, 1)],
               'vect__stop_words': [stop, None],
               'vect__tokenizer': [tokenizer, tokenizer_porter],
               'vect__use_idf':[False],
               'vect__norm':[None],
               'clf__penalty': ['l1', 'l2'],
               'clf__C': [1.0, 10.0, 100.0]},
              ]

lr_tfidf = Pipeline([('vect', tfidf),
                     ('clf', LogisticRegression(random_state=0, solver='liblinear'))])

gs_lr_tfidf = GridSearchCV(lr_tfidf, param_grid,
                           scoring='accuracy',
                           cv=5,
                           verbose=2,
                           n_jobs=-1)

In [None]:
# This param_grid involves 40 models, and runs in under 4 minutes in Colab
param_grid = [{'vect__ngram_range': [(1, 1)],
               'vect__stop_words': [stop, None],
               'vect__tokenizer': [tokenizer],
               'clf__penalty': ['l1', 'l2'],
               'clf__C': [1.0, 10.0]},
              ]

gs_lr_tfidf = GridSearchCV(lr_tfidf, param_grid,
                           scoring='accuracy',
                           cv=5,
                           verbose=2,
                           n_jobs=-1)

# Compare the train/test performance

* CV accuracy is 0.887
* Test accuracy is 0.893
* Train accuracy (without CV) is 0.997

In [None]:
gs_lr_tfidf.fit(X_train, y_train)

Fitting 5 folds for each of 8 candidates, totalling 40 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  37 tasks      | elapsed:  3.2min
[Parallel(n_jobs=-1)]: Done  40 out of  40 | elapsed:  3.5min finished


GridSearchCV(cv=5, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('vect',
                                        TfidfVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.float64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=False,
                                                        max_df=1.0,
                                                        max_features=None,
                                                        min_df=1,
                                                        ngram_range=(1, 1),
                                                        n

In [None]:
print('Best parameter set: %s ' % gs_lr_tfidf.best_params_)
print('CV Accuracy: %.3f' % gs_lr_tfidf.best_score_)

Best parameter set: {'clf__C': 10.0, 'clf__penalty': 'l2', 'vect__ngram_range': (1, 1), 'vect__stop_words': None, 'vect__tokenizer': <function tokenizer at 0x7f6e2df17b90>} 
CV Accuracy: 0.887


In [None]:
# Test accuracy here is 0.893, which is larger than the CV accuracy of 0.887
clf = gs_lr_tfidf.best_estimator_
print('Test Accuracy: %.3f' % clf.score(X_test, y_test))

Test Accuracy: 0.893


In [None]:
# But training accuracy (without CV averaging) is 0.997 -- this could be overfitting
print('Training Accuracy: %.3f' % clf.score(X_train, y_train))

Training Accuracy: 0.997
