<span style="font-family:Georgia; font-size:25pt">
Sentiment analysis
</span>
1. Cleaning and parsing text data
2. Building feature vectors from text documents
3. Training a classifier fot sentiment analysis
4. Out-of-core learning and large data tex data sets

# IMDB dataset
Source: http://ai.stanford.edu/~amaas//data/sentiment/

Download the zip file and assemble the individual text documents into a single CSV file.

In [4]:
import pandas as pd

In [None]:
import pyprind # python progress indicator
import os

pbar = pyprind.ProgBar(50000) # iteration number (number of documents we will read in) and initialize progress bar
labels = {'pos':1, 'neg':0}
df = pd.DataFrame()
for s in ('test', 'train'):
    for l in ('pos', 'neg'):
        path = './aclImdb/%s/%s' % (s, l)
        for file in os.listdir(path):
            with open(os.path.join(path, file), 'r') as infile:
                txt = infile.read()
                df = df.append([[txt, labels[l]]], ignore_index=True)
                pbar.update()

df.columns = ['review', 'sentiment']

In [6]:
import numpy as np
np.random.seed(0)
df = df.reindex(np.random.permutation(df.index))
df.to_csv('./movie_data.csv', index=False)

In [6]:
df = pd.read_csv('./movie_data.csv')
df.head(3)

Unnamed: 0,review,sentiment
0,"In 1974, the teenager Martha Moxley (Maggie Gr...",1
1,OK... so... I really like Kris Kristofferson a...,0
2,"***SPOILER*** Do not read this, if you think a...",0


# Bag-of-words model
Idea
1. Create a vocabulary of unique tokens (e.g. words) from the entire set of documents.
2. Construct a feature vector from each document that contains the counts of how often each word occurs in the particular document.

## Transform words into feature vectors

In [8]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
count = CountVectorizer()
docs = np.array([
        'The sun is shining',
        'The weather is sweet',
        'The sun is shining and the weather is sweet'])
bag = count.fit_transform(docs)

In [9]:
print(count.vocabulary_)

{'sweet': 4, 'weather': 6, 'sun': 3, 'the': 5, 'is': 1, 'and': 0, 'shining': 2}


In [10]:
print(bag.toarray())

[[0 1 1 1 0 1 0]
 [0 1 0 0 1 1 1]
 [1 2 1 1 1 2 1]]


Each index position in the feature vectors correponds to the integer values that are stored as dictionary items in the `CountVectorizer` vocabulary.

**n-gram** is the contiguous sequences of items. In sklearn, 1-gram is by default but we can change this in `CountVectorizer` with parameter <span class="mark">`ngram_range`</span>. For instance, for 2-gram, `ngram_range=(2,2)'.

## Word relevancy: Term frequency-inverse document frequency (tf-idf)
Words that **occur frequently across many docuemnts from both classes** don't contain useful information. **tf-idf** can be used to downweight these words.

<span class="mark">Definition: **term frequency** x **inverse document frequency**</span>
- term frequency: same as `CountVectorizer`
- inverse document frequency: log(total number of documents / (1 + number of documents that contain the term `t`))
    - +1 in denominator: preventing zero denominator
    - log scaled: make sure low doc frequency is not given too much weight
    - **If the term `t` exists in many documents, idf becomes smaller.**
    
sklearn's `TfidfTransformer`: takes **raw term frequency from `CountVectorizer`** as input and transforms them into tf-idfs.
- it's slighlty different from standard tf-idf calculation: sklearn's = tf x (idf + 1), and idf = log( (N+1) / (df+1))
- sklearn normalizes tf-idf by default (L2 normalization)

In [11]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf = TfidfTransformer()
np.set_printoptions(precision=2)
print(tfidf.fit_transform(count.fit_transform(docs)).toarray())

[[ 0.    0.43  0.56  0.56  0.    0.43  0.  ]
 [ 0.    0.43  0.    0.    0.56  0.43  0.56]
 [ 0.4   0.48  0.31  0.31  0.31  0.48  0.31]]


## Cleaning text data

In [12]:
df.loc[0, 'review'][-50:]

'is seven.<br /><br />Title (Brazil): Not Available'

Text contains HTML markups, punctuations, non-letter characters. We remove all of these except for emoticons. We'll use regular expression (regex).

Python Regex: https://developers.google.com/edu/python/regular-expressions

In [13]:
import re
def preprocessor(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text)
    text = re.sub('[\W]+', ' ', text.lower()) + ' '.join(emoticons).replace('-', '')
    return text

In [14]:
preprocessor(df.loc[0, 'review'][-50:])

'is seven title brazil not available'

In [15]:
preprocessor("</a>This :) is :( a test :-)!")

'this is a test :) :( :)'

In [16]:
df['review']=df['review'].apply(preprocessor)

## Tokenization

In [17]:
def tokenizer(text):
    return text.split()

## Stemming and lemmatization
**Stemming**: Transforming a word into its root form. Here, we will use Porter stemmer.

In [18]:
from nltk.stem.porter import PorterStemmer
porter = PorterStemmer()
def tokenizer_porter(text):
    return [porter.stem(word) for word in text.split()]

In [19]:
tokenizer_porter("runners like running and thus they run")

['runner', 'like', 'run', 'and', 'thu', 'they', 'run']

Other stemmers: **Snowball stemmer** (newer Porter) or **Lancaster** (faster but more aggressive).

**Lemmatization**: get grammatically correct words (=lemmas). 
- Lemmatization is computationally more difficult and expensive compared to stemming.
- Some studies show stemming and lemmatization have little impact on the performance of text classification.

## Stop-word removal

In [20]:
from nltk.corpus import stopwords
stop = stopwords.words('english')
[w for w in tokenizer_porter('a runner likes running and runs a lot')[-10:] if w not in stop]

['runner', 'like', 'run', 'run', 'lot']

# Logistic regression for document classification

In [21]:
X_train = df.loc[:25000, 'review'].values
y_train = df.loc[:25000, 'sentiment'].values
X_test = df.loc[25000:, 'review'].values
y_test = df.loc[25000:, 'sentiment'].values

Grid search for hyperparameter tuning

In [23]:
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(strip_accents=None, lowercase=False, preprocessor=None)
param_grid = [{'vect__ngram_range':[(1,1)],
               'vect__stop_words':[stop, None],
               'vect__tokenizer':[tokenizer, tokenizer_porter],
               'clf__penalty':['l1', 'l2'],
               'clf__C':[1, 10, 100]},
              {'vect__ngram_range':[(1,1)],
               'vect__stop_words':[stop, None],
               'vect__tokenizer':[tokenizer, tokenizer_porter],
               'vect__use_idf':[False], # training model based on a raw term frequencies (tf)
               'vect__norm':[None], # thus, no normalization
               'clf__penalty':['l1', 'l2'],
               'clf__C':[1, 10, 100]}]
lr_tfidf = Pipeline([('vect', tfidf),
                     ('clf', LogisticRegression(random_state=0))])
gs_lr_tfidf = GridSearchCV(lr_tfidf, param_grid, scoring='accuracy', cv=5, verbose=1, n_jobs=-1)
gs_lr_tfidf.fit(X_train, y_train)

Fitting 5 folds for each of 48 candidates, totalling 240 fits


[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed: 15.8min
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed: 79.4min
[Parallel(n_jobs=-1)]: Done 240 out of 240 | elapsed: 101.9min finished


GridSearchCV(cv=5, error_score='raise',
       estimator=Pipeline(steps=[('vect', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=False, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
 ...nalty='l2', random_state=0, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))]),
       fit_params={}, iid=True, n_jobs=-1,
       param_grid=[{'vect__ngram_range': [(1, 1)], 'vect__tokenizer': [<function tokenizer at 0x116a71158>, <function tokenizer_porter at 0x11cb34d08>], 'vect__stop_words': [['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', '...function tokenizer_porter at 0x11cb34d08>], 'clf__penalty': ['l1', 'l2'], 'vect__use_idf': [False]}],
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
    

In [26]:
print('CV Accuracy: %.3f' % gs_lr_tfidf.best_score_)
clf = gs_lr_tfidf.best_estimator_
print('Test Accuracy: %.3f' % clf.score(X_test, y_test))

CV Accuracy: 0.897
Test Accuracy: 0.899


# Bigger data - online algorithms and out-of-core learning
It is computationally expensive to construct the feature vectors for the 50k movie reviews during grid search. In real-world problems, it's possible to have even larger datasets. **Out-of-core learning** can be useful for this type of situation.

We will use `partial_fit` of the `SGDClassifier` in sklearn to **stream the documents directly from the local drive and train a model using small minibatches of documents.**

## Define a `tokenizer` to pre-process text data

In [27]:
import numpy as np
import re
from nltk.corpus import stopwords

stop = stopwords.words('english')
def tokenizer(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text.lower())
    text = re.sub('[\W]+', ' ', text.lower()) + ' '.join(emoticons).replace('-', '')
    tokenized = [w for w in text.split() if w not in stop]
    return tokenized

## Define a generator `stream_docs` to read in and return one doc at a time
Great review on Python generator is [here](http://stackoverflow.com/questions/231767/what-does-the-yield-keyword-do).
- Generators are similar to iterators like list but it reads one value at a time. (Good for memory)
- A generator can be used once cuz it's not saving iterated values.
- Instead of `return`, it uses `yield` to output a value.

Python `with`: with keyword is used when working with unmanaged resources (like file streams)

In [48]:
def stream_docs(path):
    with open(path, 'r', encoding='utf-8') as csv:
        next(csv) # skip header
        for line in csv:
            text, label = line[:-3], int(line[-2])
            yield text, label

In [56]:
stream_docs(path='./movie_data.csv')

<generator object stream_docs at 0x10abc21a8>

In [61]:
next(stream_docs(path='./movie_data.csv'))

('"In 1974, the teenager Martha Moxley (Maggie Grace) moves to the high-class area of Belle Haven, Greenwich, Connecticut. On the Mischief Night, eve of Halloween, she was murdered in the backyard of her house and her murder remained unsolved. Twenty-two years later, the writer Mark Fuhrman (Christopher Meloni), who is a former LA detective that has fallen in disgrace for perjury in O.J. Simpson trial and moved to Idaho, decides to investigate the case with his partner Stephen Weeks (Andrew Mitchell) with the purpose of writing a book. The locals squirm and do not welcome them, but with the support of the retired detective Steve Carroll (Robert Forster) that was in charge of the investigation in the 70\'s, they discover the criminal and a net of power and money to cover the murder.<br /><br />""Murder in Greenwich"" is a good TV movie, with the true story of a murder of a fifteen years old girl that was committed by a wealthy teenager whose mother was a Kennedy. The powerful and rich f

## Define a function to take a doc stream and return a particular number of documents (<span class="mark">batching</span>)

In [62]:
def get_minibatch(doc_stream, size):
    docs, y = [], []
    try:
        for _ in range(size):
            text, label = next(doc_stream)
            docs.append(text)
            y.append(label)
    except StopIteration:
        return None, None
    return docs, y

## Build a vectorizer (feature extraction)

We can't use `CountVectorize` because it requires holding the complete vocabulary in memory. Also, `TfidfVectorize` needs to keep the all feature vectors of the training dataset in memory to calcualte the idfs. 

However, another useful vectorizer for text processing is implemented in sklearn, called `HashingVectorizer`
- Data-independent
- Makes use of the Hashing trick via 32-bit MurmurHash3 algorithm.

In [65]:
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.linear_model import SGDClassifier
vect = HashingVectorizer(decode_error='ignore', n_features=2**21, preprocessor=None, tokenizer=tokenizer)
clf = SGDClassifier(loss='log', random_state=1, n_iter=1)
doc_stream = stream_docs(path='./movie_data.csv')

## Start the out-of-core learning
- Firt,  we initialized the progress bar object with 45 iterations.
- In the for loop, we iterated over 45 minibatches of docs where each of them consists of 1000 docs each.

In [69]:
import pyprind
pbar = pyprind.ProgBar(45)
classes = np.array([0, 1])
for _ in range(45):
    X_train, y_train = get_minibatch(doc_stream, size=1000)
    if not X_train:
        break
    X_train = vect.transform(X_train)
    clf.partial_fit(X_train, y_train, classes=classes)
    pbar.update()

0%                          100%
[##############################] | ETA: 00:00:00
Total time elapsed: 00:00:30


In [70]:
X_test, y_test = get_minibatch(doc_stream, size=5000)
X_test = vect.transform(X_test)
print('Accuracy: %.3f' % clf.score(X_test, y_test))

Accuracy: 0.867


One interesting fact: in `get_minibatch`, we return `None, None` if `StopIteration` is true. By using **`with, for, yield`** in `stream_doc`, we can loop over the whole documents for once and once it's done, generator reaches the end.

In [73]:
get_minibatch(doc_stream, size=5000)

(None, None)

We can even <span class="mark">update our model</span> with `partial_fit`.

In [74]:
clf = clf.partial_fit(X_test, y_test)

The bag-of-words model is still commonly used but has caveats.
- Doesn't consider grammar.
- Doesn't consider sentenc structure.

Alternatives 1: **Latent Dirichlet allocation (LDA)** (topic model)
- Consider the latent semantics of words

Alternatives 2: **Word2vec**
- Unsupervised learning algorithem based on NN
- Attempts to learn the relationship between words