In [1]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

from fastai.nlp import *   # texts_from_folders()
from sklearn.linear_model import LogisticRegression

## IMDB dataset and the sentiment classification task

The [large movie review dataset](http://ai.stanford.edu/~amaas/data/sentiment/) contains a collection of 50,000 reviews from IMDB. The dataset contains an even number of positive and negative reviews. The authors considered only highly polarized reviews. A negative review has a score ≤ 4 out of 10, and a positive review has a score ≥ 7 out of 10. Neutral reviews are not included in the dataset. The dataset is divided into training and test sets. The training set is the same 25,000 labeled reviews.

The **sentiment classification task** consists of predicting the polarity (positive or negative) of a given text.

To get the dataset, in your terminal run the following commands:

`wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz`

`gunzip aclImdb_v1.tar.gz`

`tar -xvf aclImdb_v1.tar`

### Tokenizing and term document matrix creation

In [2]:
PATH='data/aclImdb/'
names = ['neg','pos']   # 0 is negative

In [3]:
%ls {PATH}

imdbEr.txt  imdb.vocab  [0m[01;34mmodels[0m/  README  [01;34mtest[0m/  [01;34mtmp[0m/  [01;34mtrain[0m/


In [4]:
%ls {PATH}train

[0m[01;34mall[0m/             [01;34mneg[0m/  [01;34munsup[0m/         urls_neg.txt  urls_unsup.txt
labeledBow.feat  [01;34mpos[0m/  unsupBow.feat  urls_pos.txt


In [5]:
%ls {PATH}train/pos | head

0_9.txt
10000_8.txt
10001_10.txt
10002_7.txt
10003_8.txt
10004_8.txt
10005_7.txt
10006_7.txt
10007_7.txt
10008_7.txt
ls: write error


In [6]:
trn,trn_y = texts_labels_from_folders(f'{PATH}train',names)
val,val_y = texts_labels_from_folders(f'{PATH}test',names)

Here is the text of the first review

In [7]:
trn[0]  # trn is array of reviews

'I\'m a great admirer of Lon Chaney, but the screen writing of this movie just did not work for me. The story jumps around oddly (I\'ve since learned that the film is missing a section), and characters appear and disappear with irritating suddenness. Some of the intertitles are overly explanatory (e.g., "why, you\'re not a child anymore!"--cut back to picture for a long, slow beat--"you\'re a woman!" yes, we got it the first time) but there are a few talking sequences that beg for explanations that never appear. (Let\'s hear Luigi and his blond girlfriend\'s argument, please!) The plot, which involves incestuous desires (figuratively if not technically) was disturbing to the point that it was hard to watch. To the writer\'s credit, this issue was treated as a problem, and a May-December match is not portrayed as the right-and-good inevitability of some Mary Pickford films (e.g., "Daddy-Long-Legs"). Chaney gives a good performance, as usual, but I think he has been better-directed in th

In [8]:
trn_y[0]

0

[`CountVectorizer`](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) converts a collection of text documents to a matrix of token counts (part of `sklearn.feature_extraction.text`).
Below we use the tokenizer from fastai

In [9]:
veczr = CountVectorizer(tokenizer=tokenize)

`fit_transform(trn)` finds the vocabulary in the training set. It also transforms the training set into a term-document matrix. Since we have to apply the *same transformation* to your validation set, the second line uses just the method `transform(val)`. `trn_term_doc` and `val_term_doc` are sparse matrices. `trn_term_doc[i]` represents training document i and it contains a count of words for each document for each word in the vocabulary.

In [10]:
trn_term_doc = veczr.fit_transform(trn)
val_term_doc = veczr.transform(val)  #use the same vocabulary for the validation set

In [11]:
trn_term_doc     # stored as a spase matrix, 75132 is the number of unique words

<25000x75132 sparse matrix of type '<class 'numpy.int64'>'
	with 3749745 stored elements in Compressed Sparse Row format>

In [12]:
trn_term_doc[0]    

<1x75132 sparse matrix of type '<class 'numpy.int64'>'
	with 153 stored elements in Compressed Sparse Row format>

In [13]:
vocab = veczr.get_feature_names(); vocab[5000:5005]

['aussie', 'aussies', 'austen', 'austeniana', 'austens']

Above is a sample of 5 words/tokens in the vocabulary.
Below is an example of the words, but using a different method to parse them.

In [14]:
w0 = set([o.lower() for o in trn[0].split(' ')]); w0   #lets grab a set of all unique words in the trn[0]

{'"daddy-long-legs").',
 '"why,',
 '(e.g.,',
 '(figuratively',
 "(i've",
 "(let's",
 'a',
 'admirer',
 'and',
 'anymore!"--cut',
 'appear',
 'appear.',
 'are',
 'argument,',
 'around',
 'as',
 'astonishingly',
 'at',
 'back',
 'beat--"you\'re',
 'been',
 'beg',
 'better-directed',
 'blond',
 'but',
 'chaney',
 'chaney,',
 'characters',
 'charming,',
 'child',
 'clown',
 'credit,',
 'desires',
 'did',
 'disappear',
 'disturbing',
 'enjoy',
 'explanations',
 'explanatory',
 'far,',
 'few',
 'film',
 'films',
 'first',
 'for',
 "girlfriend's",
 'gives',
 'good',
 'got',
 'great',
 'hard',
 'has',
 'he',
 'hear',
 'here,',
 'his',
 'i',
 "i'm",
 'if',
 'imho.',
 'impressed',
 'in',
 'incestuous',
 'inevitability',
 'intertitles',
 'involves',
 'irritating',
 'is',
 'issue',
 'it',
 "it's",
 'its',
 'jumps',
 'just',
 'learned',
 'least-favorite',
 'lon',
 'long,',
 'loretta',
 'luigi',
 'mary',
 'match',
 'may-december',
 'me.',
 'missing',
 'moments,',
 'movie',
 'my',
 'never',
 'not',
 

In [15]:
len(w0)    # how many words/tokens on the first doc?

143

In [16]:
vy = veczr.vocabulary_['young']; vy

74487

In [17]:
trn_term_doc[0,veczr.vocabulary_['young']]   # How many times the word 'young' appears in the doc

2

In [18]:
trn_term_doc[0,5000] # How many times the word 'aussie' appears in the doc

0

## Naive Bayes

https://en.wikipedia.org/wiki/Naive_Bayes_classifier

https://www.analyticsvidhya.com/blog/2017/09/naive-bayes-explained/

We define the **log-count ratio** $r$ for each word $f$:

$r = \log \frac{\text{ratio of feature $f$ in positive documents}}{\text{ratio of feature $f$ in negative documents}}$

where ratio of feature $f$ in positive documents is the number of times a positive document has a feature divided by the number of positive documents.

In [19]:
def pr(y_i):
    p = x[y==y_i].sum(0)   # the number of times a document y_i has a feature
    return (p+1) / ((y==y_i).sum()+1)

In [20]:
x=trn_term_doc # independent variable 
y=trn_y    # dependent variable

r = np.log(pr(1)/pr(0))
b = np.log((y==1).mean() / (y==0).mean())

In [21]:
r,b

(matrix([[ 0.69315, -0.69315, -0.07511, ...,  0.69315,  0.     , -2.07944]]),
 0.0)

In [22]:
#This is the original from class
x=trn_term_doc # independent variable 
y=trn_y    # dependent variable
p = x[y==1].sum(0)+1   # for the positive reviews  , +1 for the generic review
q = x[y==0].sum(0)+1# for the negative reviews
r = np.log(p/p.sum()/q/q.sum()) # nice to take the logs to add rather than multiply and hence more stable
b = np.log(len(p) / len(q))

In [23]:
r, b

(matrix([[-29.59107, -30.97736, -30.35933, ..., -29.59107, -30.28422, -32.36366]]),
 0.0)

Here is the formula for Naive Bayes.   Notice that this is a formulation, we are not "learning".

In [24]:
pre_preds = val_term_doc @ r.T + b   # matrix multiply and add bias
preds = pre_preds.T>0
(preds==val_y).mean()

0.5

It turns out that the number of appearancesof a word in a document is not important, eg "absurd" appearing once or twice is about the same for this purposes.

Thus we can try `binarized Naive Bayes.` where the number of instances is removed by using sign() which counts only one for anything > 0 (positive). 

In [25]:
x=trn_term_doc.sign()
r = np.log(pr(1)/pr(0))

pre_preds = val_term_doc.sign() @ r.T + b
preds = pre_preds.T>0
(preds==val_y).mean()

0.83016000000000001

Got a better result with the binarized version.

## Logistic regression

Above we were using `Naive` Bayes, ignoring that the words can be correlated, ie they are NOT independent.

Alternatively, instead of using the coeficients $r$ and $b$ from above, we can try and learn them...

Here is how we can fit logistic regression where the features are the unigrams.

In [26]:
m = LogisticRegression(C=1e8, dual=True)   # create Logistic Regression, dual=true makes it faster because of dimensions
m.fit(x, y) # fit them
preds = m.predict(val_term_doc)
(preds==val_y).mean()

0.83191999999999999

So, we got a better result via learning...

Now lets try the binarized (again using sign())

In [27]:
m = LogisticRegression(C=1e8, dual=True)
m.fit(trn_term_doc.sign(), y)
preds = m.predict(val_term_doc.sign())
(preds==val_y).mean()

0.85484000000000004

We seem to have too many coeficients, so we should try regularization
...and the regularized version, adding a penalty  (weight decay)

Cross entropy is designed for classification.


Below, in `LogisticRegression` the C parameter (float, default: 1.0), is the inverse of regularization strength; must be a positive float.
Like in support vector machines, smaller values specify stronger regularization.

In [28]:
m = LogisticRegression(C=0.1, dual=True)
m.fit(x, y)
preds = m.predict(val_term_doc)
(preds==val_y).mean()

0.84872000000000003

In [29]:
m = LogisticRegression(C=0.1, dual=True)
m.fit(trn_term_doc.sign(), y)
preds = m.predict(val_term_doc.sign())
(preds==val_y).mean()

0.88404000000000005

### Trigram with NB features

Our next model is a version of logistic regression with Naive Bayes features described [here](https://www.aclweb.org/anthology/P12-2018). For every document we compute binarized features as described above, but this time we use `bigrams` and `trigrams` too. Each feature is a log-count ratio. A logistic regression model is then trained to predict sentiment.

In [30]:
veczr =  CountVectorizer(ngram_range=(1,3), tokenizer=tokenize, max_features=800000)
trn_term_doc = veczr.fit_transform(trn)
val_term_doc = veczr.transform(val)

Above max_features builds a vocabulary that only consider the top max_features ordered by term frequency across the corpus.
so after the max_features they are ignored..

In [31]:
trn_term_doc.shape   

(25000, 800000)

In [32]:
vocab = veczr.get_feature_names()

In [33]:
vocab[200000:200005]

['by vast', 'by vengeance', 'by vengeance .', 'by vera', 'by vera miles']

In [34]:
y=trn_y   # labels
x=trn_term_doc.sign()    # binarized independent variables
val_x = val_term_doc.sign()

In [35]:
r = np.log(pr(1) / pr(0))
b = np.log((y==1).mean() / (y==0).mean())

Here we fit regularized logistic regression where the features are the trigrams.

In [36]:
m = LogisticRegression(C=0.1, dual=True)
m.fit(x, y);

preds = m.predict(val_x)
(preds.T==val_y).mean()

0.90500000000000003

Here is the $\text{log-count ratio}$ `r`.  

In [37]:
r.shape, r

((1, 800000),
 matrix([[-0.05468, -0.161  , -0.24784, ...,  1.09861, -0.69315, -0.69315]]))

In [38]:
np.exp(r)

matrix([[ 0.94678,  0.85129,  0.78049, ...,  3.     ,  0.5    ,  0.5    ]])

Here we fit regularized logistic regression where the features are the trigrams' log-count ratios.
JW: Class 11 26'

In [39]:
x_nb = x.multiply(r)
m = LogisticRegression(dual=True, C=0.1)
m.fit(x_nb, y);

val_x_nb = val_x.multiply(r)
preds = m.predict(val_x_nb)
(preds.T==val_y).mean()

0.91768000000000005

## fastai NBSVM++

In [40]:
sl=2000   # up to 2000 unique words 

In [41]:
# Here is how we get a model from a bag of words
md = TextClassifierData.from_bow(trn_term_doc, trn_y, val_term_doc, val_y, sl)

In [42]:
learner = md.dotprod_nb_learner()
learner.fit(0.02, 1, wds=1e-6, cycle_len=1)

A Jupyter Widget

  0%|          | 0/391 [00:00<?, ?it/s]


TypeError: unsupported operand type(s) for *: 'NoneType' and 'int'

In [None]:
learner.fit(0.02, 2, wds=1e-6, cycle_len=1)

In [None]:
learner.fit(0.02, 2, wds=1e-6, cycle_len=1)

## References

* Baselines and Bigrams: Simple, Good Sentiment and Topic Classification. Sida Wang and Christopher D. Manning [pdf](https://www.aclweb.org/anthology/P12-2018)