<a href="https://colab.research.google.com/github/hduongck/AI-ML-Learning/blob/master/Fastai%20NLP%20course/3_Sentiment_Classification_of_movie.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Sentiment Classification of Movie Reviews (using Naive Bayes, Logistic Regression, and Ngrams)

The purpose of this notebook is to cover Naive Bayes, Logistic regression, and ngrams (some pretty classic techniques!) for sentiment classification. We will be using sklearn and the fastai library.

In a future lesson, we will tackle this same problem of sentiment classification using deep learning, so that you can compare the two approaches

The content here was extended from [Lesson 10 of the fast.ai Machine Learning course](https://course.fast.ai/lessonsml1/lesson10.html). Linear model is pretty close to the state of the art here. Jeremy surpassed state of the art using a RNN in fall 2017.

## The fastai library

We will begin using the fastai library (version 1.0) in this notebook. We will use it more once we move on to neural networks.

The fastai library is built on top of PyTorch and encodes many state-of-the-art best practices. It is used in production at a number of companies. You can read more about it here:

- [Fast.ai's software could radically democratize AI](https://www.zdnet.com/article/fast-ais-new-software-could-radically-democratize-ai/) (ZDNet)

- [fastai v1 for PyTorch: Fast and accurate neural nets using modern best practices](https://www.fast.ai/2018/10/02/fastai-ai/) (fast.ai)

- [fastai docs](https://docs.fast.ai/)



## Installation

With conda:

`conda install -c pytorch -c fastai fastai=1.0`

Or with pip:

`pip install fastai==1.0`

More installation information here.

Beginning in lesson 4, we will be using GPUs, so if you want, you could switch to a cloud option now to setup fastai.

##IMDB dataset

The [large movie review dataset](http://ai.stanford.edu/~amaas/data/sentiment/) contains a collection of 50,000 reviews from IMDB, We will use the version hosted as part [fast.ai datasets](https://course.fast.ai/datasets.html) on AWS Open Datasets.

The dataset contains an even number of positive and negative reviews. The authors considered only highly polarized reviews. A negative review has a score ≤ 4 out of 10, and a positive review has a score ≥ 7 out of 10. Neutral reviews are not included in the dataset. The dataset is divided into training and test sets. The training set is the same 25,000 labeled reviews.

The sentiment classification task consists of predicting the polarity (positive or negative) of a given text.

In [0]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

In [0]:
from fastai import *
from fastai.text import *
import sklearn.feature_extraction as sklearn_text

## Tokenizing and term document matrix creation

fast.ai has a number of datasets hosted via [AWS Open Datasets](https://course.fast.ai/datasets.html) for easy download. We can see them by checking the docs for URLs (remember ?? is a helpful command):

In [0]:
#?? URLs


It is always good to start working on a sample of your data before you use the full dataset-- this allows for quicker computations as you debug and get your code working. For IMDB, there is a sample dataset already available:

In [0]:
path = untar_data(URLs.IMDB_SAMPLE)
path

PosixPath('/root/.fastai/data/imdb_sample')


We are not going to use this dataframe, but are just loading it to get a sense of what our data looks like:

In [0]:
df = pd.read_csv(path/'texts.csv')
df.head()

Unnamed: 0,label,text,is_valid
0,negative,Un-bleeping-believable! Meg Ryan doesn't even ...,False
1,positive,This is a extremely well-made film. The acting...,False
2,negative,Every once in a long while a movie will come a...,False
3,positive,Name just says it all. I watched this movie wi...,False
4,negative,This movie succeeds at being one of the most u...,False


We will be using TextList from the fastai library:

In [0]:
movie_reviews = (TextList.from_csv(path,'texts.csv',cols='text')
                         .split_from_df(col=2)
                         .label_from_df(cols=0))

## Exploring what our data looks like

A good first step for any data problem is to explore the data and get a sense of what it looks like. In this case we are looking at movie reviews, which have been labeled as "positive" or "negative":

In [0]:
movie_reviews.valid.x[0],movie_reviews.valid.y[0]

(Text xxbos xxmaj this very funny xxmaj british comedy shows what might happen if a section of xxmaj london , in this case xxmaj xxunk , were to xxunk itself independent from the rest of the xxup uk and its laws , xxunk & post - war xxunk . xxmaj merry xxunk is what would happen . 
  
   xxmaj the explosion of a wartime bomb leads to the xxunk of ancient xxunk which show that xxmaj xxunk was xxunk to the xxmaj xxunk of xxmaj xxunk xxunk ago , a small historical xxunk long since forgotten . xxmaj to the new xxmaj xxunk , however , this is an unexpected opportunity to live as they please , free from any xxunk from xxmaj xxunk . 
  
   xxmaj stanley xxmaj xxunk is excellent as the minor city xxunk who suddenly finds himself leading one of the world 's xxunk xxunk . xxmaj xxunk xxmaj margaret xxmaj xxunk is a delight as the history professor who sides with xxmaj xxunk . xxmaj others in the stand - out cast include xxmaj xxunk xxmaj xxunk , xxmaj paul xxmaj xxunk , xxmaj xxunk xxmaj xxunk ,

In NLP, a **token** is the basic unit of processing (what the tokens are depends on the application and your choices). Here, the tokens mostly correspond to words or punctuation, as well as several special tokens, corresponding to unknown words, capitalization, etc.

All those tokens starting with "xx" are fastai special tokens. You can see the list of all of them and their meanings ([in the fastai docs](https://docs.fast.ai/text.transform.html)):

here is the meaning of the special tokens:

- UNK (xxunk) is for an unknown word (one that isn't present in the current vocabulary)
- PAD (xxpad) is the token used for padding, if we need to regroup several texts of different lengths in a batch
- BOS (xxbos) represents the beginning of a text in your dataset
- FLD (xxfld) is used if you set mark_fields=True in your TokenizeProcessor to separate the different fields of texts (if your texts are loaded from several columns in a dataframe)
- TK_MAJ (xxmaj) is used to indicate the next word begins with a capital in the original text
- TK_UP (xxup) is used to indicate the next word is written in all caps in the original text
- TK_REP (xxrep) is used to indicate the next character is repeated n times in the original text (usage xxrep n {char})
- TK_WREP(xxwrep) is used to indicate the next word is repeated n times in the original text (usage xxwrep n {word})

In [0]:
len(movie_reviews.train.x), len(movie_reviews.valid.x)

(800, 200)

Notice that **ints-to-string** and **string-to-ints** have different lengths. Think for a moment about why this is.

In [0]:
len(movie_reviews.vocab.itos),len(movie_reviews.vocab.stoi)

(6008, 19161)

In [0]:
movie_reviews.vocab.itos[220:230]

['bit',
 'might',
 'things',
 'horror',
 'us',
 'almost',
 'may',
 'right',
 'must',
 'away']

You have integer and you want to know the vocabulary word does this correspond with. In others meaning, we are going to end up representing our reviews as an array of numbers, but if you want to go from those numbers back to words, you would need to use this into string

The word order is based on frequency

In [0]:
movie_reviews.vocab.stoi['language']

917

Now we are taking our words to map them to integers.

In [0]:
movie_reviews.vocab.itos[917]

'language'

Back to our question: why itos and stoi have different lengths. Because 'language' can be shown many times in text but the '917' is only one

In [0]:
movie_reviews.vocab.stoi

defaultdict(int,
            {'xxunk': 0,
             'xxpad': 1,
             'xxbos': 2,
             'xxeos': 3,
             'xxfld': 4,
             'xxmaj': 5,
             'xxup': 6,
             'xxrep': 7,
             'xxwrep': 8,
             'the': 9,
             '.': 10,
             ',': 11,
             'and': 12,
             'a': 13,
             'of': 14,
             'to': 15,
             'is': 16,
             'it': 17,
             'in': 18,
             'i': 19,
             'that': 20,
             'this': 21,
             '"': 22,
             "'s": 23,
             '\n \n ': 24,
             '-': 25,
             'was': 26,
             'as': 27,
             'for': 28,
             'movie': 29,
             'with': 30,
             'but': 31,
             'film': 32,
             'you': 33,
             ')': 34,
             'on': 35,
             '(': 36,
             "n't": 37,
             'are': 38,
             'he': 39,
             'his': 40,
       

Let's test that a non-word maps to xxunk:

In [0]:
movie_reviews.vocab.itos[movie_reviews.vocab.stoi['rrachell']]

'xxunk'

In [0]:
movie_reviews.vocab.itos[movie_reviews.vocab.stoi['language']]

'language'

In [0]:
t= movie_reviews.train[0][0]
t.data[:30]

array([   2,    5, 4619,   25,    0,   25,  867,   52,    5, 3776,    5, 1800,   95,   37,   85,  191,   64,  935,
          0, 2738,  517,   18,   21,   11,   84, 2417,  192,   88, 3777,   64])

# Creating our term-document matrix

As we covered in the last lesson, a term-document matrix represents a document as a "bag of words", that is, we don't keep track of the order the words are in, just which words occur (and how often).

In the previous lesson, we used [sklearn's CountVectorizer](https://github.com/scikit-learn/scikit-learn/blob/55bf5d9/sklearn/feature_extraction/text.py#L940). Today we will create our own (similar) version. This is for two reasons:

- to understand what sklearn is doing underneath the hood
- to create something that will work with a fastai TextList

To create our term-document matrix, we first need to learn about **counters** and **sparse matrices**.

**Counters**

Counters are a useful Python object. If you aren't familar with them, here is how they work:

In [0]:
c = Counter([4,2,8,8,4,8])
c

Counter({2: 1, 4: 2, 8: 3})

In [0]:
c.values()

dict_values([2, 1, 3])

In [0]:
c.keys()

dict_keys([4, 2, 8])

## Sparse Matrices (in Scipy)

Even though we've reduced over 19,000 words down to 6,000, that is still a lot! Most tokens don't appear in most reviews. We want to take advantage of this by storing our data as a **sparse matrix**.

A matrix with lots of zeros is called **sparse** (the opposite of sparse is **dense**). For sparse matrices, you can save a lot of memory by only storing the non-zero values.

![alt text](https://github.com/fastai/course-nlp/raw/85e505295efeed88ce61dc0ff5e424bde9741a15/images/sparse.png)

Another example of a large, sparse matrix:

![alt text](https://github.com/fastai/course-nlp/raw/85e505295efeed88ce61dc0ff5e424bde9741a15/images/Finite_element_sparse_matrix.png)

[Source](https://commons.wikimedia.org/w/index.php?curid=2245335)

There are the most common sparse storage formats:

- coordinate-wise (scipy calls COO)
- compressed sparse row (CSR)
- compressed sparse column (CSC)

Let's walk through [these examples](http://www.mathcs.emory.edu/~cheung/Courses/561/Syllabus/3-C/sparse.html) --> [32:00](https://youtu.be/hp2ipC5pW4I?t=1978)

There are actually [many more formats](http://www.cs.colostate.edu/~mcrob/toolbox/c++/sparseMatrix/sparse_matrix_compression.html) as well.

A class of matrices (e.g, diagonal) is generally called sparse if the number of non-zero elements is proportional to the number of rows (or columns) instead of being proportional to the product rows x columns.

**Scipy Implementation**

From the [Scipy Sparse Matrix Documentation](https://docs.scipy.org/doc/scipy-0.18.1/reference/sparse.html)

To construct a matrix efficiently, use either dok_matrix or lil_matrix. The lil_matrix class supports basic slicing and fancy indexing with a similar syntax to NumPy arrays. As illustrated below, the COO format may also be used to efficiently construct matrices
To perform manipulations such as multiplication or inversion, first convert the matrix to either CSC or CSR format.
All conversions among the CSR, CSC, and COO formats are efficient, linear-time operations.



## Our version of CountVectorizer

In [0]:
def get_term_doc_matrix(label_list,vocab_len):
    j_indices = []
    indptr = []
    values = []
    indptr.append(0)
    
    for i, doc in enumerate(label_list):
        feature_counter = Counter(doc.data)
        j_indices.extend(feature_counter.keys())
        values.extend(feature_counter.values())
        indptr.append(len(j_indices))
        
    # return(values,j_indices,indptr)
    return scipy.sparse.csr_matrix((values,j_indices,indptr),
                                   shape = (len(indptr)-1,vocab_len),
                                   dtype = int)
        

### feature_counter = Counter(doc.data)


In [0]:
movie_reviews.train.x[0] 


Text xxbos xxmaj un - xxunk - believable ! xxmaj meg xxmaj ryan does n't even look her usual xxunk lovable self in this , which normally makes me forgive her shallow xxunk acting xxunk . xxmaj hard to believe she was the producer on this dog . xxmaj plus xxmaj kevin xxmaj kline : what kind of suicide trip has his career been on ? xxmaj xxunk ... xxmaj xxunk ! ! ! xxmaj finally this was directed by the guy who did xxmaj big xxmaj xxunk ? xxmaj must be a replay of xxmaj jonestown - hollywood style . xxmaj xxunk !

In [0]:
movie_reviews.train.x[0].data 

array([   2,    5, 4619,   25, ...,   10,    5,    0,   52])

taking the frequency of different words and using the Counter to get dictionary of frequencies, getting the key and values .

In [0]:
Counter(movie_reviews.train.x[0].data )

Counter({0: 8,
         2: 1,
         5: 15,
         9: 2,
         10: 3,
         11: 1,
         13: 1,
         14: 2,
         15: 1,
         18: 1,
         21: 3,
         25: 3,
         26: 2,
         35: 2,
         37: 1,
         40: 1,
         43: 1,
         48: 1,
         52: 5,
         54: 1,
         63: 1,
         64: 2,
         65: 1,
         72: 2,
         73: 1,
         84: 1,
         85: 1,
         88: 1,
         90: 1,
         91: 1,
         95: 1,
         99: 1,
         103: 1,
         127: 1,
         191: 1,
         192: 1,
         219: 1,
         228: 1,
         245: 1,
         269: 1,
         273: 1,
         282: 1,
         355: 1,
         492: 1,
         517: 1,
         534: 1,
         616: 1,
         796: 1,
         867: 1,
         935: 1,
         1144: 1,
         1145: 1,
         1213: 1,
         1360: 1,
         1361: 1,
         1447: 1,
         1800: 1,
         1801: 1,
         2417: 1,
         2418: 1,
     

j_indices.extend(feature_counter.keys())

values.extend(feature_counter.values())

The adding them on to kind of coordinates that we want to store.


In [0]:
%%time
val_term_doc = get_term_doc_matrix(movie_reviews.valid.x,len(movie_reviews.vocab.itos))

CPU times: user 51 ms, sys: 8.76 ms, total: 59.8 ms
Wall time: 55.5 ms


In [0]:
%%time
trn_term_doc = get_term_doc_matrix(movie_reviews.train.x,len(movie_reviews.vocab.itos))

CPU times: user 212 ms, sys: 20.9 ms, total: 233 ms
Wall time: 214 ms


In [0]:
trn_term_doc

<800x6008 sparse matrix of type '<class 'numpy.int64'>'
	with 112405 stored elements in Compressed Sparse Row format>

In [0]:
trn_term_doc[:10,:10]

<10x10 sparse matrix of type '<class 'numpy.int64'>'
	with 45 stored elements in Compressed Sparse Row format>

In 10x10 section of the matrix, there are 55 spot are 0 while 45 spot have values . If we want to see it in full, we can convert sparse matrix to dense matrix

In [0]:
trn_term_doc.todense()[:10,:10]

matrix([[ 8,  0,  1,  0, ...,  0,  0,  0,  2],
        [22,  0,  1,  0, ...,  2,  0,  0, 27],
        [ 4,  0,  1,  0, ...,  2,  0,  0,  5],
        [13,  0,  1,  0, ...,  0,  0,  0, 16],
        ...,
        [ 4,  0,  1,  0, ...,  0,  0,  0, 19],
        [42,  0,  1,  0, ..., 14,  0,  0, 30],
        [18,  0,  1,  0, ...,  0,  0,  0, 15],
        [20,  0,  1,  0, ...,  1,  0,  0, 10]])

In [0]:
trn_term_doc[:10,-10:]

<10x10 sparse matrix of type '<class 'numpy.int64'>'
	with 0 stored elements in Compressed Sparse Row format>

In this case, there are zero elements stored in the first 10 entries but the last ten vocabulary words

In [0]:
val_term_doc.shape

(200, 6008)

### More data exploration

We could convert our sparse matrix to a dense matrix:

In [0]:
movie_reviews.vocab.itos[:4]

['xxunk', 'xxpad', 'xxbos', 'xxeos']

In [0]:
val_term_doc.todense()[:10,:10]

matrix([[32,  0,  1,  0, ...,  1,  0,  0, 10],
        [ 9,  0,  1,  0, ...,  1,  0,  0,  7],
        [ 6,  0,  1,  0, ...,  0,  0,  0, 12],
        [78,  0,  1,  0, ...,  0,  0,  0, 44],
        ...,
        [ 8,  0,  1,  0, ...,  0,  0,  0,  8],
        [43,  0,  1,  0, ...,  8,  1,  0, 25],
        [ 7,  0,  1,  0, ...,  1,  0,  0,  9],
        [19,  0,  1,  0, ...,  2,  0,  0,  5]])

### Excercise:

[Video 5](https://youtu.be/dt7sArnLo1g?t=487)



In [0]:
review = movie_reviews.valid.x[1]; review

Text xxbos i saw this movie once as a kid on the late - late show and fell in love with it . 
 
  xxmaj it took 30 + years , but i recently did find it on xxup dvd - it was n't cheap , either - in a xxunk that xxunk in war movies . xxmaj we watched it last night for the first time . xxmaj the audio was good , however it was grainy and had the trailers between xxunk . xxmaj even so , it was better than i remembered it . i was also impressed at how true it was to the play . 
 
  xxmaj the xxunk is around here xxunk . xxmaj if you 're xxunk in finding it , fire me a xxunk and i 'll see if i can get you the xxunk . xxunk

**Exercise:** Since the word "late" shows up twice in this review ("...as a kid on the late - late show..."), confirm that a value of 2 is stored in the term-document matrix, for the row corresponding to this review and the column corresponding to the word "late".

In [0]:
#first, look up the integer of 'late'
print(movie_reviews.vocab.stoi['late'])

# then use term doc to find the values

val_term_doc[1,451]


451


array(2)

This review has total 144 tokens and it has 81 distinct tokens

In [0]:
val_term_doc[1].sum()

144

In [0]:
val_term_doc[1]

<1x6008 sparse matrix of type '<class 'numpy.int64'>'
	with 81 stored elements in Compressed Sparse Row format>

**Exercise:** How could you convert review.data back to text (without just using review.text)?

In [0]:
review.data

array([  2,  19, 248,  21, ...,   9,   0,  10,   0])

In [0]:
# use the itos

[movie_reviews.vocab.itos[a] for a in review.data]

['xxbos',
 'i',
 'saw',
 'this',
 'movie',
 'once',
 'as',
 'a',
 'kid',
 'on',
 'the',
 'late',
 '-',
 'late',
 'show',
 'and',
 'fell',
 'in',
 'love',
 'with',
 'it',
 '.',
 '\n \n ',
 'xxmaj',
 'it',
 'took',
 '30',
 '+',
 'years',
 ',',
 'but',
 'i',
 'recently',
 'did',
 'find',
 'it',
 'on',
 'xxup',
 'dvd',
 '-',
 'it',
 'was',
 "n't",
 'cheap',
 ',',
 'either',
 '-',
 'in',
 'a',
 'xxunk',
 'that',
 'xxunk',
 'in',
 'war',
 'movies',
 '.',
 'xxmaj',
 'we',
 'watched',
 'it',
 'last',
 'night',
 'for',
 'the',
 'first',
 'time',
 '.',
 'xxmaj',
 'the',
 'audio',
 'was',
 'good',
 ',',
 'however',
 'it',
 'was',
 'grainy',
 'and',
 'had',
 'the',
 'trailers',
 'between',
 'xxunk',
 '.',
 'xxmaj',
 'even',
 'so',
 ',',
 'it',
 'was',
 'better',
 'than',
 'i',
 'remembered',
 'it',
 '.',
 'i',
 'was',
 'also',
 'impressed',
 'at',
 'how',
 'true',
 'it',
 'was',
 'to',
 'the',
 'play',
 '.',
 '\n \n ',
 'xxmaj',
 'the',
 'xxunk',
 'is',
 'around',
 'here',
 'xxunk',
 '.',
 'xxmaj'

**Exercise:** Confirm that review has 81 distinct tokens

In [0]:
len(set(review.data))

81

## stoi (string-to-int) is larger than itos (int-to-string) 

since many words are mapping to unknown. We will figure out that what words were getting mapped to unknown. 

In [0]:
len(movie_reviews.vocab.stoi) - len(movie_reviews.vocab.itos)

13154

We write a little loop to go through the item and see if they map to unknown and then put them in a list 

In [0]:
unk = []
for word, num in movie_reviews.vocab.stoi.items():
    if num == 0:
        unk.append(word)

In [0]:
len(unk)

13155

In [0]:
unk[:100]

['xxunk',
 'bleeping',
 'pert',
 'ticky',
 'schtick',
 'whoosh',
 'banzai',
 'chill',
 'wooofff',
 'cheery',
 'superstars',
 'fashionable',
 'cruelly',
 'separating',
 'mistreat',
 'tensions',
 'religions',
 'baseness',
 'nobility',
 'puro',
 'disowned',
 'option',
 'faults',
 'dignified',
 'realisation',
 'reconciliation',
 'mrs',
 'iyer',
 'heartbreaking',
 'histories',
 'frankness',
 'starters',
 'montage',
 'swearing',
 'halestorm',
 'korea',
 'concentrate',
 'pic',
 'elude',
 'characteristics',
 'blathered',
 'brassed',
 'declaration',
 'peck',
 'garnered',
 'fearless',
 'tempered',
 'humane',
 'tails',
 'slighted',
 'slater',
 'barrage',
 'underway',
 'operating',
 'tag',
 'dorff',
 'reid',
 'continually',
 'revel',
 'nra',
 'benton',
 'slate',
 'penal',
 'vengeful',
 'seed',
 'backbone',
 'dismal',
 'fortunate',
 'ds',
 'tmob',
 'autographed',
 'intercepted',
 'lectured',
 'reprints',
 'comicon',
 'attendees',
 'blackhawk',
 'insisted',
 'jumped',
 'apologized',
 'wishing',
 'se

## Naive Bayes

We define the **log-count ratio** $r$ for each word $f$:

$r = \log \frac{\text{ratio of feature $f$ in positive documents}}{\text{ratio of feature $f$ in negative documents}}$

where ratio of feature $f$ in positive documents is the number of times a positive document has a feature divided by the number of positive documents.

In [0]:
movie_reviews.y.classes

['negative', 'positive']

In [0]:
x = trn_term_doc
y = movie_reviews.train.y
val_y = movie_reviews.valid.y

In [0]:
positive = y.c2i['positive']
positive

1

In [0]:
negative = y.c2i['negative']
negative

0

In [0]:
np.squeeze(np.asarray(x[y.items==negative].sum(0)))

array([7154,    0,  417,    0, ...,    0,    3,    3,    3], dtype=int64)

In [0]:
np.squeeze(x[y.items==positive].sum(0))

matrix([[6471,    0,  383,    0, ...,    3,    0,    0,    0]], dtype=int64)

In [0]:
np.squeeze(np.asarray(x[y.items==positive].sum(0)))

array([6471,    0,  383,    0, ...,    3,    0,    0,    0], dtype=int64)

For each word in our vocabulary, we are summing up how many positive reviews it is in, and how many negative reviews.

In [0]:
p1 = np.squeeze(np.asarray(x[y.items==positive].sum(0)))
p0 = np.squeeze(np.asarray(x[y.items==negative].sum(0)))

In [0]:
p1[:10]

array([ 6471,     0,   383,     0,     0, 10267,   674,    57,     0,  5260], dtype=int64)

In [0]:
v = movie_reviews.vocab

In [0]:
len(p1),len(p0)

(6008, 6008)

In [0]:
v.itos[0]

'xxunk'

### Using our ratios for even more data exploration

We can use p0 and p1 to do some more data exploration!

**Exercise:** compare how often "loved" appears in positive reviews vs. negative reviews. How about "hate"?

In [0]:
# Exercise: How often does the word "loved" appear in neg vs. pos reviews?
p0[v.stoi['loved']],p1[v.stoi['loved']]

(12, 29)

"loved" is shown more often in postive review

In [0]:
# Exercise: How often does the word "hated" appear in neg vs. pos reviews?
p0[v.stoi['hated']],p1[v.stoi['hated']]

(6, 3)

"hated" is shown more often in negative review

#### Let's figure out why positive reviews with the word "hated"

I was curious to look at an example of a postive review with the word "hated" in it:

In [0]:
v.stoi['hated']

1977

We use **np.argwhere** to find which reviews have a nonzero value for that

In [0]:
a = np.argwhere((x[:,1977]>0))[:,0];a

array([ 15,  49, 304, 351, 393, 612, 695, 773], dtype=int32)

Then I get all the postive reviews

In [0]:
b = np.argwhere(y.items==positive)[:,0];b

array([  1,   3,  10,  11, ..., 787, 789, 790, 797])

Now we take a set intersection of those to fit things. So we get all reviews that have the word "hated". So "hated" appears in 3 reviews

In [0]:
set(a).intersection(set(b))

{393, 612, 695}

In [0]:
review = movie_reviews.train.x[695]
review

Text xxbos xxmaj xxunk , yeah this episode is extremely underrated . 
 
  xxmaj even though there is a xxup lot of bad writing and acting at parts . i think the good over wins the bad . 
 
  i love the xxunk parts and the big ' twist ' at the end . i absolutely love that scene when xxmaj michelle xxunk xxmaj tony . xxmaj it 's actually one of my favorite scenes of xxmaj season 1 . 
 
  xxmaj for some reason , people have always hated the xxmaj xxunk episodes , yet i have always liked them . xxmaj they 're not the best , in terms of writing . but the theme really does interest me , 
 
  i 'm gon na give it a xxup three star , but if the writing were a little more consistent i 'd give it xxup four .

---->`people have always hated the xxmaj xxunk episodes`

#### negative reviews with the word "loved"

Now, let's look at an example of a negative review that contains the word "loved"

In [0]:
v.stoi['loved']

535

In [0]:
a = np.argwhere((x[:,534]>0))[:,0];a

array([  0,  19,  24,  51,  61,  70,  81, 110, 123, 155, 175, 193, 221, 265, 274, 279, 284, 290, 295, 304, 360, 384,
       421, 465, 516, 520, 548, 569, 588, 604, 620, 631, 661, 672, 679, 702, 709, 759, 764, 792], dtype=int32)

In [0]:
b = np.argwhere(y.items==negative)[:,0];b

array([  0,   2,   4,   5, ..., 795, 796, 798, 799])

In [0]:
set(a).intersection(set(b))

{0,
 24,
 51,
 70,
 81,
 123,
 155,
 193,
 221,
 274,
 279,
 284,
 290,
 295,
 304,
 421,
 516,
 548,
 604,
 620,
 631,
 672,
 679,
 709,
 759,
 764,
 792}

In [0]:
review = movie_reviews.train.x[792]
review

Text xxbos xxmaj this is not really a zombie film , if we 're xxunk zombies as the dead walking around . xxmaj here the protagonist , xxmaj xxunk xxmaj louque ( played by an xxunk young xxmaj dean xxmaj xxunk ) , xxunk control of a method to create zombies , though in fact , his ' method ' is to mentally project his thoughts and control other living people 's minds turning them into xxunk slaves . xxmaj this is an interesting concept for a movie , and was done much more effectively by xxmaj xxunk xxmaj lang in his series of ' xxmaj dr. xxmaj mabuse ' films , including ' xxmaj dr. xxmaj mabuse the xxmaj xxunk ' ( xxunk ) and ' xxmaj the xxmaj testament of xxmaj dr. xxmaj mabuse ' ( 1933 ) . xxmaj here it is unfortunately xxunk to his quest to regain the love of his former fiancée , xxmaj claire xxmaj duvall ( played by the xxmaj anne xxmaj xxunk look alike with a bad xxunk , xxmaj dorothy xxmaj stone ) which is really the major theme . 
 
  xxmaj the movie has an intriguing beginning , 

### Applying Naive Bayes

In [0]:
p1 = np.squeeze(np.asarray(x[y.items==positive].sum(0)))
p0 = np.squeeze(np.asarray(x[y.items==negative].sum(0)))

We do averaging over the number of positive reviews (pr1), averaging over the number of negative reviews(pr0) . The reason why we add 1 to numerator and denominator is to help with numerical stability for these formulars

In [0]:
pr1 = (p1+1)/((y.items==positive).sum()+1)
pr0 = (p0+1)/((y.items==negative).sum()+1)

We take log of the ratio

In [0]:
r = np.log(pr1/pr0) ; r

array([-0.015487,  0.084839,  0.      ,  0.084839, ...,  1.471133, -1.301455, -1.301455, -1.301455])

#### Vocab most likely associated with positive/negative reviews

the np.argpartition is not perfectly sort things but it still can give you the number you request . Create a partition so we want to get the top 10, the 10 biggest things and the 10 smallest things.  

In [0]:
biggest = np.argpartition(r,-10)[-10:]
smallest = np.argpartition(r,10)[:10]

These are the words that most kind that a review is positive or that a review is negative 

**Most positive words:**

In [0]:
[v.itos[k] for k in biggest]

['sport',
 'davies',
 'gilliam',
 'fanfan',
 'biko',
 'felix',
 'noir',
 'jabba',
 'astaire',
 'jimmy']

Then I look up "biko" which is a positive word

In [0]:
np.argmax(trn_term_doc[:,v.stoi['biko']])

515

In [0]:
movie_reviews.train.x[515]

Text xxbos " xxmaj the xxmaj true xxmaj story xxmaj of xxmaj the xxmaj friendship xxmaj that xxmaj shook xxmaj south xxmaj africa xxmaj and xxmaj xxunk xxmaj the xxmaj world . " 
 
  xxmaj richard xxmaj attenborough , who directed " a xxmaj bridge xxmaj too xxmaj far " and " xxmaj gandhi " , wanted to bring the story of xxmaj steve xxmaj biko to life , and the journey and trouble that xxunk xxmaj donald xxmaj woods went through in order to get his story told . xxmaj the films uses xxmaj wood 's two books for it 's information and basis - " xxmaj biko " and " xxmaj asking for xxmaj trouble " . 
 
  xxmaj the film takes place in the late 1970 's , in xxmaj south xxmaj africa . xxmaj south xxmaj africa is in the grip of the terrible apartheid , which keeps the blacks separated from the whites and xxunk the whites as the superior race . xxmaj the blacks are forced to live in xxunk on the xxunk of the cities and xxunk , and they come under frequent xxunk by the police and the army . xxmaj w

**Most negative words:**

In [0]:
[v.itos[k] for k in smallest]

['worst',
 'crap',
 'crater',
 'porn',
 'disappointment',
 'dog',
 'vargas',
 'naschy',
 'fuqua',
 'soderbergh']

In [0]:
np.argmax(trn_term_doc[:,v.stoi['soderbergh']])

434

In [0]:
movie_reviews.train.x[434]

Text xxbos xxmaj now that xxmaj che(2008 ) has finished its relatively short xxmaj australian cinema run ( extremely limited xxunk screen in xxmaj xxunk , after xxunk ) , i can xxunk join both xxunk of " xxmaj at xxmaj the xxmaj movies " in taking xxmaj steven xxmaj soderbergh to task . 
 
  xxmaj it 's usually satisfying to watch a film director change his style / subject , but xxmaj soderbergh 's most recent stinker , xxmaj the xxmaj girlfriend xxmaj xxunk ) , was also missing a story , so narrative ( and editing ? ) seem to suddenly be xxmaj soderbergh 's main challenge . xxmaj strange , after xxunk years in the business . xxmaj he was probably never much good at narrative , just xxunk it well inside " edgy " projects . 
 
  xxmaj none of this excuses him this present , almost diabolical failure . xxmaj as xxmaj david xxmaj xxunk xxunk , " two parts of xxmaj che do n't ( even ) make a whole " . 
 
  xxmaj epic xxunk in name only , xxmaj che(2008 ) barely qualifies as a feature film 

In [0]:
trn_term_doc[:,v.stoi["soderbergh"]]

<800x1 sparse matrix of type '<class 'numpy.int64'>'
	with 1 stored elements in Compressed Sparse Row format>

as find out, "Soderbergh" is only showing up in one of the reviews so it is probably not more informative to do on our full dataset

### Continuing with Naive Bayes¶

In [0]:
(y.items == positive).mean(),(y.items==negative).mean()

(0.47875, 0.52125)

b is the percentage of the reviews are positive versus negative

In [0]:
b = np.log((y.items ==positive).mean()/(y.items==negative).mean())

Now we get our predictions by taking the validation term document x r +b . The reason it is greater than 0 because it's predicting positive less than 0 predicting negative . 

In [0]:
preds = (val_term_doc @ r +b)>0

In [0]:
(preds == val_y.items).mean()

0.645

we got 64% accuracy which is reasonably good given that this was on a sample of our data using a pretty simple technique

### Switching to full data set

Now that we have our approach working on a smaller sample of the data, we can try using it on the full dataset.

#### Download data and process

In [0]:
path = untar_data(URLs.IMDB)
path.ls()

[PosixPath('/root/.fastai/data/imdb/test'),
 PosixPath('/root/.fastai/data/imdb/tmp_lm'),
 PosixPath('/root/.fastai/data/imdb/README'),
 PosixPath('/root/.fastai/data/imdb/unsup'),
 PosixPath('/root/.fastai/data/imdb/imdb.vocab'),
 PosixPath('/root/.fastai/data/imdb/tmp_clas'),
 PosixPath('/root/.fastai/data/imdb/train')]

In [0]:
reviews_full = (TextList.from_folder(path)
                .split_by_folder(valid='test')
                .label_from_folder(classes=['neg','pos']))

In [0]:
len(reviews_full.train),len(reviews_full.valid)

(25000, 25000)

In [0]:
v = reviews_full.vocab

In [0]:
v.itos[100:110]

['bad',
 'people',
 'will',
 'other',
 'also',
 'into',
 'first',
 'because',
 'great',
 'how']

In [0]:
%%time 
val_term_doc = get_term_doc_matrix(reviews_full.valid.x,len(reviews_full.vocab.itos))

CPU times: user 5.8 s, sys: 250 ms, total: 6.05 s
Wall time: 5.78 s


In [0]:
%%time
trn_term_doc = get_term_doc_matrix(reviews_full.train.x,len(reviews_full.vocab.itos))

CPU times: user 5.75 s, sys: 222 ms, total: 5.97 s
Wall time: 5.75 s


**Save data**

That was slow. Let's save our matrices for faster loading next time:

In [0]:
scipy.sparse.save_npz('trn_term_doc.npz',trn_term_doc)

In [0]:
scipy.sparse.save_npz('val_term_doc.npz',val_term_doc)

When storing data like this, always make sure it's included in your .gitignore file

In the future, we'll just be able to load our data:

In [0]:
trn_term_doc = scipy.sparse.load_npz('trn_term_doc.npz')
val_term_doc = scipy.sparse.load_npz('val_term_doc.npz')

#### Naive Bayes on full dataset

In [0]:
x = trn_term_doc
y = reviews_full.train.y

val_y = reviews_full.valid.y.items

In [0]:
x


<25000x38456 sparse matrix of type '<class 'numpy.int64'>'
	with 3716267 stored elements in Compressed Sparse Row format>

In [0]:
p0 = np.squeeze(np.asarray(x[y.items==negative].sum(0)))
p1 = np.squeeze(np.asarray(x[y.items==positive].sum(0)))

In [0]:
p1[:20]

array([ 28449,      0,  12500,      0,      0, 342619,  20464,   1338,      7, 173122, 138001, 143763,  89570,  83404,
        76828,  66715,  58510,  47896,  50177,  40451], dtype=int64)

#### Data exploration: negative to positive ratios

I was curious about the ratio of times a given word appears in negative reviews to times it occurs in positive reviews. Bigger ratios (> 1) mean the word is indicative of a negative review, and smaller ratios (< 1) mean it is indicative of a positive review.

In [0]:
def neg_pos_given_word(word):
    print(p0[v.stoi[word]]/p1[v.stoi[word]])

In [0]:
neg_pos_given_word('hated')

2.051546391752577


In [0]:
neg_pos_given_word('liked')

0.6424702058504875


In [0]:
neg_pos_given_word('worst')

9.837301587301587


In [0]:
pr1 = (p1+1)/((y.items==positive).sum()+1)
pr0 = (p0+1)/((y.items==negative).sum()+1)

In [0]:
r = np.log(pr1/pr0)

In [0]:
r[v.stoi['hated']]

-0.7133498878774648

In [0]:
r[v.stoi['loved']]

1.1563661500586044

In [0]:
r[v.stoi['worst']]

-2.2826243504315076

In [0]:
r[v.stoi['best']]

0.7225576052173609

#### Back to Naive Bayes

In [0]:
negative = y.c2i['neg']
p0 = np.squeeze(np.asarray(x[y.items==negative].sum(0)))

Since we have equal numbers of positive and negative reviews in this data set, b is 0.

In [0]:
pr1 = (p1+1) / ((y.items==positive).sum() + 1)
pr0 = (p0+1) / ((y.items==negative).sum() + 1)

In [0]:
b = np.log((y.items==positive).mean()/(y.items==negative).mean());b

0.0

In [0]:
preds = (val_term_doc @ r + b) > 0

Our accuracy is 80% for the full dataset

In [0]:
(preds == val_y).mean()

0.8084

#### Binarized Naive Bayes

Another variation of Naive Bayes is Binarized Naive Bayes which maybe it only matters whether a word is in a review or not ( not the frequency of the word ). Previous we are looking at the counts of how often a word appear in a review and maybe that's not important. Instead of keeping track of "loved" shown ten times in a review, now it's just going to be yes or no is loved  in the review.

To do that we just convert our term document matrix using **.sign()**

In [0]:
x = trn_term_doc.sign()
y = reviews_full.train.y


Using **c2i** to convert a class into integer. In this case, the classes are "neg" and "pos" and this let us know if that's 1 or 0

In [0]:
negative = y.c2i['neg']
positive = y.c2i['pos']

In [0]:
p1 = np.squeeze(np.asarray(x[y.items==positive].sum(0)))
p0 = np.squeeze(np.asarray(x[y.items==negative].sum(0)))

In [0]:
pr1 = (p1+1)/((y.items==positive).sum()+1)
pr0 = (p0+1)/((y.items==negative).sum()+1)

In [0]:
r = np.log(pr1/pr0)
b = np.log((y.items==positive).mean()/(y.items==negative).mean())

In [0]:
preds = (val_term_doc.sign() @ r +b) > 0

In [0]:
(preds==val_y).mean()

0.82924

We got 82% accuracy.

Futhermore, in x , we only see 1 and 0 telling you whether or not that word in a review or not 

In [0]:
x.todense()

matrix([[1, 0, 1, 0, ..., 0, 0, 0, 0],
        [1, 0, 1, 0, ..., 0, 0, 0, 0],
        [0, 0, 1, 0, ..., 0, 0, 0, 0],
        [1, 0, 1, 0, ..., 0, 0, 0, 0],
        ...,
        [1, 0, 1, 0, ..., 0, 0, 0, 0],
        [1, 0, 1, 0, ..., 0, 0, 0, 0],
        [1, 0, 1, 0, ..., 0, 0, 0, 0],
        [1, 0, 1, 0, ..., 0, 0, 0, 0]])

**Question** : When to use binarized vs regular naive bayes?

Let's try both 

## Logistic regression

Here is how we can fit logistic regression where the features are the **unigrams**.

**ngrams** is refered to sequence of n words. Here n is 1 , it's just a single word

In [0]:
from sklearn.linear_model import LogisticRegression

to use Logistic regression, we convert y values to be integer - **y.items.astype(int)**

In [0]:
x = trn_term_doc
y = reviews_full.train.y

In [0]:
m = LogisticRegression(C=0.1,dual=True)
m.fit(x,y.items.astype(int))
preds = m.predict(val_term_doc)
(preds==val_y).mean()




0.88296

Now we have better accuracy with 88.2%

**and with binarized version:**

In [0]:
m = LogisticRegression(C=0.1, dual=True)
m.fit(trn_term_doc.sign(),y.items.astype(int))
preds = m.predict(val_term_doc.sign())
(preds==val_y).mean()



0.88536

the accuracy improve to 88.5% by switching to binarized version

## Trigram with NB features

Our next model is a version of logistic regression with Naive Bayes features described [here](https://www.aclweb.org/anthology/P12-2018). For every document we compute binarized features as described above, but this time we use bigrams and trigrams too. Each feature is a log-count ratio. A logistic regression model is then trained to predict sentiment.

### ngrams

An n-gram is a contiguous sequence of n items (where the items can be characters, syllables, or words). A 1-gram is a unigram, a 2-gram is a bigram, and a 3-gram is a trigram.

Here, we are referring to sequences of words. So examples of bigrams include "the dog", "said that", and "can't you".

In [0]:
path = untar_data(URLs.IMDB_SAMPLE)

In [0]:
movie_reviews = (TextList.from_csv(path,'texts.csv',cols='text')
                    .split_from_df(col=2)
                    .label_from_df(cols=0))

In [0]:
v = movie_reviews.vocab.itos

In [0]:
vocab_len = len(v)

###Our data

Create train matrix

In [0]:
min_n = 1
max_n = 3

j_indices=[]
indptr=[]
values=[]
indptr.append(0)
num_tokens = vocab_len

itongram = dict()
ngramtoi = dict()

We will iterate through the sequences of words to create our n-grams. That we have to go each review to get what are all the sequences of two words that show up. And we'll do this with nested loop

In [0]:
for i, doc in enumerate(movie_reviews.train.x):
    feature_counter = Counter(doc.data)
    j_indices.extend(feature_counter.keys())
    values.extend(feature_counter.values())
    this_doc_ngrams = list()

    m = 0
    for n in range(min_n, max_n + 1):
        for k in range(vocab_len - n + 1):
            ngram = doc.data[k: k + n]
            if str(ngram) not in ngramtoi:
                if len(ngram)==1:
                    num = ngram[0]
                    ngramtoi[str(ngram)] = num
                    itongram[num] = ngram
                else:
                    ngramtoi[str(ngram)] = num_tokens
                    itongram[num_tokens] = ngram
                    num_tokens += 1
            this_doc_ngrams.append(ngramtoi[str(ngram)])
            m += 1

    ngram_counter = Counter(this_doc_ngrams)
    j_indices.extend(ngram_counter.keys())
    values.extend(ngram_counter.values())
    indptr.append(len(j_indices))

Using dictionaries to convert between indices and strings (in this case, our n-grams) is a common & useful approach! Here, we have **itongram** (index to n-gram) and **ngramtoi** (n-gram to index).

In [0]:
train_ngram_doc_matrix = scipy.sparse.csr_matrix((values,j_indices,indptr),
                                                shape=(len(indptr)-1,len(ngramtoi)),
                                                dtype=int)

In [0]:
train_ngram_doc_matrix

<800x260374 sparse matrix of type '<class 'numpy.int64'>'
	with 678885 stored elements in Compressed Sparse Row format>

before we were saying term document matrix when our term are just unigram and now we got ngram document matrix for the training set . Here we note that this got 260374 tokens. There are lot of tokens because this can grow conceivably in like **n_factorial** , you know the fact that you can take all these different combinations.

### Looking at our data 

In [0]:
len(ngramtoi),len(itongram)

(260374, 260374)

In [0]:
itongram[20005]

array([ 15,   9, 710])

In [0]:
ngramtoi[str(np.array([15,9,710]))]

20005

In [0]:
itongram[100000]

array([1140,   33])

We randomly curious about 100000th element. It says the ngrams looks like this array has two things. So this must be bigram with two words in 1140 and 33. And the words are corresponding to "hate" and "you". These two words are next to each others. So this is one of our bigram

In [0]:
v[1140],v[33]

('hate', 'you')

In [0]:
itongram[100010]

array([5430,   10])

In [0]:
v[5430],v[10]    

('photographer', '.')

Here we find a trigram.

In [0]:
itongram[6116]

array([ 85, 191,  64])

In [0]:
v[85],v[191],v[64]

('even', 'look', 'her')

All of these give us a sense of kind of what is this matrix that we've found. This matrix has 800 reviews and it has 260374 ngrams which are combination of unigram, some bigram and some trigram. 

They are recorded as these arrays of one, two or three things but they correspond to set of tupples of words with one, two, or three words in them.

### Create valid matrix

In [0]:
j_indices = []
indptr = []
values = []
indptr.append(0)

for i, doc in enumerate(movie_reviews.valid.x):
    feature_counter = Counter(doc.data)
    j_indices.extend(feature_counter.keys())
    values.extend(feature_counter.values())
    this_doc_ngrams = list()

    m = 0
    for n in range(min_n, max_n + 1):
        for k in range(vocab_len - n + 1):
            ngram = doc.data[k: k + n]
            if str(ngram) in ngramtoi:
                this_doc_ngrams.append(ngramtoi[str(ngram)])
            m += 1

    ngram_counter = Counter(this_doc_ngrams)
    j_indices.extend(ngram_counter.keys())
    values.extend(ngram_counter.values())
    indptr.append(len(j_indices))

In [0]:
valid_ngram_doc_matrix = scipy.sparse.csr_matrix((values,j_indices,indptr),
                                                shape=(len(indptr)-1,len(ngramtoi)),
                                                dtype=int)

In [0]:
valid_ngram_doc_matrix

<200x260374 sparse matrix of type '<class 'numpy.int64'>'
	with 121600 stored elements in Compressed Sparse Row format>

In [0]:
train_ngram_doc_matrix

<800x260374 sparse matrix of type '<class 'numpy.int64'>'
	with 678885 stored elements in Compressed Sparse Row format>

###Save data

In [0]:
scipy.sparse.save_npz('train_ngram_matrix.npz',train_ngram_doc_matrix)
scipy.sparse.save_npz('valid_ngram_matrix.npz',valid_ngram_doc_matrix)

In [0]:
with open('itongram.pickle','wb') as handle:
       pickle.dump(itongram,handle,protocol=pickle.HIGHEST_PROTOCOL)
        
with open('ngramtoi.pickle','wb') as handle:
    pickle.dump(ngramtoi,handle,protocol=pickle.HIGHEST_PROTOCOL)

###Load data

In [0]:
train_ngram_doc_matrix = scipy.sparse.load_npz("train_ngram_matrix.npz")
valid_ngram_doc_matrix = scipy.sparse.load_npz("valid_ngram_matrix.npz")

In [0]:
with open('itongram.pickle', 'rb') as handle:
    b = pickle.load(handle)
    
with open('ngramtoi.pickle', 'rb') as handle:
    b = pickle.load(handle)

###Naive Bayes

In [0]:
x = train_ngram_doc_matrix
y = movie_reviews.train.y

In [0]:
positive = y.c2i['positive']
negative = y.c2i['negative']

In [0]:
x

<800x260374 sparse matrix of type '<class 'numpy.int64'>'
	with 566480 stored elements in Compressed Sparse Row format>

In [0]:
k = 260374

In [0]:
pos = (y.items==positive)[:k]
neg = (y.items==negative)[:k]


In [0]:
xx = x[:k]

In [0]:
valid_labels = [o==positive for o in movie_reviews.valid.y.items]

In [0]:
p0 = np.squeeze(np.array(xx[neg].sum(0)))
p1 = np.squeeze(np.array(xx[pos].sum(0)))

In [0]:
pr1 = (p1+1)/((y.items==positive).sum()+1)
pr0 = (p0+1)/((y.items==negative).sum()+1)

In [0]:
r = np.log(pr1/pr0)

In [0]:
b = np.log((y.items==positive).mean()/(y.items==negative).mean())

In [0]:
b

-0.08505123261815539

In [0]:
(y.items==positive).mean(),(y.items==negative).mean()

(0.47875, 0.52125)

In [0]:
pre_preds = valid_ngram_doc_matrix @ r.T + b

In [0]:
pre_preds

array([ 111.020137,   39.70486 ,    1.3799  ,   14.603319, ...,   81.603164,   -5.777097, -152.193113,  120.194783])

In [0]:
preds = pre_preds.T >0

In [0]:
preds[:10]

array([ True,  True,  True,  True, False,  True,  True, False,  True, False])

In [0]:
(preds==valid_labels).mean()

0.76

### Binarized Naive Bayes

In [0]:
trn_x_ngram_sgn = train_ngram_doc_matrix.sign()
val_x_ngram_sgn = valid_ngram_doc_matrix.sign()

In [0]:
xx = trn_x_ngram_sgn[:k]

In [0]:
p0 = np.squeeze(np.array(xx[neg].sum(0)))
p1 = np.squeeze(np.array(xx[pos].sum(0)))

In [0]:
pr1 = (p1+1)/((y.items==positive).sum()+1)
pr0 = (p0+1)/((y.items==negative).sum()+1)

In [0]:
r = np.log(pr1/pr0)
b = np.log((y.items==positive).mean()/(y.items==negative).mean())

pre_preds = val_x_ngram_sgn @ r.T +b
preds = pre_preds.T >0

In [0]:
(preds==valid_labels).mean()

0.735

###Logistic Regression

Here we fit regularized logistic regression where the features are the trigrams.

**use CountVectorizer to compare**

In [0]:
from sklearn.feature_extraction.text import CountVectorizer

In [0]:
veczr = CountVectorizer(ngram_range=(1,3),preprocessor=noop,tokenizer=noop,max_features=800000)

In [0]:
docs = movie_reviews.train.x

In [0]:
train_words = [[docs.vocab.itos[o] for o in doc.data] for doc in movie_reviews.train.x]

In [0]:
valid_words = [[docs.vocab.itos[o] for o in doc.data] for doc in movie_reviews.valid.x]

In [0]:
%%time
train_ngram_doc = veczr.fit_transform(train_words)

CPU times: user 1.56 s, sys: 7.82 ms, total: 1.57 s
Wall time: 1.57 s


In [0]:
train_ngram_doc

<800x260373 sparse matrix of type '<class 'numpy.int64'>'
	with 565680 stored elements in Compressed Sparse Row format>

In [0]:
veczr.vocabulary_

{'xxbos': 235215,
 'xxmaj': 235589,
 'un': 217511,
 '-': 14660,
 'xxunk': 247952,
 'believable': 50420,
 '!': 593,
 'meg': 134438,
 'ryan': 171955,
 'does': 72621,
 "n't": 141191,
 'even': 78281,
 'look': 129009,
 'her': 101674,
 'usual': 219401,
 'lovable': 129862,
 'self': 175864,
 'in': 110056,
 'this': 206619,
 ',': 8800,
 'which': 228198,
 'normally': 145182,
 'makes': 131546,
 'me': 133646,
 'forgive': 88662,
 'shallow': 177158,
 'acting': 27682,
 '.': 16836,
 'hard': 97855,
 'to': 210364,
 'believe': 50457,
 'she': 177306,
 'was': 222309,
 'the': 193798,
 'producer': 164483,
 'on': 152319,
 'dog': 72926,
 'plus': 162127,
 'kevin': 122604,
 'kline': 123461,
 ':': 20359,
 'what': 226855,
 'kind': 123231,
 'of': 147512,
 'suicide': 188353,
 'trip': 215808,
 'has': 98123,
 'his': 103526,
 'career': 58691,
 'been': 49234,
 '?': 20987,
 '...': 18342,
 'finally': 85368,
 'directed': 71212,
 'by': 56556,
 'guy': 96410,
 'who': 229023,
 'did': 70468,
 'big': 51595,
 'must': 140407,
 'be'

In [0]:
%%time
val_ngram_doc = veczr.transform(valid_words)

CPU times: user 176 ms, sys: 2.97 ms, total: 179 ms
Wall time: 178 ms


In [0]:
val_ngram_doc

<200x260373 sparse matrix of type '<class 'numpy.int64'>'
	with 93552 stored elements in Compressed Sparse Row format>

In [0]:
vocab = veczr.get_feature_names()

In [0]:
vocab[200000:200005]

['the same cult',
 'the same dance',
 'the same date',
 'the same day',
 'the same diner']

#### Binarized Naive Bayes, using ngrams from CountVectorizer

In [0]:
y = movie_reviews.train.y

C is the inverse of regularization strength; smaller values specify stronger regularization. Regularized:

In [0]:
m = LogisticRegression(C = 0.1, dual=True);
m.fit(train_ngram_doc.sign(),y.items);

preds = m.predict(val_ngram_doc.sign());
(preds.T == valid_labels).mean()



0.83

Not binarized

In [0]:
m = LogisticRegression(C=0.1, dual=True)
m.fit(train_ngram_doc, y.items);

preds = m.predict(val_ngram_doc)
(preds.T==valid_labels).mean()



0.78

**Using my ngrams, binarized:**

In [0]:
m2 = LogisticRegression(C=0.1, dual=True)
m2.fit(trn_x_ngram_sgn, y.items)



LogisticRegression(C=0.1, class_weight=None, dual=True, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

In [0]:
preds = m2.predict(val_x_ngram_sgn)
(preds.T==valid_labels).mean()

0.83

Worse performance when not binarized. I manually tried several different C values, and this was the best I found:

In [0]:
m2 = LogisticRegression(C=0.0001, dual=True, max_iter=50000)
m2.fit(train_ngram_doc_matrix, y.items)



LogisticRegression(C=0.0001, class_weight=None, dual=True, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=50000,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

In [0]:
preds = m2.predict(valid_ngram_doc_matrix)
(preds.T==valid_labels).mean()

0.68

#### Log-count ratio

Here is the log-count ratio r.

In [0]:
x = train_ngram_doc_matrix.sign()
val_x = valid_ngram_doc_matrix.sign()
y = movie_reviews.train.y

In [0]:
positive=y.c2i['positive']
negative = y.c2i['negative']


In [0]:
x.shape

(800, 260374)

In [0]:
k = 260374

In [0]:
pos = (y.items==positive)[:k]
neg = (y.items==negative)[:k]


In [0]:
xx = x[:k]


In [0]:
xx.shape

(800, 260374)

In [0]:
valid_labels = [ o ==positive for o in movie_reviews.valid.y.items]

In [0]:
p0 = np.squeeze(np.array(xx[neg].sum(0)))
p1 = np.squeeze(np.array(xx[pos].sum(0)))

In [0]:
pr1 = (p1+1) / ((y.items==positive).sum() + 1)
pr0 = (p0+1) / ((y.items==negative).sum() + 1)

In [0]:
r = np.log(pr1/pr0)

In [0]:
r.shape

(260374,)

In [0]:
b = np.log((y.items==positive).mean() / (y.items==negative).mean())

In [0]:
np.exp(r)

array([0.994341, 1.088542, 1.      , 1.088542, ..., 0.544271, 0.544271, 0.544271, 0.544271])

Here we fit regularized logistic regression where the features are the trigrams' log-count ratios.

In [0]:
x_nb = xx.multiply(r)
m = LogisticRegression(dual=True, C=0.1)
m.fit(x_nb, y.items);

val_x_nb = val_x.multiply(r)
preds = m.predict(val_x_nb)
(preds.T==valid_labels).mean()



ValueError: ignored

##References

Baselines and Bigrams: Simple, Good Sentiment and Topic Classification. Sida Wang and Christopher D. Manning [pdf](https://www.aclweb.org/anthology/P12-2018)

# Additionaly [Video 6](https://youtu.be/z8-Tbrg1-rE?t=56)



## Trouble installing fastai library?

Here is a [guide to troubleshooting](https://docs.fast.ai/troubleshoot.html) problems with fastai installation. By far, the most common problem is having fastai installed for a different environment/different Python installation than the one your Jupyter notebook is using (you can have Python installed in multiple places on your computer and not even realize it!). Or, you might have different versions of fastai installed in your different environments/different Python installations (and the one you are running in Jupyter notebook could be out of date, even if you installed version 1.0 somewhere else). For both of these problems, please [see this entry](https://docs.fast.ai/troubleshoot.html#modulenotfounderror-no-module-named-fastaivision).

## More detail about randomized SVD

I didn't cover how randomized SVD worked, because we aren't going to learn about it in detail in this course. The main things I want you to know about randomized SVD are:

- it is fast

- it gives us a truncated SVD (whereas with traditional SVD, we are usually throwing away small singular values and their corresponding columns)

If you were curious to know more, two keys are:

- It is often useful to be able to reduce dimensionality of data in a way that preserves distances. The Johnson–Lindenstrauss lemma is a classic result of this type. [Johnson-Lindenstrauss Lemma](https://en.wikipedia.org/wiki/Johnson%E2%80%93Lindenstrauss_lemma): a small set of points in a high-dimensional space can be embedded into a space of much lower dimension in such a way that distances between the points are nearly preserved (proof uses random projections).

- We haven't found a better general SVD method, we'll just use the method we have on a smaller matrix.

Below is an over-simplified version of randomized_svd (you wouldn't want to use this in practice, but it covers the core ideas). The main part to notice is that we multiply our original matrix by a smaller random matrix (M @ rand_matrix) to produce smaller_matrix, and then use our same np.linalg.svd as before:



```
def randomized_svd(M, k=10):
    m, n = M.shape
    transpose = False
    if m < n:
        transpose = True
        M = M.T

    rand_matrix = np.random.normal(size=(M.shape[1], k))  # short side by k
    Q, _ = np.linalg.qr(M @ rand_matrix, mode='reduced')  # long side by k
    smaller_matrix = Q.T @ M                              # k by short side
    U_hat, s, V = np.linalg.svd(smaller_matrix, full_matrices=False)
    U = Q @ U_hat

    if transpose:
        return V.T, s.T, U.T
    else:
        return U, s, V
```

This code snippet is from this [randomized-SVD jupyter notebook](https://github.com/fastai/randomized-SVD/blob/master/Randomized%20SVD.ipynb) which was the demo I used for my PyBay talk on [Using randomness to make code](https://www.youtube.com/watch?v=7i6kBz1kZ-A&list=PLtmWHNX-gukLQlMvtRJ19s7-8MrnRV6h6&index=7) much faster.

## Bayes Theorem

Ex: Physicist Leonard Mlodinow tested positive for HIV in 1989.
His doctor said there was a 99.9% chance he had HIV.

A = positive test results

B = having HIV

True positives:  $P(A|B) = 99.9\%$

Prevalence:  $P(B)= 0.01\%$

False positives:  $P(A|B^C) = 0.1\%$

Was his doctor correct?

This example is from the book:

![alt text](https://github.com/fastai/course-nlp/raw/85e505295efeed88ce61dc0ff5e424bde9741a15/images/drunkards-walk.jpg)

Bayes Theorem (for conditional probabilities): 

P(A | B) P(B) = P(B | A) P(A)

![link text](https://github.com/hduongck/AI-ML-Learning/blob/master/Pic/Capture.PNG?raw=True)



Derivation of Naive Bayes
We want to calculate the probability that the review "I loved it" is positive. Using Bayes Theorem, we can rewrite this:

$$ P(\text{pos} | \text{"I"}, \text{"loved"}, \text{"it"}) = \frac{P(\text{"I"}, \text{"loved"}, \text{"it"}, | \text{pos}) \cdot P(\text{"loved"} | \text{pos}) \cdot P(\text{"it"} | \text{pos}) \cdot P(\text{pos})}{P(\text{"I"}, \text{"loved"}, \text{"it})}$$
The "naive" part of Naive Bayes is that we will assume that the probabilities of the different words are all independent.

$$ P(\text{pos} | \text{"I"}, \text{"loved"}, \text{"it"}) = \frac{P(\text{"I"} | \text{pos}) \cdot P(\text{"loved"} | \text{pos}) \cdot P(\text{"it"} | \text{pos}) \cdot P(\text{pos})}{P(\text{"I"}, \text{"loved"}, \text{"it})}$$
We do the same calculation to see how likely it is the review is negative, and then choose whichever is larger.

$$ P(\text{neg} | \text{"I"}, \text{"loved"}, \text{"it"}) = \frac{P(\text{"I"} | \text{neg}) \cdot P(\text{"loved"} | \text{neg}) \cdot P(\text{"it"} | \text{neg}) \cdot P(\text{neg})}{P(\text{"I"}, \text{"loved"}, \text{"it})}$$
We will add one to avoid dividing by zero (or something close to it). Similarly, we take logarithms to avoid multiplying by a lot of tiny values. For the reasons we want to avoid this, please see the next section on numerical stability:

More reading: [Using log-probabilities for Naive Bayes](http://www.cs.rhodes.edu/~kirlinp/courses/ai/f18/projects/proj3/naive-bayes-log-probs.pdf)

## Numerical Stability
Exercise
Take a moment to look at the function $f$ below. Before you try running it, write on paper what the output would be of $x_1 = f(\frac{1}{10})$. Now, (still on paper) plug that back into $f$ and calculate $x_2 = f(x_1)$. Keep going for 10 iterations.

This example is taken from page 107 of Numerical Methods, by Greenbaum and Chartier.

In [0]:
def f(x):
    if x <= 1/2:
        return 2 * x
    if x > 1/2:
        return 2 * x - 1
    

Only after you've written down what you think the answer should be, run the code below:

In [0]:
x = 1/10
for i in range(80):
    print(x)
    x = f(x)

0.1
0.2
0.4
0.8
0.6000000000000001
0.20000000000000018
0.40000000000000036
0.8000000000000007
0.6000000000000014
0.20000000000000284
0.4000000000000057
0.8000000000000114
0.6000000000000227
0.20000000000004547
0.40000000000009095
0.8000000000001819
0.6000000000003638
0.2000000000007276
0.4000000000014552
0.8000000000029104
0.6000000000058208
0.20000000001164153
0.40000000002328306
0.8000000000465661
0.6000000000931323
0.20000000018626451
0.40000000037252903
0.8000000007450581
0.6000000014901161
0.20000000298023224
0.4000000059604645
0.800000011920929
0.6000000238418579
0.20000004768371582
0.40000009536743164
0.8000001907348633
0.6000003814697266
0.20000076293945312
0.40000152587890625
0.8000030517578125
0.600006103515625
0.20001220703125
0.4000244140625
0.800048828125
0.60009765625
0.2001953125
0.400390625
0.80078125
0.6015625
0.203125
0.40625
0.8125
0.625
0.25
0.5
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0


**What went wrong?**

It's like a pattern here 0.1 0.2 0.4 0.8 0.6 repeatedly. But the problem here , it starts getting a litte bit of small decimal (0.60000000000000**01**) and it grows bigger and bigger. Eventually, it causes problem to us when we run long enough. It will converge to 1. This is one of problems about using computers 

## Problem: math is continuous & infinite, but computers are discrete & finite

Two Limitations of computer representations of numbers:

1. they can't be arbitrarily large or small
2. there must be gaps between them

The reason we need to care about accuracy, is because computers can't store infinitely accurate numbers. It's possible to create calculations that give very wrong answers (particularly when repeating an operation many times, since each operation could multiply the error).

How computers store numbers:

![alt text](https://github.com/fastai/course-nlp/raw/85e505295efeed88ce61dc0ff5e424bde9741a15/images/fpa.png)

The mantissa can also be referred to as the significand.

IEEE Double precision arithmetic:

- Numbers can be as large as $1.79 \times 10^{308}$ and as small as $2.23 \times 10^{-308}$.
- The interval $[1,2]$ is represented by discrete subset: $$1, \: 1+2^{-52}, \: 1+2 \times 2^{-52},\: 1+3 \times 2^{-52},\: \ldots, 2$$

- The interval $[2,4]$ is represented: $$2, \: 2+2^{-51}, \: 2+2 \times 2^{-51},\: 2+3 \times 2^{-51},\: \ldots, 4$$

Floats and doubles are not equidistant:

![alt text](https://github.com/fastai/course-nlp/raw/85e505295efeed88ce61dc0ff5e424bde9741a15/images/fltscale-wh.png)

Source: [What you never wanted to know about floating point but will be forced to find out](http://www.volkerschatz.com/science/float.html)

**Machine Epsilon**

Half the distance between 1 and the next larger number. This can vary by computer. IEEE standards for double precision specify 

$$ \varepsilon_{machine} = 2^{-53} \approx 1.11 \times 10^{-16}$$

**Two important properties of Floating Point Arithmetic:**

The difference between a real number $x$ and its closest floating point approximation $fl(x)$ is always smaller than $\varepsilon_{machine}$ in relative terms. For some $\varepsilon$, where $\lvert \varepsilon \rvert \leq \varepsilon_{machine}$, $$fl(x)=x \cdot (1 + \varepsilon)$$

Where is any operation ($+, -, \times, \div$), and $\circledast$ is its floating point analogue, $$ x \circledast y = (x &lt;/em&gt; y)(1 + \varepsilon)$$ for some $\varepsilon$, where $\lvert \varepsilon \rvert \leq \varepsilon&lt;em&gt;{machine}$ That is, every operation of floating point arithmetic is exact up to a relative error of size at most $\varepsilon&lt;/em&gt;{machine}$

##Speed of different types of memory

This course is 90% NLP and 10% things I want to make sure you see before the end of your MSDS.

Here are some numbers everyone should know (from the legendary Jeff Dean):

- L1 cache reference 0.5 ns
- L2 cache reference 7 ns
- Main memory reference/RAM 100 ns
- Send 2K bytes over 1 Gbps network 20,000 ns
- Read 1 MB sequentially from memory 250,000 ns
- Round trip within same datacenter 500,000 ns
- Disk seek 10,000,000 ns
- Read 1 MB sequentially from network 10,000,000 ns
- Read 1 MB sequentially from disk 30,000,000 ns
- Send packet CA->Netherlands->CA 150,000,000 ns
- And here is an updated, interactive version, which includes a timeline of how these numbers have changed.

Key take-away: Each successive memory type is (at least) an order of magnitude worse than the one before it. Disk seeks are very slow.

##Revisiting Naive Bayes in an Excel Spreadsheet

Let's calculate naive bayes in a spreadsheet to get a more visual picture of what is going on. Here's how I processed the data for this:

### Loading our data

In [0]:
from fastai import *
from fastai.text import *


In [0]:
path = untar_data(URLs.IMDB_SAMPLE)


In [0]:
movie_reviews = (TextList.from_csv(path,'texts.csv',cols='text')
                        .split_from_df(col=2)
                        .label_from_df(cols=0))

In [0]:
def get_term_doc_matrix(label_list, vocab_len):
    j_indices = []
    indptr = []
    values = []
    indptr.append(0)

    for i, doc in enumerate(label_list):
        feature_counter = Counter(doc.data)
        j_indices.extend(feature_counter.keys())
        values.extend(feature_counter.values())
        indptr.append(len(j_indices))
        
#     return (values, j_indices, indptr)

    return scipy.sparse.csr_matrix((values, j_indices, indptr),
                                   shape=(len(indptr) - 1, vocab_len),
                                   dtype=int)

In [0]:
trn_term_doc = get_term_doc_matrix(movie_reviews.train.x,len(movie_reviews.vocab.itos))

### Getting data for our spreadsheet

To keep our spreadsheet manageable, we will just get the 40 shortest reviews:

In [0]:
inds = np.argpartition(np.count_nonzero(trn_term_doc.todense(),1),40,axis=0)[:40]

In [8]:
inds

matrix([[597],
        [170],
        [320],
        [119],
        [730],
        [386],
        [428],
        [245],
        [467],
        [102],
        [477],
        [210],
        [ 53],
        [212],
        [ 42],
        [336],
        [395],
        [275],
        [533],
        [276],
        [499],
        [630],
        [264],
        [697],
        [461],
        [441],
        [122],
        [704],
        [790],
        [712],
        [756],
        [254],
        [369],
        [483],
        [617],
        [607],
        [699],
        [646],
        [ 46],
        [328]])

In [0]:
inds = np.squeeze(np.asarray(inds))

In [10]:
inds

array([597, 170, 320, 119, 730, 386, 428, 245, 467, 102, 477, 210,  53, 212,  42, 336, 395, 275, 533, 276, 499, 630,
       264, 697, 461, 441, 122, 704, 790, 712, 756, 254, 369, 483, 617, 607, 699, 646,  46, 328])


Let's get the text from these 40 shortest reviews:

In [0]:
list_text = [movie_reviews.train.x[i] for i in inds]

In [12]:
list_text

[Text xxbos xxmaj this movie is so bad , i knew how it ends right after this little girl killed the first person . xxmaj very bad acting very bad plot very bad movie 
  
   do yourself a xxunk and xxup don't watch it 1 / 10,
 Text xxbos i found this film to funny from the start . xxmaj john xxmaj waters use of characters reminded of some of the down to earth characters from xxmaj xxunk films . xxmaj christina xxmaj xxunk has once again xxunk her abilities in this film . xxmaj if you are looking for a fun movie without xxunk , i recommend this film .,
 Text xxbos xxmaj if you ever see a stand up comedy movie this is the one . xxmaj you will laugh xxunk if you have any sense of humor at all . xxmaj this is a once in a lifetime performance from a once in a lifetime performer . xxmaj this is a stand up standard .,
 Text xxbos xxmaj another movie to suffer without an adventure to run , no xxunk to solve . xxmaj just an xxunk man , acting like an animal . xxmaj no a good reason to take this 

In [0]:
vocab_used = defaultdict(int)

for i in inds:
    for val in movie_reviews.train.x[i].data:
        vocab_used[val] += 1


Let's choose the words that are used at least 6 times (so not too rare), but less than 30 (so not too common). You could try experimenting with different cut-off points on your own:

In [0]:
interesting_inds = [key for key,val in vocab_used.items() if val<30 and val>6]

In [25]:
len(interesting_inds)

44

I copied the vocab and text of the movie reviews directly from here to paste into the spreadsheet:

In [26]:
[movie_reviews.vocab.itos[i] for i in interesting_inds]

['so',
 'bad',
 'very',
 'acting',
 '\n \n ',
 'do',
 'xxup',
 'film',
 'from',
 'some',
 'if',
 'are',
 'for',
 'up',
 'one',
 'have',
 'all',
 'an',
 'no',
 'just',
 'like',
 'good',
 'great',
 'but',
 '...',
 'about',
 'movies',
 'seen',
 'with',
 '!',
 'me',
 'as',
 "'s",
 'was',
 'that',
 'out',
 '"',
 'on',
 "n't",
 'story',
 '-',
 '(',
 ')',
 'not']

In [0]:
x = trn_term_doc[inds,:]
y = movie_reviews.train.y[inds]

### Export to CSV


Let's export the term-document matrix and the labels to CSVs.

In [0]:
from IPython.display import FileLink, FileLinks

In [30]:
np.savetxt("x.csv", x.todense()[:,interesting_inds], delimiter=",", fmt='%.14f')
FileLink('x.csv')

In [31]:
np.savetxt("y.csv", y, delimiter=",", fmt="%i")
FileLink('y.csv')