# **Sentiment classification of movie reviews**

For more information check:
https://github.com/fastai/course-nlp/blob/master/3b-more-details.ipynb

## **IMBD dataset**

This dataset contains an even number of positive and negative reviews. Neutral reviews weren't included. 

Our task here will be to predict the polarity if a given text.

In [0]:
from fastai import *
from fastai.text import *

import sklearn.feature_extraction.text as sklearn_text

Some of the datasets we can use are the following:

In [0]:
?? URLs

In [3]:
path = untar_data(URLs.IMDB_SAMPLE)
path

PosixPath('/root/.fastai/data/imdb_sample')

In [4]:
df = pd.read_csv(path/'texts.csv')
df.head()

Unnamed: 0,label,text,is_valid
0,negative,Un-bleeping-believable! Meg Ryan doesn't even ...,False
1,positive,This is a extremely well-made film. The acting...,False
2,negative,Every once in a long while a movie will come a...,False
3,positive,Name just says it all. I watched this movie wi...,False
4,negative,This movie succeeds at being one of the most u...,False


From the fast.ai library, we'll use textlist

In [0]:
movie_reviews = (TextList.from_csv(path, 'texts.csv', cols='text')
                         .split_from_df(col=2).label_from_df(cols=0))

We can see some of these reviews before going on.

In [6]:
movie_reviews.valid.x[100]

Text xxbos xxmaj once again the same familiar story about a man ( writer here ) who sell his soul to the devil in order to have his most xxunk xxunk in life : success . xxmaj unfunny script ( we should " go home and write better " ) , ridiculous lines in order to understand the " strong " " xxmaj xxunk " message ( our only xxunk in life is to find love , respect and a good friendship ) and a very long trial scene at the end where the agent xxmaj hopkins beat the devil ( xxmaj xxunk xxmaj love xxmaj xxunk is no sexy or evil at all ) for all the bad things she made to this xxunk character . xxmaj not bad efforts from the actors ( xxmaj baldwin also as a director , xxmaj xxunk in a " xxmaj sex and the xxmaj city " role again , xxmaj xxunk with some funny lines in his limited role ) . xxup p.s. xxmaj try also a not so popular film from xxmaj xxunk called " xxmaj xxunk xxunk xxunk " , a brilliant adaptation of this xxunk ( an old man give his soul to the devil to get back his youth )

In [7]:
movie_reviews.valid.y[100]

Category negative

In [8]:
len(movie_reviews.train.x), len(movie_reviews.valid.x)

(800, 200)

Notice how ints-to-string and string-to-ints have different lengths

In [9]:
len(movie_reviews.vocab.itos), len(movie_reviews.vocab.stoi)

(6008, 19161)

In [10]:
movie_reviews.vocab.stoi['funny']

189

In [11]:
movie_reviews.vocab.itos[189]

'funny'

In case we type a word which doesn't appear on the text, we'll have the following.

In [12]:
movie_reviews.vocab.itos[movie_reviews.vocab.stoi['hola']]

'xxunk'

In [13]:
t = movie_reviews.train[0][0]
t.data[:30]

array([   2,    5, 4619,   25,    0,   25,  867,   52,    5, 3776,    5, 1800,   95,   37,   85,  191,   64,  935,
          0, 2738,  517,   18,   21,   11,   84, 2417,  192,   88, 3777,   64])

## **Our version of CountVectorizer**

In [14]:
Counter((movie_reviews.valid.x)[0].data)

Counter({0: 32,
         2: 1,
         5: 32,
         6: 1,
         9: 10,
         10: 7,
         11: 10,
         12: 1,
         13: 4,
         14: 6,
         15: 6,
         16: 4,
         18: 2,
         20: 1,
         21: 3,
         23: 1,
         24: 3,
         25: 2,
         26: 1,
         27: 3,
         30: 1,
         44: 1,
         45: 1,
         49: 1,
         50: 3,
         52: 1,
         54: 2,
         58: 1,
         59: 1,
         63: 2,
         71: 1,
         74: 1,
         77: 1,
         84: 1,
         109: 1,
         115: 1,
         149: 1,
         189: 1,
         194: 1,
         197: 2,
         204: 1,
         207: 1,
         221: 1,
         239: 1,
         251: 1,
         258: 1,
         285: 1,
         288: 1,
         319: 1,
         324: 1,
         337: 1,
         358: 1,
         378: 1,
         404: 1,
         409: 1,
         430: 1,
         456: 1,
         478: 1,
         541: 1,
         571: 1,
         579: 1

In [15]:
movie_reviews.vocab.itos[55]

'there'

In [16]:
(movie_reviews.valid.x)[0]

Text xxbos xxmaj this very funny xxmaj british comedy shows what might happen if a section of xxmaj london , in this case xxmaj xxunk , were to xxunk itself independent from the rest of the xxup uk and its laws , xxunk & post - war xxunk . xxmaj merry xxunk is what would happen . 
 
  xxmaj the explosion of a wartime bomb leads to the xxunk of ancient xxunk which show that xxmaj xxunk was xxunk to the xxmaj xxunk of xxmaj xxunk xxunk ago , a small historical xxunk long since forgotten . xxmaj to the new xxmaj xxunk , however , this is an unexpected opportunity to live as they please , free from any xxunk from xxmaj xxunk . 
 
  xxmaj stanley xxmaj xxunk is excellent as the minor city xxunk who suddenly finds himself leading one of the world 's xxunk xxunk . xxmaj xxunk xxmaj margaret xxmaj xxunk is a delight as the history professor who sides with xxmaj xxunk . xxmaj others in the stand - out cast include xxmaj xxunk xxmaj xxunk , xxmaj paul xxmaj xxunk , xxmaj xxunk xxmaj xxunk , xxma

In [0]:
def get_term_doc_matrix(label_list, vocab_len):
    j_indices = []
    indptr = []
    values = []
    indptr.append(0)

    for i, doc in enumerate(label_list):
        feature_counter = Counter(doc.data)
        j_indices.extend(feature_counter.keys())
        values.extend(feature_counter.values())
        indptr.append(len(j_indices))

    return scipy.sparse.csr_matrix((values, j_indices, indptr),
                                   shape=(len(indptr) - 1, vocab_len),
                                   dtype=int)

In [0]:
val_term_doc = get_term_doc_matrix(movie_reviews.valid.x, len(movie_reviews.vocab.itos))

In [0]:
trn_term_doc = get_term_doc_matrix(movie_reviews.train.x, len(movie_reviews.vocab.itos))

### **Counters**

Here's a quick explanation of how counters work.

In [0]:
c = Counter([0,1,1,0,5,8,8,8])

In [21]:
c.values()

dict_values([2, 2, 1, 3])

In [22]:
c.keys()

dict_keys([0, 1, 5, 8])

## **More data exploration**

In [23]:
movie_reviews.vocab.itos[4]

'xxfld'

In [24]:
val_term_doc.todense()[:10,:10]

matrix([[32,  0,  1,  0, ...,  1,  0,  0, 10],
        [ 9,  0,  1,  0, ...,  1,  0,  0,  7],
        [ 6,  0,  1,  0, ...,  0,  0,  0, 12],
        [78,  0,  1,  0, ...,  0,  0,  0, 44],
        ...,
        [ 8,  0,  1,  0, ...,  0,  0,  0,  8],
        [43,  0,  1,  0, ...,  8,  1,  0, 25],
        [ 7,  0,  1,  0, ...,  1,  0,  0,  9],
        [19,  0,  1,  0, ...,  2,  0,  0,  5]])

In [25]:
review = movie_reviews.valid.x[1]; review

Text xxbos i saw this movie once as a kid on the late - late show and fell in love with it . 
 
  xxmaj it took 30 + years , but i recently did find it on xxup dvd - it was n't cheap , either - in a xxunk that xxunk in war movies . xxmaj we watched it last night for the first time . xxmaj the audio was good , however it was grainy and had the trailers between xxunk . xxmaj even so , it was better than i remembered it . i was also impressed at how true it was to the play . 
 
  xxmaj the xxunk is around here xxunk . xxmaj if you 're xxunk in finding it , fire me a xxunk and i 'll see if i can get you the xxunk . xxunk

## **Naive Bayes**

In [26]:
movie_reviews.y.classes

['negative', 'positive']

In [0]:
x = trn_term_doc
y = movie_reviews.train.y
val_y = movie_reviews.valid.y

In [0]:
positive = y.c2i['positive']
negative = y.c2i['negative']

For each word, we are summing up how many positive and negative reviews it has.

In [0]:
p1 = np.squeeze(np.asarray(x[y.items==positive].sum(0)))
p0 = np.squeeze(np.asarray(x[y.items==negative].sum(0)))

In [30]:
p1[:5]

array([6471,    0,  383,    0,    0], dtype=int64)

In [0]:
v = movie_reviews.vocab

**Exercise** Checking how many times the word 'hated' appears. Then checking which of those text are positive reviews. And finally printing one of these.

In [32]:
v.stoi['hated']

1977

In [33]:
a = np.argwhere((x[:,1977] > 0))[:,0]
b = np.argwhere(y.items==positive)[:,0]
set(a).intersection(set(b))

{393, 612, 695}

In [34]:
review = movie_reviews.train.x[695]
review.text

"xxbos xxmaj xxunk , yeah this episode is extremely underrated . \n \n  xxmaj even though there is a xxup lot of bad writing and acting at parts . i think the good over wins the bad . \n \n  i love the xxunk parts and the big ' twist ' at the end . i absolutely love that scene when xxmaj michelle xxunk xxmaj tony . xxmaj it 's actually one of my favorite scenes of xxmaj season 1 . \n \n  xxmaj for some reason , people have always hated the xxmaj xxunk episodes , yet i have always liked them . xxmaj they 're not the best , in terms of writing . but the theme really does interest me , \n \n  i 'm gon na give it a xxup three star , but if the writing were a little more consistent i 'd give it xxup four ."

## **Applying Naive Bayes**


In [0]:
p1 = np.squeeze(np.asarray(x[y.items==positive].sum(0)))
p0 = np.squeeze(np.asarray(x[y.items==negative].sum(0)))

In [0]:
pr1 = (p1+1) / ((y.items==positive).sum() + 1)
pr0 = (p0+1) / ((y.items==negative).sum() + 1)

In [37]:
r = np.log(pr1/pr0)
r

array([-0.015487,  0.084839,  0.      ,  0.084839, ...,  1.471133, -1.301455, -1.301455, -1.301455])

### **Words associated with pos/neg reviews**

In [0]:
biggest = np.argpartition(r, -10)[-10:]
smallest = np.argpartition(r, 10)[:10]

In [39]:
# Most positive words
[v.itos[k] for k in biggest]

['sport',
 'davies',
 'gilliam',
 'fanfan',
 'biko',
 'felix',
 'noir',
 'jabba',
 'astaire',
 'jimmy']

In [40]:
# Most negative words
[v.itos[k] for k in smallest]

['worst',
 'crap',
 'crater',
 'porn',
 'disappointment',
 'dog',
 'vargas',
 'naschy',
 'fuqua',
 'soderbergh']

In [42]:
np.argmax(trn_term_doc[:,v.stoi['davies']])

292

In [43]:
movie_reviews.train.x[515]

Text xxbos " xxmaj the xxmaj true xxmaj story xxmaj of xxmaj the xxmaj friendship xxmaj that xxmaj shook xxmaj south xxmaj africa xxmaj and xxmaj xxunk xxmaj the xxmaj world . " 
 
  xxmaj richard xxmaj attenborough , who directed " a xxmaj bridge xxmaj too xxmaj far " and " xxmaj gandhi " , wanted to bring the story of xxmaj steve xxmaj biko to life , and the journey and trouble that xxunk xxmaj donald xxmaj woods went through in order to get his story told . xxmaj the films uses xxmaj wood 's two books for it 's information and basis - " xxmaj biko " and " xxmaj asking for xxmaj trouble " . 
 
  xxmaj the film takes place in the late 1970 's , in xxmaj south xxmaj africa . xxmaj south xxmaj africa is in the grip of the terrible apartheid , which keeps the blacks separated from the whites and xxunk the whites as the superior race . xxmaj the blacks are forced to live in xxunk on the xxunk of the cities and xxunk , and they come under frequent xxunk by the police and the army . xxmaj w

### **More Naive Bayes**

In [44]:
(y.items==positive).mean(), (y.items==negative).mean()

(0.47875, 0.52125)

In [0]:
b = np.log((y.items==positive).mean() / (y.items==negative).mean())

In [46]:
preds = (val_term_doc @ r + b) > 0
(preds == val_y.items).mean()

0.645