# Introduction To Natural Language Processing

In [None]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

In [None]:
from fastai2 import *
from fastai2.text import *
from fastai2.text.all import *

In [None]:
import sklearn.feature_extraction.text as sklearn_text

# Tokenizing and term document matrix creation

fast.ai has a number of *datasets hosted via AWS Open datasets* for easy download. We can see them by checking the docs for URLs (remember ?? is a helpful command)

In [None]:
URLs??

In [None]:
URLs.IMDB_SAMPLE

'http://files.fast.ai/data/examples/imdb_sample.tgz'

In [None]:
path = untar_data(URLs.IMDB)
path

Path('C:/Users/t158185/.fastai/data/imdb')

In [None]:
(path/'train').ls()

(#4) [Path('C:/Users/t158185/.fastai/data/imdb/train/labeledBow.feat'),Path('C:/Users/t158185/.fastai/data/imdb/train/neg'),Path('C:/Users/t158185/.fastai/data/imdb/train/pos'),Path('C:/Users/t158185/.fastai/data/imdb/train/unsupBow.feat')]

In [None]:
df = pd.read_csv("C:\\Users\\t158185\\Downloads\\imdb_sample\\imdb_sample\\imdb_sample\\texts.csv")

In [None]:
df.head()

Unnamed: 0,label,text,is_valid
0,negative,"Un-bleeping-believable! Meg Ryan doesn't even look her usual pert lovable self in this, which normally makes me forgive her shallow ticky acting schtick. Hard to believe she was the producer on this dog. Plus Kevin Kline: what kind of suicide trip has his career been on? Whoosh... Banzai!!! Finally this was directed by the guy who did Big Chill? Must be a replay of Jonestown - hollywood style. Wooofff!",False
1,positive,"This is a extremely well-made film. The acting, script and camera-work are all first-rate. The music is good, too, though it is mostly early in the film, when things are still relatively cheery. There are no really superstars in the cast, though several faces will be familiar. The entire cast does an excellent job with the script.<br /><br />But it is hard to watch, because there is no good end to a situation like the one presented. It is now fashionable to blame the British for setting Hindus and Muslims against each other, and then cruelly separating them into two countries. There is som...",False
2,negative,"Every once in a long while a movie will come along that will be so awful that I feel compelled to warn people. If I labor all my days and I can save but one soul from watching this movie, how great will be my joy.<br /><br />Where to begin my discussion of pain. For starters, there was a musical montage every five minutes. There was no character development. Every character was a stereotype. We had swearing guy, fat guy who eats donuts, goofy foreign guy, etc. The script felt as if it were being written as the movie was being shot. The production value was so incredibly low that it felt li...",False
3,positive,"Name just says it all. I watched this movie with my dad when it came out and having served in Korea he had great admiration for the man. The disappointing thing about this film is that it only concentrate on a short period of the man's life - interestingly enough the man's entire life would have made such an epic bio-pic that it is staggering to imagine the cost for production.<br /><br />Some posters elude to the flawed characteristics about the man, which are cheap shots. The theme of the movie ""Duty, Honor, Country"" are not just mere words blathered from the lips of a high-brassed offic...",False
4,negative,"This movie succeeds at being one of the most unique movies you've seen. However this comes from the fact that you can't make heads or tails of this mess. It almost seems as a series of challenges set up to determine whether or not you are willing to walk out of the movie and give up the money you just paid. If you don't want to feel slighted you'll sit through this horrible film and develop a real sense of pity for the actors involved, they've all seen better days, but then you realize they actually got paid quite a bit of money to do this and you'll lose pity for them just like you've alr...",False


# Data processing

For data processing, we will deal with transformations. transformations are functions which can be applied on the dataset and convert it into required format

In [None]:
tfms=[attrgetter('label'), Categorize()]
tlist = TfmdList(df, tfms)

In [None]:
tlist.decode(tlist[0:10])

"(#10) ['negative','positive','negative','positive','negative','negative','negative','negative','negative','negative']"

Lets do the same for text. For text we need to follow following procedures
1. Tokenize the text
2. Add extra tokens to take care of special cases
3. Numericalize the tokens

In [None]:
tfms = [attrgetter('text'), Tokenizer.from_df('text'), Numericalize(min_freq=3)]
tlits = TfmdList(df, tfms)

In [None]:
tlits.decode(tlits[0])

"xxbos xxmaj un - xxunk - believable ! xxmaj meg xxmaj ryan does n't even look her usual xxunk lovable self in this , which normally makes me forgive her shallow xxunk acting xxunk . xxmaj hard to believe she was the producer on this dog . xxmaj plus xxmaj kevin xxmaj kline : what kind of suicide trip has his career been on ? xxmaj xxunk … xxmaj xxunk xxrep 3 ! xxmaj finally this was directed by the guy who did xxmaj big xxmaj xxunk ? xxmaj must be a replay of xxmaj jonestown - hollywood style . w xxrep 3 o xxrep 3 f !"

In [None]:
df.iloc[0].text

"Un-bleeping-believable! Meg Ryan doesn't even look her usual pert lovable self in this, which normally makes me forgive her shallow ticky acting schtick. Hard to believe she was the producer on this dog. Plus Kevin Kline: what kind of suicide trip has his career been on? Whoosh... Banzai!!! Finally this was directed by the guy who did Big Chill? Must be a replay of Jonestown - hollywood style. Wooofff!"

In NLP, a token is the basic unit of processing ( what the tokens are depends on the application and choices ). Here the tokens mostly correspond to words or punctuations, as well as other several tokens, corresponding to unknown words, capitalization etc.

All those tokens starting wtih "xx" are fastai special tokens.

In [None]:
len(tlits.vocab)

19240

# Create a datasource

DataSource is just the fancy name to combine two lists, add splits and convert to train / valid datasets

In [None]:
splits = RandomSplitter()(range_of(df))
dsrc = DataSource(df, tfms=[[attrgetter('text'), Tokenizer.from_df('text'), Numericalize(min_freq=3)],[attrgetter('label'), Categorize()]], splits=splits)

In [None]:
len(dsrc.train), len(dsrc.valid)

(800, 200)

In [None]:
dsrc.o2i['language']

938

In [None]:
dsrc.vocab[0][938]

'language'

# Sparse Matrices in (Scipy)

Even though, we've reduced over 19,000 words down to 6,000 that is stil a lot, Most tokens don't appear in most reviews. We want to take advantage of this by storing our data in sparse matrix

There are the most common sparse storage formats.
* Coordinate wise (scipy calls COO)
* Compressed sparse row (CSR)
* compressed sparse column (CSC)

In [None]:
import numpy as np
from scipy.sparse import csc_matrix, csr_matrix, coo_matrix

np.random.seed = 0
mat = csr_matrix(np.random.random((10, 12))> 0.7, dtype=int)
mat[1,0]=2
mat[0,1]=3

  self._set_intXint(row, col, x.flat[0])


In [None]:
print(mat.A)

[[0 3 1 0 1 1 0 1 0 1 1 0]
 [2 0 1 0 0 0 0 0 0 0 0 1]
 [0 0 0 0 0 0 0 1 0 0 0 1]
 [0 0 0 0 0 0 1 1 0 0 0 1]
 [0 0 0 0 0 0 0 1 0 0 0 0]
 [0 0 1 0 0 0 1 0 0 1 1 0]
 [0 0 0 0 1 0 1 0 1 0 1 1]
 [0 0 0 0 0 1 0 1 0 0 0 0]
 [0 0 0 0 1 0 0 0 1 1 1 1]
 [0 0 0 0 1 0 1 0 0 0 0 0]]


Sparse matrix is stored as three dense array, *data*, *indices* and *indptr*
- Data contains the non-zero values of the matrix, in the order in which they would be encountered if we walked along the columns top to bottom and left to right. If this wre a csr matrix, the walk would have been along the rows.

In [None]:
mat.data[:10]

array([3, 1, 1, 1, 1, 1, 1, 2, 1, 1], dtype=int32)

In [None]:
mat.indptr  # row pointers

array([ 0,  7, 10, 12, 15, 16, 20, 25, 27, 32, 34], dtype=int32)

In [None]:
mat.indices # column pointers

array([ 1,  2,  4,  5,  7,  9, 10,  0,  2, 11,  7, 11,  6,  7, 11,  7,  2,
        6,  9, 10,  4,  6,  8, 10, 11,  5,  7,  4,  8,  9, 10, 11,  4,  6],
      dtype=int32)

In [None]:
len(mat.indices) # column pointers

34

In [None]:
def get_coo_matrix(docs, vocab_len):
    
    rows=[]
    datas = []
    cols = []
    
    for i,d in enumerate(docs):
        uniq_indices, counts = np.unique(d[0].numpy(), return_counts=True)
        for key,value in zip(uniq_indices, counts):
            rows.append(i)     # appending the doc_idx
            cols.append(key)      # appending where token is situated
            datas.append(value)
    
    return coo_matrix((datas, (rows,cols)), shape=(len(docs), vocab_len))

In [None]:
dtm = get_coo_matrix(dsrc.train, len(dsrc.vocab[0]))

In [None]:
np.unique(dsrc.train[0][0].numpy(),return_counts=True)

(array([   0,    2,    5,    7,    8,    9,   10,   11,   12,   13,   14,
          15,   16,   17,   18,   19,   20,   21,   22,   23,   24,   25,
          28,   29,   30,   31,   33,   34,   35,   36,   38,   39,   40,
          46,   49,   54,   55,   56,   58,   60,   61,   64,   72,   74,
          77,   82,   88,   92,   93,  119,  120,  125,  126,  132,  133,
         143,  151,  161,  170,  182,  203,  212,  223,  234,  248,  268,
         269,  283,  290,  292,  295,  325,  328,  329,  349,  351,  361,
         434,  450,  452,  464,  467,  487,  544,  627,  650,  659,  660,
         984, 1170, 1399, 1402, 1420, 1561, 1834, 1852, 1944, 2001, 2146,
        3135, 4466, 4589, 4853, 5374, 6022, 6505, 6571, 6587], dtype=int64),
 array([ 6,  1,  2,  3, 18,  5,  3, 12,  3,  5,  4,  3,  4,  2,  3,  2,  2,
         3,  4,  1,  1,  1,  2,  4,  2,  1,  3,  2,  1,  2,  1,  1,  1,  2,
         1,  1,  1,  1,  1,  2,  1,  2,  1,  1,  1,  2,  1,  1,  1,  1,  1,
         1,  1,  1,  2,  1,  

In [None]:
dtm.todense()[0,0]

6

In [None]:
# Convert to sparse matrix csr format

In [None]:
trn_term_doc = dtm.tocsr()

In [None]:
trn_term_doc.shape

(800, 7080)

### Small verification

In [None]:
# checking the 2nd document

trn_term_doc[1].sum() , len(dsrc.train[1][0]) 

(824, 824)

In [None]:
# checking the unique tokens
len(trn_term_doc[1].data), len(torch.unique(dsrc.train[1][0]))

(313, 313)

# Naive Bayes classifier

We define the log-count ratio r for each word f

In [None]:
x = trn_term_doc
y = np.array([y[1].numpy().item() for y in dsrc.train])

In [None]:
p1 = np.squeeze(np.asarray(x[y==1].sum(0)))
p0 = np.squeeze(np.asarray(x[y==0].sum(0)))

In [None]:
p1[:10]

array([5644,    0,  370,    0,    0,   71,    0,  579, 9329, 5166],
      dtype=int64)

# Questions

How many times does the word "loved" appear in neg vs positive reviews

In [None]:
dsrc.o2i['loved']

532

In [None]:
x[y==1][:,532].sum(), x[y==0][:,532].sum()

(30, 12)

How many times does the word 'hated' appear in neg vs positive reviews

In [None]:
ind = dsrc.o2i['hated']
x[y==1][:,ind].sum(), x[y==0][:,ind].sum()

(1, 7)

### Positive reviews with the word hated

In [None]:
a = np.argwhere(x[:,ind]>0)[:,0]

In [None]:
b = np.argwhere(y==1)[:,0]

In [None]:
set(a).intersection(set(b))

{221}

In [None]:
dsrc.decode(dsrc.train[221])

('xxbos xxmaj there are numerous films relating to xxup xxunk , but xxmaj mother xxmaj night is quite distinctive among them : xxmaj in this film , we are introduced to xxmaj howard xxmaj campbell ( nolte ) , an xxmaj american living in xxmaj berlin and married to a xxmaj german , xxmaj xxunk xxmaj xxunk ( lee ) , who decides to accept the role of a spy : xxmaj more specifically , a xxup cia agent xxmaj major xxmaj xxunk ( goodman ) recruits xxmaj campbell who becomes a xxmaj nazi xxunk in order to enter the highest xxunk of the xxmaj hitler xxunk . xxmaj however , the deal is that the xxup us xxmaj government will never xxunk xxmaj campbell \'s role in the war for national security reasons , and so xxmaj campbell becomes a hated figure across the xxup us . xxmaj after the war , he tries to xxunk his identity , but the past comes back and haunts him . xxmaj his only " friend " is xxmaj xxunk , but even he can not do much for the xxunk of events that fall upon poor xxmaj campbell … \n\n

# Applying naive bayes

In [None]:
p1 = np.squeeze(np.array(x[y==1].sum(axis=0)))
p2 = np.squeeze(np.array(x[y==1].sum(axis=0)))

In [None]:
pr1 = (p1+1)/((y==1).sum()+1)
pr0 = (p2+1)/((y==0).sum()+1)

In [None]:
r = np.log(pr1/pr0); r

array([0.14990603, 0.14990603, 0.14990603, ..., 0.14990603, 0.14990603,
       0.14990603])

# Vocab most likely associated with positive / negative reviews

In [None]:
biggest = np.argpartition(r,-10)[-10:]
smallest = np.argpartition(r, 10)[:10]

In [None]:
[dsrc.vocab[0][k] for k in biggest]

['jeffrey',
 'crafted',
 'president',
 'screenwriter',
 'portray',
 'claiming',
 'china',
 'eve',
 'respective',
 'xxfake']

In [None]:
[dsrc.vocab[0][k] for k in smallest]

['80', 'carol', 'lots', 'up', 'dr', 'dahl', 'davis', 'stick', 'some', 'elvira']

# Sklearn

In [None]:
x_train = get_coo_matrix(dsrc.train, len(dsrc.vocab[0]))
y_train = np.array([y[1].item() for y in dsrc.train])

x_test = get_coo_matrix(dsrc.valid, len(dsrc.vocab[0]))
y_test = np.array([y[1].item() for y in dsrc.valid])

In [None]:
from sklearn.linear_model import LogisticRegression
m = LogisticRegression(C=0.1, dual=True)

m.fit(x_train, y_train)
y_pred = m.predict(x_test)

In [None]:
(y_pred==y_test).mean()

0.83

In [None]:
arr = np.squeeze(m.coef_)

biggest = np.argpartition(arr, -10)[-10:]
smallest = np.argpartition(arr, 10)[:10]

In [None]:
[dsrc.vocab[0][k] for k in biggest]

['also',
 'both',
 'best',
 'more',
 'very',
 'may',
 'well',
 'see',
 'great',
 'times']

In [None]:
[dsrc.vocab[0][k] for k in smallest]

['bad', '*', '?', 'worst', 'were', 'no', 'plot', 'simply', 'poor', 'acting']

# Sklearn naive bayes

In [None]:
from sklearn.naive_bayes import GaussianNB,MultinomialNB

In [None]:
m = MultinomialNB()
m.fit(x_train.toarray(), y_train)
y_pred = m.predict(x_test.toarray())

In [None]:
(y_pred==y_test).mean()

0.79

In [None]:
arr = np.squeeze(m.coef_)

biggest = np.argpartition(arr, -10)[-10:]
smallest = np.argpartition(arr, 10)[:10]

In [None]:
[dsrc.vocab[0][k] for k in biggest]

['is', 'to', 'of', 'a', 'and', 'xxmaj', 'the', ',', '.', 'xxunk']

In [None]:
[dsrc.vocab[0][k] for k in smallest]

['hallan',
 'dave',
 'bollywood',
 'sheets',
 'balcony',
 'flies',
 'unwisely',
 'pokes',
 'laying',
 'object']