# NLP with RNN

This notebook uses RNN to do text classification. We use the IMDB dataset to train a model that classifies movie reviews as either positive or negative.

## Tokenization  

### Word Tokenization with fast.ai

Grabbing the IMDB dataset:

In [None]:
from fastai.text.all import *
path = untar_data(URLs.IMDB)

To get the text file in the path for tokenization using get_text_file.  
We cna also pass the folders to restrict the search to a particular list of subfolders:

In [None]:
files = get_text_files(path, folders = ['train', 'test', 'unsup'])

Grabbing the first file, open and read it:

In [None]:
txt = files[0].open().read(); txt[:75]

Fastai uses a library called spaCy for tokenization. As we are doing word tokenization, we will have to specify that.  
Also, we use fastai's coll_repr(collection, n) function to display the results. This displays the first n items of collection.  
We have to pass txt as a list to our tokenizer(spacy) as it only takes a collection of documents.

In [None]:
spacy = WordTokenizer()
toks = first(spacy([txt]))
print(coll_repr(toks, 30))

spaCy separates "." when it's being used to terminate a sentence but not in an acroynm or number:

In [None]:
first(spacy(['The U.S. dollar $1 is $1.00.']))

Tokenizer class by fastai adds some additional functionality to the tokenization process:

In [None]:
tkn = Tokenizer(spacy)
print(coll_repr(tkn(txt), 31))

This allows us to lowercase everything so that the embedding matrix can only work with lowercase text. 
Words that begin with a capital letter have a special token 'xxmaj' while the beginning of a stream is indicated by 'xxbos'.

Other special tokens:

In [None]:
coll_repr(tkn('&copy;   Fast.ai www.fast.ai/INDEX'), 31)

xxrep: replaces any character repeated 3 or more times with a special token with a special token for repetition (xxrep), the number of times it's repeated, then the character.  
xxup: Lowercases a word written in all caps and adds a special token for all caps (xxup) in front of it

### Subword Tokenization

This method follows the following steps:  
1) Analyze a corpus of documents to find the most commonly occurring groups of letters. These become the vocab.
2) Tokenize the corpus using this vocab of subword units.

Let's look at an example.  
For our corpus, we'll use the first 2,000 movie reviews:

In [None]:
txts = L(o.open().read() for o in [:2000])

Now we can use setup(), which is a special fastai method that is used to train our Tokenizer to find common sequences of characters to create the vocab.  
We'll create a function that takes a certain size of a vocabulary:

In [None]:
def subword(sz):
    sp = SubWordTokenizer(vocab_sz=sz)
    sp.setup(txts)
    return ' '.join(first(sp[txts]))[:40]

Trying it:

In [None]:
subword(1000)

The special character ▁ represents a space character in the original text when using fastai's subword tokenizer.

For small vocabs, each token will represent fewer characters, and it will take more tokens to represent a sentence.  
For larger vocabs, most common English words will end up in the vocab themselves, and we will not need as many to represent a sentence.

In [None]:
subword(200)

In [None]:
subword(10000)

### Numericalization

Numericalization is the process of mapping tokens to integers. It involves:  
1) Make a list of all possible levels of that categorical variable (the vocab).  
2) Replace each level with its index in the vocab. 

We'll use the word tokenized text:

In [None]:
toks = tkn(txt)
print(coll_repr(tkn(txt), 31))

In order to numericalize, we first have to call setup() that creates a vocab.  
Since tokenization takes a while, it's done in parallel by fastai; but for this manual walkthrough, we'll use a small subset:

In [None]:
toks200 = txts[:200].map(tkn)
toks200[0]

In [None]:
num = Numericalize()
num.setup(toks200)
coll_repr(num.vocab,20)