<a href="https://colab.research.google.com/github/jacKlinc/movie_review_sentiment/blob/main/notebooks/1_mdl_nlp_train_sentiment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [19]:
!pip install -Uqq fastbook
import fastbook

In [20]:
from fastbook import *

# ImDB Movie Review Sentiment Analysis
I want to predict whether a user's review is negative or postive based on the text they type. The idea behind this approach is to first use transfer learning with the Wikipedia model which has all its articles, this teaches the model how sentences are structured and what words come after what. The next step is to train on an imDB model, this introduces the model to words specific to film making, giving the model the domain specfic language to read reviews.

---

#### TODO
- [ ] Tokenise words
- [ ] Numericalise
- [ ] Preprocess data
- [ ] Train on imDB

## Tokenisation
This step converts a text corpus into something a computer can more easily understand. The main change is how punctuation and capital letters are handled, which seem pretty complicated to a computer. 

FastAI does not tokenise the words rather uses Spacy which is quite popular.

Import imDB dataset.

In [21]:
from fastai.text.all import *
path = untar_data(URLs.IMDB)

In [22]:
files = get_text_files(path, folders=['train', 'test', 'unsup'])

In [23]:
txt = files[0].open().read()
txt[:60]

"I'm getting frustrated that so many people are complaining t"

### Word Tokeniser

In [24]:
spacy = WordTokenizer()
toks = first(spacy([txt]))
print(coll_repr(toks, 30))

(#231) ['I',"'m",'getting','frustrated','that','so','many','people','are','complaining','that','this','show','is','propaganda','for','the','Christian','religion','.','I','watched','the','first','few','episodes','of','this','miniseries','and'...]


The above step splits all words like "it's" into "it" and "s".

In [25]:
tokeniser = Tokenizer(spacy)
print(coll_repr(tokeniser(txt), 30))

(#254) ['xxbos','xxmaj','i',"'m",'getting','frustrated','that','so','many','people','are','complaining','that','this','show','is','propaganda','for','the','xxmaj','christian','religion','.','i','watched','the','first','few','episodes','of'...]


This step adds special tokens for things like the beginning and end of sentences. Below are some examples:
- `xxbos` is the beginning of a text stream
- `xxmaj` is a capital letter


Data is rarely clean and this also applies to the review data where things like HTML fragements and emtpy spaces are present. Luckily, FastAI provides some rules to help

In [26]:
defaults.text_proc_rules

[<function fastai.text.core.fix_html>,
 <function fastai.text.core.replace_rep>,
 <function fastai.text.core.replace_wrep>,
 <function fastai.text.core.spec_add_spaces>,
 <function fastai.text.core.rm_useless_spaces>,
 <function fastai.text.core.replace_all_caps>,
 <function fastai.text.core.replace_maj>,
 <function fastai.text.core.lowercase>]

In [27]:
coll_repr(tokeniser('&amp'))

"(#2) ['xxbos','&']"

`&amp` is unicode for `&` symbol, the tokeniser knows this and relaces the word with the correct one and prepends it with an `xxbos` token.

### Sub-Word Tokeniser

## Numericalisation
Maps the tokens created to indices in the text. The frequency of words is also counted, making it easier for the model to process a large number of repeated words. 

The sentence "I like my dog and my cat" would have "I" as `text[0]` and "my" as `text[2]` and `text[5]`; instead of storing this mapping twice, it would map "my" to both indices.

In [28]:
toks = tokeniser(txt)
print(coll_repr(tokeniser(txt), 31))

(#254) ['xxbos','xxmaj','i',"'m",'getting','frustrated','that','so','many','people','are','complaining','that','this','show','is','propaganda','for','the','xxmaj','christian','religion','.','i','watched','the','first','few','episodes','of','this'...]


In [30]:
txts = L(o.open().read() for o in files[:2000])

In [32]:
toks200 = txts[:200].map(tokeniser)
toks200[0]

(#254) ['xxbos','xxmaj','i',"'m",'getting','frustrated','that','so','many','people'...]

In [38]:
num = Numericalize()
num.setup(toks200)
coll_repr(num.vocab, 20)

"(#1992) ['xxunk','xxpad','xxbos','xxeos','xxfld','xxrep','xxwrep','xxup','xxmaj','the',',','.','a','and','of','to','it','is','in','i'...]"

In [40]:
nums = num(toks)[:20]
nums

TensorText([   2,    8,   19,  142,  339,    0,   21,   47,  135,   83,   42,    0,   21,   20,  157,   17, 1465,   29,    9,    8])

The numbers above represent the mappings from tokenised words to integers that the model can interpret. Next step is to put them into a model.