## NLP

### Problem on classifying movie reviews

For movie reviews, if we just traditional method of classification, where we use each text occurrences as a input feature, and output will be positive/negative, this will not have good result.

Because language is very complex. E.g. sarcasm, hidden message, etc. We need a model that can "link" diffferent worlds together and form a new meaning.

![a](lesson4/lesson4-1.png "")

## Language Model

Language Model has a very specific meaning in NLP. It is a model that learns to predict the next world of a sentence. To predict the next world of a sentence, you actually need to learn quite a lot about a language, and also quite a lot of world knowledge.

E.g. The language model should be about to predict:
```
I'd like to eat a hot ___. (dog)
It was a hot ___. (day)
```
It can not mix 2 words between these sentences.

## Engram

It is basically means how oven a pair of words or triplets of words tends to appear next to each other.

An engram is terrible to solve text related classifier. Because it only account for pair or triplets of words, it cannot solve the example above.

## How to create language model

So how do we create a language model that can answer the example above? we need a pre-trained model that trained a large set of data, such as wikipedia.

It is like getting a model to learn English by asking it to read all wikipedia page.

Note: Wikitext103 has over 1 billion tokens, while movie reviews probably only has 2000 tokens


## Create language model for movie reviews

We can now use the pre-trained wikitext103 lanugage model, and transfer it to learn a new model, which using the movie reviews.

## Tokenization

The first step of processing we make the texts go through is to split the raw sentences into words, or more exactly tokens. The easiest way to do this would be to split the string on spaces, but we can be smarter:

- we need to take care of punctuation
- some words are contractions of two different words, like isn't or don't
- we may need to clean some parts of our texts, if there's HTML code for instance

To see what the tokenizer had done behind the scenes, let's have a look at a few texts in a batch.

![a](lesson4/lesson4-2.png "")

The texts are truncated at 100 tokens for more readability. We can see that it did more than just split on space and punctuation symbols: 
- the "'s" are grouped together in one token
- the contractions are separated like this: "did", "n't"
- content has been cleaned for any HTML symbol and lower cased
- there are several special tokens (all those that begin by xx), to replace unknown tokens (see below) or to introduce different text fields (here we only have one).

## Numericalization

Once we have extracted tokens from our texts, we convert to integers by creating a list of all the words used. We only keep the ones that appear at least twice with a maximum vocabulary size of 60,000 (by default) and replace the ones that don't make the cut by the unknown token `UNK`.

The correspondance from ids to tokens is stored in the `vocab` attribute of our datasets, in a dictionary called `itos` (for int to string). 

Note: every word in the vocabularies is going to require a separate row in the weight matrix, that's why we restrict to 60000 max.

### Check lesson3-imdb.ipynb language model

Note: For langugae model training, we can use all the data we have, in train, even test and validation test set. 

Because we are not using label, we are not training a classifier. This is just a language model.  There is no "cheating" involve. 

### Encoder

TODO

### Check lesson3-imdb.ipynb Classifier

Note: when we create a new classifer out from a pre-trained language model, make sure the vocabolaries is exactly the same. (the token). 

That's why we pass the vocab from language model to classifer:
```
data_clas = (TextList.from_folder(path, vocab=data_lm.vocab)
```

### Note: Learning rate, slice, use 2.6 as magic number.

## Tabular

### Categorical variables

variables that is finite. like marital-status, occupation.

We need to use a technique called Embedding to put those values into nerual net.

Sometimes you need to think carefully about what to put as ategorical variable. E.g. Day of the month, 1st, 15th, 30th of day probably has different purchasing behavior. Futhermore, there are only 31 of days in a month. (Cardinality is not high ) Isn't this better to use then continuous variable?

### Continuous Variables

It is numbers, range of numbers, has infinite digits.

These data can directly put into the nerual net, like pixels in a image

### fastai Processor

It is data pre-processing, They run once before any training. The run once on the training set, and then any kind of state or metadata is shared across test and validation set.

E.g. For image recognition, we have set of class for different cat breeds, and turned into a number, they are the same between training set and validation set.

```
procs = [FillMissing, Categorify, Normalize]
```

### fastai Transformation

It is data argumentation, like randomizing it. It will be different everytime we want to train again. Like image transformation.

### Tabular data validation

Rule of thumb is to split a validation set in a continuous manner. For time related data, should split a validation set by a range of period; For video, should be a validation set of continuous frames

### Fastai metrics

Those are for print-outs, does not change the result
```
learn = tabular_learner(data, layers=[200,100], metrics=accuracy)
learn.fit(1, 1e-2)
Total time: 00:03
epoch	train_loss	valid_loss	accuracy
1	0.354604	0.378520	0.820000
```

### Check lesson4-tabular.ipynb

## Collaborative filtering

![a](lesson4/lesson4-3.png "")

Collaborative filtering is basically movie reviews. We have users and movies, and users have rating on any number of movies. 

The goal here is for a given user, with some rating already in the data, to predict a rating that the user did not have rating yet ( did not watch yet ). Like fill in the blanks.


### Sparse Matrix

The example above has very few blanks, but it real life, it will probably has a lot of blanks.

a lot of blanks is a sparse matrix. If we store users and movies in a matrix, which is a sparse matrix ( not enough data ), it will be wasting space. Because the space complexity is o(n^2).

### Cold start problem

When there is a lot of empty data ( which form a sparse matrix ), it is hard to predict. It is called cold start problem.

It is really a case-by-case basis. But for example, netflix will ask the new user what movies did they watch. By doing so, netflix get some info on the new user, so it solve the cold start problem.

![a](lesson4/lesson4-4.png "")

The basically idea of solving collaborative filtering is, for each user, we have generated 5 random column of number, do the same for earh movie.

Then, for each rating, we perform matrix multiplication for it, with respect to 5 numbers from user and movies.
```
H25 =IF(H2="",0,MMULT($B25:$F25,H$19:H$23))
```

Then, we use gradient descent to run thorugh all the batches, and tweak those numbers

### Embedding Matrix

This is just the 5 columns and rows for user in above example.
So that example has a embedding matrix of shape[15, 5]
Also has another embedding matrix for movies of shape [15, 5]

the 5 is user choice, you can have as many dimention you like.

### Bias

So, just do a dot product of the 5 numbers from embedding matrix user to the ones in movies is not enough. What happen if a movie is more popular? What happen if the user likes to rating movie higher than average (give all 5 stars )?

So we need a new number called bias.

### Limit calculation output

![a](lesson4/lesson4-5.png "")

Although the collaborative filter model above can already predict the user rating output, we can do one thing better, by limit the model output from 0 to 5 only ( 0 to 5 stars).

We can do this by using a sigmoid function at the end of the layer.

This save a lot of calculations, and give a much better result.

We actually want to y_range to be a little higher than 5, because a sigmoid function will never hit 5.

```
learn = collab_learner(data, n_factors=40, y_range=[0,5.5], wd=1e-1)
```
Check out the implementation from fastai class EmbeddingDotBias.
![a](lesson4/lesson4-6.png "")

### How data is stored

The example above is using a matrix to store ratings, where row is users and column is movies.
But we cannot store data that way, consider there can be millions of users

We store it as an arrayof [userid, movieid, rating]

![a](lesson4/lesson4-7.png "")

We added a new user_index and movie_index to make it easier to think.

We will still have the embedding matrix, store as follow:
![a](lesson4/lesson4-8.png "")

## One hot encoding

To quickly get the row of embeddings number ( 5 numbers in this case ) for each user, we can use a identity matrix and do a matrix multiplication

![a](lesson4/lesson4-9.png "")

Do the same for movies:

![a](lesson4/lesson4-10.png "")

Now, we can do a dot prodct of User activation and Movie activation, and get a prediction.

![a](lesson4/lesson4-11.png "")

## Embedding

You can quickly see that both the user activation and movie activations, the number is the same as the embedding matrix. This make sense, because we are just multiply it with identity matrix.

So, we can actually just do a "array lookup", to look at the index of the embedding for user/movie, and put it in the activations
```
M3 =OFFSET($B$3,$L3,M$2)
```
![a](lesson4/lesson4-12.png "")

Embedding in here means an array lookup, because in here it is identical to matrix multiplcation of an identity matrix

## Latent features/ Latent Factors

![a](lesson4/lesson4-13.png "")

So, for a particular user, who like an actor A, and like a particular movie B, which have that actor A in it; After training, the embeddings ( especially the highlighted purple ) is going to have to get some features out of the data.

For example, the number in the purple box eventually is going to represent a certain thing, even though the model does not know that user "like actor A" specifically, but it knows user like this feature.

The model will find hidden features, even though we did not specifically program the model to find "actor A movies"

## Bias

![a](lesson4/lesson4-13.png "")

So, what happen if a user like actor A, and there is a movie that have A in it, but that movie is really bad?

The previous model would just recommend anyway.

So we need to add a bias. So now it will not solidly depend on only 1 feature.

### Check lesson4-collab.ipynb

## Pricinple Componenets Analysis

It basically take a larger dimention of tensor and try to fit the data in lower dimentions
```
movie_pca = movie_w.pca(3)
```