# Andrej Karpathy's /makemore

#### Reference
https://www.youtube.com/watch?v=TCH_1BHY58I

https://github.com/karpathy/makemore

With Bigram char level model, we only used two chars which created a 27x27 data space. if we move deeper in this approach to enhance the loss function and the model itself, the only avenue was to explor adding more dimensions, i.e. 27x27x27. however this path suddenly explodes in terms of data and parameters that we want to use for this model.

Therefore, we need to explore a better model.

## Multi Layer Perceptron (MLP)
#### Bengio et al. 2003

This is another char level model to predict the next char. The proposed approach is to take 'w' number of words, and associate to each word, 'm' number of feature vectors. Meaning that, each word is embedded in a 'm' dimensional feature space. Initially these words are initialized randomly but later we'll tune them using backpropagation. 

To imagine this approach, think about words that are similar or synonyms. They will end up in the same part of the space. And those that are different will be separated. 

The modeling approach is similar to the NN approach for Bigram. They use multi-layer NN to predict the next words, given the previous words. To train the NN, they ```maximize the log-likelihood of the training data```.

Let's look at an ```example``` for this approach. Assume, we are not given the sentence "A dog was running in a room". But now for testing the model we are providing it with "A dog was running in a ..." and expecting the model to fill in the blank. Since it hasn't seen this exact sentence, we call it, ```out of distribution```. However, MLP doesn't need to have seen the exact words to predict 'room' for the blank. Because it might have seen "The dog was running in a room" and based on the learnings, it has put the embeddings of 'The' and 'A' near by each other in the space. So now that we are asking it to fill the blank based on "A dog was running in a ...", it will match it up with "The dog was running in a room". This is called ```knowledge transfer```.

Let's look at the ```architecture``` of this approach. 

Assuming the NN's input, takes 3 previous words. And the output is the fourth word. Each of the incoming words, will go through a look-up table, to match up the corresponding embedding ('m' feature vector) for that word. So there will be $3 \times m$ neurons holding the 3 words. 

The we need to build a hidden layer. The size is a ```hyper-parameter```. Meaning that, we need to come up with the right size based on try-error. So all the input neurons goes into the hidden layer. And there will be a ```tanh``` function applied for non-linearity. 

The output layer is a huge one, because the number of neurons is equivalent to $w$, the number of words in our data set. All the neurons in the hidden layer are connected to the output neurons. That's why there will be lots of params in between these two layers, and therefore, it's going to be computationally expensive. On top of thee output layer we have ```softmax``` to exponentiate the logits and then normalized, so that it will sum up to 1. This way, we'll get a nice probability distribution for the next word in the sequence. 

During training, because we have xs and ys, we will get the probability for each x and minimize the NN's loss by improving the parameters. The optimization used here is also ```backpropagation```.