<a href="https://colab.research.google.com/github/Mrinal18/Deep_Learning_Catalyst/blob/main/Deep_Learning_Course_Chapter_6_Sequence_Models.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Course Overview: 

1. Introduction to NLP
2. word processing and bag of words
3. Introduction to Recurrent Neural Network 
4. Introduction to Transformer Models and famous BERT.
5. Introduction to Huggingface API


HW: 
1. Read a research paper and produce a summary + psudo code of the paper.
2. Building a LSTM for character level prediction
3. Building Machine Translation using transformers and hugging face 
Note: Needs to be done using catalyst

# Natural language Processing
===========================================

Text Processing is one of the most common task in many ML applications. Below are some examples of such applications.

* Language Translation: Translation of a sentence from one language to another.
*  Sentiment Analysis: To determine, from a text corpus, whether the  sentiment towards any topic or product etc. is positive, negative, or neutral.
* Spam Filtering:  Detect unsolicited and unwanted email/messages.



These applications deal with huge amount of text to perform classification or translation and involves a lot of work on the back end. Transforming text into something an algorithm can digest is a complicated process.


Introduction to NLP
===========================================
Data comes in many different forms: time stamps, sensor readings, images, categorical labels, and so much more. But text is still some of the most valuable data out there for those who know how to use it.

In this course about Natural Language Processing (NLP), you will use the leading NLP library to take on some of the most important tasks in working with text.

By the end, you will be able to use NLP for:

Basic text processing and pattern matching
Building machine learning models with text
Representing text with word embeddings that numerically capture the meaning of words and documents
To get the most out of this course, you'll need some experience with machine learning. If you don't have experience with scikit-learn, check out Intro to Machine Learning and Intermediate Machine Learning to learn the fundamentals.



Word Processing and Bag of Words
===========================================

Word embeddings are dense vectors of real numbers, one per word in your vocabulary. In NLP, it is almost always the case that your features are words! But how should you represent a word in a computer? You could
store its ascii character representation, but that only tells you what the word *is*, it doesn't say much about what it *means* (you might be able to derive its part of speech from its affixes, or properties from
its capitalization, but not much). Even more, in what sense could you combine these representations? We often want dense outputs from our neural networks, where the inputs are $|V|$ dimensional, where
$V$ is our vocabulary, but often the outputs are only a few dimensional (if we are only predicting a handful of labels, for instance). How do we get from a massive dimensional space to a smaller
dimensional space?

How about instead of ascii representations, we use a one-hot encoding? That is, we represent the word $w$ by 
$$\begin{align}\overbrace{\left[ 0, 0, \dots, 1, \dots, 0, 0 \right]}^\text{|V| elements}\end{align})$$

where the 1 is in a location unique to $w$. Any other word will have a 1 in some other location, and a 0 everywhere else.

There is an enormous drawback to this representation, besides just how huge it is. It basically treats all words as independent entities withno relation to each other. What we really want is some notion of
*similarity* between words. Why? Let's see an example.

Suppose we are building a language model. Suppose we have seen the
sentences

* The mathematician ran to the store.
* The physicist ran to the store.
* The mathematician solved the open problem.

in our training data. Now suppose we get a new sentence never before seen in our training data:

* The physicist solved the open problem.

Our language model might do OK on this sentence, but wouldn't it be much better if we could use the following two facts:

* We have seen  mathematician and physicist in the same role in a sentence. Somehow they have a semantic relation.
* We have seen mathematician in the same role  in this new unseen sentence as we are now seeing physicist.

and then infer that physicist is actually a good fit in the new unseen sentence? This is what we mean by a notion of similarity: we mean *semantic similarity*, not simply having similar orthographic
representations. It is a technique to combat the sparsity of linguistic data, by connecting the dots between what we have seen and what we haven't. This example of course relies on a fundamental linguistic
assumption: that words appearing in similar contexts are related to each other semantically. This is called the `distributional hypothesis <https://en.wikipedia.org/wiki/Distributional_semantics>`__.


Getting Dense Word Embeddings 
How can we solve this problem? That is, how could we actually encode semantic similarity in words? Maybe we think up some semantic attributes. For example, we see that both mathematicians and physicists
can run, so maybe we give these words a high score for the "is able to run" semantic attribute. Think of some other attributes, and imagine what you might score some common words on those attributes.

If each attribute is a dimension, then we might give each word a vector, like this:

$$\begin{align}q_\text{mathematician} = \left[ \overbrace{2.3}^\text{can run}, \overbrace{9.4}^\text{likes coffee}, \overbrace{-5.5}^\text{majored in Physics}, \dots \right]\end{align}z$$

$$\begin{align}q_\text{physicist} = \left[ \overbrace{2.5}^\text{can run},\overbrace{9.1}^\text{likes coffee}, \overbrace{6.4}^\text{majored in Physics}, \dots \right]\end{align}$$

Then we can get a measure of similarity between these words by doing:

$$\begin{align}\text{Similarity}(\text{physicist}, \text{mathematician}) = q_\text{physicist} \cdot q_\text{mathematician}\end{align}$$

Although it is more common to normalize by the lengths:

$$\begin{align}\text{Similarity}(\text{physicist}, \text{mathematician}) = \frac{q_\text{physicist} \cdot q_\text{mathematician}}{\| q_\text{physicist} \| \| q_\text{mathematician} \|} = \cos (\phi)\end{align}$$

Where $\phi$ is the angle between the two vectors. That way, extremely similar words (words whose embeddings point in the same direction) will have similarity 1. Extremely dissimilar words should have similarity -1.

You can think of the sparse one-hot vectors from the beginning of this section as a special case of these new vectors we have defined, where each word basically has similarity 0, and we gave each word some unique
semantic attribute. These new vectors are *dense*, which is to say their entries are (typically) non-zero.

But these new vectors are a big pain: you could think of thousands of different semantic attributes that might be relevant to determining similarity, and how on earth would you set the values of the different
attributes? Central to the idea of deep learning is that the neural network learns representations of the features, rather than requiring the programmer to design them herself. So why not just let the word
embeddings be parameters in our model, and then be updated during training? This is exactly what we will do. We will have some *latent semantic attributes* that the network can, in principle, learn. Note
that the word embeddings will probably not be interpretable. That is, although with our hand-crafted vectors above we can see that mathematicians and physicists are similar in that they both like coffee,
if we allow a neural network to learn the embeddings and see that both mathematicians and physicists have a large value in the second dimension, it is not clear what that means. They are similar in some
latent semantic dimension, but this probably has no interpretation to us.


In summary, **word embeddings are a representation of the *semantics* ofa word, efficiently encoding semantic information that might be relevant to the task at hand**. You can embed other things too: part of speech tags, parse trees, anything! The idea of feature embeddings is central to the field.


An Example: N-Gram Language Modeling

Recall that in an n-gram language model, given a sequence of words  𝑤 , we want to compute

𝑃(𝑤𝑖|𝑤𝑖−1,𝑤𝑖−2,…,𝑤𝑖−𝑛+1) 

Where  𝑤𝑖  is the ith word of the sequence.

In this example, we will compute the loss function on some training examples and update the parameters with backpropagation.

# Introduction to Spacy

Before we delve on using Natural language for using deep learning for sequence modling, we will first look at how to use tokens and word embeddings first

## NLTK and Spacy:
source: https://www.activestate.com/blog/natural-language-processing-nltk-vs-spacy/

NLTK and spacy both are tools to be used for any natural language pipelines, both has more it's advantages and disadvantages 

to install spacy for your machine, please refer to this site:  https://spacy.io/usage




In [None]:
!pip install -U pip setuptools wheel
!pip install -U spacy
!python -m spacy download en_core_web_sm



In [None]:
import spacy
nlp = spacy.load("en_core_web_sm")

text = ("When Sebastian Thrun started working on self-driving cars at "
        "Google in 2007, few people outside of the company took him "
        "seriously. “I can tell you very senior CEOs of major American "
        "car companies would shake my hand and turn away because I wasn’t "
        "worth talking to,” said Thrun, in an interview with Recode earlier "
        "this week.")
doc = nlp(text)

# Analyze syntax
print("Noun phrases:", [chunk.text for chunk in doc.noun_chunks])
print("Verbs:", [token.lemma_ for token in doc if token.pos_ == "VERB"])

# Find named entities, phrases and concepts
for entity in doc.ents:
    print(entity.text, entity.label_)
    

## Natural Language Models

Basically we input the text into a neural network, the neural network will map all this context onto a vector. This vector represents the next word and we have some big word embedding matrix. The word embedding matrix contains a vector for every possible word the model can output. We then compute similarity by dot product of the context vector and each of the word vectors. We’ll get a likelihood of predicting the next word, then train this model by maximum likelihood. The key detail here is that we don’t deal with words directly, but we deal with things called sub-words or characters.


$$p(x_0 \| x_0, ..., n-1) = softmax(E(f(x_0 \| x_0, ..., n-1)))$$

![title](figures/fig1.jpeg)



## Convolutional Language Models
The first neural language model
Embed each word as a vector, which is a lookup table to the embedding matrix, so the word will get the same vector no matter what context it appears in Apply same feed forward network at each time step Unfortunately, fixed length history means it can only condition on bounded context These models do have the upside of being very fast 

![title](figures/cnn_language_model.jpeg)


## Recurrent Language Models
### The most popular approach until a couple years ago. 

Conceptually straightforward: every time step we maintain some state (received from the previous time step), which represents what we’ve read so far. This is  combined with current word being read and used at later state. Then we repeat this process for as many time steps as we need. 

Uses unbounded context: in principle the title of a book would affect the hidden states of last word of the book. 

Disadvantages:
* The whole history of the document reading is compressed into fixed-size.
* vector at each time step, which is the bottleneck of this model.
* Gradients tend to vanish with long contexts.
* Not possible to parallelize over time-steps, so slow training.

In [None]:
#Install catalyst

Text preprocessing is the most important part of NLP. 

In comparison, an image is usually reshaped and normalized in a preprocessing pipeline. But a text is different. A text consists of words(or tokens), that has a different probability to be written. Words are arrays of characters, and different arrays can be related to one word(E.g. "it" and "It" or "Имя" and Имени" is one word, but different word form.). 

That's why texts should be normalized and tokenized.

### Preparing the Data

Text preprocessing is the most important part of NLP. In comparison, an image is usually reshaped and normalized in a preprocessing pipeline. But a text is different. A text consists of words(or tokens), that has a different probability to be written. Words are arrays of characters, and different arrays can be related to one word(E.g. "it" and "It" or "Имя" and "Имени" is one word, but different word form.). That's why texts should be normalized and tokenized.

In [None]:
#preparing the data

In [None]:
#Implementing RNN from scratch. 

"""
We will be building and training a basic character-level RNN to classify words. This tutorial, along with the following two, show how to do preprocess data for NLP modeling “from scratch”, in particular not using many of the convenience functions of torchtext, so you can see how preprocessing for NLP modeling works at a low level.

A character-level RNN reads words as a series of characters - outputting a prediction and “hidden state” at each step, feeding its previous hidden state into each next step. We take the final prediction to be the output, i.e. which class the word belongs to.

Specifically, we’ll train on a few thousand surnames from 18 languages of origin, and predict which language a name is from based on the spelling:

python predict.py Hinton
(-0.47) Scottish
(-1.52) English
(-3.57) Irish

python predict.py Schmidhuber
(-0.19) German
(-2.48) Czech
(-2.68) Dutch
"""


#Implementing RNN from scratch using catalyst


import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim


class RNN(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(RNN, self).__init__()
        self.hidden_size = hidden_size
        self.i2h = nn.Linear(input_size + hidden_size, hidden_size)
        self.i2o = nn.Linear(input_size + hidden_size, output_size)
        self.softmax = nn.LogSoftmax(dim=1)

    def forward(self, input, hidden):
        combined = torch.cat((input, hidden), 1)
        hidden = self.i2h(combined)
        output = self.i2o(combined)
        output = self.softmax(output)
        return output, hidden

    def init_hidden(self):
        return torch.zeros(1, self.hidden_size)
