One of the most fascinating advancements in the world of machine learning, is the development of abilities to teach a machine how to understand human communication. This very arm of machine learning is called as Natural Language Processing.

This Notebook is an attempt at explaining the basics of Natural Language Processing and how a rapid progress has been made in it with the advancements of deep learning and neural networks.

Before we dive further into this it is necessary to understand the basics

# What is Language?

A language, basically is a fixed vocabulary of words which is shared by a community of humans to express and communicate their thoughts.

This vocabulary is taught to humans as a part of their growing up process, and mostly remains fixed with few additions each year.

Elaborate resources such as dictionaries are maintained so that if a person comes across a new word he or she can reference the dictionary for its meaning. Once the person gets exposed to the word it gets added in his or her vocabulary and can be used for further communications.

# How do computers understand language?

A computer is a machine working under mathematical rules. It lacks the complex interpretations and understandings which humans can do with ease, but can perform a complex calculation in seconds.

For a computer to work with any concept it is necessary that there should be a way to express the said concept in the form of a mathematical model.

This constraint highly limits the scope and the areas of natural language a computer can work with. So far what machines have been highly successful in performing are classification and translation tasks.

A classification is basically categorizing a piece of text into a category and translation is converting that piece into any other language.

# What is Natural Language Processing?

Natural Language Processing, or NLP for short, is broadly defined as the automatic manipulation of natural language, like speech and text, by software.

The study of natural language processing has been around for more than 50 years and grew out of the field of linguistics with the rise of computers.

# Basic Transformations

As mentioned earlier, for a machine to make sense of natural language( language used by humans) it needs to be converted into some sort of a mathematical framework which can be modeled. Below mentioned, are some of the most commonly used techniques which help us achieve that.

# Tokenization, Stemming and Lemmitization

### Tokenization

is the process of breaking down a text into words. Tokenization can happen on any character, however the most common way of tokenization is to do it on space character.

### Stemming

is a crude way of chopping of an end to get base word and often includes removal of derivational affixes. A derivational affix is an affix by means of which one word is formed (derived) from another. The derived word is often of a different word class from the original.

### Lemmatization 

performs vocabulary and morphological analysis of the word and is normally aimed at removing inflectional endings only. An inflectional ending is a group of letters added to the end of a word to change its meaning. Some inflectional endings are: -s. eg. bat. bats.

### N-Grams

N-grams refer to the process of combining the nearby words together for representation purposes where N represents the number of words to be combined together.

Original sentence : "Natural Language Processing is essential to Computer Science”

A 1-gram or unigram model will tokenize the sentence into one word combinations and thus the output will be “Natural, Language, Processing, is, essential, to, Computer, Science”

A bigram model on the other hand will tokenize it into combination of 2 words each and the output will be “Natural Language, Language Processing, Processing is, is essential, essential to, to Computer, Computer Science”

Similarly, a trigram model will break it into “Natural Language Processing, Language Processing is, Processing is essential, is essential to, essential to Computer, to Computer Science” , and a n-gram model will thus tokenize a sentence into combination of n words together.

Note : Breaking down a natural language into n-grams is essential for maintaining counts of words occurring in sentences which forms the backbone of traditional mathematical processes used in Natural Language Processing.

### Transformation Methods

One of the most common methods of achieving this in a bag of words representation is tf-idf

TF-IDF is a way of scoring the vocabulary so as to provide adequate weight to a word in proportion of the impact it has on the meaning of a sentence. The score is a product of 2 independent scores, term frequency(tf) and inverse document frequency (idf)

![image.png](attachment:image.png)

### Term Frequency (TF): 

Term frequency is defined as frequency of word in the current document.

### Inverse Document Frequency( IDF): 

is a measure of how much information the word provides, i.e., if it’s common or rare across all documents. It is calculated as log (N/d) where, N is total number of documents and d is the number of documents in which the word appears.

# One-Hot Encodings

One hot encodings are another way of representing words in numeric form. The length of the word vector is equal to the length of the vocabulary, and each observation is represented by a matrix with rows equal to the length of vocabulary and columns equal to the length of observation, with a value of 1 where the word of vocabulary is present in the observation and a value of zero where it is not.

![image.png](attachment:image.png)

# Bag of Words

For an algorithm to derive relationships amongst the text data it needs to be represented in a clear structured format.

Bag of words is a way to represent the data in a tabular format with columns representing the total vocabulary of the corpus and each row representing a single observation. The cell (intersection of the row and column) represents the count of the word represented by the column in that particular observation.

It helps a machine understand a sentence in an easy to interpret paradigm of matrices and thus enables various linear algebraic operations and other algortihms to be applied on the data to build predictive models.

However, there are two major drawbacks of this representation:

1. It disregards the order/grammar of the text and thus looses the context in which a word is being used

2. The matrix generated by this representation is highly sparse and more biased towards the most common words. Think about it, the algorithms primarily work on the count of the words, whereas in language the importance of word is actually inversely proportional to frequency of occurrence. Words with higher frequency are more general words like the, is, an, which do not alter the meaning of sentence significantly. Thus it becomes important to weigh the words appropriately so as to reflect there adequate impact on the meaning of a sentence.

# Recurrent Neural Networks (RNN)

Recurrent Neural Networks or RNN as they are called in short, are a very important variant of neural networks heavily used in Natural Language Processing.

Conceptually they differ from a standard neural network as the standard input in a RNN is a word instead of the entire sample as in the case of a standard neural network. This gives the flexibility for the network to work with varying lengths of sentences, something which cannot be achieved in a standard neural network due to it’s fixed structure. It also provides an additional advantage of sharing features learned across different positions of text which can not be obtained in a standard neural network.

A RNN treats each word of a sentence as a separate input occurring at time ‘t’ and uses the activation value at ‘t-1’ also, as an input in addition to the input at time ‘t’. 

# Limitations of RNN

Apart from all of its usefulness RNN does have certain limitations major of which are :

Examples of RNN architecture are capable of capturing the dependencies in only one direction of language. Basically in case of Natural Language Processing it assumes that the word coming after has no effect on the meaning of the word coming before. With our experience of languages we know that it is certainly not true.

RNN are also not very good in capturing long term dependencies and the problem of vanishing gradients resurface in RNN.

Both these limitations give rise to new types of RNN architectures, as given below:

A typical RNN looks like this:
    
![image.png](attachment:image.png)

![image.png](attachment:image.png)

h is hidden layer, x is input, t-1, t, t+1 is the sequence of data, w is weight, y is output, f is activation function

It is now easier for us to visualize how these networks are considering the trend of stock prices. This helps us in predicting the prices for the day. Here, every prediction at time t (h_t) is dependent on all previous predictions and the information learned from them. Fairly straightforward, right?

RNNs can solve our purpose of sequence handling to a great extent but not entirely.

Text is another good example of sequence data. Being able to predict what word or phrase comes after a given text could be a very useful asset. 

Now, RNNs are great when it comes to context that is short or small in nature. But in order to be able to build a story and remember it, our models should be able to understand the context behind the sequences, just like a human brain.

# Understanding of LSTM Networks

Long Short-Term Memory is an advanced version of recurrent neural network (RNN) architecture that was designed to model chronological sequences and their long-range dependencies more precisely than conventional RNNs.

LSTM networks are an extension of recurrent neural networks (RNNs) mainly introduced to handle situations where RNNs fail. 

It fails to store information for a longer period of time. At times, a reference to certain information stored quite a long time ago is required to predict the current output. But RNNs are absolutely incapable of handling such “long-term dependencies”.

There is no finer control over which part of the context needs to be carried forward and how much of the past needs to be ‘forgotten’. 

Other issues with RNNs are exploding and vanishing gradients which occur during the training process of a network through backtracking. 

Information is retained by the cells and the memory manipulations are done by the gates. There are three gates which are explained below:

### Forget Gate

The information that is no longer useful in the cell state is removed with the forget gate.

### Input gate

The addition of useful information to the cell state is done by the input gate.

### Output gate

The task of extracting useful information from the current cell state to be presented as output is done by the output gate.

### GRU v/s LSTM

In spite of being quite similar to LSTMs, GRUs have never been so popular. But what are GRUs? GRU stands for Gated Recurrent Units. As the name suggests, these recurrent units are also provided with a gated mechanism to effectively and adaptively capture dependencies of different time scales. They have an update gate and a reset gate. The former is responsible for selecting what piece of knowledge is to be carried forward, whereas the latter lies in between two successive recurrent units and decides how much information needs to be forgotten. 

# FINDING NUMBER OF PARAMETERS IN A NEURAL NETWORK WITH A SINGLE DENSE LAYER

In [1]:
from tensorflow.python.keras.layers.recurrent import LSTM
import numpy as np
from tensorflow.python.keras import Input, Model
from tensorflow.python.keras.layers import LSTM
from tensorflow.python.layers.core import Dense

In [2]:
# define model
timesteps=40           # dimensionality of the input sequence
features=3            # dimensionality of each input representation in the sequence
LSTMoutputDimension = 2 # dimensionality of the LSTM outputs (Hidden & Cell states)

input = Input(shape=(timesteps, features))
output= LSTM(LSTMoutputDimension)(input)
model_LSTM = Model(inputs=input, outputs=output)

model_LSTM.summary()

Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         [(None, 40, 3)]           0         
_________________________________________________________________
lstm (LSTM)                  (None, 2)                 48        
Total params: 48
Trainable params: 48
Non-trainable params: 0
_________________________________________________________________


Assume that

i= input size

h= size of hidden layer (number of neurons in the hidden layer)

o= output size (number of neurons in the output layer)

For a single hidden layer,

number of parameters in the model = connections between layers + biases in every layer (hiden + output layers!)

= (i×h + h×o) + (h+o)

![image.png](attachment:image.png)

We can find the number of parameters by counting the number of connections between layers and by adding bias.

connections (weigths) between layers:

between input and hidden layer is

i * h = 3 * 5 = 15

between hidden and output layer is

h * o = 5 * 2 = 10

biases in every layer

biases in hidden layer

h = 5

biases in output layer

o = 2

Total:

15 + 10 + 5 + 2 = 32 parameters (weights + biases)


# KERAS LSTM CELL STRUCTURE

![image.png](attachment:image.png)

Review the above Figure to capture the internal structure of LSTM cell

There are 3 inputs to the LSTM cell:

ht−1  previous timestep (t-1) Hidden State value
ct−1  previous timestep (t-1) Cell State value
xt  current timestep (t) Input value


There are 4 dense layers:

Forget Gate
Input Gates = Input + Candidate
Output Gate
The input output tensor sizes (dimensions) are symbolized with circles. In the figure,

Cell and Hidden states are vectors which have a dimension = 2. This number is defined by the programmer by setting LSTM parameter units (LSTMoutputDimension) to 2
Input is a vector which has a dimension = 3. This number is also defined by the programmer by deciding how many dimension would be to represent an input (e.g. dimension of one-hot encoding, word embedding, etc.)

Note that, By definition:

Hidden and Cell states vector dimensions must be the same
Hidden and Cell states vector dimensions at time t-1 and t must be the same
Each input in the sequence must have the same vector dimensions

### Let's focus on Forget Gate

![image.png](attachment:image.png)

As seen in above figure,

there are Hidden state values (2  ht−1  red circles) and
Input values ( 3  xt  green circles)
Total 5 numbers ( 2  ht−1  + 3  xt  ) are inputted to a dense layer

Output layer has 2 values (which must be equal to the dimension of  ht−1  Hidden state vector in the LSTM Cell)

We can calculate the number of parameters in this dense layer as we did before:

= ( ht−1  +  xt ) ×  ht−1  +  ht−1 

= (2 + 3) × 2 + 2

= 12

Thus Forget Gate has 12 parameters (weights + biases)

![image.png](attachment:image.png)

# Case Study

### Emotion Detection using Bidirectional LSTM

Emotion Detection is one of the hottest topics in research nowadays. Emotion-sensing technology can facilitate communication between machines and humans. It will also help to improve the decision-making process. Many Machine Learning Models have been proposed to recognize emotions from the text.

Bidirectional LSTMs in short BiLSTM are an addition to regular LSTMs which are used to enhance the performance of the model on sequence classification problems. BiLSTMs use two LSTMs to train on sequential input. The first LSTM is used on the input sequence as it is. The second LSTM is used on a reversed representation of the input sequence. It helps in supplementing additional context and makes our model fast.

The dataset we have used is ISEAR (The International Survey on Emotion Antecedents and Reactions). Here, is a glimpse of the dataset.

ISEAR dataset contains 7652 sentences. It has a total of seven sentiments- Joy, Fear, Anger, Sadness, Guilt, Shame, and Disgust.  

Let’s go step by step in making the model which will predict emotion.



### Let me open code file of case study