<a href="https://colab.research.google.com/github/itSammycodethngy/ApiCook/blob/master/RNN_Overview_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h4> Recurrent Neural Network Overview </h4>
In Simple put all other neural networks do not have memory and so they cannot deal with sequential data . For example the position of a word in a sentence would have significant impact on the its meaning . RNN learn representation of sequences in which order is important . They need to have a memory of the past , anytime we feed a word as input , we alsop feed a state variable which has knowledge of the past. In RNN we have an additional matrix that helps learn the past state . The state vector at the end of inputting a full sentence is a representation of the sentence taking into account of the order of input . 

<h6> LSTM </h6>
Special kind of RNN that can learn long  term depencies . Cell state carries information around the network , thus it has the ability to carry infomation from one end to another .It can be called the conveyor belt. 
LSTMs have the ability to remove and add information to the cell state , carefully regulated by structures called gates . They are a way to optionally allow an information through . The sigmoid layer outputs number between 0-1 showing exactly how much information should be read through .<br>

1. First step is to decide which information to throw away from the cell state , this will be done by a signoid gate known as the forget gate 

2. The second step is to decide what new information we are we going to store in the cell state . A sigmoid gate decides which values need to be update , then a tan layer prepare candidate vectors that could be added to the state 

3. Deciding the output , first we run a sigmoid gate which decide what part of the cell state to output . Then we put the cell state through a tan function and multiply by the output of the sigmoid gate , so the network only output the part it has decided to .

<h6> GRU (Gated Recurrent Unit) </h6> 
The GRU merges the input , forget gate into one single update gate . It also merges the cell state and the hidden state , the resultant model is faster and computationally less expensive , LSTM are better in performance but much more expensive . Choosing GRU is a trade-off between computation and performance . 

<h6> Applications </h6> 
1. Natural language Processing <br>
2. Machine Translation <br>
3. Image Captioning <br>
4. Speech Recognition <br>
5. Sentiment analysis 


<h2> Data Pre-Processing </h2>
How keras helps in pre processing . 

In [None]:
import os 
import numpy as np 
from keras.preprocessing.text import Tokenizer 
from keras.preprocessing.sequence import pad_sequences 


Using TensorFlow backend.


<h4> Load Data </h4>

In [None]:
def load_file(path):
    input_file = os.path.join(path)
    with open (input_file, "r", encoding ="utf8") as f:
        data = f.read()
        
    return data.split('\n')


In [None]:
english_sentences = load_file('train.en.txt')
vietnam_sentences = load_file('train.vi.txt')


In [None]:
print(english_sentences[1])
print(vietnam_sentences[1])


In 4 minutes , atmospheric chemist Rachel Pike provides a glimpse of the massive scientific effort behind the bold headlines on climate change , with her team -- one of thousands who contributed -- taking a risky flight over the rainforest in pursuit of data on a key molecule .
Trong 4 phút , chuyên gia hoá học khí quyển Rachel Pike giới thiệu sơ lược về những nỗ lực khoa học miệt mài đằng sau những tiêu đề táo bạo về biến đổi khí hậu , cùng với đoàn nghiên cứu của mình -- hàng ngàn người đã cống hiến cho dự án này -- một chuyến bay mạo hiểm qua rừng già để tìm kiếm thông tin về một phân tử then chốt .


<h4> Tokenizer </h4>
As you might have probably guessed , a neural network cannot work with text data . It needs numbers which it can perform computations on 
We will therefore turn each word in the sentence into numbers .  The Tokenizer feature of the keras framework will help us .



In [None]:
def tokenize(input):
    tokenizer = Tokenizer()
    tokenizer.fit_on_texts(input)
    input_tokenized = tokenizer.texts_to_sequences(input)
    
    return input_tokenized , tokenizer

english_data_tokenized , english_tokenizer = tokenize(english_sentences)
vietnam_data_tokenized , vietnam_tokenizer = tokenize(vietnam_sentences)

print(english_data_tokenized[1])
print(vietnam_data_tokenized[1])



[9, 1297, 448, 8298, 6152, 6343, 15813, 3291, 6, 4182, 5, 1, 1226, 901, 1464, 555, 1, 2705, 3608, 25, 670, 159, 26, 131, 559, 38, 5, 508, 64, 5226, 439, 6, 5227, 1692, 119, 1, 3749, 9, 3680, 5, 206, 25, 6, 707, 1438]
[14, 588, 574, 619, 145, 179, 60, 391, 917, 3380, 7329, 89, 945, 951, 1187, 32, 7, 1133, 256, 293, 60, 3381, 2864, 1062, 102, 7, 354, 110, 1300, 835, 32, 294, 142, 391, 689, 117, 20, 1165, 266, 209, 8, 82, 162, 854, 13, 18, 1969, 1335, 22, 314, 431, 17, 4, 831, 428, 1320, 610, 133, 872, 907, 28, 146, 491, 144, 125, 32, 4, 283, 358, 2427, 1295]


<h4> Add Padding </h4>
In order for our model to train efficiently , each sequence needs to be of the same rank , since sentences are different in ranks , we can add zeros to the ones with small length so we can have the same length. The padding_sequence will automate the process for us , we can specify where we want to add the zeros , at the beginnig or at the end . This will not affect the performance in anyway. 

In [None]:
def pad(input, length=None):
    if length == None:
        length = max([len(seq) for seq in input])
        
    return pad_sequences(input, maxlen = length, padding='post')

english_data_padded = pad(english_data_tokenized)
# We need to reshape so we can use s sparse categorical function entropy. 
vietnam_data_padded = pad(vietnam_data_tokenized)
vietnam_data_padded = vietnam_data_padded.reshape(*vietnam_data_padded.shape,1)

print(english_data_padded.shape)
print (vietnam_data_padded.shape)



(133318, 612)
(133318, 720, 1)


<h4> Build Model </h4>


In [None]:
from keras.layers import GRU, Input, Dense , TimeDistributed 
from keras.models import Model , Sequential 
from keras.layers import Activation 
from keras.optimizers import Adam 
from keras.losses import sparse_categorical_crossentropy 


<h4>About Time Distributed</h4>
We wrap the Dense layer in a time distributed function . What this does is to apply dense layer to every time step , so instead of applying a dense layer to a whole sentence , the network will apply the dense layer to each word in the sentence , we are doing this in order to keep each time step value separate or predict each word based on previous once intead of predicting a whole sentence . This will make our predictions more accurate . 

In [None]:
def simple_model(input_shape, output_len, num_uniq_en_words, num_uniq_vi_words):
    model = Sequential()
    model.add(GRU(units=256, input_shape = input_shape[1:], return_sequences=True))# You can use LSTM too 
    model.add(TimeDistributed(Dense(num_uniq_vi_words))) # We set a dense layer to the number of vietnam words so the model can predict how likely each word is . 
    model.add(Activation('softmax'))
    
    learning_rate = 0.002
    
    model.compile(loss =spare_categorical_crossentropy, optimizer = Adam(learning_rate), metrics=['accuracy'])
    
    return model



<h4> Build Advanced Model</h4>


In [None]:
from keras.layers import Embedding, Bidirectional, RepeatVector, Flatten
from keras.layers import LSTM


<h4> Approach </h4>
First we add an <b>embedding layer</b> , it will compress our data from sparse vectors to non sparse vectors . It will also merge <br>
Similar words into similar regions in the vector space . It will learn the retionship between words . The other layers will also take the same input size as the embedding layer since it's output will serve as their input, we can see that they are all 512 in size . The <b> Bidirectional layer </b> simply make the network  learn from the future of the sentences  and not only the past ones . They will help increase the performance of the network . Because the network is bidirectional , we need to generate the word from the whole sentence . This is the work of the <b> Repeat Vector</b> it will repeat the whole sentence for as input for each word we predict , that means each timestep will get the same input but different hidden state , if we do not do that , we will only get one vietnam word per english sentence .  


In [None]:
def advanced_model(input_shape, output_len, num_uniq_en_words, num_uniq_vi_words):
    model = Sequential()
    model.add(Embedding(num_uniq_en_words, 512, input_length = input_shape[1]))
    model.add(Bidirectional(LSTM(512, return_sequences= False)))
    model.add(RepeatVector(output_len))
    model.add(Bidirectional(LSTM(512, return_sequences=True)))
    model.add(TimeDistributed(Dense(num_uniq_vi_words)))
    model.add(Activation('softmax'))
    
    learning_rate = 0.002
    
    model.compile(loss =sparse_categorical_crossentropy, optimizer = Adam(learning_rate), metrics=['accuracy'])
   
    
    return model 


    

In [None]:
model = advanced_model(english_data_padded.shape, vietnam_data_padded.shape[1],
                      len(english_tokenizer.word_index), len(vietnam_tokenizer.word_index))


Instructions for updating:
Colocations handled automatically by placer.


<h4> Train the Model </h4>


In [None]:
history = model.fit(english_data_padded[:100], vietnam_data_padded[:100], batch_size=64, epochs =3 , validation_split=0.1)


Instructions for updating:
Use tf.cast instead.
Instructions for updating:
Deprecated in favor of operator or tf.math.divide.
Train on 90 samples, validate on 10 samples
Epoch 1/3


In [None]:
vietnam_data_padded.shape
