# Neural Networks Homework 8 - Encoder Decoder Language Translation

## Mustafa Nazlıer - 15050111035

**In this implementation, I will be trying to commentate and improve the structure of the code written by Krish Naik**

**Importing the necessary network layers for the implementation -> Input layer, Long Short-Term Memory Layer and Dense layer type. We also could have use Gated Recurrence Unit Layers(GRU) **

In [1]:
from keras.models import Model
from keras.layers import Input, LSTM, Dense
import numpy as np              
from google.colab import files   #This was not present in the original code, I am using google colab to run the code, 
                                 #that is why I will be using this library to upload the 'fra.txt'


uploaded = files.upload()    #this will open a section for me to upload the file to the environment
                            #If you are going to use Jupyter lab, delete this part and add your file path to the below 'with open('yourPath/fra.txt', 'r' , encoding='utf-8') as f:

### Data preprocessing and vectorizing
**In this part, we will be reading our file line by line and get the necessary parts which is splitted by a tab '\t'. In the file fra.txt, there are also unnecessary information like (word   word_in_french and the rest of the line is needless data). That is why we are parsing the file using this below for loops **

In [2]:
input_texts=[]        #english word
target_texts=[]       #french translation of the particular english word

num_samples=10000

input_characters= set()
target_characters= set()

with open('fra.txt', 'r' , encoding='utf-8') as f:          
    lines = f.read().split('\n')
for line in lines[:min(num_samples,len(lines)-1)]:
    input_text, target_text, _ =line.split('\t')
    target_text= '\t'+ target_text + '\n'
    input_texts.append(input_text)
    target_texts.append(target_text)
    for char in input_text:
        if char not in input_characters:
            input_characters.add(char)
    for char in target_text:
        if char not in target_characters:
            target_characters.add(char)


input_characters = sorted(list(input_characters))        #Every possible character that is used in the english words in the file
target_characters = sorted(list(target_characters))      #Every possible character that is in the french version of the english words
num_encoder_tokens = len(input_characters)
num_decoder_tokens = len(target_characters)
max_encoder_seq_length= max([len(txt) for txt in input_texts])     
max_decoder_seq_length= max([len(txt) for txt in target_texts])

In [3]:
print('Number of samples:', len(input_texts))
print('Number of unique input tokens:', num_encoder_tokens)
print('Number of unique output tokens:', num_decoder_tokens)
print('Max sequence length for inputs:', max_encoder_seq_length)
print('Max sequence length for outputs:',max_decoder_seq_length)


Number of samples: 10000
Number of unique input tokens: 71
Number of unique output tokens: 92
Max sequence length for inputs: 15
Max sequence length for outputs: 59


### Vectorizing our input and target characters

In [4]:
input_token_index = dict(
    [(char,i) for i, char in enumerate(input_characters)])
target_token_index = dict(
    [(char,i) for i, char in enumerate(target_characters)])

In [5]:
print('Input tokens indexed',input_token_index.keys())
print('Target tokens indexed', target_token_index.keys())

Input tokens indexed dict_keys([' ', '!', '"', '$', '%', '&', "'", ',', '-', '.', '0', '1', '2', '3', '5', '7', '8', '9', ':', '?', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'Y', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 'é'])
Target tokens indexed dict_keys(['\t', '\n', ' ', '!', '%', '&', "'", '(', ')', ',', '-', '.', '0', '1', '2', '3', '5', '8', '9', ':', '?', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'Y', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '\xa0', '«', '»', 'À', 'Ç', 'É', 'Ê', 'à', 'â', 'ç', 'è', 'é', 'ê', 'î', 'ï', 'ô', 'ù', 'û', 'œ', '\u2009', '’', '\u202f'])


#### Creating my input and target data dimensions for the encoder decoder architecture, notice that there is no encoder output, it is because our output from the encoders will be in the last one as states

In [6]:
encoder_input_data = np.zeros(
    ( len(input_texts), max_encoder_seq_length,num_encoder_tokens),       #dimensions for the english
    dtype='float32')
decoder_input_data = np.zeros(
    ( len(input_texts), max_decoder_seq_length,num_decoder_tokens),       #dimensions for the french
    dtype='float32')
decoder_target_data = np.zeros(
    ( len(input_texts), max_decoder_seq_length,num_decoder_tokens),
    dtype='float32')

In [7]:
for i, (input_text, target_text) in enumerate(zip(input_texts,target_texts)):
    for t, char in enumerate(input_text):
        encoder_input_data[i, t, input_token_index[char]] = 1.
    encoder_input_data[i, t + 1:, input_token_index[' ']] = 1.                                    #We make
    for t, char in enumerate(target_text):                                                        #decoder target data is one timestep(in our case a char) 
        decoder_input_data[i, t, target_token_index[char]] = 1.                                   #ahead of decoder input data
        if t > 0:
            decoder_target_data[i, t-1, target_token_index[char]] = 1.
    decoder_input_data[i, t+1:, target_token_index[' ']] = 1.
    decoder_target_data[i, t:, target_token_index[' ']] = 1.

### Creating the LSTM layer architecture

In [8]:

latent_dim=256    #

encoder_inputs=Input(shape=(None,num_encoder_tokens))
encoder =LSTM(latent_dim, return_state=True)
encoder_outputs, state_h, state_c = encoder(encoder_inputs)      # We will not be using the encoder outputs because in encoder decoder architecture, the information we need is 
                                                                 # in the states. Outputs from each encoder flows through the encoders to form the states that we are going to pass
encoder_states = [state_h, state_c]                              # through the decoder part

decoder_inputs= Input(shape=(None,num_decoder_tokens))
decoder_lstm=LSTM(latent_dim,return_sequences=True,return_state=True)
decoder_outputs, _, _ = decoder_lstm(decoder_inputs, initial_state=encoder_states)
decoder_dense = Dense(num_decoder_tokens, activation='softmax')
decoder_outputs = decoder_dense (decoder_outputs)

### Creating our model that includes the LTSM that we specified 

In [9]:
batch_size=64   ## batch size for the network
epochs=100      ## epoch number                                       


model = Model([encoder_inputs, decoder_inputs], decoder_outputs)

model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['accuracy'])

model.fit([encoder_input_data,decoder_input_data], decoder_target_data, batch_size= batch_size, epochs=epochs, validation_split=0.2)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

<keras.callbacks.History at 0x7fe8b799c5d0>

### Generating the sentences 

Now that we have trained our network, we can now translate the fra English words into French by around %85 accuracy, of course it is far from perfect but given the simplicity of this implementation, It shows the vast promise of the Recurrent Neural Networks and the Encoder Decoder Architecture

In [10]:
encoder_model= Model(encoder_inputs, encoder_states)            #Now using the trained model, we can feed a word into it and retrieve the achieved translation

decoder_state_input_h= Input(shape=(latent_dim,))
decoder_state_input_c= Input(shape=(latent_dim,))
decoder_states_inputs= [decoder_state_input_h,decoder_state_input_c ]

decoder_outputs, state_h, state_c = decoder_lstm(decoder_inputs, initial_state=decoder_states_inputs)
decoder_states= [state_h, state_c]
decoder_outputs= decoder_dense(decoder_outputs)
decoder_model= Model([decoder_inputs] + decoder_states_inputs, [decoder_outputs] + decoder_states)

reverse_input_char_index= dict(
    (i, char) for char, i in input_token_index.items())                  ##Revectorizing our data
reverse_target_char_index= dict(
    (i, char) for char, i in target_token_index.items())

def decode_sequence(input_seq):
    states_value= encoder_model.predict(input_seq)

    target_seq = np.zeros((1, 1, num_decoder_tokens))                 #Decoding a sentence

    target_seq[0, 0, target_token_index['\t']] = 1.

    stop_condition= False
    decoded_sentence = ''
    while not stop_condition:
        output_tokens, h , c = decoder_model.predict([target_seq]+ states_value)


        sampled_token_index= np.argmax(output_tokens[0, -1, :])
        sampled_char = reverse_target_char_index[sampled_token_index]
        decoded_sentence+= sampled_char



        if(sampled_char == '\n' or len(decoded_sentence)> max_decoder_seq_length): 
            stop_condition= True

        
        target_seq= np.zeros((1, 1, num_decoder_tokens))
        target_seq[0, 0, sampled_token_index]= 1.

        states_value= [h, c]

    return decoded_sentence

                                                                                     
for seq_index in range(100):                                                  #Translating the first 100 words in the fra.txt by using the model
                               
    input_seq= encoder_input_data[seq_index: seq_index+1]
    decoded_sentence= decode_sequence(input_seq)
    print('-')
    print('Input sentence:', input_texts[seq_index])
    print('Decoded sentence:', decoded_sentence)
     

-
Input sentence: Go.
Decoded sentence: Marche.

-
Input sentence: Go.
Decoded sentence: Marche.

-
Input sentence: Go.
Decoded sentence: Marche.

-
Input sentence: Hi.
Decoded sentence: Salut !

-
Input sentence: Hi.
Decoded sentence: Salut !

-
Input sentence: Run!
Decoded sentence: File !

-
Input sentence: Run!
Decoded sentence: File !

-
Input sentence: Run!
Decoded sentence: File !

-
Input sentence: Run!
Decoded sentence: File !

-
Input sentence: Run!
Decoded sentence: File !

-
Input sentence: Run!
Decoded sentence: File !

-
Input sentence: Run!
Decoded sentence: File !

-
Input sentence: Run!
Decoded sentence: File !

-
Input sentence: Run.
Decoded sentence: File !

-
Input sentence: Run.
Decoded sentence: File !

-
Input sentence: Run.
Decoded sentence: File !

-
Input sentence: Run.
Decoded sentence: File !

-
Input sentence: Run.
Decoded sentence: File !

-
Input sentence: Run.
Decoded sentence: File !

-
Input sentence: Run.
Decoded sentence: File !

-
Input sentence: Ru

### Improvements over the implementation
#### I have tried changing  the activation function but the accuracy results were similar
#### I believe the implementation is hard coded. We could have change the parsing style to make it more generic to adapt it into different input file styles
#### Instead of LTSM, we could have use GRU and It would have an effect on the accuracy outcome 
#### I have tried to improve general structural quality bt adding sections and more understandable comments


Ps: I have used google colab in this homework, that is why I have added an upload section, to use it with jupyter lab, you can easily delete or comment that part and add your path