# Translation

**Translation** refers to the process of converting text or speech from one language to another. It involves understanding the meaning and structure of the source language and producing an equivalent expression in the target language.

NLP translation systems aim to bridge the language barrier by leveraging machine learning and computational linguistics techniques. There are two primary approaches to NLP translation:

- Rule-based Translation: This approach relies on linguistic rules and dictionaries to translate text. Linguists and language experts manually create these rules and mappings between words, phrases, and grammatical structures of the source and target languages. However, this method is limited by the complexity of language rules and does not handle ambiguities or context-dependent translations well.

- Statistical Machine Translation (SMT): SMT is an approach that uses statistical models to learn patterns from bilingual corpora, which are large collections of translated texts. The models learn to associate phrases or sentences in the source language with their corresponding translations in the target language. They make translation decisions based on the probabilities learned from the training data. SMT systems often employ techniques like phrase-based translation and n-gram language modeling.

More recently, Neural Machine Translation (NMT) has become the dominant approach for translation in NLP. NMT models are based on artificial neural networks, specifically recurrent neural networks (RNNs) or transformer models. These models learn to directly map sequences of words from the source language to the target language, considering the context and dependencies between words. NMT models have shown significant improvements in translation quality compared to traditional SMT systems.

Note this notebooks is adapted from this [sample](https://www.kaggle.com/code/kkhandekar/machine-translation-beginner-s-guide)

In [None]:
pip install keras tensorflow helper plotly tabulate

In [None]:
import collections
from collections import Counter

import helper
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Model, Sequential
from keras.layers import GRU, Input, Dense, TimeDistributed, Activation, RepeatVector, Bidirectional,LSTM
# from keras.layers.embeddings import Embedding
from keras.optimizers import Adam
from keras.losses import sparse_categorical_crossentropy
from keras.callbacks import ModelCheckpoint

import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go

from sklearn.model_selection import train_test_split

from tabulate import tabulate

import gc

### The Data 
For this example, we are using a translation set that translates between English and French. It consists of several single words as well as more complex phrases.

In [None]:
df = pd.read_csv("../data/supplementary_content/eng_-french.csv")
df = df.rename(columns={"English words/sentences":"Eng", "French words/sentences":"Frn" })
df.head()

Let's quickly explore our data a little bit by looking at word count:

In [None]:
# Function for word count
def word_count (txt):
    return len(txt.split())

In [None]:
# Apply function to both English an French words and phrases
df['Eng_Count'] = df['Eng'].apply(lambda x: word_count(x))
df['Frn_Count'] = df['Frn'].apply(lambda x: word_count(x))

In [None]:
print( '{} English Words'.format(df['Eng_Count'].sum()) ) 
print('{} French Words'.format(df['Frn_Count'].sum()) )

## Preprocessing

Like any other ML problem, we'll need to preprocess out data so it's properly configured for the deep learning model. We'll leverage concepts we've discussed before tokenization and padding. 

In [None]:
# Tokenize Function
def tokenize(x):
    x_tk = Tokenizer(char_level = False)
    x_tk.fit_on_texts(x)
    return x_tk.texts_to_sequences(x), x_tk

# Padding Function
def pad(x, length=None):
    if length is None:
        length = max([len(sentence) for sentence in x])
    return pad_sequences(x, maxlen = length, padding = 'post')

In [None]:
# Tokenize English text & determine English Vocab Size 
eng_seq, eng_tok = tokenize(df['Eng'])
eng_vocab_size = len(eng_tok.word_index) + 1
print("Complete English Vocab Size: ",eng_vocab_size)

# Tokenize French text & determine French Vocab Size 
frn_seq, frn_tok = tokenize(df['Frn'])
frn_vocab_size = len(frn_tok.word_index) + 1
print("Complete French Vocab Size: ",frn_vocab_size)

Once we've defined the various padding and tokenizing functions, we'll extract the appropriate sequences from our English and French columns respectively.

In [None]:
# Get sequence lengths 
eng_len = max([len(sentence) for sentence in eng_seq])
frn_len = max([len(sentence) for sentence in frn_seq])

print("English Sequence Length: ",eng_len,"\n",
      "French Sequence Length: ",frn_len)

After apply the pre-processing, we'll split the data into a *training* set and a *test* set

In [None]:
# Split data 
train_data, test_data = train_test_split(df, test_size=0.2, random_state = 0)

In [None]:
# Drop Label and re-index 
# Drop Columns
train_data = train_data.drop(columns=['Eng_Count', 'Frn_Count'],axis=1)
test_data = test_data.drop(columns=['Eng_Count', 'Frn_Count'],axis=1)

#Re-Index
train_data = train_data.reset_index(drop=True)
test_data = test_data.reset_index(drop=True)

Keep in mind that up to this point, we are still working with plain text (whether it be in English or French). In order to use a machine learning model, we'll need to encode this text into numeric values. In this example we are using the `Tokenizer` from `Keras`

In [None]:
# -- Tokenization --

# Training Data
train_X_seq, train_X_tok = tokenize(train_data['Eng'])
train_Y_seq, train_Y_tok = tokenize(train_data['Frn'])

train_eng_vocab = len(train_X_tok.word_index) + 1
train_frn_vocab = len(train_Y_tok.word_index) + 1

# Testing Data
test_X_seq, test_X_tok = tokenize(test_data['Eng'])
test_Y_seq, test_Y_tok = tokenize(test_data['Frn'])

test_eng_vocab = len(test_X_tok.word_index) + 1
test_frn_vocab = len(test_Y_tok.word_index) + 1


# -- Padding --

# Training Data
train_X_seq = pad(train_X_seq)
train_Y_seq = pad(train_Y_seq)

# Testing Data
test_X_seq = pad(test_X_seq)
test_Y_seq = pad(test_Y_seq)

## Model Building 

In this case of SMT, we'll be using our own model that we're building from scratch. There are several existing pre-trained models that could be a good alternative to buildling a model from scratch. In the architecture, we'll be leveraging a couple `LSTM` layers to better capture context. You can read more about LTSMs for machine translation [here](https://www.kaggle.com/code/harshjain123/machine-translation-seq2seq-lstms)

In [None]:
# Define Model

def define_model(in_vocab,out_vocab, in_timesteps,out_timesteps, btch_size):
    
    model = Sequential()
    model.add(Embedding(in_vocab, btch_size, input_length=in_timesteps, mask_zero=True))
    
    model.add(LSTM(btch_size))
    model.add(RepeatVector(out_timesteps))
    model.add(LSTM(btch_size, return_sequences=True))
    model.add(Dense(out_vocab, activation='softmax'))
    
    return model

In [None]:
# Compile Parameters
batch_size = 64  
lr = 1e-3        

# Instantiate model
model = define_model(eng_vocab_size, frn_vocab_size, eng_len, frn_len, batch_size)

# Compile model
model.compile(loss='sparse_categorical_crossentropy', optimizer = Adam(lr))

In [None]:
fn = 'model.h1.MT'
epoch = 2
val_split = 0.1

# checkpoint in case something happens to interrupt training
checkpoint = ModelCheckpoint(fn, monitor='val_loss', verbose=1, save_best_only=True, mode='min')

# Train model 
history = model.fit(train_X_seq, train_Y_seq,
                    epochs=epoch, batch_size=batch_size, validation_split = val_split, callbacks=[checkpoint], 
                    verbose=1)

In [None]:
# Let's plot the loss epoch over epoch

plt.rcParams["figure.figsize"] = (10,8)
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.legend(['train','validation'])
plt.title("Train vs Validation - Loss", fontsize=15)
plt.show()

## Make Predictions 

Keep in mind that when making predictions, we've alread tokenized and embedding all of our data. This means that once we generate predictions, we'll have to decode them using the tokenizer back to plain text. 

In [None]:
test_data

In [None]:
# Making Predictions for a single word / phrase
predictions = model.predict(test_X_seq[1:6])[0]
predictions

In [None]:
# Function to map embeddings back to words
def to_text(logits, tokenizer):

    index_to_words = {id: word for word, id in tokenizer.word_index.items()}
    index_to_words[0] = ''
    return ' '.join([index_to_words[prediction] for prediction in np.argmax(logits, 1)])

In [None]:
print(f"The French translation is: {to_text(predictions, frn_tok)}")