# Translation Models


Machine translation is a pivotal field within natural language processing (NLP) that focuses on automating the conversion of text or speech from one language to another. It relies on sophisticated models and techniques to accomplish this challenging task effectively. One of the cornerstone methods in machine translation is the sequence-to-sequence (seq2seq) model, which employs deep neural networks to encode input text and then decode it into the target language. This technique has revolutionized translation tasks by learning to capture complex linguistic nuances and contextual information. Additionally, other models like Transformer-based models, including the famous BERT and GPT-3, have also made significant strides in translation, leveraging attention mechanisms to excel in various language pairs and domains. The choice of model depends on specific translation requirements, language pairs, and the quality of available training data. In this Colab file, we havee given a basic demo on how tto use the dataset and work on a simple seq2seq moel usig RNN.Your task will be to improve the model to the maximum you can ,make prediction on the test dataset given and write a code to generate the BLEU score of you prediction compared to original.






In [4]:
import pandas as pd
import numpy as np
#from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Model, Sequential
from tensorflow.keras.layers import GRU, Input, Dense, TimeDistributed, Activation, RepeatVector, Bidirectional,LSTM, Dropout
from tensorflow.keras.layers import Embedding
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.losses import sparse_categorical_crossentropy
from tensorflow.keras.callbacks import ModelCheckpoint
from tokenizers import Tokenizer
from tokenizers.models import WordPiece,BPE
from tokenizers.trainers import WordPieceTrainer,BpeTrainer
from tokenizers.pre_tokenizers import Whitespace
from tokenizers.processors import TemplateProcessing

In [6]:
##Loading and processing data
eng_fr = pd.read_csv("nlp_intel_train.csv")
eng_fr_test = pd.read_csv("nlp_intel_test.csv")

In [7]:
eng_fr = eng_fr.dropna(axis=0, how="any", subset=None, inplace=False)
eng_fr_test = eng_fr_test.dropna(axis=0, how="any", subset=None, inplace=False)

In [8]:
X=eng_fr["en"].tolist()
Y=eng_fr["fr"].tolist()

In [9]:
def to_corpus(sent_list):
  text_corpus=""
  for sentence in sent_list:
    text_corpus+=sentence.lower()
  return text_corpus

In [11]:
def train_tokenizer(file_path):
  tokenizer = Tokenizer(WordPiece(unk_token="[UNK]"))
  trainer = WordPieceTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])
  tokenizer.pre_tokenizer = Whitespace()
  files=[file_path]
  tokenizer.train(files, trainer)
  tokenizer.post_processor = TemplateProcessing(single="[CLS] $A [SEP]", pair="[CLS] $A [SEP] $B:1 [SEP]:1",special_tokens=[("[CLS]", tokenizer.token_to_id("[CLS]")),("[SEP]", tokenizer.token_to_id("[SEP]")),],)
  return tokenizer

In [31]:
def sequences(tokenizer,sent_list):
  prepoc_sentences=[]
  for sent in sent_list:
    encoding=tokenizer.encode(sent)
    prepoc_sentences.append(encoding.ids)
  prepoc_sentences=np.array(prepoc_sentences)
  prepoc_sentences = pad_sequences(prepoc_sentences,55, padding='post')
  return prepoc_sentences

In [13]:
f1=open("x.txt","w")
f1.write(to_corpus(X))

f2=open("y.txt","w")
f2.write(to_corpus(Y))

2515115

In [21]:
tokenizer_eng,tokenizer_fr=train_tokenizer("/content/x.txt"),train_tokenizer("/content/y.txt")

In [None]:
prepoc_english_sentences,prepoc_french_sentences=sequences(tokenizer_eng,X),sequences(tokenizer_fr,Y)