<a href="https://colab.research.google.com/github/Vilmo18/Machine_Translation/blob/main/Machine_translation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!nvidia-smi

Mon Jul  8 11:58:56 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   69C    P8              11W /  70W |      0MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

## Install librairies

In [None]:
!pip install -q \
transformers==4.38.2 \
datasets==2.18.0 \
evaluate==0.4.1 \
sacrebleu==2.4.2 \
tensorflow==2.15.0 \
tf-keras==2.15.1 \
matplotlib==3.7.1

In [None]:
#!pip install datasets transformers[sentencepiece] sacrebleu -q

## Import librairies

In [None]:
import os
import tensorflow as tf
import sys
from datasets import load_dataset
import transformers
from transformers import AutoTokenizer, TFAutoModelForSeq2SeqLM, DataCollatorForSeq2Seq, AdamWeightDecay

In [None]:
## Define the model
model_checkpoint="Helsinki-NLP/opus-mt-en-hi"

In [None]:
#Load model
ds = load_dataset("cfilt/iitb-english-hindi")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [None]:
## Display dataset
ds

DatasetDict({
    train: Dataset({
        features: ['translation'],
        num_rows: 1659083
    })
    validation: Dataset({
        features: ['translation'],
        num_rows: 520
    })
    test: Dataset({
        features: ['translation'],
        num_rows: 2507
    })
})

In [None]:
ds['train'][0]

{'translation': {'en': 'Give your application an accessibility workout',
  'hi': 'अपने अनुप्रयोग को पहुंचनीयता व्यायाम का लाभ दें'}}

## Prepreprocessing data

In [None]:
tokenizer=AutoTokenizer.from_pretrained(model_checkpoint)



In [None]:
## Exanple Tokenizer for inputs
tokenizer('hello my name is')

{'input_ids': [39915, 155, 300, 23, 0], 'attention_mask': [1, 1, 1, 1, 1]}

In [None]:
## Example Tokenizer for target
with tokenizer.as_target_tokenizer() :
  print(tokenizer('4'))

{'input_ids': [898, 0], 'attention_mask': [1, 1]}




In [None]:
max_input_length=128
max_target_length=128
source_lang="en"
target_lang="hi"

## Function for preprocessing
def preprocess_function(examples):
  inputs=[ex[source_lang] for ex in examples["translation"]]
  targets=[ex[target_lang] for ex in examples["translation"]]

  #setup the tokenize for inputs
  model_inputs=tokenizer(inputs,max_length=max_input_length,truncation=True)
  #setup the tokenize for targets
  with tokenizer.as_target_tokenizer():
    labels=tokenizer(targets,max_length=max_target_length,truncation=True)

  model_inputs["labels"]=labels["input_ids"]
  return model_inputs

In [None]:
def preprocess_function(examples):
    inputs = [ example[source_lang] for example in examples["translation"]]
    targets = [example[target_lang] for example in examples["translation"]]
    model_inputs = tokenizer(inputs, text_target=targets, max_length=128, truncation=True)
    return model_inputs

In [None]:
## Test proprocess function
preprocess_function(ds['train'][:2])

{'input_ids': [[3872, 85, 2501, 132, 15441, 36398, 0], [32643, 28541, 36253, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1]], 'labels': [[63, 2025, 18, 16155, 346, 20311, 24, 2279, 679, 0], [26618, 16155, 346, 33383, 0]]}

In [None]:
##  Tokenizer overall dataset
tokenized_datasets=ds.map(preprocess_function,remove_columns=['translation'],batched=True)

Map:   0%|          | 0/1659083 [00:00<?, ? examples/s]

Map:   0%|          | 0/520 [00:00<?, ? examples/s]

Map:   0%|          | 0/2507 [00:00<?, ? examples/s]

In [None]:
del model

NameError: name 'model' is not defined

In [None]:
# Load model
model = TFAutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

All model checkpoint layers were used when initializing TFMarianMTModel.

All the layers of TFMarianMTModel were initialized from the model checkpoint at Helsinki-NLP/opus-mt-en-hi.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFMarianMTModel for predictions without further training.


In [None]:
## Hyperparameters
batch_size=16
learning_rate=2e-5
weight_decay=0.01
num_train_epochs=2

In [None]:
data_collator=DataCollatorForSeq2Seq(tokenizer,model=model_checkpoint, return_tensors="tf")

In [None]:
generation_data_collator=DataCollatorForSeq2Seq(tokenizer,model=model_checkpoint, return_tensors="tf", pad_to_multiple_of=8)

In [None]:
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 1659083
    })
    validation: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 520
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 2507
    })
})

In [None]:
## Prepare the  Train dataset
train_dataset=model.prepare_tf_dataset(
    dataset=tokenized_datasets["test"],
    batch_size=16,
    shuffle=True,
    collate_fn=data_collator
    )

In [None]:
## Prepare the  validation dataset
validation_dataset=model.prepare_tf_dataset(
    tokenized_datasets["validation"],
    shuffle=False,
    batch_size=batch_size,
    collate_fn=data_collator)

## Train model

In [None]:
# Setup the optimizer
optimizer=AdamWeightDecay(learning_rate=learning_rate, weight_decay_rate=weight_decay)
model.compile(optimizer=optimizer)

In [None]:
# Fit the model
model.fit(train_dataset,validation_data=validation_dataset,epochs=3)

Epoch 1/3
Epoch 2/3
Epoch 3/3


<tf_keras.src.callbacks.History at 0x7be428e13af0>

In [None]:
## Save the model
model.save_pretrained("model")

Non-default generation parameters: {'max_length': 512, 'num_beams': 4, 'bad_words_ids': [[61949]], 'forced_eos_token_id': 0}


## Test model

In [None]:
tokenizer=AutoTokenizer.from_pretrained(model_checkpoint)
model=TFAutoModelForSeq2SeqLM.from_pretrained('model')

All model checkpoint layers were used when initializing TFMarianMTModel.

All the layers of TFMarianMTModel were initialized from the model checkpoint at model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFMarianMTModel for predictions without further training.


In [None]:
input_text="ca fonctionne "
tokenized=tokenizer(input_text,return_tensors="tf")
out=model.generate(**tokenized,max_length=128)
print(out)

tf.Tensor([[61949     6   314 12645 11273     0]], shape=(1, 6), dtype=int32)


In [None]:
with tokenizer.as_target_tokenizer():
  print(tokenizer.decode(out[0],skip_special_tokens=True))

केन्शनेन
