!pip install transformers datasets sentencepiece -q:

This line installs the necessary libraries from the Python Package Index (PyPI).

transformers:

This library provides pre-trained models and tools for working with transformer models, including the MarianMT model we're using.

datasets:

This library provides tools for loading and processing datasets, which is used to handle the translation data.

sentencepiece:

This is a library for text tokenization, which is used by the Marian tokenizer.

-q: This flag makes the installation quiet, suppressing detailed output.

In [None]:
# Fine-Tune French to English Translation Model Using Hugging Face Transformers

## 📦 Install Required Libraries
!pip install transformers datasets sentencepiece -q

## 📁 Load Dataset
import pandas as pd
from datasets import Dataset

from google.colab import drive
drive.mount('/content/drive')
df = pd.read_csv('/content/drive/My Drive/eng_-french.csv')

Mounted at /content/drive


Load and Prepare Dataset

df = df.rename(columns={"English words/sentences": "en", "French words/sentences": "fr"}):
- Renames the columns of the DataFrame to simpler names "en" and "fr".

df = df[["fr", "en"]]:
- Selects only the "fr" and "en" columns, discarding any other columns in the original CSV.

df["translation"] = df.apply(lambda row: {"fr": row["fr"], "en": row["en"]}, axis=1):

- Creates a new column named "translation" where each row contains a dictionary with the French and English sentences. This format is suitable for the datasets library.

hf_dataset = Dataset.from_pandas(df[["translation"]]):

- Converts the pandas DataFrame (specifically the "translation" column) into a datasets object, which is the format expected by the transformers library for training.


In [None]:
df = df.rename(columns={
    "English words/sentences": "en",
    "French words/sentences": "fr"
})
df = df[["fr", "en"]]
df["translation"] = df.apply(lambda row: {"fr": row["fr"], "en": row["en"]}, axis=1)
hf_dataset = Dataset.from_pandas(df[["translation"]])

Load Tokenizer and Model

* from transformers import MarianTokenizer,
MarianMTModel:

  - Imports the MarianTokenizer and MarianMTModel classes from the transformers library.


* model_name = "Helsinki-NLP/opus-mt-fr-en":

  - Defines the name of the pre-trained model we will use. This is a MarianMT model fine-tuned for French-to-English translation.

* tokenizer = MarianTokenizer.from_pretrained(model_name):

  - Loads the tokenizer associated with the specified pre-trained model. The tokenizer is responsible for converting text into numerical tokens that the model can understand.

* model = MarianMTModel.from_pretrained(model_name):

  - Loads the pre-trained MarianMT model itself.

In [None]:
from transformers import MarianTokenizer, MarianMTModel

model_name = "Helsinki-NLP/opus-mt-fr-en"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)

tokenizer_config.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

source.spm:   0%|          | 0.00/802k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/778k [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

config.json: 0.00B [00:00, ?B/s]



pytorch_model.bin:   0%|          | 0.00/301M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/301M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

Tokenize and subset Dataset

* max_length = 128:
  
  -Sets the maximum length for the tokenized sequences. Sentences longer than this will be truncated, and shorter ones will be padded.

* def preprocess(example): ...:

  - Defines a function to preprocess each example in the dataset.
  It tokenizes both the French input (example["translation"]["fr"]) and the English target (example["translation"]["en"]).

* truncation=True:

  - Truncates sequences longer than max_length.

* padding="max_length":

  - Pads sequences shorter than max_length to the maximum length.

* inputs["labels"] = targets["input_ids"]:
  - Adds the tokenized target English sentences as "labels". This is the expected format for training sequence-to-sequence models with the transformers trainer.

* subset_size = 10000:

  -Defines the size of the subset of the dataset to use.

* hf_dataset_subset = hf_dataset.select(range(subset_size)):

  -Creates a new dataset object containing only the first subset_size examples from the original hf_dataset.

* tokenized_dataset = hf_dataset_subset.map(preprocess):

  -Applies the preprocess function to each example in the subsetted dataset, creating the tokenized_dataset ready for training.

In [None]:
max_length = 128

def preprocess(example):
    inputs = tokenizer(example["translation"]["fr"], truncation=True, padding="max_length", max_length=max_length)
    targets = tokenizer(example["translation"]["en"], truncation=True, padding="max_length", max_length=max_length)
    inputs["labels"] = targets["input_ids"]
    return inputs

# Reduce dataset size for faster training (e.g., use the first 10000 examples)
subset_size = 10000
hf_dataset_subset = hf_dataset.select(range(subset_size))

tokenized_dataset = hf_dataset_subset.map(preprocess)

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

Trainer Setup

* from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments, DataCollatorForSeq2Seq:

  - Imports the necessary classes for setting up the training process.

* training_args = Seq2SeqTrainingArguments(...):

 - Configures the training arguments. Key arguments here include the output directory, batch size, learning rate, number of epochs (set to 3), logging and saving steps, and enabling mixed precision training (fp16=True).

* data_collator = DataCollatorForSeq2Seq(tokenizer, model=model):
  - Initializes a data collator that will be used by the trainer to prepare batches of data during training. It handles padding and ensuring the labels are correctly formatted.

In [None]:
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments, DataCollatorForSeq2Seq

training_args = Seq2SeqTrainingArguments(
    output_dir="./finetuned-marian-fr-en",
    per_device_train_batch_size=4,
    learning_rate=5e-5,
    num_train_epochs=3, # Reduced number of epochs
    logging_steps=10,
    save_steps=50,
    save_total_limit=2,
    push_to_hub=False,        # ✅ Don't push to Hugging Face Hub
    report_to=[],              # ✅ Disable any logging (like WandB or HF)
    fp16=True # Enable mixed precision training
)

data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

Train the Model

* trainer = Seq2SeqTrainer(...):
  - Creates an instance of the Seq2SeqTrainer, passing the model, training arguments, tokenized training dataset, tokenizer, and data collator.

* trainer.train():
  - Starts the fine-tuning process. The trainer will iterate through the tokenized_dataset for the specified number of epochs, performing the training steps.

In [None]:
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
    tokenizer=tokenizer,
    data_collator=data_collator
)

trainer.train()

  trainer = Seq2SeqTrainer(


Step,Training Loss
10,5.0042
20,0.23
30,0.1696
40,0.1468
50,0.124
60,0.1227
70,0.1111
80,0.1083
90,0.1053
100,0.0949




TrainOutput(global_step=7500, training_loss=0.03197706652730704, metrics={'train_runtime': 2685.459, 'train_samples_per_second': 11.171, 'train_steps_per_second': 2.793, 'total_flos': 1016950947840000.0, 'train_loss': 0.03197706652730704, 'epoch': 3.0})

 Save the Model

* model.save_pretrained("/mnt/data/french_to_english_model"):
  - Saves the fine-tuned model to the specified directory.

* tokenizer.save_pretrained("/mnt/data/french_to_english_model"):
  - Saves the tokenizer to the same directory. This is important so you can load and use the fine-tuned model later for inference.

In [None]:
model.save_pretrained("/mnt/data/french_to_english_model")
tokenizer.save_pretrained("/mnt/data/french_to_english_model")

('/mnt/data/french_to_english_model/tokenizer_config.json',
 '/mnt/data/french_to_english_model/special_tokens_map.json',
 '/mnt/data/french_to_english_model/vocab.json',
 '/mnt/data/french_to_english_model/source.spm',
 '/mnt/data/french_to_english_model/target.spm',
 '/mnt/data/french_to_english_model/added_tokens.json')

Test Translation

* def translate(text): ...:

  - Defines a function to perform translation using the fine-tuned model.
  It takes a French text string as input.

* tokenizer([text], return_tensors="pt", ...):
  - Tokenizes the input text.
  
* return_tensors="pt"
  - specifies that the output should be PyTorch tensors.

* translated = model.generate(**inputs):

  - Uses the model's generate method to produce the translated output sequence based on the tokenized input.

* tokenizer.decode(translated[0], skip_special_tokens=True):
  - Decodes the generated token IDs back into a human-readable string, skipping any special tokens used by the model.

* print(...):
  - Prints example translations using the defined translate function.

In [3]:
from transformers import MarianTokenizer, MarianMTModel

print("Transformers imported successfully.") # Added for debugging

model_name = "Helsinki-NLP/opus-mt-fr-en"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)

def translate(text):
    inputs = tokenizer([text], return_tensors="pt", padding=True, truncation=True)
    # Move input tensors to the same device as the model
    inputs = {key: value.to(model.device) for key, value in inputs.items()}
    translated = model.generate(**inputs)
    return tokenizer.decode(translated[0], skip_special_tokens=True)

print("FR: Salut! --> EN:", translate("Salut!"))
print("FR: Comment ça va ? --> EN:", translate("Comment ça va ?"))
print("FR: Je suis fatigué. --> EN:", translate("Je suis fatigué."))
print("FR: C'est une belle journée. --> EN:", translate("C'est une belle journée."))
print("FR: J'adore apprendre le français. --> EN:", translate("J'adore apprendre le français."))

Transformers imported successfully.




FR: Salut! --> EN: Hello!
FR: Comment ça va ? --> EN: How are you?
FR: Je suis fatigué. --> EN: I'm tired.
FR: C'est une belle journée. --> EN: It's a nice day.
FR: J'adore apprendre le français. --> EN: I love learning French.
