# **1. Importing Libraries and Initial Setup**

In [8]:
!pip install datasets transformers
!pip install evaluate
!pip install rouge_score

from transformers import (
    Trainer,
    TrainingArguments,
    AutoModelForSeq2SeqLM,
    DataCollatorForSeq2Seq
)

from datasets import load_dataset
from transformers import (
    AutoTokenizer, AutoModelForSeq2SeqLM, DataCollatorForSeq2Seq,
    TrainingArguments, Trainer
)

import evaluate
import torch
import numpy as np
import pandas as pd
import nltk
nltk.download("punkt")
import time
from scipy.stats import uniform
from transformers import T5ForConditionalGeneration, T5Tokenizer, TrainingArguments, Trainer, DataCollatorForSeq2Seq
import os
from random import choice
from evaluate import load as load_metric
from datasets import load_dataset



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In this part of our project, we set up the environment and imported all the tools we needed to build and train our T5-based grammar correction model. First, we installed the essential Python packages like transformers and datasets for loading pre-trained models and datasets, evaluate for measuring model performance, and rouge_score for evaluating text outputs in tasks like grammar correction.

Next, we imported key classes from Hugging Face, such as Trainer and TrainingArguments, which help us structure and manage the training process. AutoModelForSeq2SeqLM and T5ForConditionalGeneration allow us to load and fine-tune a sequence-to-sequence model, while AutoTokenizer ensures our text data is properly processed into tokens that the model can understand. We also included DataCollatorForSeq2Seq to batch and format our data correctly during training.

Additionally, we brought in supporting libraries like torch for model operations, numpy and pandas for data handling, nltk for natural language processing tasks, and scipy for statistical operations. We downloaded the NLTK “punkt” tokenizer to help with sentence tokenization. Finally, we imported utility modules like os, time, and random for file handling, timing experiments, and random selections in our workflow.

Overall, this setup ensures that our project has all the necessary tools and resources, allowing us to focus on training, evaluating, and fine-tuning our grammar correction model efficiently.

# **2. Data Loading, Subsetting, Tokenizer/Model Initialization, and Tokenization Function**

In [None]:
dataset = load_dataset("juancavallotti/bea-19-corruption")

# ✅ Use only small subset to avoid high memory use
train_data = dataset["train"].select(range(5000))
eval_data = dataset["train"].select(range(3000, 3500))

print(f"Train size: {len(train_data)} | Eval size: {len(eval_data)}")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/255 [00:00<?, ?B/s]

Repo card metadata block was not found. Setting CardData to empty.


dataset_infos.json: 0.00B [00:00, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/7.41M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/84106 [00:00<?, ? examples/s]

Train size: 1000 | Eval size: 50


In this part of our project, we loaded the dataset that we will use to train and evaluate our grammar correction model. We chose the "juancavallotti/bea-19-corruption" dataset because it contains sentences with grammatical errors paired with their corrected versions, which is perfect for our task.

To manage memory efficiently and ensure our training runs smoothly, we selected only a smaller subset of the data. Specifically, we took the first 5,000 samples for training and a smaller slice of 500 samples (from index 3,000 to 3,500) for evaluation. This allows us to experiment and test our model without running into performance issues. Finally, we printed the sizes of our training and evaluation datasets to confirm that the subsets were correctly created.

In [None]:
model_name = "t5-small"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)


tokenizer_config.json: 0.00B [00:00, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

config.json: 0.00B [00:00, ?B/s]

pytorch_model.bin:   0%|          | 0.00/892M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/892M [00:00<?, ?B/s]

In this cell, we set up the core of our grammar correction model by choosing the "t5-small" pre-trained model. We used AutoTokenizer to convert our text data into tokens that the model can understand, and AutoModelForSeq2SeqLM to load the T5 model itself. This setup allows us to leverage a pre-trained sequence-to-sequence model, which we can then fine-tune on our dataset to correct grammatical errors efficiently.

In [None]:
def tokenize_function(examples):
    tokenized_inputs = tokenizer(examples['sentence'], truncation=True, padding='max_length', max_length=256)
    tokenized_labels = tokenizer(examples['broken'], truncation=True, padding='max_length', max_length=256)
    tokenized_inputs['labels'] = tokenized_labels['input_ids']
    return tokenized_inputs

In [None]:
tokenized_train_data = train_data.map(tokenize_function, batched=True)
tokenized_eval_data = eval_data.map(tokenize_function, batched=True)

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/50 [00:00<?, ? examples/s]

In this part, we defined a function to prepare our dataset for the model. The tokenize_function takes each example from the dataset and converts both the input sentence and its corrected version into token IDs using the tokenizer we set up earlier. We applied truncation and padding to ensure all sequences have the same length (maximum 256 tokens), which is important for efficient batch processing. Finally, we added the tokenized corrected sentences as the labels for the model, so it knows what output to learn during training. This step essentially formats our data so it’s ready for the T5 model to process.

# **3. Evaluation Metrics and Device Configuration**

In [None]:
bleu = evaluate.load("bleu")
rouge = evaluate.load("rouge")

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.where(predictions != -100, predictions, tokenizer.pad_token_id)
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    bleu_score = bleu.compute(predictions=decoded_preds, references=[[l] for l in decoded_labels])["score"]
    rouge_score = rouge.compute(predictions=decoded_preds, references=decoded_labels)

    return {
        "bleu": round(bleu_score, 2),
        "rouge1": round(rouge_score["rouge1"], 2),
        "rougeL": round(rouge_score["rougeL"], 2)
    }

Downloading builder script: 0.00B [00:00, ?B/s]

Downloading extra modules:   0%|          | 0.00/1.55k [00:00<?, ?B/s]

Downloading extra modules: 0.00B [00:00, ?B/s]

Downloading builder script: 0.00B [00:00, ?B/s]

In this cell, we set up the evaluation metrics to measure how well our model performs. We loaded the BLEU and ROUGE metrics using the evaluate library, which are commonly used to assess text generation tasks.

We then defined the compute_metrics function, which takes the model’s predictions and the true labels as input. First, we replaced any placeholder values (-100) with the tokenizer’s padding token to make sure decoding works properly. Next, we decoded both the predicted and true token IDs back into readable text.

Finally, we calculated the BLEU score to measure the overlap of predicted sentences with the reference text, and the ROUGE scores (ROUGE-1 and ROUGE-L) to evaluate the quality of generated text in terms of precision, recall, and sequence matching. We rounded the results for easier interpretation. This function allows us to quantitatively track our model’s performance during training and evaluation.

In [None]:
if torch.cuda.is_available():
    device = torch.device("cuda")
    print(f"Using GPU: {torch.cuda.get_device_name(0)}")
else:
    device = torch.device("cpu")
    print("GPU not available, using CPU.")

Using GPU: Tesla T4


In this part, we checked whether a GPU is available for training. If a GPU is detected, we set the device to "cuda" and print the GPU’s name (Tesla T4), which allows our model to train much faster. If a GPU isn’t available, we default to using the CPU. This ensures our code can run on any machine while taking advantage of hardware acceleration when possible.

# **4. Hyperparameter Distribution and Random Search Loop Initialization**

In [None]:
log_columns = [
    'learning_rate', 'batch_size', 'num_train_epochs',
    'gradient_accumulation_steps',
    'max_grad_norm',
    'adam_epsilon',
    'warmup_steps',
    'weight_decay',
    'label_smoothing',

    # METRICS
    'eval_loss', 'bleu', 'rouge', 'time_taken'
]

experiment_log = pd.DataFrame(columns=log_columns)


In this cell, we set up a logging system to keep track of our training experiments. We defined a list of important columns, including model hyperparameters like learning_rate, batch_size, num_train_epochs, and optimization settings such as gradient_accumulation_steps, max_grad_norm, and weight_decay. We also included evaluation metrics (eval_loss, bleu, rouge) and time_taken to monitor performance and efficiency.

By creating an empty DataFrame with these columns, we can systematically record the results of each experiment. This helps our group compare different training configurations and track which settings produce the best grammar correction results.

In [None]:
param_dist = {
    'learning_rate': uniform(loc=1e-5, scale=1e-4 - 1e-5),      # Learning rate between 1e-5 and 1e-4
    'batch_size': [1, 2, 4, 8],                                 # Batch sizes to try
    'num_train_epochs': [2, 3, 4],                              # Number of epochs
    'gradient_accumulation_steps': [1, 2, 4, 8],                # Accumulate gradients
    'max_grad_norm': uniform(loc=0.5, scale=1.5),               # Gradient clipping between 0.5 and 2.0
    'adam_epsilon': uniform(loc=1e-8, scale=1e-7 - 1e-8),       # Epsilon between 1e-8 and 1e-7
    'warmup_steps': [0, 300, 600],                              # Warmup steps
    'weight_decay': uniform(loc=0.0, scale=0.1),                # Weight decay between 0.0 and 0.1
    'label_smoothing': uniform(loc=0.0, scale=0.1),             # Label smoothing between 0.0 and 0.1
}


In this cell, we defined a range of hyperparameters to explore during training. We used uniform distributions for continuous values like learning_rate, max_grad_norm, adam_epsilon, weight_decay, and label_smoothing, allowing us to sample values within specified ranges. For discrete parameters like batch_size, num_train_epochs, and gradient_accumulation_steps, we listed the specific options we wanted to try.

This setup allows our group to experiment with different configurations systematically and find the combination of hyperparameters that gives the best performance for our grammar correction model. Essentially, it’s preparing the model for a controlled hyperparameter search.

In [None]:
training_args = TrainingArguments(
    output_dir='./results',
    eval_strategy="epoch",
    logging_dir='./logs',
    logging_steps=500,
    num_train_epochs=1,
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    report_to="none",
)

# Trainer Setup
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train_data,
    eval_dataset=tokenized_eval_data,
    tokenizer=tokenizer,
    data_collator=DataCollatorForSeq2Seq(
        tokenizer=tokenizer,
        model=model,
        padding=True,
        return_tensors="pt"
    )
)

  trainer = Trainer(


In this cell, we set up the training process for our model using Hugging Face’s Trainer API. First, we defined TrainingArguments to specify how training should be conducted. We set the output directory for model checkpoints and logs, determined that evaluation should occur at the end of each epoch, configured logging frequency, and set basic hyperparameters like the number of epochs and batch sizes.

Next, we initialized the Trainer by passing in our model, training arguments, tokenized datasets, tokenizer, and a DataCollatorForSeq2Seq. The data collator ensures that input and output sequences are correctly padded and converted into PyTorch tensors for efficient batch processing. This setup allows our group to train the T5 model in a structured and reproducible way, while also handling evaluation and tokenization automatically.

In [None]:
best_score = np.inf
best_params = {}

In this cell, we initialized variables to keep track of the best experiment results during hyperparameter tuning. best_score is set to infinity initially, so that any evaluation loss from our experiments will be lower and can replace it. best_params is an empty dictionary that will store the hyperparameter configuration that achieves the best performance. This setup allows our group to systematically track and save the optimal settings for our grammar correction model.

In [None]:
model_save_path = './best_model'
os.makedirs(model_save_path, exist_ok=True)

In this cell, we created a folder to save our best-performing model. By setting model_save_path to './best_model' and using os.makedirs with exist_ok=True, we ensure that the directory is created if it doesn’t already exist. This allows our group to safely store the trained model for later use, evaluation, or deployment without overwriting previous results.

# **5. Main Random Search Training and Logging Loop**

In [None]:
for _ in range(10):  # 10 trials for random search
    # Randomly sample hyperparameters
    current_params = {
        'learning_rate': choice(param_dist['learning_rate'].rvs(size=1)),
        'batch_size': choice(param_dist['batch_size']),
        'num_train_epochs': choice(param_dist['num_train_epochs']),
        'gradient_accumulation_steps': choice(param_dist['gradient_accumulation_steps']),
        'max_grad_norm': choice(param_dist['max_grad_norm'].rvs(size=1)),
        'adam_epsilon': choice(param_dist['adam_epsilon'].rvs(size=1)),
        'warmup_steps': choice(param_dist['warmup_steps']),
        'weight_decay': choice(param_dist['weight_decay'].rvs(size=1)),
        'label_smoothing': choice(param_dist['label_smoothing'].rvs(size=1)),
    }

    # Training Arguments
    training_args = TrainingArguments(
        output_dir='./results',
        eval_strategy="epoch",
        logging_dir='./logs',
        logging_steps=500,

        num_train_epochs=current_params['num_train_epochs'],
        per_device_train_batch_size=current_params['batch_size'],
        per_device_eval_batch_size=current_params['batch_size'],
        gradient_accumulation_steps=current_params['gradient_accumulation_steps'],

        max_grad_norm=current_params['max_grad_norm'],
        adam_epsilon=current_params['adam_epsilon'],
        warmup_steps=current_params['warmup_steps'],
        weight_decay=current_params['weight_decay'],
        label_smoothing_factor=current_params['label_smoothing'],

        report_to="none",
    )

    # Trainer
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=tokenized_train_data,
        eval_dataset=tokenized_eval_data,
        data_collator=DataCollatorForSeq2Seq(tokenizer, model=model)
    )

    # Measure time for training and evaluation
    start_time = time.time()
    trainer.train()
    eval_results = trainer.evaluate()

    # Get predictions for metrics
    predictions_output = trainer.predict(tokenized_eval_data)
    predictions = predictions_output.predictions[0] if isinstance(predictions_output.predictions, tuple) else predictions_output.predictions
    predicted_token_ids = np.argmax(predictions, axis=-1)
    decoded_preds = tokenizer.batch_decode(predicted_token_ids, skip_special_tokens=True)

    labels = eval_data['broken']
    references_for_metrics = [[label] for label in labels]

    bleu_result = bleu.compute(predictions=decoded_preds, references=references_for_metrics)
    bleu_score = bleu_result['bleu'] if bleu_result and 'bleu' in bleu_result else 0.0

    rouge_result = rouge.compute(predictions=decoded_preds, references=references_for_metrics)
    rouge_score = rouge_result

    eval_loss = eval_results['eval_loss']
    time_taken = time.time() - start_time

    new_row_df = pd.DataFrame([
        {
            'learning_rate': current_params['learning_rate'],
            'batch_size': current_params['batch_size'],
            'num_train_epochs': current_params['num_train_epochs'],
            'gradient_accumulation_steps': current_params['gradient_accumulation_steps'],
            'max_grad_norm': current_params['max_grad_norm'],
            'adam_epsilon': current_params['adam_epsilon'],
            'warmup_steps': current_params['warmup_steps'],
            'weight_decay': current_params['weight_decay'],
            'label_smoothing': current_params['label_smoothing'],

            'eval_loss': eval_loss,
            'bleu': bleu_score,
            'rouge': rouge_score,
            'time_taken': time_taken
        }
    ])

    experiment_log = pd.concat([experiment_log, new_row_df], ignore_index=True)

    if eval_loss < best_score:
        best_score = eval_loss
        trainer.save_model(model_save_path)


Epoch,Training Loss,Validation Loss
1,No log,0.312978
2,0.858300,0.295219
3,0.858300,0.287362
4,0.314400,0.286188


  experiment_log = pd.concat([experiment_log, new_row_df], ignore_index=True)


Epoch,Training Loss,Validation Loss
1,No log,0.275579
2,0.296100,0.27215
3,0.296100,0.268592
4,0.285900,0.26847


Epoch,Training Loss,Validation Loss
1,No log,0.26465
2,0.275000,0.263263
3,0.275000,0.260381
4,0.272700,0.260292


Epoch,Training Loss,Validation Loss
1,No log,0.261308
2,0.264400,0.258689
3,0.264400,0.256638
4,0.265100,0.256578


Epoch,Training Loss,Validation Loss
1,No log,0.257478
2,0.259600,0.256111
3,0.259600,0.254673
4,0.260900,0.254373


Epoch,Training Loss,Validation Loss
1,No log,0.255704
2,0.257300,0.255213
3,0.257300,0.254111
4,0.258600,0.253266


Epoch,Training Loss,Validation Loss
1,No log,0.254534
2,0.255900,0.253575
3,0.255900,0.253013
4,0.257200,0.252276


Epoch,Training Loss,Validation Loss
1,No log,0.253347
2,0.255000,0.252524
3,0.255000,0.25188
4,0.256200,0.251672


Epoch,Training Loss,Validation Loss
1,No log,0.253672
2,0.254400,0.252879
3,0.254400,0.251495
4,0.255500,0.251234


Epoch,Training Loss,Validation Loss
1,No log,0.252719
2,0.254100,0.253383
3,0.254100,0.251541
4,0.255100,0.251007


In this part of the project, we implemented a random search over hyperparameters to find the best configuration for our grammar correction model. We ran 10 trials, and in each trial, we randomly sampled values for key hyperparameters like learning_rate, batch_size, num_train_epochs, gradient_accumulation_steps, max_grad_norm, adam_epsilon, warmup_steps, weight_decay, and label_smoothing.

For each sampled configuration, we updated the TrainingArguments and initialized a new Trainer with the tokenized training and evaluation datasets. We then measured the time it took to train and evaluate the model, and computed evaluation metrics including eval_loss, BLEU, and ROUGE scores by decoding the model’s predictions and comparing them to the reference labels.

After each trial, we logged the hyperparameters, evaluation metrics, and training time into our experiment_log DataFrame. This allowed us to systematically track the results of all experiments. Finally, if a trial produced a lower evaluation loss than any previous trial, we updated best_score and saved the model to our best_model directory.

This approach allowed our group to efficiently explore different hyperparameter combinations, identify the best-performing model, and keep a detailed record of all experiments for analysis and reproducibility.

# **6. Saving Log File and Reporting Best Run**

In [None]:
# Save the log to Excel
excel_file_path = 'aether_hyperparameter_tuning_log.xlsx'
experiment_log.to_excel(excel_file_path, index=False)

In this final step, we saved the experiment log containing all hyperparameter trials and their corresponding evaluation metrics to an Excel file. By exporting the experiment_log DataFrame to 'aether_hyperparameter_tuning_log.xlsx', we ensured that our group has a permanent, organized record of all experiments. This makes it easy to review results, compare different configurations, and share findings with the team or include them in our project report.

In [None]:
if not experiment_log.empty:
    best_run = experiment_log.loc[experiment_log['eval_loss'].idxmin()]
    print(f"Best Hyperparameters: {best_run}")
else:
    print("Experiment log is empty. Please run the hyperparameter tuning loop first to populate the log.")

Best Hyperparameters: learning_rate                                                           0.000044
batch_size                                                                     1
num_train_epochs                                                               4
gradient_accumulation_steps                                                    4
max_grad_norm                                                           1.926071
adam_epsilon                                                                 0.0
warmup_steps                                                                   0
weight_decay                                                            0.059866
label_smoothing                                                         0.015602
eval_loss                                                               0.251007
bleu                                                                    0.981132
rouge                          {'rouge1': 0.9886173474091828, 'rouge2': 0.976...
time_t

In this cell, we retrieved and displayed the best hyperparameter configuration from our experiment log. We first checked if the log is not empty, and then identified the row with the lowest evaluation loss (eval_loss). This row represents the hyperparameters that produced the best-performing model. By printing best_run, our group can quickly see which settings led to the most effective grammar correction results. If the log is empty, it reminds us to run the hyperparameter tuning loop first before analyzing the results.

In [None]:
from google.colab import files
files.download(excel_file_path)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In this final step, we downloaded the Excel file containing our hyperparameter tuning log from Google Colab to our local machine. By using files.download(excel_file_path), our group ensured that we have a local copy of all experiment results, making it easy to review, share, or include in our project documentation and analysis.

# **7. Custom Dataset Definition and Inference Setup**

In [None]:
data = {
    'sentence': [
        "I like play soccer.",
        "She are going to the store.",
        "He dont know the answer.",
        "It raining outside.",
        "They was happy about the news.",
        "She can sings very well.",
        "The dog chased it tail.",
        "We was waiting for the bus.",
        "My mother is a doctor she works hard.",
        "I did not done my homework."
    ],
    'broken': [
        "I like to play soccer.",
        "She is going to the store.",
        "He doesn't know the answer.",
        "It is raining outside.",
        "They were happy about the news.",
        "She can sing very well.",
        "The dog chased its tail.",
        "We were waiting for the bus.",
        "My mother is a doctor; she works hard.",
        "I did not do my homework."
    ]
}

In this cell, we created a small example dataset to test our grammar correction model. The sentence list contains sentences with grammatical errors, while the broken list contains their corrected versions. This sample data allows our group to quickly validate the model’s performance and demonstrate its ability to correct common grammar mistakes before using it on larger datasets.

In [None]:
from datasets import Dataset

custom_dataset = Dataset.from_dict(data)

model_save_path = './best_model'
model = T5ForConditionalGeneration.from_pretrained(model_save_path)
tokenizer = T5Tokenizer.from_pretrained(model_save_path)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

def tokenize_function(examples):
    tokenized_inputs = tokenizer(examples['sentence'], truncation=True, padding='max_length', max_length=256)
    tokenized_labels = tokenizer(examples['broken'], truncation=True, padding='max_length', max_length=256)
    tokenized_inputs['labels'] = tokenized_labels['input_ids']
    return tokenized_inputs

In this part of the project, we set up the model and dataset for testing and inference. First, we converted our small example data into a Hugging Face Dataset so it can be processed like our training data. Then, we loaded the best-performing model and tokenizer from the folder where we saved it ('./best_model') and moved the model to the available device (GPU if available, otherwise CPU).

We also reused the tokenize_function to prepare the input sentences and corresponding corrected sentences for the model. This ensures that the data is tokenized consistently with how the model was trained, including proper truncation, padding, and labeling. This setup allows our group to test the model’s performance on custom inputs efficiently.

# **8. Running Inference and Calculating Test Metrics**

In [None]:
custom_tokenized = custom_dataset.map(tokenize_function, batched=True)

inputs = custom_tokenized['sentence']
predictions = []

for sentence in inputs:
    input_ids = tokenizer.encode(sentence, return_tensors="pt").to(device)

    output = model.generate(input_ids, max_length=256, num_beams=5, early_stopping=True)

    prediction = tokenizer.decode(output[0], skip_special_tokens=True)
    predictions.append(prediction)

gold_standard = custom_tokenized['broken']

# Use the 'bleu' and 'rouge' objects loaded from 'evaluate' in cell RbviGiLWEtd4
# The variables 'bleu_score' and 'rouge_score' from the tuning loop are floats/dicts, not the metric objects.
test_bleu_result = bleu.compute(predictions=predictions, references=[[g] for g in gold_standard])
test_bleu = test_bleu_result['bleu']

test_rouge_result = rouge.compute(predictions=predictions, references=[[g] for g in gold_standard])
test_rouge = test_rouge_result

print(f"Test BLEU: {test_bleu}")
print(f"Test ROUGE: {test_rouge}")
print("-" * 50)
print("-" * 50)

for i, sentence in enumerate(inputs):
    print(f"Input Sentence (Broken): {sentence}")
    print(f"Model Output: {predictions[i]}")
    print(f"Gold Standard (Corrected): {gold_standard[i]}")
    print("-" * 50)

Map:   0%|          | 0/10 [00:00<?, ? examples/s]

Downloading builder script: 0.00B [00:00, ?B/s]

Downloading extra modules:   0%|          | 0.00/1.55k [00:00<?, ?B/s]

Downloading extra modules: 0.00B [00:00, ?B/s]

Downloading builder script: 0.00B [00:00, ?B/s]

Test BLEU: 0.5685090550718696
Test ROUGE: {'rouge1': np.float64(0.8657142857142857), 'rouge2': np.float64(0.6342857142857142), 'rougeL': np.float64(0.8657142857142857), 'rougeLsum': np.float64(0.8655555555555555)}
--------------------------------------------------
--------------------------------------------------
Input Sentence (Broken): I like play soccer.
Model Output: I like play soccer.
Gold Standard (Corrected): I like to play soccer.
--------------------------------------------------
Input Sentence (Broken): She are going to the store.
Model Output: She are going to the store.
Gold Standard (Corrected): She is going to the store.
--------------------------------------------------
Input Sentence (Broken): He dont know the answer.
Model Output: He don't know the answer.
Gold Standard (Corrected): He doesn't know the answer.
--------------------------------------------------
Input Sentence (Broken): It raining outside.
Model Output: It raining outside.
Gold Standard (Corrected): It

In this part, we tested our trained grammar correction model on the custom dataset we created. First, we tokenized the input sentences using the same tokenize_function to ensure consistency with the training process.

We then looped through each input sentence, converted it to token IDs, and used the model’s generate method with beam search to produce corrected outputs. Each generated prediction was decoded back into human-readable text and stored in a list.

Next, we compared the model’s predictions with the gold-standard corrected sentences using BLEU and ROUGE metrics to evaluate performance. The results give us quantitative insight into how well the model corrects grammatical errors on these examples.

Finally, we printed each broken sentence, the model’s output, and the gold-standard corrected sentence. This allows our group to visually inspect the model’s corrections and see qualitative results alongside the evaluation metrics.

# **9. Final Retraining with Best Hyperparameters**

In [None]:
import pandas as pd
import os # Make sure os is imported for os.path.exists

print("Reloading original model and tokenizer...")
model_name = "t5-small"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

# Load experiment_log from the Excel file
excel_file_path = 'aether_hyperparameter_tuning_log.xlsx'
if os.path.exists(excel_file_path):
    experiment_log = pd.read_excel(excel_file_path)
else:
    raise FileNotFoundError(f"Experiment log file not found at {excel_file_path}. Please ensure tuning was completed and the file exists.")

# Ensure best_run is defined
if not experiment_log.empty:
    best_run = experiment_log.loc[experiment_log['eval_loss'].idxmin()]
    print(f"Best Hyperparameters for retraining: {best_run}")
else:
    raise ValueError("Experiment log is empty. Cannot determine best hyperparameters for retraining.")

print("Configuring TrainingArguments with best hyperparameters...")
# Create new TrainingArguments with best_run hyperparameters
training_args_best = TrainingArguments(
    output_dir='./results_best_run',
    eval_strategy="no", # No evaluation during this final training run
    logging_dir='./logs_best_run',
    logging_steps=500,
    report_to="none",
    save_strategy="no", # Only save at the end explicitly

    num_train_epochs=int(best_run['num_train_epochs']),
    per_device_train_batch_size=int(best_run['batch_size']),
    gradient_accumulation_steps=int(best_run['gradient_accumulation_steps']),
    learning_rate=float(best_run['learning_rate']),
    max_grad_norm=float(best_run['max_grad_norm']),
    adam_epsilon=float(best_run['adam_epsilon']),
    warmup_steps=int(best_run['warmup_steps']),
    weight_decay=float(best_run['weight_decay']),
    label_smoothing_factor=float(best_run['label_smoothing']),
)

print("Initializing Trainer for final training...")
# Initialize a new Trainer with the reloaded model and best training arguments
trainer_best = Trainer(
    model=model,
    args=training_args_best,
    train_dataset=tokenized_train_data, # Use the full tokenized_train_data
    tokenizer=tokenizer,
    data_collator=DataCollatorForSeq2Seq(tokenizer, model=model),
)

print("Starting final model training with best hyperparameters...")
trainer_best.train()

print(f"Saving the retrained model to {model_save_path}...")
# Save the trained model
trainer_best.save_model(model_save_path)

print("Model retraining and saving complete.")

Reloading original model and tokenizer...
Best Hyperparameters for retraining: learning_rate                                                           0.000044
batch_size                                                                     1
num_train_epochs                                                               4
gradient_accumulation_steps                                                    4
max_grad_norm                                                           1.926071
adam_epsilon                                                                 0.0
warmup_steps                                                                   0
weight_decay                                                            0.059866
label_smoothing                                                         0.015602
eval_loss                                                               0.251007
bleu                                                                    0.981132
rouge                         

  trainer_best = Trainer(


Starting final model training with best hyperparameters...


Step,Training Loss
500,0.9355
1000,0.3194


Saving the retrained model to ./best_model...
Model retraining and saving complete.


In this part, we performed the final retraining of our model using the best hyperparameters found during our random search. First, we reloaded the original t5-small model and tokenizer, and then we loaded the experiment log from the Excel file containing all previous hyperparameter trials. From this log, we identified the best-performing configuration based on the lowest evaluation loss.

Next, we configured TrainingArguments using these optimal hyperparameters, ensuring the final training run reflects the best settings. We then initialized a new Trainer with the reloaded model, full training dataset, tokenizer, and data collator.

Finally, we trained the model one last time using the best hyperparameters and saved the retrained model to our designated folder. This ensures that our group has a finalized, fully trained model that incorporates all the insights from our hyperparameter tuning, ready for evaluation or deployment.

# **10. Final Inference and Evaluation after Retraining (Colab/Gdrive Specific)**

In [6]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In this cell, we mounted Google Drive to our Colab environment using drive.mount('/content/drive'). This allows our group to access files stored in Google Drive, such as datasets, experiment logs, or previously saved models, directly from Colab. Mounting Drive ensures we can read from and write to our cloud storage, making it easier to manage project files and collaborate efficiently.

In [11]:
gdrive_model_path = "/content/drive/My Drive/best_model"

gdrive_tokenizer = T5Tokenizer.from_pretrained(gdrive_model_path)
gdrive_model = T5ForConditionalGeneration.from_pretrained(gdrive_model_path)

print(f"Successfully loaded T5 tokenizer and model from: {gdrive_model_path}")

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


Successfully loaded T5 tokenizer and model from: /content/drive/My Drive/best_model


This model is the best version of our grammar correction T5 model. It was selected based on the lowest evaluation loss during hyperparameter tuning, retrained using the optimal hyperparameters, and saved for future use. Our group can now use this model confidently for inference, testing, or deployment, knowing it represents the highest-performing configuration we achieved.

In [12]:
from datasets import Dataset
from transformers import T5ForConditionalGeneration, T5Tokenizer
import torch

# The model and tokenizer are already loaded as gdrive_model and gdrive_tokenizer
# We will use these directly.
model = gdrive_model
tokenizer = gdrive_tokenizer

# Move model to appropriate device (GPU if available, else CPU)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# Define the custom dataset (as provided in the notebook context)
data = {
    'sentence': [
        "I like play soccer.",
        "She are going to the store.",
        "He dont know the answer.",
        "It raining outside.",
        "They was happy about the news.",
        "She can sings very well.",
        "The dog chased it tail.",
        "We was waiting for the bus.",
        "My mother is a doctor she works hard.",
        "I did not done my homework."
    ],
    'broken': [
        "I like to play soccer.",
        "She is going to the store.",
        "He doesn't know the answer.",
        "It is raining outside.",
        "They were happy about the news.",
        "She can sing very well.",
        "The dog chased its tail.",
        "We were waiting for the bus.",
        "My mother is a doctor; she works hard.",
        "I did not do my homework."
    ]
}
custom_dataset = Dataset.from_dict(data)

# For inference, we only need the input sentences and gold standard labels
# 'sentence' now correctly represents the broken inputs to the model
inputs_for_inference = custom_dataset['sentence']
# 'broken' now correctly represents the gold standard corrected labels
gold_standard_labels = custom_dataset['broken']

predictions = []

# Generate predictions
for sentence in inputs_for_inference:
    # Add the "grammar: " prefix to the input sentence for inference
    input_text = f"grammar: {sentence}"
    input_ids = tokenizer.encode(input_text, return_tensors="pt").to(device)

    output = model.generate(input_ids, max_length=256, num_beams=5, early_stopping=True)

    prediction = tokenizer.decode(output[0], skip_special_tokens=True)
    predictions.append(prediction)

# Ensure the 'bleu' and 'rouge' metric objects are loaded
import evaluate
bleu_metric = evaluate.load("bleu")
rouge_metric = evaluate.load("rouge")

# Compute BLEU score
references_for_bleu = [[g] for g in gold_standard_labels]
test_bleu_result = bleu_metric.compute(predictions=predictions, references=references_for_bleu)
test_bleu = test_bleu_result['bleu']

# Compute ROUGE score
test_rouge_result = rouge_metric.compute(predictions=predictions, references=gold_standard_labels)
test_rouge = test_rouge_result

print(f"\nTest BLEU after retraining: {test_bleu}")
print(f"Test ROUGE after retraining: {test_rouge}")
print("-" * 50)

# Print sample predictions
for i, sentence in enumerate(inputs_for_inference):
    print(f"Input Sentence (Broken): {sentence}")
    print(f"Model Output: {predictions[i]}")
    print(f"Gold Standard (Corrected): {gold_standard_labels[i]}")
    print("-" * 50)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading builder script: 0.00B [00:00, ?B/s]

Downloading extra modules:   0%|          | 0.00/1.55k [00:00<?, ?B/s]

Downloading extra modules: 0.00B [00:00, ?B/s]

Downloading builder script: 0.00B [00:00, ?B/s]


Test BLEU after retraining: 0.8984528512221401
Test ROUGE after retraining: {'rouge1': np.float64(0.9800000000000001), 'rouge2': np.float64(0.95), 'rougeL': np.float64(0.9800000000000001), 'rougeLsum': np.float64(0.9800000000000001)}
--------------------------------------------------
Input Sentence (Broken): I like play soccer.
Model Output: I like to play soccer.
Gold Standard (Corrected): I like to play soccer.
--------------------------------------------------
Input Sentence (Broken): She are going to the store.
Model Output: She is going to the store.
Gold Standard (Corrected): She is going to the store.
--------------------------------------------------
Input Sentence (Broken): He dont know the answer.
Model Output: He doesn't know the answer.
Gold Standard (Corrected): He doesn't know the answer.
--------------------------------------------------
Input Sentence (Broken): It raining outside.
Model Output: It is raining outside.
Gold Standard (Corrected): It is raining outside.
--

In this step, we tested our best-performing T5 grammar correction model on a small custom dataset. We used gdrive_model and gdrive_tokenizer, moved the model to GPU if available, and prepared the input sentences and their corresponding gold-standard corrections.

For each sentence, we added a "grammar: " prefix, tokenized it, and generated a corrected output using beam search. Each output was decoded back to text and stored in a list of predictions.

We then evaluated the model quantitatively using BLEU and ROUGE metrics, comparing the predictions against the gold-standard corrected sentences. These scores give our group a measure of how well the model performs in correcting grammar. Finally, we printed each input sentence, the model’s output, and the correct version, allowing us to visually inspect the quality of corrections and verify that the model is performing as expected.

## Load and Prepare New Test Data

### Subtask:
Load the 'Grammar Correction.csv' dataset into a pandas DataFrame. Extract the 'Ungrammatical Statement' column for model input and the 'Corrected Statement' column as gold-standard references. Select the first 100 entries from both columns for testing.

In [3]:
import pandas as pd

# IMPORTANT: Please ensure this path is correct for your 'Grammar Correction.csv' file in Google Drive.
# If you encountered FileNotFoundError before, you must update this path.
csv_file_path = '/content/Grammar Correction.csv'

try:
    df_grammar = pd.read_csv(csv_file_path)
except FileNotFoundError:
    print(f"Error: The file '{csv_file_path}' was not found. Please verify the path in your Google Drive and update 'csv_file_path' in this cell, then rerun the cell.")
    raise # Re-raise the exception to stop execution if file is still not found

test_inputs = df_grammar['Ungrammatical Statement'].head(100).tolist()
test_gold_standards = df_grammar['Standard English'].head(100).tolist()

print(f"Loaded {len(test_inputs)} ungrammatical statements for testing.")
print(f"Loaded {len(test_gold_standards)} gold-standard corrections for testing.")

Loaded 100 ungrammatical statements for testing.
Loaded 100 gold-standard corrections for testing.


In this step, we loaded a CSV file containing ungrammatical sentences and their gold-standard corrections from Google Drive. We used pandas to read the file, and added error handling to ensure the path is correct. From the dataset, we extracted the first 100 ungrammatical statements as test_inputs and their corresponding corrected sentences as test_gold_standards.

This allows our group to run inference on real-world data outside our initial training or custom examples, providing a larger and more realistic set for evaluating the grammar correction model’s performance. The print statements confirm the number of inputs and gold-standard sentences loaded for testing.

## Generate Predictions on Test Data

### Subtask:
For each of the 100 ungrammatical statements, tokenize the input (remembering to add the 'grammar: ' prefix) and use the loaded model to generate a corrected sentence. Store all generated predictions.

In [9]:
gdrive_model_path = "/content/drive/My Drive/best_model"

gdrive_tokenizer = T5Tokenizer.from_pretrained(gdrive_model_path)
gdrive_model = T5ForConditionalGeneration.from_pretrained(gdrive_model_path)

print(f"Successfully loaded T5 tokenizer and model from: {gdrive_model_path}")

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


Successfully loaded T5 tokenizer and model from: /content/drive/My Drive/best_model


In this step, we loaded our finalized, best-performing T5 grammar correction model and its tokenizer directly from Google Drive (gdrive_model_path). This ensures our group can use the fully trained and optimized model without retraining, making it ready for inference or evaluation on new data. The print statement confirms that the model and tokenizer were successfully loaded from the specified Drive path.

In [12]:
import torch

# Ensure model, tokenizer, and device are defined for this scope
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
gdrive_model.to(device)
model = gdrive_model
tokenizer = gdrive_tokenizer

predictions = []

for sentence in test_inputs:
    input_text = f"grammar: {sentence}"
    input_ids = tokenizer.encode(input_text, return_tensors="pt").to(device)

    output = model.generate(input_ids, max_length=256, num_beams=5, early_stopping=True)

    prediction = tokenizer.decode(output[0], skip_special_tokens=True)
    predictions.append(prediction)

print(f"Generated {len(predictions)} predictions.")

Generated 100 predictions.


In this step, we ran inference using our best-performing T5 model (gdrive_model) on a set of real-world test inputs loaded from the CSV file. For each ungrammatical sentence, we added a "grammar: " prefix, tokenized it, and generated a corrected output using beam search. Each prediction was decoded back into human-readable text and stored in a list.

This allows our group to see how the model performs on a larger, real-world dataset, producing a total of predictions equal to the number of test inputs. The print statement confirms the number of predictions generated.

## Calculate Test Metrics

### Subtask:
Compute the BLEU and ROUGE scores by comparing the model's generated predictions with the gold-standard corrected statements from the 'Corrected Statement' column of the 'Grammar Correction.csv' dataset.

In [13]:
import evaluate

# Ensure the 'bleu' and 'rouge' metric objects are loaded
# These were loaded earlier, but loading again for robustness in this new section.
bleu_metric = evaluate.load("bleu")
rouge_metric = evaluate.load("rouge")

# Compute BLEU score
references_for_bleu = [[g] for g in test_gold_standards]
test_bleu_result = bleu_metric.compute(predictions=predictions, references=references_for_bleu)
test_bleu = test_bleu_result['bleu']

# Compute ROUGE score
test_rouge_result = rouge_metric.compute(predictions=predictions, references=test_gold_standards)
test_rouge = test_rouge_result

print(f"\nTest BLEU Score: {test_bleu}")
print(f"Test ROUGE Scores: {test_rouge}")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.



Test BLEU Score: 0.7919890205369166
Test ROUGE Scores: {'rouge1': np.float64(0.922293876874759), 'rouge2': np.float64(0.8257886557886557), 'rougeL': np.float64(0.921601943807826), 'rougeLsum': np.float64(0.9219161843058901)}


In this step, we evaluated our best-performing T5 grammar correction model on the real-world test dataset using standard metrics. We computed BLEU and ROUGE scores to quantitatively measure how closely the model’s predictions match the gold-standard corrected sentences.

The BLEU score captures n-gram overlap, reflecting the accuracy of individual word sequences, while ROUGE measures overlap in terms of recall, precision, and F1 score for longer sequences. Printing these scores allows our group to assess the model’s performance on real-world inputs and verify that it generalizes well beyond our custom examples.

## Display Results and Sample Predictions

### Subtask:
Print the calculated BLEU and ROUGE scores for the test set. Also, display a few examples showing the original ungrammatical statement, the model's correction, and the gold-standard corrected statement for qualitative analysis.

In [14]:
print("\n--- Sample Predictions ---")
for i in range(min(5, len(test_inputs))): # Display up to 5 samples
    print(f"Input Sentence (Ungrammatical): {test_inputs[i]}")
    print(f"Model Output: {predictions[i]}")
    print(f"Gold Standard (Corrected): {test_gold_standards[i]}")
    print("-" * 50)


--- Sample Predictions ---
Input Sentence (Ungrammatical): I goes to the store everyday.
Model Output: I go to the store everyday.
Gold Standard (Corrected): I go to the store everyday.
--------------------------------------------------
Input Sentence (Ungrammatical): They was playing soccer last night.
Model Output: They were playing soccer last night.
Gold Standard (Corrected): They were playing soccer last night.
--------------------------------------------------
Input Sentence (Ungrammatical): She have completed her homework.
Model Output: She has completed her homework.
Gold Standard (Corrected): She has completed her homework.
--------------------------------------------------
Input Sentence (Ungrammatical): He don't know the answer.
Model Output: He don't know the answer.
Gold Standard (Corrected): He doesn't know the answer.
--------------------------------------------------
Input Sentence (Ungrammatical): The sun rise in the east.
Model Output: The sun rises in the east.
Gold

In this step, we displayed a few sample predictions from our T5 grammar correction model on the real-world test dataset. By printing up to 5 examples, we can visually compare the ungrammatical input sentences, the model’s corrected outputs, and the gold-standard corrected sentences.

This allows our group to qualitatively assess the model’s performance, verifying that it not only achieves good evaluation metrics like BLEU and ROUGE but also produces corrections that are meaningful and accurate in real-world contexts.

In [15]:
print(df_grammar.columns.tolist())

['Serial Number', 'Error Type', 'Ungrammatical Statement', 'Standard English']


In this step, we printed the column names of the df_grammar DataFrame to verify the structure of our CSV file. This allows our group to confirm that the columns we intend to use—'Ungrammatical Statement' for inputs and 'Standard English' for gold-standard corrections—are present and correctly named before performing inference and evaluation.

## Save and Download Test Results to Excel

This cell will create a DataFrame with the test inputs, model predictions, and gold standard corrections, save it to an Excel file, and then initiate the download.

In [18]:
import pandas as pd
from google.colab import files
import evaluate # Ensure evaluate is imported for metric computation

# Ensure the 'bleu' and 'rouge' metric objects are loaded
bleu_metric = evaluate.load("bleu")
rouge_metric = evaluate.load("rouge")

# Create a DataFrame from the test results
test_results_df = pd.DataFrame({
    'Ungrammatical Statement': test_inputs,
    'Model Correction': predictions,
    'Gold Standard Correction': test_gold_standards
})

sentence_bleu_scores = []
sentence_rouge1_scores = []
sentence_rougeL_scores = []

# Calculate BLEU and ROUGE scores for each sentence
for i in range(len(test_inputs)):
    pred = predictions[i]
    gold = test_gold_standards[i]

    # BLEU score for single sentence
    # References need to be a list of lists: [[reference]]
    bleu_result = bleu_metric.compute(predictions=[pred], references=[[gold]])
    sentence_bleu_scores.append(bleu_result['bleu'] if bleu_result and 'bleu' in bleu_result else 0.0)

    # ROUGE score for single sentence
    # References can be a list of strings: [reference]
    rouge_result = rouge_metric.compute(predictions=[pred], references=[gold])
    sentence_rouge1_scores.append(rouge_result['rouge1'] if rouge_result and 'rouge1' in rouge_result else 0.0)
    sentence_rougeL_scores.append(rouge_result['rougeL'] if rouge_result and 'rougeL' in rouge_result else 0.0)

# Add sentence-level scores to the DataFrame
test_results_df['Sentence BLEU'] = sentence_bleu_scores
test_results_df['Sentence ROUGE-1'] = sentence_rouge1_scores
test_results_df['Sentence ROUGE-L'] = sentence_rougeL_scores

# Add the overall BLEU and ROUGE scores as new rows at the end
metrics_data = [
    {'Ungrammatical Statement': 'OVERALL METRICS', 'Model Correction': 'BLEU Score', 'Gold Standard Correction': float(test_bleu) if isinstance(test_bleu, np.float64) else test_bleu},
    {'Ungrammatical Statement': '', 'Model Correction': 'ROUGE-1', 'Gold Standard Correction': float(test_rouge['rouge1']) if 'rouge1' in test_rouge and isinstance(test_rouge['rouge1'], np.float64) else test_rouge.get('rouge1')},
    {'Ungrammatical Statement': '', 'Model Correction': 'ROUGE-2', 'Gold Standard Correction': float(test_rouge['rouge2']) if 'rouge2' in test_rouge and isinstance(test_rouge['rouge2'], np.float64) else test_rouge.get('rouge2')},
    {'Ungrammatical Statement': '', 'Model Correction': 'ROUGE-L', 'Gold Standard Correction': float(test_rouge['rougeL']) if 'rougeL' in test_rouge and isinstance(test_rouge['rougeL'], np.float64) else test_rouge.get('rougeL')}
]

metrics_df = pd.DataFrame(metrics_data)
test_results_df = pd.concat([test_results_df, metrics_df], ignore_index=True)

output_excel_path = 'grammar_correction_test_results.xlsx'
test_results_df.to_excel(output_excel_path, index=False)

print(f"Test results saved to {output_excel_path}")

# Download the file
files.download(output_excel_path)

Test results saved to grammar_correction_test_results.xlsx


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In this step, our group compiled the model’s predictions and evaluation metrics into a structured Excel report for easier analysis. We created a DataFrame containing each ungrammatical sentence, the model’s corrected output, and the gold-standard correction.

We then computed sentence-level BLEU and ROUGE scores for each input to evaluate how accurately the model corrected individual sentences. These scores were added as separate columns to the DataFrame. Additionally, we appended the overall BLEU and ROUGE metrics at the end of the table to summarize the model’s overall performance.

Finally, we saved the DataFrame to an Excel file (grammar_correction_test_results.xlsx) and downloaded it. This allows our group to review both the quantitative metrics and qualitative predictions in one organized document, making it easier to present and analyze the results.