# [Fine-Tune a Transformer Model for Grammar Correction](https://www.vennify.ai/fine-tune-grammar-correction/)
|-> Main objective is to correct grammer of text (only english language).

|-> Train T-5 Model from scratch for the task of grammer correction.

|-> Save the model and do Inference.

|-> Further Improvement.

# Example:

![](https://production-media.paperswithcode.com/tasks/gec_foTfIZW.png)

# Table of content:
- Introduction
- Installation
- Data Collection
- Data Examination
- Dataset Preprocessing
- Before Training Evaluating
- Training
- After Training Evaluating
- Inference

# Introduction:
- In linguistics, the grammar of a natural language is its set of structural constraints on speakers' or writers' composition of clauses, phrases, and words.
- A grammar checker, in computing terms, is a program, or part of a program, that attempts to verify written text for grammatical correctness.
- Here in Grammer Correction we will be using [T5 Model](https://huggingface.co/docs/transformers/model_doc/t5) (only for English Language).
- T5 was created by Google AI and released to the world for anyone to download and use.
- T5 is an encoder-decoder model and converts all NLP problems into a text-to-text format. It is trained using teacher forcing. This means that for training, we always need an input sequence and a corresponding target sequence.
- We'll use Python package called [Happy Transformer](https://happytransformer.com/). 
- Happy Transformer is built on top of Hugging Face's Transformers library and makes it easy to implement and train transformer models with just a few lines of code. 

# Installation: 
- We need to install happytransformer using following command.
- pip install happytransformer.
- Read more about [pypi](https://pypi.org/project/happytransformer/)
- [Documentation](https://happytransformer.com/)

In [1]:
""" Installation of library are mentioned here """
!pip install happytransformer 
from IPython.display import clear_output
clear_output()

In [2]:
""" Imports are mentioned here """

import csv
from datasets import load_dataset
from happytransformer import TTSettings
from happytransformer import TTTrainArgs
from happytransformer import HappyTextToText

# Model
- T5 comes in several different sizes, and we'll use the base model, which has 220 million parameters.
- T5 is a text-to-text model, meaning given text, it generated a standalone piece of text based on the input. 
- Thus, we'll import a class called HappyTextToText from Happy Transformer, which we'll use to load the model.
- We'll provide the model type (T5) to the first position parameter and the model name (t5-base) to the second.
- If you want to read more about T5 you can find the resouces below.


In [None]:
""" Model """

happy_tt = HappyTextToText("T5", "t5-base")

Downloading:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/850M [00:00<?, ?B/s]

# Data Collection
- The [dataset](https://huggingface.co/datasets/jfleg) is available on Hugging Face's datasets distribution network and can be accessed using their Datasets library. 
- Since this library is a dependency for Happy Transformer, we do not need to install it and can go straight to importing a function called load_dataset from the library.  

In [None]:
train_dataset = load_dataset("jfleg", split='validation[:]')

eval_dataset = load_dataset("jfleg", split='test[:]')

# Data Examination  
- We just successfully downloaded the dataset.
- Let's now explore it by iterating over some cases. Both the train and eval datasets are structured the same way and have two features, sentences and corrections. 
- The sentence feature contains a single string for each case, while the correction feature contains a list of 4 human-generated corrections.

In [None]:
for case in train_dataset["corrections"][:2]:
    print(case)
    print(case[0])
    print("--------------------------------------------------------")

# Data Preprocessing  
- Now, we must process the into the proper format for Happy Transformer. 
- We need to structure both of the training and evaluating data into the same format, which is a CSV file with two columns: input and target.
- The input column contains grammatically incorrect text, and the target column contains text that is the corrected version of the text from the target column.

In [None]:
def generate_csv(csv_path, dataset):
    with open(csv_path, 'w', newline='') as csvfile:
        writter = csv.writer(csvfile)
        writter.writerow(["input", "target"])
        for case in dataset:
     	    # Adding the task's prefix to input 
            input_text = "grammar: " + case["sentence"]
            for correction in case["corrections"]:
                # a few of the cases contain blank strings. 
                if input_text and correction:
                    writter.writerow([input_text, correction])

In [None]:
generate_csv("train.csv", train_dataset)
generate_csv("eval.csv", eval_dataset)

# Before Training Evaluating
- We'll evaluate the model before and after fine-tuning using a common metric called loss. 
- Loss can be described as how "wrong" the model's predictions are compared to the correct answers. 
- So, if the loss decreases after fine-tuning, then that suggests the model learned.
- It's important that we use separate data for training and evaluating to show that the model can generalize its obtained knowledge to solve unseen cases.

In [None]:
before_result = happy_tt.eval("eval.csv")

- The result is a dataclass object with a single variable called loss, which we can isolate as shown below.

In [None]:
print("Before loss:", before_result.loss)

# Training
- Let's now train the model. 
- We can do so by calling happy_tt's train() method. 
- For simplicity, we'll use the default parameters other than the batch size which we'll increase to 8.
- If you experience an out of memory error,  then I suggest you reduce the batch size. 
- You can visit this [webpage](https://happytransformer.com/text-to-text/finetuning/) to learn how to modify various parameters like the learning rate and the number of epochs.

In [None]:
args = TTTrainArgs(batch_size=8)
happy_tt.train("train.csv", args=args)

# After Training Evaluating
- Like before, let's determine the model's loss.

In [None]:
before_loss = happy_tt.eval("eval.csv")

print("After loss: ", before_loss.loss)

# Inference
- Let's now use the model to correct the grammar of examples we'll provide it.
- To accomplish this, we'll use happy_tt's generate_text() method. 
- We'll also use an algorithm called beam search for the generation. 
- You can view the different text generation parameters you can modify on this [webpage](https://happytransformer.com/text-to-text/settings/), along with different configurations you could use for common algorithms.

In [None]:
beam_settings =  TTSettings(num_beams=5, min_length=1, max_length=20)

In [None]:
""" Example1: """
example_1 = "grammar: This sentences, has bads grammar and spelling!"
result_1 = happy_tt.generate_text(example_1, args=beam_settings)
print(result_1.text)

In [None]:
""" Example2: """

example_2 = "grammar: I am enjoys, writtings articles ons AI and I also enjoyed write articling on AI."

result_2 = happy_tt.generate_text(example_2, args=beam_settings)
print(result_2.text)

# Further Improvement:
- I suggest transferring some of the evaluating cases to the training data and then optimize the hyperparameters by applying a technique like grid search. 
- You can then include the evaluating cases in the training set to fine-tune a final model using your best set of hyperparameters.
- Even we can try multiple languages to support multilinguality.
- Add custom layers to refine output.
- Try other models as well.

# Additional Resources:
- [Transformers](https://towardsdatascience.com/transformers-89034557de14)
- [T5](https://paperswithcode.com/method/t5)
- [Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://arxiv.org/abs/1910.10683)
- [Hugging Face](https://huggingface.co/)

# The End