
**Training the t5 base model with jleg data.**
<p>This program is created to train the model with jfleg (https://huggingface.co/datasets/jfleg) and save the trained model.

<p>The created pre-trained model will be used to test the shared tasted for English language (https://github.com/spraakbanken/multiged-2023)

In [None]:
!pip install happytransformer
!pip install transformers

In [2]:
from happytransformer import HappyTextToText

happy_tt = HappyTextToText("T5", "t5-base")

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


In [3]:
from datasets import load_dataset

In [4]:
train_dataset = load_dataset("jfleg", split='validation[:]')

eval_dataset = load_dataset("jfleg", split='test[:]')



**Pre-processing the data**

In [5]:
import csv

def generate_csv(csv_path, dataset):
    with open(csv_path, 'w', newline='') as csvfile:
        writter = csv.writer(csvfile)
        writter.writerow(["input", "target"])
        for case in dataset:
     	    # Adding the task's prefix to input 
            input_text = "grammar: " + case["sentence"]
            for correction in case["corrections"]:
                # a few of the cases contain blank strings. 
                if input_text and correction:
                    writter.writerow([input_text, correction])
                    


generate_csv("train.csv", train_dataset)
generate_csv("eval.csv", eval_dataset)

In [None]:
before_result = happy_tt.eval("eval.csv")


**If program gives errors try running this part and then comment this part and restart the runtime**
<p>!pip uninstall -y transformers accelerate
<p>!pip install transformers accelerate

In [None]:
# If program gives errors try running this part and then comment this part and restart the runtime
# !pip uninstall -y transformers accelerate
# !pip install transformers accelerate

In [None]:
print("Before loss:", before_result.loss)

**Training**

In [None]:
from happytransformer import TTTrainArgs

args = TTTrainArgs(batch_size=8)
happy_tt.train("train.csv", args=args)

In [None]:
before_loss = happy_tt.eval("eval.csv")

print("After loss: ", before_loss.loss)

In [None]:
from happytransformer import TTSettings

beam_settings =  TTSettings(num_beams=5, min_length=1, max_length=50)

In [None]:
from google.colab import drive
drive.mount('/content/drive')

**Input your path where you want to save the trained model**

In [None]:
happy_gen = HappyTextToText("T5", "t5-base")
path_to_save = "model/"
happy_gen.save(path_to_save)