# Final Project

## Description

### Task

In this final project, we create a service for spelling correction for english language.

### Setup and Models

To perform this task, we chose [T5-lage-spell model](https://huggingface.co/ai-forever/T5-large-spell) (description found here : https://habr.com/ru/companies/sberdevices/articles/763932/) and a smaller model (T5-small) to make the main job.

**UPDATE**: due to unimpressive results obtained after training and attempts to distill the model, we chose to proceed by simply quantising the weights of T5-large-spell model.

### Procedure

We begin by loading the models and creating datasets. We will be using two datasets: `bea60k`, containing about 60k sentences for training the model and `jfleg` for distillation. Afterwards we assess the trained models quality.

**UPDATE**: model quality was not very good, so we just chose to perform quantisation of the large model.

### Metric

To evaluate the quality of the model, we suggest the following metric: we compare the original sentence ($x$), it's correction ($y$) and model predicted correction ($\hat y$). Let's call it "Error correction rate". To calculate it, we use the following formula:

$$Error\ correction\ rate = \cfrac{\sum \mathbb{I}(\hat y = y)}{\#\ of\ errors}$$

We first compare $x$ and $y$ to determine the indices of corrected words. The model does not deal with missing or extra whitespaces, the word count is always the same. Afterwards, we compare (at specified indices), whether the correction and predicted correction match. If yes, the error correction increases by one, if not, nothing happens. After comparing all words, we divide the matches by total number of errors. We also track the number of false positives (needless corrections) in a separate key.

### Results

**UPDATE**: it turned out, that the quality of t5-small model is too low to use to for production purposes (due to insufficient training, or maybe we messed up the procedure). So, we decided to perform quantization of the large model and use it in production.

The large model shows quite good correction quality. Moreover, quantization helped to reduce model size from 2.8 GB to just 0.8 GB without losing quality. 


## Code

### Modules

In [14]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import random
import time
from tqdm import tqdm

import datetime
import os

from transformers import T5ForConditionalGeneration, AutoTokenizer, AdamW, get_linear_schedule_with_warmup, PreTrainedTokenizerFast

from datasets import load_dataset

import torch
from torch.utils.data import DataLoader, TensorDataset, random_split, RandomSampler, SequentialSampler

In [15]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

### Helper Functions

In [39]:
def format_time(elapsed):
    '''
    Takes a time in seconds and returns a string hh:mm:ss
    '''
    # Round to the nearest second.
    elapsed_rounded = int(round((elapsed)))

    # Format as hh:mm:ss
    return str(datetime.timedelta(seconds=elapsed_rounded))

def flat_accuracy(preds, labels):
    '''
    Flattens labels and calculates accuracy
    '''
    pred_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    return np.sum(pred_flat == labels_flat) / len(labels_flat)

def flat_accuracy_tocpu(preds, labels):
    '''
    Flattens labels and calculates accuracy
    '''
    pred_flat = np.argmax(preds.cpu().detach().numpy(), axis=1).flatten()
    labels_flat = labels.cpu().detach().numpy().flatten()
    return np.sum(pred_flat == labels_flat) / len(labels_flat)

def calc_weights(model):    
    '''
    Calculates number of weights in the model
    '''
    result = 0
    for layer in model.features.children():
        if hasattr(layer, 'weight'):
            result += len(layer.weight.reshape(-1))
    for layer in model.classifier.children():
        if hasattr(layer, 'weight'):
            result += len(layer.weight.reshape(-1))
    return result

def calc_size(model):
    '''
    Calculates size of the model
    '''
    torch.save(model.state_dict(), "./tmp/model.p")
    size=os.path.getsize("./tmp/model.p")
    os.remove('./tmp/model.p')
    return "{:.3f} KB".format(size / 1024)

### Loading models

In [17]:
# path_to_teacher_model = "ai-forever/T5-large-spell" # we will load it before distillation to save space on GPU
path_to_student_model = "t5-small"

In [18]:
# # T5
# model_T5_spell = T5ForConditionalGeneration.from_pretrained(path_to_teacher_model)
# tokenizer_T5_spell = AutoTokenizer.from_pretrained(path_to_teacher_model)

In [19]:
# T5 small
tokenizer_student = AutoTokenizer.from_pretrained(path_to_student_model)
t5_small = T5ForConditionalGeneration.from_pretrained(path_to_student_model)

# load weights of a model (from saved results)
# t5_small.load_state_dict(torch.load("t5-small2023-11-12-09-12-20.391733.bin"))

<All keys matched successfully>

### Dataset

As we already mentioned, we will use two datasets: jfleg for distillation and bea60k for training model.

In [20]:
jfleg = load_dataset("jfleg")

In [21]:
jfleg_validation = jfleg["validation"][:]
jfleg_test = jfleg["test"][:]

jfleg_validation = pd.DataFrame(jfleg_validation)
jfleg_validation["corrections"] = jfleg_validation["corrections"].apply(lambda x: x[0])

jfleg_validation.tail()

Unnamed: 0,sentence,corrections
750,The government also should try to reduce the s...,The government should also try to reduce the s...
751,Alot of memories with enogh time to remember w...,"A lot of memories , with enough time to rememb..."
752,Sceene of violence can affect on them .,A scene of violence can have an effect on them .
753,While the communities in general have reckoned...,The communities in general have reckoned that ...
754,,


In [22]:
with open('traintest/test.bea60k', 'rb') as f:
    # Read the data into a NumPy array
    corrections = np.fromfile(f, dtype=np.uint8)
    corrections = "".join([chr(a) for a in corrections])
    corrections = corrections.split("\n")

with open('traintest/test.bea60k.noise', 'rb') as f:
    # Read the data into a NumPy array
    sentences = np.fromfile(f, dtype=np.uint8)
    sentences = "".join([chr(a) for a in sentences])
    sentences = sentences.split("\n")

In [23]:
bea60k = pd.DataFrame({"sentences" : sentences, 
                       "corrections" : corrections})

bea60k['sentences'] = bea60k['sentences'].apply(lambda x: x.replace(" .", ".").replace(" ,", ","))
bea60k['sentences'] = bea60k['sentences'].apply(lambda x: "grammar: " + x)
bea60k['corrections'] = bea60k['corrections'].apply(lambda x: x.replace(" .", ".").replace(" ,", ","))

bea60k.head()

Unnamed: 0,sentences,corrections
0,grammar: I WANT TO THAK YOU FOR PREPARING SUCH...,I WANT TO THANK YOU FOR PREPARING SUCH A GOOD ...
1,grammar: IN MY OPINION FAMOUS PEOPLE ARE BEING...,IN MY OPINION FAMOUS PEOPLE ARE BEING OBLIGED ...
2,"grammar: Also, I want to say that the plays an...","Also, I want to say that the plays and films w..."
3,grammar: In our Acadamy we are not allowed to ...,In our Academy we are not allowed to smoke.
4,grammar: I was trully dissapointed by it.,I was truly disappointed by it.


In [24]:
bea60k.shape

(63045, 2)

#### Create loaders

Transform the data tables into data loaders. Before that, we need to tokenise them.

In [25]:
def tokenize_dataset(sentence_list, corrections_list, tokenizer):

    # Tokenize all of the sentences and map the tokens to thier word IDs.
    input_ids_sentence = []
    attention_masks_sentence = []

    input_ids_correction = []
    attention_masks_correction = []

    # For every sentence...
    for sentence in sentence_list:
        # `encode_plus` will:
        #   (1) Tokenize the sentence.
        #   (2) Prepend the `[CLS]` token to the start.
        #   (3) Append the `[SEP]` token to the end.
        #   (4) Map tokens to their IDs.
        #   (5) Pad or truncate the sentence to `max_length`
        #   (6) Create attention masks for [PAD] tokens.
        encoded_dict_sentence = tokenizer.encode_plus(
                            sentence,                      # Sentence to encode.
                            add_special_tokens = True, # Add '[CLS]' and '[SEP]'
                            max_length = 128,           # Pad & truncate all sentences.
                            pad_to_max_length = True,
                            return_attention_mask = True,   # Construct attn. masks.
                            return_tensors = 'pt',     # Return pytorch tensors.
                    )

        # Add the encoded sentence to the list.
        input_ids_sentence.append(encoded_dict_sentence['input_ids'])

        # And its attention mask (simply differentiates padding from non-padding).
        attention_masks_sentence.append(encoded_dict_sentence['attention_mask'])

    for correction in corrections_list:

        encoded_dict_correction = tokenizer.encode_plus(
                            correction,                      # Sentence to encode.
                            add_special_tokens = True, # Add '[CLS]' and '[SEP]'
                            max_length = 128,           # Pad & truncate all sentences.
                            pad_to_max_length = True,
                            return_attention_mask = True,   # Construct attn. masks.
                            return_tensors = 'pt',     # Return pytorch tensors.
                    )

        # Add the encoded sentence to the list.
        input_ids_correction.append(encoded_dict_correction['input_ids'])

        # And its attention mask (simply differentiates padding from non-padding).
        attention_masks_correction.append(encoded_dict_correction['attention_mask'])

    # Convert the lists into tensors.
    input_ids_sentence = torch.cat(input_ids_sentence, dim=0)
    attention_masks_sentence = torch.cat(attention_masks_sentence, dim=0)

    input_ids_correction = torch.cat(input_ids_correction, dim=0)
    attention_masks_correction = torch.cat(attention_masks_correction, dim=0)

    # labels = torch.tensor(labels_list.to_numpy(), dtype=torch.long)

    # Print sentence 0, now as a list of IDs.
    print('Original: ', sentence_list[0])
    print('Token IDs:', input_ids_sentence[0])

    # Combine the training inputs into a TensorDataset.
    dataset = TensorDataset(input_ids_sentence, attention_masks_sentence, input_ids_correction, attention_masks_correction)

    return dataset

In [26]:
dataset = tokenize_dataset(bea60k["sentences"].values,
                           bea60k["corrections"].values, 
                           tokenizer_student)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


Original:  grammar: I WANT TO THAK YOU FOR PREPARING SUCH A GOOD PROGRAMME FOR US AND ESPECIALLY FOR TAKING US ON THE RIVER TRIP TO GREENWICH.
Token IDs: tensor([19519,    10,    27,   549,  9156,  3001,     3,  4611, 12396,  6223,
         5652,  6045, 13986, 21034,   180, 20314,    71, 28299,  6828,   517,
        23203,  4369,  5652,   837,  3430,   262, 20452, 15397,  5121,  5652,
            3,  3221, 20961,   837,  9191,  1853,     3,  5593, 16174,   332,
        26017,  3001,     3,  8727, 23394,   518, 22293,     5,     1,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,   

In [27]:
dataset_train, dataset_valid = random_split(dataset, [0.9, 0.1])

In [28]:
batch_size = 16

train_dataloader = DataLoader(
            dataset_train,  
            sampler = RandomSampler(dataset_train),
            batch_size = batch_size
        )

validation_dataloader = DataLoader(
            dataset_valid, 
            sampler = SequentialSampler(dataset_valid), 
            batch_size = batch_size 
        )

In [29]:
dataset_distill = tokenize_dataset(jfleg_validation["sentence"].values,
                           jfleg_validation["corrections"].values, 
                           tokenizer_student)

Original:  So I think we can not live if old people could not find siences and tecnologies and they did not developped . 
Token IDs: tensor([ 264,   27,  317,   62,   54,   59,  619,    3,   99,  625,  151,  228,
          59,  253,  108, 1433,    7,   11,    3, 5822,   29, 4137,    7,   11,
          79,  410,   59, 1344, 3138,    3,    5,    1,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0])


In [30]:
distll_dataloader = DataLoader(
            dataset_distill,  
            sampler = RandomSampler(dataset_distill),
            batch_size = batch_size
        )

### Metric

We have described the metric above in the [Description](#description) section.

In [31]:
def error_correction_rate(sentence, correction, prediction):

    sentence = np.array(sentence.split())
    correction = np.array(correction.split())
    prediction = np.array(prediction.split())

    corr_indices = np.array([x != y for x, y in zip(sentence, correction)], dtype=np.uint8)

    pred_indices = np.array([x != y for x, y in zip(sentence, prediction)], dtype=np.uint8)

    # print(sentence)
    # print(prediction)

    # print(corr_indices)
    # print(pred_indices)

    acc = {"corr" : 0, "FP": 0}

    for i, data in enumerate(zip(corr_indices, pred_indices)):
        corr, pred = data
        if corr == pred:
            if corr == 1:
                if correction[i] == prediction[i]:
                    acc['corr'] += 1
        elif pred == 1:
            acc['FP'] += 1

    if corr_indices.sum() != 0:
        acc['corr'] /= corr_indices.sum()
    else:
        acc['corr'] = 1

    return acc

### Train T5-small

In [32]:
t5_small.cuda()

T5ForConditionalGeneration(
  (shared): Embedding(32128, 512)
  (encoder): T5Stack(
    (embed_tokens): Embedding(32128, 512)
    (block): ModuleList(
      (0): T5Block(
        (layer): ModuleList(
          (0): T5LayerSelfAttention(
            (SelfAttention): T5Attention(
              (q): Linear(in_features=512, out_features=512, bias=False)
              (k): Linear(in_features=512, out_features=512, bias=False)
              (v): Linear(in_features=512, out_features=512, bias=False)
              (o): Linear(in_features=512, out_features=512, bias=False)
              (relative_attention_bias): Embedding(32, 8)
            )
            (layer_norm): T5LayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (1): T5LayerFF(
            (DenseReluDense): T5DenseActDense(
              (wi): Linear(in_features=512, out_features=2048, bias=False)
              (wo): Linear(in_features=2048, out_features=512, bias=False)
              (dropout): Drop

In [33]:
params = list(t5_small.named_parameters())
optimizer = AdamW(t5_small.parameters(),
                  lr = 2e-5,
                  eps = 1e-8
                )

epochs = 3 # more epochs appear to give worse results on average

total_steps = len(train_dataloader) * epochs

scheduler = get_linear_schedule_with_warmup(optimizer,
                                            num_warmup_steps = 0, 
                                            num_training_steps = total_steps)



In [21]:
def train_NLP(model, trainloader, validloader, optimizer, scheduler, seed_val):
    '''
    Train + Validation
    '''

    random.seed(seed_val)
    np.random.seed(seed_val)
    torch.manual_seed(seed_val)
    torch.cuda.manual_seed_all(seed_val)

    # store training data here
    training_stats = []

    # start timer
    time_start = time.time() 

    for e in range(0, epochs):

        print("")
        print('======== Epoch {:} / {:} ========'.format(e + 1, epochs))

        # training segment
        print('Training...')

        t0 = time.time()
        total_train_loss = 0
        model.train()

        # process data
        for step, batch in tqdm(enumerate(trainloader), total=len(trainloader)):
            
            if step % 1000 == 0 and not step == 0:
                elapsed = format_time(time.time() - t0)
                print('  Batch {:>5,}  of  {:>5,}.    Elapsed: {:}.'.format(step, len(trainloader), elapsed))

            b_input_ids = batch[0].to(device)
            b_attention_mask = batch[1].to(device)

            b_labels = batch[2].to(device)
            d_attention_mask = batch[3].to(device)

            model.zero_grad()

            res = model(input_ids=b_input_ids, 
                        #attention_mask = b_attention_mask,
                        #decoder_input_ids=b_labels, 
                        #decoder_attention_mask = d_attention_mask,
                        labels=b_labels)
            
            loss = res['loss']
            logits = res['logits']

            total_train_loss += loss.item()

            loss.backward()

            torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

            optimizer.step()
            scheduler.step()

        avg_train_loss = total_train_loss / len(trainloader)

        training_time = format_time(time.time() - t0)

        print("")
        print("  Average training loss: {0:.2f}".format(avg_train_loss))
        print("  Training epoch took: {:}".format(training_time))

        # validation segment
        print("Running Validation...")

        t0 = time.time()
        model.eval()
        total_eval_accuracy = 0
        total_eval_loss = 0
        nb_eval_steps = 0

        for step, batch in tqdm(enumerate(validloader), total=len(validloader)):

            b_input_ids = batch[0].to(device)
            b_attention_mask = batch[1].to(device)
            b_labels = batch[2].to(device)
            d_attention_mask = batch[3].to(device)

            with torch.no_grad():
                res = model(input_ids=b_input_ids, 
                        #decoder_input_ids=b_labels, 
                        #attention_mask = b_attention_mask,
                        #decoder_attention_mask = d_attention_mask,
                        labels=b_labels)

            loss = res['loss']
            logits = res['logits']

            total_eval_loss += loss.item()

            logits = logits.detach().cpu().numpy()
            label_ids = b_labels.to('cpu').numpy()

            total_eval_accuracy += flat_accuracy(logits, label_ids)

        avg_val_accuracy = total_eval_accuracy / len(validloader)
        print("  Accuracy: {0:.2f}".format(avg_val_accuracy))

        avg_val_loss = total_eval_loss / len(validloader)

        validation_time = format_time(time.time() - t0)

        print("  Validation Loss: {0:.2f}".format(avg_val_loss))
        print("  Validation took: {:}".format(validation_time))


        training_stats.append(
            {
                'epoch': e + 1,
                'Training Loss': avg_train_loss,
                'Valid. Loss': avg_val_loss,
                'Valid. Accur.': avg_val_accuracy,
                'Training Time': training_time,
                'Validation Time': validation_time
            }
        )

        torch.save(model.state_dict(), path_to_student_model + str(datetime.datetime.now()).replace(" ", "-").replace(":", "-") + ".bin")

    print("")
    print("Training complete!")

    print("Total training took {:} (h:mm:ss)".format(format_time(time.time()-time_start)))

    return training_stats

In [22]:
seed_val = 42
result = train_NLP(t5_small, 
                   train_dataloader, 
                   validation_dataloader, 
                   optimizer, 
                   scheduler, 
                   seed_val)


Training...


 28%|██▊       | 1000/3547 [05:16<13:09,  3.23it/s]

  Batch 1,000  of  3,547.    Elapsed: 0:05:17.


 56%|█████▋    | 2000/3547 [10:32<08:00,  3.22it/s]

  Batch 2,000  of  3,547.    Elapsed: 0:10:33.


 85%|████████▍ | 3000/3547 [15:44<03:06,  2.93it/s]

  Batch 3,000  of  3,547.    Elapsed: 0:15:44.


100%|██████████| 3547/3547 [18:37<00:00,  3.18it/s]



  Average training loss: 0.04
  Training epoch took: 0:18:37
Running Validation...


  return np.sum(pred_flat == labels_flat) / len(labels_flat)
100%|██████████| 394/394 [02:01<00:00,  3.25it/s]


  Accuracy: 0.00
  Validation Loss: 0.03
  Validation took: 0:02:01

Training...


 28%|██▊       | 1000/3547 [05:11<13:12,  3.21it/s]

  Batch 1,000  of  3,547.    Elapsed: 0:05:12.


 56%|█████▋    | 2000/3547 [10:23<08:00,  3.22it/s]

  Batch 2,000  of  3,547.    Elapsed: 0:10:23.


 85%|████████▍ | 3000/3547 [15:36<02:50,  3.22it/s]

  Batch 3,000  of  3,547.    Elapsed: 0:15:37.


100%|██████████| 3547/3547 [18:26<00:00,  3.21it/s]



  Average training loss: 0.04
  Training epoch took: 0:18:27
Running Validation...


100%|██████████| 394/394 [02:01<00:00,  3.25it/s]


  Accuracy: 0.00
  Validation Loss: 0.03
  Validation took: 0:02:01

Training...


 28%|██▊       | 1000/3547 [05:09<13:11,  3.22it/s]

  Batch 1,000  of  3,547.    Elapsed: 0:05:09.


 56%|█████▋    | 2000/3547 [10:19<08:00,  3.22it/s]

  Batch 2,000  of  3,547.    Elapsed: 0:10:20.


 85%|████████▍ | 3000/3547 [15:30<02:49,  3.22it/s]

  Batch 3,000  of  3,547.    Elapsed: 0:15:30.


100%|██████████| 3547/3547 [18:20<00:00,  3.22it/s]



  Average training loss: 0.03
  Training epoch took: 0:18:20
Running Validation...


100%|██████████| 394/394 [02:00<00:00,  3.26it/s]


  Accuracy: 0.00
  Validation Loss: 0.03
  Validation took: 0:02:01

Training complete!
Total training took 1:01:29 (h:mm:ss)


### Predict

In [24]:
# t5_small.load_state_dict(torch.load('t5_small.bin'))

In [23]:
sentence = bea60k['sentences'][1]

print("Input:", sentence)

encodings = tokenizer_student(sentence, add_special_tokens=False, return_tensors="pt").to(device).input_ids
generated_tokens = t5_small.generate(encodings, bos_token_id=101, eos_token_id=102, max_length = 50)
answer = tokenizer_student.decode(generated_tokens[0], skip_special_tokens=True)

print("Output:", answer)

Input: grammar: IN MY OPINION FAMOUS PEOPLE ARE BEING OBLIGED TO PAY A PRICE FOR BEING FAMOUS THAT, IN SOME CASS, COSTS MORE THAN THEY DESERVE TO PAY.
Output: IN MY OPINION FAMOUS PEOPLE ARE BEING OBLIGED TO PAY A PRICE FOR BEING FAMOUS THAT, IN SOME CASS, COSTS MORE THAN THEY


In [24]:
prefix = "grammar: "

sentence = "I yu bкoght something goregous, you well be vry happy"
sentence = prefix + sentence

print("Input:", sentence)

encodings = tokenizer_student(sentence, add_special_tokens=False, return_tensors="pt").to(device).input_ids
generated_tokens = t5_small.generate(encodings, max_length = 50)
answer = tokenizer_student.batch_decode(generated_tokens, skip_special_tokens=True)

print("Output:", answer[0])

Input: grammar: I yu bкoght something goregous, you well be vry happy
Output: I yu bкoght something goregous, you well be vry happy happy be vry happy happy be vry happy happy be vry happy happy be vry happy happy


The model appears to be working too bad. Either we have messed up with the learning procedure, or learning was insufficient (which is strange, as we used quite a big dataset on multiple epochs - more than actually listed in this notebook, as we also loaded saved states). 

We suggest that we don't use this model for error correction, but instead try to make the big model lighter using quantisation. See block [Quantization](#quantization) for this procedure.

#### Predict - full model

In [34]:
path_to_teacher_model = "ai-forever/T5-large-spell"

# T5
model_T5_spell = T5ForConditionalGeneration.from_pretrained(path_to_teacher_model)
tokenizer_T5_spell = AutoTokenizer.from_pretrained(path_to_teacher_model)

In [26]:
sentence = bea60k['sentences'][1]

print("Input:", sentence)

# create encodings for the sentence and then generate a correction for it
encodings = tokenizer_T5_spell(sentence, return_tensors="pt")
generated_tokens = model_T5_spell.generate(**encodings, max_new_tokens = 100)
answer = tokenizer_T5_spell.batch_decode(generated_tokens, skip_special_tokens=True, )

print("Output:", answer[0])

Input: grammar: IN MY OPINION FAMOUS PEOPLE ARE BEING OBLIGED TO PAY A PRICE FOR BEING FAMOUS THAT, IN SOME CASS, COSTS MORE THAN THEY DESERVE TO PAY.
Output: IN MY OPINION FAMOUS PEOPLE ARE BEING OBLIGED TO PAY A PRICE FOR BEING FAMOUS THAT, IN SOME CASS, COSTS MORE THAN THEY DESERVE TO PAY.


In [27]:
prefix = "grammar: "

sentence = "I yu bкoght something goregous, you well be vry happy. Sorry for unconenvinence"
sentence = prefix + sentence

print("Input:", sentence)

encodings = tokenizer_T5_spell(sentence, return_tensors="pt")
generated_tokens = model_T5_spell.generate(**encodings)
answer = tokenizer_T5_spell.batch_decode(generated_tokens, skip_special_tokens=True)

print("Output:", answer[0])

Input: grammar: I yu bкoght something goregous, you well be vry happy. Sorry for unconenvinence




Output: If you brought something gorgeous, you will be very happy. Sorry for inconvenience


In [28]:
ECR = []

for i, val in tqdm(jfleg_validation.iterrows(), total=len(jfleg_validation)):
    
    encodings = tokenizer_T5_spell(val['sentence'], return_tensors="pt")
    generated_tokens = model_T5_spell.generate(**encodings)
    answer = tokenizer_T5_spell.batch_decode(generated_tokens, skip_special_tokens=True)[0]

    ECR.append(error_correction_rate(val['sentence'].replace("grammar: ", ""),
                                     val['corrections'],
                                     answer)['corr'])
    
np.mean(np.array(ECR))


100%|██████████| 755/755 [17:16<00:00,  1.37s/it]


0.19124576775596863

For full model, the corrections work quite well. However, our metric shows quite low result: only 20%. Note though, that this implies 20% match with some "benchmark" correction, which may not only contain spelling correction, but also restructure phrase a bit. Overall, I believe spelling correction is OK.

### Distillation

In [35]:
import torch.nn as nn
import torch.nn.functional as F

In [36]:
t5_small.to(device)
model_T5_spell.to(device);

In [40]:
def distill(teacher_model, student_model, train_loader, epoch_number=2, alpha=0.5, temperature=2):
    
    def error_and_output(var_X_batch, var_y_batch): # create loss function
        # Kullback-Leibler Divergence is used to calc cross-entropy between answers of models
        kldloss = nn.KLDivLoss()  
        # Regular cross-entropy
        celoss = nn.CrossEntropyLoss()
        
        # teached model outputs
        teacher_logits = teacher_model(var_X_batch, labels=var_y_batch)['logits']
        # student model outputs
        student_preds = student_model(var_X_batch, labels=var_y_batch)
        student_logits = student_preds['logits']
        
        # softmax with temperature T for student network
        soft_predictions = F.log_softmax( student_logits / temperature, dim=1 )
        # and for teacher network
        soft_labels = F.softmax( teacher_logits / temperature, dim=1 )
        # distillation loss
        distillation_loss = kldloss(soft_predictions, soft_labels)
        
        # regular loss
        student_loss = student_preds['loss']
        # student_loss = celoss(student_logits, var_y_batch)
        
        # sum-up
        return distillation_loss * alpha + student_loss * (1 - alpha), student_logits
    
    optimizer = torch.optim.Adam(student_model.parameters())
    student_model.train()
    
    # train goes as usual
    for epoch in range(epoch_number):
        correct = 0
        
        for batch_idx, batch in tqdm(enumerate(train_loader), total=len(train_loader)):

            b_input_ids = batch[0].to(device)
            b_attention_mask = batch[1].to(device)
            b_labels = batch[2].to(device)
            d_attention_mask = batch[3].to(device)
            
            var_X_batch = b_input_ids.long()
            var_y_batch = b_labels.long()
            optimizer.zero_grad()
            loss, output = error_and_output(var_X_batch, var_y_batch)
            loss.backward()
            optimizer.step()

            # predicted = torch.max(output.data, 1)[1] 
            # correct += (predicted == var_y_batch).sum()
            correct += flat_accuracy_tocpu(output, b_labels)
            if batch_idx % 200 == 0:
                print('Epoch : {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}\t Accuracy:{:.3f}%'.format(
                    epoch, \
                    batch_idx*len(b_input_ids), \
                    len(train_loader.dataset), \
                    100.*batch_idx / len(train_loader), \
                    loss.data, \
                    float(correct*100) / float(batch_size*(batch_idx+1))))

In [41]:
torch.manual_seed(2023)
distill(model_T5_spell, t5_small, distll_dataloader, temperature=10.0)

  return np.sum(pred_flat == labels_flat) / len(labels_flat)
  2%|▏         | 1/48 [02:35<2:01:43, 155.38s/it]



  4%|▍         | 2/48 [06:41<2:34:04, 200.97s/it]


OutOfMemoryError: CUDA out of memory. Tried to allocate 252.00 MiB (GPU 0; 6.00 GiB total capacity; 19.28 GiB already allocated; 0 bytes free; 19.96 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

It appears that using two models simultaneusly is a bit too much for my PC. 

## Quantization

As distillation results prove to be disappointing, we have decided to quantize the full model to make it smaller.  
Took some references from [here](https://huggingface.co/docs/transformers/main/main_classes/quantization).

In [36]:
torch.save(model_T5_spell.state_dict(), "model_T5_spell.bin")

In [12]:
calc_size(model_T5_spell)

'2881706.759 KB'

Current full model weight is about 2.8 GB. Let's try to reduce it by quantising linear layers.

In [11]:
quantized_model = torch.quantization.quantize_dynamic(model_T5_spell, {torch.nn.Linear}, dtype=torch.qint8)

In [None]:
# at first, errors in quantization indicated that we needed to set ceratin quantisation configuration to Embedding layers
#  but after a few tries, it became obsolete

# for _, mod in model_T5_spell.named_modules():
#     if isinstance(mod, torch.nn.Embedding):
#         mod.qconfig = torch.ao.quantization.float_qparams_weight_only_qconfig

# quantized_model = torch.quantization.quantize_dynamic(model_T5_spell, {torch.nn.Linear}, dtype=torch.qint8)

In [21]:
# torch.save(quantized_model, "quantized_model.pt") # commented out saving
tokenizer_T5_spell.save_pretrained('./tokenizer')

('./tokenizer\\tokenizer_config.json',
 './tokenizer\\special_tokens_map.json',
 './tokenizer\\spiece.model',
 './tokenizer\\added_tokens.json',
 './tokenizer\\tokenizer.json')

In [14]:
calc_size(quantized_model)

'849763.708 KB'

Wow, the model size has reduced a lot!

## Load check

Let's check that loading a model and tokenizer work OK (as we are going to use them on a server)

In [2]:
new_tokenizer = PreTrainedTokenizerFast(tokenizer_file="./server/src/tokenizer/tokenizer.json")

In [3]:
new_model = torch.load("./server/src/models/quantized_T5-large.pt")
new_model.eval()

  device=storage.device,


T5ForConditionalGeneration(
  (shared): Embedding(32128, 1024)
  (encoder): T5Stack(
    (embed_tokens): Embedding(32128, 1024)
    (block): ModuleList(
      (0): T5Block(
        (layer): ModuleList(
          (0): T5LayerSelfAttention(
            (SelfAttention): T5Attention(
              (q): DynamicQuantizedLinear(in_features=1024, out_features=1024, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
              (k): DynamicQuantizedLinear(in_features=1024, out_features=1024, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
              (v): DynamicQuantizedLinear(in_features=1024, out_features=1024, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
              (o): DynamicQuantizedLinear(in_features=1024, out_features=1024, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
              (relative_attention_bias): Embedding(32, 16)
            )
            (layer_norm): T5LayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (1): T5

In [None]:
sentence = bea60k['sentences'][0]

encodings = new_tokenizer(sentence, return_tensors="pt").input_ids
generated_tokens = new_model.generate(encodings, max_new_tokens = 100)
answer = new_tokenizer.batch_decode(generated_tokens, skip_special_tokens=True, )
print(sentence)
print(answer[0])

In [None]:
prefix = "grammar: "

sentence = "I yu bкoght something goregous, you well be vry happy. Sorry for unconenvinence"
sentence = prefix + sentence

encodings = new_tokenizer(sentence, return_tensors="pt").input_ids
generated_tokens = new_model.generate(encodings)
answer = new_tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
print(answer[0])



If you bought something gorgeous, you will be very happy. Sorry for inconvenience
