# Model Training

In this notebook, we will train 2 models using the best hyperparameters for this dataset.

We will continue to use the wandb library to track our training runs.

In [1]:
# Import Classes for tokenization and model training
from transformers import (AutoTokenizer, AutoModelForSequenceClassification,
                          TrainingArguments, Trainer)

# Import DatasetDict which will help us prepare our own dataset for use in training and evaulating machine learning models
from datasets import DatasetDict

# Import function to be used as loss function
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import ShuffleSplit

import pandas as pd
import numpy as np

# Import library to track our training runs and change settings
import wandb

# Replace the variables below with your own: name, project name, and project directory
%env WANDB_ENTITY = langdon
%env WANDB_PROJECT = ellipse
%env WANDB_DIR = /home/jovyan/active-projects/ellipse-methods-showcase/bin

env: WANDB_ENTITY=langdon
env: WANDB_PROJECT=ellipse
env: WANDB_DIR=/home/jovyan/active-projects/ellipse-methods-showcase/bin


## Load DatasetDict and Tokenize

We could have tokenized our datadict when we created the dataset partitions, but waiting until the last minute gives us the flexibility to try out different models that may require different tokenization schemes.

In [2]:
# Initialize tokenizer and create helper function for tokenization as we did in the previous notebooks.
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

def tokenize_inputs(example):
    return tokenizer(example['text'], max_length=512, truncation=True)

In [3]:
def get_datadict(score_to_predict):
    ''' Selects a target score that the model should predict and renames that score to 'label'.
    Removes other columns from the dataset. The other columns are not needed for training.
    '''
    
    # These columns will be removed from the dataset
    scores = {
        'Overall',
        'Cohesion',
        'Syntax',
        'Vocabulary',
        'Phraseology',
        'Grammar',
        'Conventions'
    }
    
    columns_to_remove = scores.symmetric_difference([score_to_predict])
    
    # Load the DatasetDict object we created in the previous notebook. 
    # We will be removing the columns that we defined above, and renaming the target column (=score_to_predict) into 'label'
    dd = (DatasetDict
          .load_from_disk('../data/ellipse.hf')
          .remove_columns(columns_to_remove)
          .map(tokenize_inputs, remove_columns=['text_id', 'text']) # the transformer does not need these columns to train.
          .rename_column(score_to_predict, 'label') # Huggingface will look for a column that contains the string 'label' to calculate metrics.
         )
    
    return dd

## Compute Metric

By default, Huggingface will evaluate models based on the sum of metrics produced by this function.

We only have one metric (mse), but if other metrics are included (like r-squared), Huggingface needs to know which metric to use (because MSE should be minimized and r-squared should be maximized, summing these values will create a nonsense metric). We will be specifying th metric when we are configuring the training arguments.

In [4]:
def compute_metrics(eval_pred):
    preds, labels = eval_pred
    mse = mean_squared_error(labels, preds)

    return {'mse': mse}

## Train Function

We can make some improvements here. The development data is more-or-less wasted with this configuration, so we could either decide not to use it, or we could utilize it by keeping the best model from 4-5 epochs of training.

In [5]:
def train(score_to_predict):
    # load in the dataset we created before with the target column's name changed to 'label'
    datadict = get_datadict(score_to_predict)

    # Generate in-fold and out-of-fold indexes
    folds = ShuffleSplit(n_splits=5, random_state=42)        
    splits = folds.split(np.zeros(datadict["train"].num_rows), datadict["train"]["label"])
    
    # Iterate over in-fold and out-of-fold indexes
    for i, (inf_idxs, oof_idxs) in enumerate(splits):
        # since we create the model from_pretrained() within the train() function, we do not need a model_init()
        model = AutoModelForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=1)
        
        training_args = TrainingArguments(
            output_dir = '../bin/checkpoints',
            optim = 'adamw_torch',
            logging_dir = f'../logs/{score_to_predict}',
            evaluation_strategy='epoch',
            save_strategy='no',
            log_level='error',
            disable_tqdm = False,
            report_to='wandb',
            num_train_epochs=2, # tuned
            learning_rate=5e-5, # tuned
            per_device_train_batch_size=16, # tuned
            per_device_eval_batch_size=16,
        )
    
        trainer = Trainer(
            model=model,
            args=training_args,
            train_dataset=datadict['train'].select(inf_idxs),
            eval_dataset=datadict['train'].select(oof_idxs),
            compute_metrics=compute_metrics,
            tokenizer=tokenizer,
        )
    
        trainer.train()
        trainer.save_model(f'../bin/kfold-{score_to_predict.lower()}-models/{score_to_predict.lower()}-model-{i:02}')   

## Train Grammar

Finetune a model that predicts the 'Grammar' scores in the ELLIPSE corpus using the function we created above.

In [6]:
train('Grammar')

Map:   0%|          | 0/4537 [00:00<?, ? examples/s]

Map:   0%|          | 0/972 [00:00<?, ? examples/s]

Map:   0%|          | 0/973 [00:00<?, ? examples/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[34m[1mwandb[0m: Currently logged in as: [33mlangdon[0m. Use [1m`wandb login --relogin`[0m to force relogin


Epoch,Training Loss,Validation Loss,Mse
1,No log,0.276915,0.276915
2,0.322500,0.272244,0.272244


Epoch,Training Loss,Validation Loss,Mse
1,No log,0.24737,0.24737
2,0.372600,0.25073,0.25073


Epoch,Training Loss,Validation Loss,Mse
1,No log,0.30336,0.30336
2,0.358800,0.242446,0.242446


Epoch,Training Loss,Validation Loss,Mse
1,No log,0.265568,0.265568
2,0.360500,0.256617,0.256617


Epoch,Training Loss,Validation Loss,Mse
1,No log,0.351895,0.351895
2,0.355300,0.256412,0.256412


## Train Vocabulary

We can use the same approach to finetune a model that predicts the 'Vocabulary' scores. We will assume that the optimal hyperparameters are similar for different scores on this dataset.

In [7]:
train('Vocabulary')

Map:   0%|          | 0/4537 [00:00<?, ? examples/s]

Map:   0%|          | 0/972 [00:00<?, ? examples/s]

Map:   0%|          | 0/973 [00:00<?, ? examples/s]

Epoch,Training Loss,Validation Loss,Mse
1,No log,0.199172,0.199172
2,0.302200,0.189354,0.189354


Epoch,Training Loss,Validation Loss,Mse
1,No log,0.21381,0.21381
2,0.305200,0.191467,0.191467


Epoch,Training Loss,Validation Loss,Mse
1,No log,0.204694,0.204694
2,0.306300,0.19494,0.19494


Epoch,Training Loss,Validation Loss,Mse
1,No log,0.365828,0.365828
2,0.292400,0.209755,0.209755


Epoch,Training Loss,Validation Loss,Mse
1,No log,0.195263,0.195263
2,0.316300,0.191075,0.191075


# Prepare Dataset for Confirmatory Factor Analysis

In [35]:
# Load the DatasetDict object we created in the previous notebook. 
datadict = DatasetDict.load_from_disk('../data/ellipse.hf/')

# We are specifically interested in using the test set since we are in our model evaluation phase
ds = datadict['test']

In [36]:
df = pd.read_csv('../data/both_raters.csv')

In [37]:
idf = pd.DataFrame({
    "text_id_original": ds["text_id"],
    "order": range(ds.num_rows)
})
df = pd.merge(idf, df, on="text_id_original").sort_values("order").drop("order", axis=1)

In [38]:
from transformers import pipeline
from tqdm.auto import tqdm

# Model inference pipeline that uses our finetuned model
def predict(eval_data, model_path):
    pipe = pipeline('text-classification',
                    model=model_path,
                    truncation=True,
                    batch_size=16,
                    function_to_apply='none',
                   )
    
    predictions = [pipe(text)[0]['score'] for text in tqdm(eval_data['text'])]
    
    return predictions

In [39]:
for score_to_predict in ["grammar", "vocabulary"]:
    for i in range(5):
        model_path = f"../bin/kfold-{score_to_predict}-models/{score_to_predict}-model-{i:02}"
        df[f"{score_to_predict}-model-{i}"] = predict(ds, model_path)

  0%|          | 0/973 [00:00<?, ?it/s]

  0%|          | 0/973 [00:00<?, ?it/s]

  0%|          | 0/973 [00:00<?, ?it/s]

  0%|          | 0/973 [00:00<?, ?it/s]

  0%|          | 0/973 [00:00<?, ?it/s]

  0%|          | 0/973 [00:00<?, ?it/s]

  0%|          | 0/973 [00:00<?, ?it/s]

  0%|          | 0/973 [00:00<?, ?it/s]

  0%|          | 0/973 [00:00<?, ?it/s]

  0%|          | 0/973 [00:00<?, ?it/s]

In [40]:
df

Unnamed: 0,text_id_original,Filename,Text,Overall_1,Cohesion_1,Syntax_1,Vocabulary_1,Phraseology_1,Grammar_1,Conventions_1,...,grammar-model-0,grammar-model-1,grammar-model-2,grammar-model-3,grammar-model-4,vocabulary-model-0,vocabulary-model-1,vocabulary-model-2,vocabulary-model-3,vocabulary-model-4
0,AAAUUP138190003116482836_OR,AAAUUP138190003116482836_OR.txt,"The famous Albert Einstein always said ""Imagin...",4,4,3,4,4,4,4,...,3.945891,3.906997,3.874162,3.937964,3.877669,3.740423,3.747329,3.594682,3.608412,3.749964
1,AAAUUP138190002068932807_OR,AAAUUP138190002068932807_OR.txt,People these days dont go outdoors as much as ...,3,3,3,3,3,3,3,...,4.177153,4.026624,4.047930,4.220422,4.240192,3.710368,3.769626,3.544382,3.738004,3.874523
2,AAAXMP138200002018982823_OR,AAAXMP138200002018982823_OR.txt,One of the things I want to acumplish in the f...,2,2,2,3,3,3,3,...,4.003042,3.524597,3.769304,3.703856,3.969286,3.273657,3.245494,3.113843,3.362435,3.353452
3,AAAUUP138180000043802140_OR,AAAUUP138180000043802140_OR.txt,"I think that success, are composed by failures...",3,4,3,3,3,4,3,...,2.944404,3.125247,3.005486,2.804622,2.910588,3.406393,3.464580,3.456613,3.629762,3.518577
4,AAAXMP138190000364752108_OR,AAAXMP138190000364752108_OR.txt,Which be the characteristic that show a person...,2,2,2,3,2,2,3,...,2.211505,2.137684,2.233499,2.097087,2.239886,2.730236,2.829128,2.726695,2.693180,2.789509
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
968,AAAUUP138180000261162117_OR,AAAUUP138180000261162117_OR.txt,"I believe music, drama, or art class should be...",4,4,4,4,4,4,4,...,4.047146,4.007442,4.117114,4.191662,3.901466,4.106008,3.950391,4.156456,4.176138,4.182713
969,AAAXMP138190000094922128_OR,AAAXMP138190000094922128_OR.txt,Some schools in different states in America th...,4,3,3,4,4,4,4,...,3.380976,3.192650,3.298422,3.492209,3.353168,3.253963,3.284392,3.280299,3.130947,3.180582
970,AAAXMP138190000120882107_OR,AAAXMP138190000120882107_OR.txt,First impressions are always important and can...,4,4,4,4,4,4,5,...,3.586772,3.680668,3.855306,4.028326,3.781683,3.617156,3.583234,3.596291,3.813251,3.465148
971,AAAXMP138200001632912845_OR,AAAXMP138200001632912845_OR.txt,Why is honesty important for you?\r\n\r\nI thi...,2,2,2,2,2,2,2,...,2.413828,2.485096,2.318097,2.371035,2.304187,2.422342,2.475149,2.417537,2.380605,2.596770


In [41]:
df.to_csv("../results/cfa.csv")