# Model Training

In this notebook, we will train 2 models using the best hyperparameters for this dataset.

We will continue to use the wandb library to track our training runs.

In [5]:
# Import Classes for tokenization and model training
from transformers import (AutoTokenizer, AutoModelForSequenceClassification,
                          TrainingArguments, Trainer)

# Import DatasetDict which will help us prepare our own dataset for use in training and evaulating machine learning models
from datasets import DatasetDict

# Import function to be used as loss function
from sklearn.metrics import mean_squared_error

# Import library to track our training runs and change settings
import wandb

# Replace the variables below with your own: name, project name, and project directory
%env WANDB_ENTITY = langdon
%env WANDB_PROJECT = ellipse
%env WANDB_DIR = /home/jovyan/active-projects/ellipse-methods-showcase/bin

env: WANDB_ENTITY=langdon
env: WANDB_PROJECT=ellipse
env: WANDB_DIR=/home/jovyan/active-projects/ellipse-methods-showcase/bin


## Load DatasetDict and Tokenize

We could have tokenized our datadict when we created the dataset partitions, but waiting until the last minute gives us the flexibility to try out different models that may require different tokenization schemes.

In [6]:
# Initialize tokenizer and create helper function for tokenization as we did in the previous notebooks.
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

def tokenize_inputs(example):
    return tokenizer(example['text'], max_length=512, truncation=True)

In [13]:
def get_datadict(score_to_predict):
    ''' Selects a target score that the model should predict and renames that score to 'label'.
    Removes other columns from the dataset. The other columns are not needed for training.
    '''
    
    # These columns will be removed from the dataset
    scores = {
        'Overall',
        'Cohesion',
        'Syntax',
        'Vocabulary',
        'Phraseology',
        'Grammar',
        'Conventions'
    }
    
    columns_to_remove = scores.symmetric_difference([score_to_predict])
    
    # Load the DatasetDict object we created in the previous notebook. 
    # We will be removing the columns that we defined above, and renaming the target column (=score_to_predict) into 'label'
    dd = (DatasetDict
          .load_from_disk('../data/ellipse.hf')
          .remove_columns(columns_to_remove)
          .map(tokenize_inputs, remove_columns=['text_id', 'text']) # the transformer does not need these columns to train.
          .rename_column(score_to_predict, 'label') # Huggingface will look for a column that contains the string 'label' to calculate metrics.
         )
    
    return dd

## Compute Metric

By default, Huggingface will evaluate models based on the sum of metrics produced by this function.

We only have one metric (mse), but if other metrics are included (like r-squared), Huggingface needs to know which metric to use (because MSE should be minimized and r-squared should be maximized, summing these values will create a nonsense metric). We will be specifying th metric when we are configuring the training arguments.

In [10]:
def compute_metrics(eval_pred):
    preds, labels = eval_pred
    mse = mean_squared_error(labels, preds)

    return {'mse': mse}

## Train Function


In [14]:
def train(score_to_predict):
    # since we create the model from_pretrained() within the train() function, we do not need a model_init()
    model = AutoModelForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=1)
    
    # load in the dataset we created before with the target column's name changed to 'label'
    datadict = get_datadict(score_to_predict)
    
    training_args = TrainingArguments(
        output_dir = f'../bin/{score_to_predict.lower()}-checkpoints',
        optim = 'adamw_torch',
        logging_dir = f'../logs/{score_to_predict}',
        load_best_model_at_end = True,
        metric_for_best_model = 'mse', # be sure to set this value if compute_metrics returns multiple metrics.
        evaluation_strategy='epoch',
        save_strategy='epoch',
        save_total_limit=1, # only keep the best model
        greater_is_better = False,
        log_level = 'error',
        disable_tqdm = False,
        report_to='wandb',
        num_train_epochs=2, # tuned
        learning_rate=5e-5, # tuned
        per_device_train_batch_size=16, # tuned
        per_device_eval_batch_size=16,
    )

    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=datadict['train'],
        eval_dataset=datadict['dev'],
        compute_metrics=compute_metrics,
        tokenizer=tokenizer,
    )

    trainer.train()
    
    return trainer    

## Train Grammar

Finetune a model that predicts the 'Grammar' scores in the ELLIPSE corpus using the function we created above.

In [15]:
grammar_trainer = train('Grammar')

Map:   0%|          | 0/4537 [00:00<?, ? examples/s]

Map:   0%|          | 0/972 [00:00<?, ? examples/s]

Map:   0%|          | 0/973 [00:00<?, ? examples/s]

Epoch,Training Loss,Validation Loss,Mse
1,No log,0.256704,0.256704
2,0.353500,0.239764,0.239764


## Train Vocabulary

We can use the same approach to finetune a model that predicts the 'Vocabulary' scores. We will assume that the optimal hyperparameters are similar for different scores on this dataset.

In [16]:
vocab_trainer = train('Vocabulary')

Map:   0%|          | 0/4537 [00:00<?, ? examples/s]

Map:   0%|          | 0/972 [00:00<?, ? examples/s]

Map:   0%|          | 0/973 [00:00<?, ? examples/s]

Epoch,Training Loss,Validation Loss,Mse
1,No log,0.210288,0.210288
2,0.288800,0.211494,0.211494


## Save

These models were automatically saved to '../bin/{score_to_predict}/' but we can give them better names.

In [17]:
grammar_trainer.save_model('../bin/grammar-model')

In [18]:
vocab_trainer.save_model('../bin/vocabulary-model')