# ASAP 2.0 analyses with hyper-parameters

Purpose

    - Use hyperparameter search findings on ASAP 2.0 dataset 
    - Develop LLM scoring models for the ASAP 2.0 data using a Fine-tuned transformer.

Using deberta 3 large

    - With Roberta large as a back up

We have two approaches here

1. Finetuned model without hyperparametrization
2. Finetuned model with hyperparametrization


## Dataset is ASAP 2.0.

https://www.kaggle.com/competitions/learning-agency-lab-automated-essay-scoring-2/data

But the dataset includes all the data.

~25,000 source-based essays

Scored from 1-6

Training and test sets defined

A variety of demographic information available including

  - economic disadvantage
  - disability (both gifted and challenged)
  - ELL
  - Race
  - Gender
  - Grade

## Install Packages

If you use the hugging_face environment, you will not need to install packages

![image.png](attachment:a741233c-05af-4e06-92ca-a46899719a29.png)

This was set up using this [link](https://github.com/learlab/development-server/wiki/Using-JupyterHub#virtual-environments-in-python)

In [None]:

#!pip install transformers[torch]
#!pip install --quiet transformers[torch] #datasets evaluate pingouin


## Import packages

In [None]:
#install transformers and other packages that might not exist
#!pip install transformers
#!pip install datasets
#!pip install evaluate
#!pip install transformers[sentencepiece]

In [5]:
#had some problems and needed to upgrade some stuff

#!pip install --upgrade pyarrow
#!pip uninstall -y datasets
#!pip install datasets

Collecting datasets
  Downloading datasets-3.0.1-py3-none-any.whl.metadata (20 kB)
Collecting pyarrow>=15.0.0 (from datasets)
  Using cached pyarrow-17.0.0-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (3.3 kB)
Collecting requests>=2.32.2 (from datasets)
  Using cached requests-2.32.3-py3-none-any.whl.metadata (4.6 kB)
Collecting tqdm>=4.66.3 (from datasets)
  Using cached tqdm-4.66.5-py3-none-any.whl.metadata (57 kB)
Collecting xxhash (from datasets)
  Using cached xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.17-py311-none-any.whl.metadata (7.2 kB)
Collecting huggingface-hub>=0.22.0 (from datasets)
  Downloading huggingface_hub-0.25.2-py3-none-any.whl.metadata (13 kB)
INFO: pip is looking at multiple versions of multiprocess to determine which version is compatible with other requirements. This could take a while.
Collecting multiprocess (from datasets)
  Using cached m

In [1]:
#!pip install scikit-learn


#import libraries that are not problematic

import pandas as pd # You know what this is
import sklearn
import numpy as np # Numpy is for whenever you have numbers in Python
import seaborn as sns # Plotting library based on MatPlotLib
from scipy import stats # Statistical distributions, functions, and a few tests
from IPython.display import display # print(), but for HTML output (like Pandas dataframes)

In [2]:
# The following packages and modules are all from HuggingFace

# A class for managing data with lots of useful features for model training
#!pip install --upgrade datasets
import datasets
from datasets import load_dataset, Dataset, DatasetDict

  from .autonotebook import tqdm as notebook_tqdm


In [4]:

# The following packages and modules are all from HuggingFace

# We import four classes from transformers.
from transformers import (
    # All language models start with tokens (by definition)
    AutoTokenizer,
    # Convenience class for creating and using transformer-based sequence classifiers
    AutoModelForSequenceClassification,
    # Configuration class for managing, you guessed it, training arguments
    TrainingArguments,
    # A class that abstracts away the PyTorch training loop.
    Trainer,
    # A data collator organizes (collates) the data into batches for training
    # It will also add special [PAD] tokens to make all the sequences in a batch the same length
    # Batches with equal-length sequences make our GPU go brr
    DataCollatorWithPadding,
    DefaultDataCollator
)

# A library for performance evaluations metrics
# Especially useful for complex and task-specific metrics
# For example, the GLUE metric can score language models on a suite of popular benchmarks
# But we will just use some simple metrics today
import evaluate

# Pipelines are used to streamline tokenization and inference
# We will create a pipeline after finetuning!
from transformers import pipeline





## Load and Prepare ASAP Corpus

In [5]:
asap_df = (
    pd.read_csv("ASAP2_competitiondf_with-metadata.csv"))

asap_df

Unnamed: 0,essay_id,score,full_text,set,pubpriv,assignment,prompt_name,economically_disadvantaged,student_disability_status,ell_status,race_ethnicity,gender,grade_level,essay_word_count,source,task
0,AAAVUP14319000159574,4,The author suggests that studying Venus is wor...,train,0,"In ""The Challenge of Exploring Venus,"" the aut...",Exploring Venus,Economically disadvantaged,Identified as having disability,No,Black/African American,F,10.0,409.0,PERSUADE,Text dependent
1,AAAVUP14319000159542,2,NASA is fighting to be alble to to go to Venus...,train,0,"In ""The Challenge of Exploring Venus,"" the aut...",Exploring Venus,Not economically disadvantaged,Not identified as having disability,No,Hispanic/Latino,F,10.0,197.0,PERSUADE,Text dependent
2,AAAVUP14319000159461,3,"""The Evening Star"", is one of the brightest po...",test,private,"In ""The Challenge of Exploring Venus,"" the aut...",Exploring Venus,Economically disadvantaged,Identified as having disability,No,White,M,10.0,361.0,MI,Text dependent
3,AAAVUP14319000159420,2,The author supports this idea because from rea...,train,0,"In ""The Challenge of Exploring Venus,"" the aut...",Exploring Venus,Economically disadvantaged,Not identified as having disability,Yes,Hispanic/Latino,F,10.0,209.0,MI,Text dependent
4,AAAVUP14319000159419,2,How the author supports this idea is that he s...,train,0,"In ""The Challenge of Exploring Venus,"" the aut...",Exploring Venus,Economically disadvantaged,Not identified as having disability,Yes,Hispanic/Latino,M,10.0,214.0,MI,Text dependent
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
24723,5022045,3,Everyone should have an opinion on what they t...,train,0,Write a letter to your state senator in which ...,Does the electoral college work?,,,No,Asian/Pacific Islander,M,9.0,355.0,PERSUADE,Text dependent
24724,5022027,4,The debate on the pros and cons of car usage h...,train,0,Write an explanatory essay to inform fellow ci...,Car-free cities,,,No,Hispanic/Latino,F,10.0,303.0,PERSUADE,Text dependent
24725,5022023,1,dear senator:\n\nthe electoral college most be...,train,0,Write a letter to your state senator in which ...,Does the electoral college work?,,,Yes,Hispanic/Latino,F,9.0,256.0,PERSUADE,Text dependent
24726,5022018,2,"New cars are invented almost everyday, some ar...",train,0,Write an explanatory essay to inform fellow ci...,Car-free cities,,,No,Hispanic/Latino,F,10.0,253.0,PERSUADE,Text dependent


### Basic descriptives

In [None]:
#number of words

mean_nw = asap_df['essay_word_count'].mean()
sd_nw = asap_df['essay_word_count'].std()
max_nw = asap_df['essay_word_count'].max()

print(f'mean number of words = {mean_nw}')
print(f'sd for number of words = {sd_nw}')
print(f'max for number of words = {max_nw}')


#how many essays are greated than 512 words? (almost 4k)

llm_max = 512

llm_over = (asap_df['essay_word_count'] > llm_max).sum()
print(f'essays that are over 512 word = {llm_over}')


In [None]:
#statistics on the data itself

prompt_count = asap_df['prompt_name'].value_counts()
ed_count = asap_df['economically_disadvantaged'].value_counts()
sd_count = asap_df['student_disability_status'].value_counts()
ell_count = asap_df['ell_status'].value_counts()
re_count = asap_df['race_ethnicity'].value_counts()
gen_count = asap_df['gender'].value_counts()
grade_count = asap_df['grade_level'].value_counts()

print(prompt_count)
print("\n")
print(ed_count)
print("\n")
print(sd_count)
print("\n")
print(ell_count)
print("\n")
print(re_count)
print("\n")
print(gen_count)
print("\n")
print(grade_count)

In [6]:
#get columns you want and rename them using correct conventions
asap_df2 = asap_df[['full_text', 'score', 'pubpriv']].rename(
    columns={
        'full_text': 'text',
        'score': 'label'})

asap_df2


Unnamed: 0,text,label,pubpriv
0,The author suggests that studying Venus is wor...,4,0
1,NASA is fighting to be alble to to go to Venus...,2,0
2,"""The Evening Star"", is one of the brightest po...",3,private
3,The author supports this idea because from rea...,2,0
4,How the author supports this idea is that he s...,2,0
...,...,...,...
24723,Everyone should have an opinion on what they t...,3,0
24724,The debate on the pros and cons of car usage h...,4,0
24725,dear senator:\n\nthe electoral college most be...,1,0
24726,"New cars are invented almost everyday, some ar...",2,0


In [7]:
print(asap_df2.dtypes)

#label is a int64, but it needs to be float64 for the transformer to work

asap_df2['label'] = asap_df2['label'].astype('float64')

print(asap_df2.dtypes)

text       object
label       int64
pubpriv    object
dtype: object
text        object
label      float64
pubpriv     object
dtype: object


### Get training and test sets

These are already defined in the datasets. Just need to divide and name them

This is with a validation set used for hyperparameters

Training and test sets would be different and found underset.



In [8]:

asap_dd = datasets.DatasetDict({
   "train": datasets.Dataset.from_pandas(asap_df2[asap_df2["pubpriv"] == "0"]),
    "dev": datasets.Dataset.from_pandas(asap_df2[asap_df2["pubpriv"] == "public"]),
   "test": datasets.Dataset.from_pandas(asap_df2[asap_df2["pubpriv"] == "private"])
})

In [8]:
#! View the structure of the train set
#! display() the first (0th) example

display(asap_dd["train"]) #features
display(asap_dd["train"][0]) #text

Dataset({
    features: ['text', 'label', 'pubpriv', '__index_level_0__'],
    num_rows: 17307
})

{'text': 'The author suggests that studying Venus is worthy enough even though it is very dangerous. The author mentioned that on the planet\'s surface, temperatures average over 800 degrees Fahrenheit, and the atmospheric pressure is 90 times greater than what we experience on our own planet . His solution to survive this weather that is dangerous to us humans is to allow them to float above the fray. A "blimp-like" vehicle hovering 30 or so miles would help avoid the unfriendly ground conditions . At thirty-plus miles above the surface, temperatures would still be toasty at around 170 degrees Fahrenheit, but the air pressure would be close to that of sea level on Earth. So not easy conditions, but survivable enough for humans. So this would help make the mission capeable of completing.\n\nHe also mentions how peering at venus from a ship orbiting or hovering safely far above the planet can provide only limited insight on ground conditions because most forms of light cannot penertrate

In [None]:
#! Find the lowest scored essay in the training set.

# Loop
min_writing = 1 #change this number to see various scores
for sample in asap_dd["train"]:
  if sample["label"] == min_writing:
    min_writing = sample["label"]
    min_sample = sample
display(min_sample)


In [9]:
# Training regression-type models requires a floating point response variable
# The Type is "Value", which comes from the datasets library
asap_dd["train"].features['label']

Value(dtype='float64', id=None)

### Pre-process inputs (tokenization)

This is for train, dev, and test sets as set up.

`transformers` will handle tokenization, but it's worth looking at what it does.  Huggingface toenization course [here](https://huggingface.co/course/chapter2/4?fw=pt).

Different models require different types of tokenization. You need to choose a model before you tokenize. 

This will use [roberta-large](https://huggingface.co/microsoft/deberta-v3-large) from huggingface. 

Want to run deberta-v3-large, but having problems (microsoft/deberta-v3-large)

In [10]:
# Instantiate tokenizer by downloading tokenizer config files from HF Hub
#tokenizer = AutoTokenizer.from_pretrained('roberta-large') #roberta works
tokenizer = AutoTokenizer.from_pretrained('microsoft/deberta-v3-large') #deberta requires transformers[sentencepiece] or 
#tokenizer = AutoTokenizer.from_pretrained("microsoft/deberta-v3-large", use_fast=False) 
#transformers now uses fast tokenizers by default, but the deberta tokenizer is a slow tokenizer
tokenizer.name_or_path #what tokenizer are we using?



'microsoft/deberta-v3-large'

In [11]:
# Define tokenizing function
# Models usually have a max length (the number of tokens they can process per sample)
# roberta-large max length is 512 tokens, which is about 300-400 words.
# Many essays will be longer (around 4K)

def tokenize_inputs(example):
    # This will automatically truncate documents over 512 tokens
    # Truncation is a limitation.
    # If you truncate, clearly describe how many texts are truncated, by how much.
    #max_length=512 cuts essays to 512 words
    return tokenizer(example['text'], truncation=True, max_length=512)

In [12]:
# Do the tokenizing using DataDict.map()
# We remove "text" because it is not used by the transformer.
# Transformers operate on token_ids
asap_dd_tokenized = asap_dd.map(tokenize_inputs, batched=True, remove_columns=['text'])

Map: 100%|██████████| 17307/17307 [00:07<00:00, 2450.93 examples/s]
Map: 100%|██████████| 2257/2257 [00:00<00:00, 2835.08 examples/s]
Map: 100%|██████████| 5164/5164 [00:01<00:00, 2893.18 examples/s]


In [13]:
#what's in the data

asap_dd_tokenized

DatasetDict({
    train: Dataset({
        features: ['label', 'pubpriv', '__index_level_0__', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 17307
    })
    dev: Dataset({
        features: ['label', 'pubpriv', '__index_level_0__', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 2257
    })
    test: Dataset({
        features: ['label', 'pubpriv', '__index_level_0__', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 5164
    })
})

After tokenization, text samples are transformed as follows:

The **text** field is replaced with two new fields:

*   **input_ids**

The `input_ids` field contains the tokenized input sequences represented as token IDs.

*   **attention_mask**

The `attention_mask` field contains attention masks, which indicate which tokens should be attended to and which should be ignored during model training. In this case, the attention mask is all 1s, indicating that every token is attended to.

In [None]:
#! Grab the 0th item in the train portion of our tokenized dataset and iterate over its .items()
#! print() key and value from inside this loop to see what the tokenized inputs look like.

for key, value in asap_dd_tokenized["train"][0].items():
  print(key, value)

In [13]:
#what is the max length of the input ids (i.e., did they truncate)

max_length = max([len(n['input_ids']) for n in asap_dd_tokenized['train']] +
                 [len(n['input_ids']) for n in asap_dd_tokenized['dev']] +
                 [len(n['input_ids']) for n in asap_dd_tokenized['test']])

print(max_length)

512


### Data Collator
A data collator feeds the data to the language model. There are some interesting optimizations that can be made [here](https://huggingface.co/course/chapter3/2?fw=pt).




In [14]:
# Instantiate data collator
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

#This data collator is used to dynamically pad input sequences to the maximum length within each batch during training
#ensures that all sequences within a batch have the same length by adding padding tokens to the shorter sequences
#DataCollatorWithPadding handles variable-length input sequences and avoids the need for manual padding or truncation


## THE BELOW DOES NOT INVOLVE HYPERPARAMETER TUNING

**it is only based on a training and test set**

That code is farther down....

If you want to run just training and test sets, you would need to do the following

1. Grab up the column set (and not pubpriv_

#get columns you want and rename them using correct conventions
asap_df2 = asap_df[['full_text', 'score', 'set']].rename(
    columns={
        'full_text': 'text',
        'score': 'label'})

asap_df2

2. Set up training and test datasets only

asap_dd = datasets.DatasetDict({
   "train": datasets.Dataset.from_pandas(asap_df2[asap_df2["set"] == "train"]),
   "test": datasets.Dataset.from_pandas(asap_df2[asap_df2["set"] == "test"])
})

3. Rerun tokenization

## Set up Training



### Define model and task


### `AutoModelForSequenceClassification.from_pretrained()`

1.   Downloads the model you specify from the HuggingFace Hub. Downloading a model means downloading the pretrained model weights, as well as some configuration files. The model architecture is already described inside the `transformers` library.
2.   Discards the language modeling head of the model.
3.   Creates a brand new sequence classification head with randomly initialized weights -- you will see some warnings about this to remind you that you need to train the model!

### `model_init`
We use a `model_init()` function here instead of loading the model directly. This way, we will get a fresh distilbert everytime we rerun later code cells. This allows us to start training from the Huggingface checkpoint on every training run. In other words, we guarantee that training will always start from the Hugginface weights. This prevents us from accidentally resuming training when we don't mean to!


In [None]:
# Define a model init function, which will help us start training from scratch
def model_init():
  return AutoModelForSequenceClassification.from_pretrained(
      'microsoft/deberta-v3-large',
      #'roberta-large', # the name of the model on HuggingFace Hub
      num_labels=1 # Regression is just classification with a single, continous label. So intuitive!
      )

### Define Metrics

Let's consider what metrics might be useful for training this model. Lots of metrics are available to us through [HF Datasets Metrics repo](https://huggingface.co/metrics).

In [None]:
# Load some useful performance metrics using evaluate library
# These metrics are complementary. They tell us different things about model performance.
metrics = evaluate.combine(
    #combine multiple evaluation metrics into a single evaluation object 
    {
        # Root Mean Squared Error (RMSE); lower is better; more sensitive to few big errors
        "RMSE" : evaluate.load("mse", squared=False),
        # Mean Absolute Error; lower is better; less sensitive to few big errors
        "MAE": evaluate.load('mae')
    }
)

# Define compute_metrics()
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    # Alternative option is to use sklearn metrics here
    return metrics.compute(predictions=logits, references=labels) #logits refer to the raw, unnormalized outputs of the final layer of model
                                                                  #labels are the true values being predicted.

In [None]:
for metric in metrics.evaluation_modules:
  print(f"{metric.name:_^80}") #spacer
  print(metric.description) #description of metric if curious

### Settings for Model Training
Now, let's set some settings that will govern the training loop. This includes practical considerations such as:
* Where the model should be saved
* How to handle logging
* How often to assess the performance of the model

This is also where we could change hyperparameters for neural network training, including:
* Number of epochs to train. One epoch means the model will see the whole training dataset one time.
  * We will set this ourselves. The best value depends on the model and the dataset, but it is usually between 2 and 10.
* Learning rate. This controls how quickly the model learns.
  * We will set this ourselves. The default value is 5e-5 or 0.00005.
* Other optimizer parameters that control how the optimizer will adjust the model weights during training.
  * We use the default optimizer, called AdamW. But you might have already heard of another optimizer: stochastic gradient descent.
  * As an example, the AdamW optimizer has a `weight_decay` hyperparameter. This applies a penalty to the model loss, which has the effect of making weight updates smaller. This can prevent the model from trying so hard to (over)fit the data.
  * Since we do not specify a value for `weight_decay`, our `Trainer` will use its default value of 0.0 (no penalty).

Remember that we did not set up a development set, so we **must** set **all** hyperparameters *a priori*.

We do this through [TrainingArguments](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments) and the [Trainer class.](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.Trainer) Let's take a look!

In [None]:
#will need to know where to store output
import os

current_directory = os.getcwd()
print(current_directory)


In [None]:

training_args = TrainingArguments(
    # directory to save model checkpoints
    output_dir="/home/jovyan/active-projects/asap_scoring/fine_tuned_model",
    # how often to log. 'epoch' means that logging will happen at the end of every epoch
    logging_strategy='epoch',
    # how often to evaluate performance
    evaluation_strategy='epoch',
    # how often to save a model checkpoint
    save_strategy='no',
    # If an earlier checkpoint was better than the last one, load that checkpoint from disk
    # We cannot do this, because we don't have a development set
    load_best_model_at_end=False,
    # Not used; how to choose which model was best
    metric_for_best_model=None,
    # Not used; but lower RMSE is better
    greater_is_better=False,
    per_device_train_batch_size= 16, # 8 or 16 are typical. Switch to 4 to avoid OutOfMemoryError
    per_device_eval_batch_size= 16, # 8 or 16 are typical
    #! Try 1e-5, 5e-5, or 1e-4.
    learning_rate=5e-5,
    #! Try 2, 3, or 4 epochs
    num_train_epochs=3,
)


In [None]:

# We defined all these components in the code above and saved them as variables
# Add those variables as arguments to the Trainer call to make our trainer :)
trainer = Trainer(
    # notice that this is "model_init" and NOT "model_init()"
    # if we add the (), it will evaluate the function and provide the func output to the trainer
    # but we want to give the function itself to the trainer
    # functions are just objects, and we can pass them around by name
    model_init=model_init,
    args=training_args,
    data_collator=data_collator,
    train_dataset=asap_dd_tokenized["train"], #! the training dataset
    eval_dataset=asap_dd_tokenized["dev"], #! the testing set (this should a dev set)
    compute_metrics=compute_metrics,
)

## Finetune the model
finetune the model!


In [None]:
trainer.train()

Results

TrainOutput(global_step=3246, training_loss=0.34734542856081285, metrics={'train_runtime': 6000.1827, 'train_samples_per_second': 8.653, 'train_steps_per_second': 0.541, 'total_flos': 4.836970157388125e+16, 'train_loss': 0.34734542856081285, 'epoch': 3.0})

![image.png](attachment:cc642585-8c5b-4a9f-84b7-69fb5faac3a2.png)

In [None]:
#! Add the location where you want to save the finetuned writing model

trainer.save_model(
    output_dir="/home/jovyan/active-projects/asap_scoring/fine_tuned_model"
)


## Results
We can assess the performance of the model over a large number of inputs (e.g., the test set). Here, we initially look at the performance of the training set to make sure the model has _learned_ from the data we've provided.

We do the following

**first** we call in the saved dataset and apply it to test set

**second** we run the trained model in the code above (i.e., the model in memory) on the test set (i.e., we do not call in the saved model

### Calling in saved model

This is the easiest solution and will allow you to retrieve saved model as needed without retraining the model using all the steps above

In [None]:
#call in the saved model

#The chunk above saved three items
# 1. Config.json: Configuration settings for transformer model like the model architecture
# 2. model.safetensors: weights and parameters of the trained transformer model
# 3. training_args.bin: the arguments and settings that were used during the training process of the transformer model
# learning rate, batch size, etc...

# call in a new variable called trainer_saved and use it below for trainer_saved.evaluate and trainer_saved.predict

#set path
PATH = "/home/jovyan/active-projects/asap_scoring/fine_tuned_model"

num_labels = 1 #for regression analysis. This would change with classification
model = AutoModelForSequenceClassification.from_pretrained(PATH, num_labels=num_labels)

In [None]:
#set up a new trainer

trainer = Trainer(
    model=model, #calls in model
    args=training_args, #calls in training_args from code above 
    data_collator=data_collator, #calls in pre-defined data_collator (see code above)
    train_dataset=asap_dd_tokenized["train"], #! the training dataset
    eval_dataset=asap_dd_tokenized["test"], #! the testing set
    compute_metrics=compute_metrics,
)

#evaluates on test set
trainer.evaluate(
    asap_dd_tokenized["test"]
    )


![image.png](attachment:72d1e4d7-c1ab-4eac-a25f-c63e570cf6a2.png)


Results from above

### Continue with model that is in memory

This solution works, but if you come back to the code after it has been run and the server has shut down, you will have retrain the model to get the variables.

In [None]:
# Evaluate the performance of the model on a given dataset (test)

trainer.evaluate(
    asap_dd_tokenized["test"]
    )

**Results**

![image.png](attachment:ef66198c-7484-410d-b4f8-d26829b5e038.png)

Here's what each of these components is telling us:

* `epoch`: the epoch number at which the evaluation is performed.

* `eval_loss`: the evaluation loss. The evaluation loss is a measure of how well the model performs on the evaluation dataset, with lower values indicating better performance.

* `eval_mse`: Why is this the same as `eval_loss`? Because mean-squared error was our loss function!

* `eval_mae`: Mean Absolute Error

* `eval_runtime`: the total runtime of the evaluation process in seconds.

* `eval_samples_per_second`: the number of samples processed per second.

* `eval_steps_per_second`: the number of batches processed per second.


But the above just gives us summary statistics. Let's get all the individual predictions.

In [None]:
#Generate predictions using the trained model on a given datase

pred = trainer.predict(
    asap_dd_tokenized["test"]
)

print(pred.predictions)
print(pred.metrics)

[1.3956398 3.624475  2.5796957 ... 2.43714   3.2705112 2.3355937]
{'test_loss': 0.34458446502685547, 'test_mse': 0.3445844767259118, 'test_mae': 0.4533937437233308, 'test_runtime': 198.7241, 'test_samples_per_second': 25.986, 'test_steps_per_second': 1.625}

predictions are an array of floating point values. 

Each of these lists contains a single value for our predicted writing quality score.

In [None]:
# .reshape(-1) removes an empty dimension
# We get a one-dimensional array (similar to a list)
pred.predictions.reshape(-1)

In [None]:
# Create a dataframe for inspection
preds_df = pd.DataFrame(
    {
        'predicted':pred.predictions.reshape(-1), # flatten array
        'true':pred.label_ids,
        'text':asap_dd['test']['text'] #! Add the text column of the test partition here
    }
)
display(preds_df.sample(2)) # .sample(2) randomly selects 2 rows

In [None]:
preds_df.shape

In [None]:
#save dataframe for later use

preds_df.to_csv('non_hyper_parameterized_results_asap.csv', index=False)

### Statistical Analysis

Linear regression

In [None]:
# Run the simple linear regression
results = stats.linregress(preds_df["true"], preds_df["predicted"])

# Using an f-string to clearly display and round the relevant output
print(f"R-squared: {results.rvalue**2:.3f}, P-value= {results.pvalue:.3f}")

# Plot the data with a regression line
sns.lmplot(x='true', y='predicted', data=preds_df);

In [None]:
correlation = np.sqrt(.676)

print(correlation)

**Results**

r = .822
r2 of .676

![image.png](attachment:05ee19cd-663b-4b27-9c67-472ac5e1de5e.png)
