# Metrics for regression

In this notebook, we will learn several metrics for evaluating regression Models and How to implement them using the sci-kit-learn library.

As you well know, **Regression** is a type of Machine learning which helps in finding the relationship between independent and dependent variable. Regression can be defined as a Machine learning problem where we have to **predict discrete values** like price, Rating, Fees, etc.

Sources:

https://machinelearningmastery.com/regression-metrics-for-machine-learning/
https://towardsdatascience.com/what-are-the-best-metrics-to-evaluate-your-regression-model-418ca481755b

https://www.analyticsvidhya.com/blog/2021/05/know-the-best-evaluation-metrics-for-your-regression-model/

To illustrate the examples, we will use the MINT dataset that contains tweets in several languages. Each tweets is scored with an intimacy score. The task is to predict the intimacy level of a tweet.

In [None]:
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Not connected to a GPU')
else:
  print(gpu_info)

Mon Nov 14 16:09:19 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   44C    P8    10W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [None]:
!pip install datasets transformers

Successfully installed datasets-2.6.1 dill-0.3.5.1 huggingface-hub-0.10.1 multiprocess-0.70.13 responses-0.18.0 tokenizers-0.13.2 transformers-4.24.0 urllib3-1.25.11 xxhash-3.1.0


Let's train  a simple linear regression on this dataset: 

In [None]:
#1. load the dataset
from datasets import load_dataset
access_token="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"
# if the dataset was defined as public, use this:
# dataset = load_dataset("ISEGURA/edos", use_auth_token=True)
# if the dataset is private:
dataset = load_dataset("ISEGURA/mint", use_auth_token=access_token)

#2. we gets the labels
y_train = dataset['train']['label']
y_val = dataset['validation']['label']
y_test = dataset['test']['label']

#3. Encoding
from transformers import AutoTokenizer
model_name='bert-base-multilingual-uncased'
tokenizer = AutoTokenizer.from_pretrained(model_name)

MAX_LEN = 50

def tokenize(examples):
    ## it applies the tokenzier on the dataset in its field text
    # we could add max_length = MAX_LENGHT, but in this case is not neccesary because MAX_LENTH is already 512, the maximum length allowed by the model
    return tokenizer(examples["text"], truncation=True, max_length=MAX_LEN, padding='max_length')


data_encodings=dataset.map(tokenize, batched=True)

#4. model
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels = 1).to("cuda")

import numpy as np
from sklearn.metrics import mean_squared_error, r2_score, mean_squared_error, mean_absolute_error

#5 metrics for the model
def compute_metrics_for_regression(eval_pred):
    logits, labels = eval_pred
    labels = labels.reshape(-1, 1)

    mse = mean_squared_error(labels, logits)
    rmse = mean_squared_error(labels, logits, squared=False)
    mae = mean_absolute_error(labels, logits)
    r2 = r2_score(labels, logits)
    smape = 1/len(labels) * np.sum(2 * np.abs(logits-labels) / (np.abs(labels) + np.abs(logits))*100)

    return {"mse": mse, "rmse": rmse, "mae": mae, "r2": r2, "smape": smape}

from transformers import TrainingArguments

NUM_EPOCHS = 1 # we recommend at least 3

#6 Specifiy the arguments for the trainer  
training_args = TrainingArguments(
    output_dir ='./results',          
    num_train_epochs = NUM_EPOCHS,     
    per_device_train_batch_size = 64,   
    per_device_eval_batch_size = 20,   
    weight_decay = 0.01,               
    learning_rate = 2e-5,
    logging_dir = './logs',            
    save_total_limit = 10,
    load_best_model_at_end = True,     
    metric_for_best_model = 'rmse',    
    evaluation_strategy = "epoch",
    save_strategy = "epoch",
    report_to = 'all',
) 

from transformers import Trainer

#7 Call the Trainer
trainer = Trainer(
    model = model,                         
    args = training_args,                  
    train_dataset = data_encodings['train'],         
    eval_dataset = data_encodings['validation'],          
    compute_metrics = compute_metrics_for_regression,     
)

#8. Train the model
trainer.train()


trainer.evaluate()

#9. evaluation on test dataset
trainer.eval_dataset = data_encodings['test']
trainer.evaluate()

Downloading readme:   0%|          | 0.00/25.0 [00:00<?, ?B/s]



Downloading and preparing dataset csv/ISEGURA--mint to /root/.cache/huggingface/datasets/ISEGURA___csv/ISEGURA--mint-5c90ae891d86e6db/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317...


Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/571k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/161k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/81.1k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

0 tables [00:00, ? tables/s]

0 tables [00:00, ? tables/s]

0 tables [00:00, ? tables/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/ISEGURA___csv/ISEGURA--mint-5c90ae891d86e6db/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/625 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/872k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.72M [00:00<?, ?B/s]

  0%|          | 0/7 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

Downloading:   0%|          | 0.00/672M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-multilingual-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model 

Epoch,Training Loss,Validation Loss,Mse,Rmse,Mae,R2,Smape
1,No log,0.588005,0.588005,0.766815,0.612518,0.279253,29.192075


The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: language, text. If language, text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 940
  Batch size = 20
Saving model checkpoint to ./results/checkpoint-104
Configuration saved in ./results/checkpoint-104/config.json
Model weights saved in ./results/checkpoint-104/pytorch_model.bin


Training completed. Do not forget to share your model on huggingface.co/models =)


Loading best model from ./results/checkpoint-104 (score: 0.7668147087097168).
The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: language, text. If language, text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
 

The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: language, text. If language, text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1908
  Batch size = 20


{'eval_loss': 0.5395293831825256,
 'eval_mse': 0.5395295023918152,
 'eval_rmse': 0.7345266938209534,
 'eval_mae': 0.5825014114379883,
 'eval_r2': 0.3140804132410826,
 'eval_smape': 28.398343324161427,
 'eval_runtime': 5.8261,
 'eval_samples_per_second': 327.491,
 'eval_steps_per_second': 16.478,
 'epoch': 1.0}

In [None]:
def get_prediction(text):
    # prepare our text into tokenized sequence
    inputs = tokenizer(text, padding="max_length", truncation=True, return_tensors="pt").to("cuda")
    outputs = model(**inputs)   #output is a tensor
    return outputs[0].item()    #we only have to return the value of the tensor by using item()

y_pred=[get_prediction(text) for text in dataset['test']['text']]
y_pred[:5]

[2.667494773864746,
 1.295894980430603,
 1.2254654169082642,
 2.8308157920837402,
 1.5095515251159668]

## Mean Absolute Error(MAE)
MAE is a very simple metric which calculates the absolute difference between actual and predicted values. 
To calculate MAE, you have to sum all the errors and divide them by the total number of observations. 

<img src='https://miro.medium.com/max/723/1*9BhnZiaHkApC-gQt3rYpMQ.png' width=50%>



Let's calculate it:

In [None]:
total = 0
for (actual,predicted) in zip(y_test, y_pred):
    print(actual,predicted)
    total += abs(actual-predicted)

MAE = total / len(y_pred)

print('Mean Absolute Error (MAE) is: ', MAE)

1.3333333333333333 2.667494773864746
1.0 1.295894980430603
1.2 1.2254654169082642
2.0 2.8308157920837402
2.5 1.5095515251159668
4.8 2.2946910858154297
2.25 1.8477258682250977
4.2 2.94132137298584
2.2 2.945384979248047
2.2 1.6035940647125244
2.4 2.7736220359802246
2.8 2.534252405166626
1.0 2.4518301486968994
1.75 1.7596471309661865
2.2 2.926135540008545
1.0 1.312157392501831
1.6 2.934441566467285
1.4 2.461473226547241
3.6 3.0759530067443848
3.75 2.8150436878204346
3.25 2.4099173545837402
2.333333333333333 2.746135950088501
2.6 3.0151126384735107
1.8 2.8953709602355957
1.6 1.5490587949752808
1.6 2.0593104362487793
2.2 1.7647414207458496
1.6 1.5371410846710205
1.4 2.3097989559173584
2.6 2.3933181762695312
2.0 2.082341194152832
1.8 1.5151211023330688
3.2 1.7025281190872192
1.8 1.7128987312316895
1.8 2.431781530380249
1.4 2.556138277053833
1.5 1.6299192905426025
1.8 2.1930670738220215
1.0 1.5114408731460571
4.2 2.6186389923095703
1.0 1.3409490585327148
3.4 2.48052716255188
1.2 2.39038276672

As you see, it is very easy to calculate. Moreover, sklearn already works for you:

In [None]:
from sklearn.metrics import mean_absolute_error
errors = mean_absolute_error(y_test, y_pred)
# report error
print(errors)

0.5823445220392621


The goal of our model is to get a minimum MAE because this is a loss. 



## Mean Squared Error(MSE)
What actually the MSE represents? It represents the squared distance between actual and predicted values. we perform squared to avoid the cancellation of negative terms and it is the benefit of MSE.

It is the most used because it is very simple metric.

<img src='https://lh3.googleusercontent.com/-JBio3Q_1FiI/YB2oQKEmRBI/AAAAAAAAAkM/c8KJ3wPwtMEd3Ik0nYMMdmr_pRqMF6MlQCLcBGAsYHQ/w550-h177/image.png'>


In [None]:
sum_total = 0
for (actual,predicted) in zip(y_test, y_pred):
    sum_total += (actual-predicted)**2

MSE = sum_total / len(y_pred)

print('Mean Squared Error (MSE) is: ', MSE)

Mean Squared Error (MSE) is:  0.5387361828257454


Althoug it is very easy to calculate it, please let sklearn work for you:

In [None]:
from sklearn.metrics import mean_squared_error
errors = mean_squared_error(y_test, y_pred)
# report error
print('mse:', errors)

mse: 0.5387361828257469


## Root Mean Squared Error(RMSE)

<img src='https://miro.medium.com/max/966/1*lqDsPkfXPGen32Uem1PTNg.png'>


https://abdatum.com/ciencia/rmse


In [None]:
import numpy as np

RMSE = np.sqrt(MSE) # MSE ** 0.5
print('Root Mean Squared Error (RMSE): ', RMSE)
print("RMSE", np.sqrt(mean_squared_error(y_test,y_pred)))


Root Mean Squared Error (RMSE):  0.7339865004383564
RMSE 0.7339865004383574


## Root Mean Squared Log Error(RMSLE)

Taking the log of the RMSE metric slows down the scale of error. The metric is very helpful when you are developing a model without calling the inputs. In that case, the output will vary on a large scale.

To control this situation of RMSE we take the log of calculated RMSE error and resultant we get as RMSLE.

To perform RMSLE we have to use the NumPy log function over RMSE.


In [None]:
print("RMSLE", np.log(np.sqrt(mean_squared_error(y_test,y_pred))))


RMSLE -0.3092646423101072


## R Squared (R2)
R2 score tells us how many wells did your model perform. 
That is, it gives the performance of your model, not the loss. 

<img src='https://miro.medium.com/max/1200/1*_mVvAFVEGinHlijmmeWwzg.png'>

where:
- $y_{i}$ is the actual value.
- ^y$_{i}$ is the predicted value.
- $_y$ is the mean value of the y values.

<img src='https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRYlppEmuer0Gf0XNuca0ENDiZwrPtdWK16ZDDiZ_6o2vxR4viREzIP6gGcrMlzxARZLQY&usqp=CAU'>

The same we have in classification problems which we call a threshold which is fixed at 0.5. So basically R2 squared calculates how must regression line is better than a mean line. Hence, R2 squared is also known as **Coefficient of Determination** or sometimes also known as Goodness of fit.

You can use directly sklear to calculate:

In [None]:
from sklearn.metrics import r2_score
r2 = r2_score(y_test,y_pred)
print(r2)

0.31508893576490404


**How do you interpret the R2 score?**

Suppose If the R2 score is zero then the above regression line by mean line is equal means 1 so 1-1 is zero. So, in this case, both lines are overlapping means model performance is worst, It is not capable to take advantage of the output column.

Now the second case is when the R2 score is 1, it means when the division term is zero and it will happen when the regression line does not make any mistake, it is perfect. In the real world, it is not possible.

So we can conclude that as **our regression line moves towards perfection, R2 score move towards one**. And the model performance improves.

The normal case is when the R2 score is between zero and one (like 0.8 which means your model is capable to explain 80 per cent of the variance of data).

##  Pearson's R

We will use the Pearson's r in spicy (https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.pearsonr.html), but it should also be fine if you use the Pearson's r in sklearn

The Pearson correlation coefficient measures the linear relationship between two datasets. Like other correlation coefficients, this one varies between -1 and +1 with 0 implying no correlation. Correlations of -1 or +1 imply an exact linear relationship. Positive correlations imply that as x increases, so does y. Negative correlations imply that as x increases, y decreases.




In [None]:
from scipy import stats
res = stats.pearsonr(y_test, y_pred)
res

(0.5837135971971379, 1.1690059767931627e-174)