# Statistical baseline
This notebook is calculating the statistical baselines. It takes mean as a predicted value and checks the results. The measures in questions are: RMSE, MAE, MaxAE, Huber Loss(q=1), Quantile loss(q=0.5). Both are calculated for a full test dataset. TL;DR: What is the score if we just give the mean as a prediction? 

Results for travel and abortion posts (they are the same)
|measure|adm|riv|
|-----|----|---|
|MAE|0.88|0.62|
|MaxAE|2.49|2.54|
|Huber Loss|0.49|0.27|
|Quantile Loss|0.44|0.31|
|RMSE|1.07|0.79|

Results for ai
|measure|adm|riv|
|-----|----|---|
|MAE|0.92|0.62|
|MaxAE|2.38|2.55|
|Huber Loss|0.53|0.28|
|Quantile Loss|0.46|0.31|
|RMSE|1.10| 0.81|


In [30]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import root_mean_squared_error

import torch
from torch.nn import HuberLoss

In [31]:
path_to_test = '../data/split/full_test.csv'

In [32]:
# load the test data
test_data = pd.read_csv(path_to_test)
test_data[['post_travel','adm','riv']].head()

Unnamed: 0,post_travel,adm,riv
0,Next stop-HAWAII!!,4.667,3.0
1,I recently traveled to fall creek falls and it...,3.778,1.778
2,Just got back from FIJI it was amazing!!,2.333,2.0
3,Recently got back from an amazing trip in Japa...,5.222,3.0
4,I went to Italy and it was amazing! Sorrento s...,2.0,3.111


In [33]:
# inictialize the huber loss
huber_loss = HuberLoss(delta=1.0)

In [34]:
# define the quantile loss
def quantile_loss(y_true: torch.Tensor, y_pred: torch.Tensor, quantile=0.5):
    error = y_true - y_pred
    return torch.mean(torch.max(quantile * error, (quantile - 1) * error))

## Travel and abortion posts
### Narcissistic Admiration

In [37]:
# get the mean of the adm column
mean_adm_tensor = np.full(test_data.shape[0], np.mean(test_data['adm']))

In [38]:
# calculate the mean absolute error
adm_mae = mean_absolute_error(test_data['adm'], mean_adm_tensor)
print(f'MAE: {adm_mae}')

MAE: 0.8830683902524239


In [39]:
adm_maxae = np.max(np.abs(test_data['adm'] - mean_adm_tensor))
print(f'MAXAE: {adm_maxae}')

MAXAE: 2.4863177570093455


In [40]:
adm_huber_loss_score = huber_loss(torch.tensor(test_data['adm'].values), 
                                  torch.tensor(mean_adm_tensor))

print(f'Huber Loss: {adm_huber_loss_score}')

Huber Loss: 0.49352194411053324


In [41]:
adm_quantile_loss_score = quantile_loss(torch.tensor(test_data['adm'].values), 
                                        torch.tensor(mean_adm_tensor), quantile=0.5)

print(f'Quantile Loss: {adm_quantile_loss_score}')

Quantile Loss: 0.4415341951262118


In [42]:
# RMSE
adm_rmse = root_mean_squared_error(test_data['adm'], mean_adm_tensor)
print(f'RMSE: {adm_rmse}')

RMSE: 1.0666058280772832


### Narcissistic Rivalry

In [43]:
# get the mean of the adm column
mean_riv_tensor = np.full(test_data.shape[0], np.mean(test_data['riv']))

In [44]:
riv_mae = mean_absolute_error(test_data['riv'], mean_riv_tensor)
print(f'MAE: {riv_mae}')

MAE: 0.6216487029434886


In [45]:
riv_maxae = np.max(np.abs(test_data['riv'] - mean_riv_tensor))
print(f'MAXAE: {riv_maxae}')

MAXAE: 2.5382429906542057


In [46]:
riv_huber_loss_score = huber_loss(torch.tensor(test_data['riv'].values), 
                                  torch.tensor(mean_riv_tensor))
print(f'Huber Loss: {riv_huber_loss_score}')

Huber Loss: 0.2746123445956591


In [47]:
riv_quantile_loss_score = quantile_loss(torch.tensor(test_data['riv'].values),
                                        torch.tensor(mean_riv_tensor), quantile=0.5)
print(f'Quantile Loss: {riv_quantile_loss_score}')


Quantile Loss: 0.3108243514717443


In [48]:
riv_rmse = root_mean_squared_error(test_data['riv'], mean_riv_tensor)
print(f'RMSE: {riv_rmse}')

RMSE: 0.7911494044792541


## AI posts
### Admiration

In [49]:
ai_test_data = test_data[['post_ai','adm','riv']].dropna()

In [52]:
mean_adm_tensor = np.full(ai_test_data.shape[0], np.mean(ai_test_data['adm']))

In [54]:
adm_mae = mean_absolute_error(ai_test_data['adm'], mean_adm_tensor)
print(f'MAE: {adm_mae}')

MAE: 0.9234790311418686


In [56]:
adm_maxae = np.max(np.abs(ai_test_data['adm'] - mean_adm_tensor))
print(f'MAXAE: {adm_maxae}')

MAXAE: 2.380717647058824


In [57]:
adm_huber_loss_score = huber_loss(torch.tensor(ai_test_data['adm'].values),
                                    torch.tensor(mean_adm_tensor))
print(f'Huber Loss: {adm_huber_loss_score}')

Huber Loss: 0.5327778402035416


In [58]:
adm_quantile_loss_score = quantile_loss(torch.tensor(ai_test_data['adm'].values),
                                        torch.tensor(mean_adm_tensor), quantile=0.5)
print(f'Quantile Loss: {adm_quantile_loss_score}')

Quantile Loss: 0.4617395155709343


In [59]:
adm_rmse = root_mean_squared_error(ai_test_data['adm'], mean_adm_tensor)
print(f'RMSE: {adm_rmse}')

RMSE: 1.10872033933948


### Rivalry

In [60]:
mean_riv_tensor = np.full(ai_test_data.shape[0], np.mean(ai_test_data['riv']))

In [62]:
riv_mae = mean_absolute_error(ai_test_data['riv'], mean_riv_tensor)
print(f'MAE: {riv_mae}')

MAE: 0.6202103806228372


In [63]:
riv_maxae = np.max(np.abs(ai_test_data['riv'] - mean_riv_tensor))
print(f'MAXAE: {riv_maxae}')

MAXAE: 2.549364705882353


In [64]:
riv_huber_loss_score = huber_loss(torch.tensor(ai_test_data['riv'].values),
                                    torch.tensor(mean_riv_tensor))
print(f'Huber Loss: {riv_huber_loss_score}')

Huber Loss: 0.27872819826582534


In [65]:
riv_quantile_loss_score = quantile_loss(torch.tensor(ai_test_data['riv'].values),
                                        torch.tensor(mean_riv_tensor), quantile=0.5)
print(f'Quantile Loss: {riv_quantile_loss_score}')

Quantile Loss: 0.3101051903114186


In [66]:
riv_rmse = root_mean_squared_error(ai_test_data['riv'], mean_riv_tensor)
print(f'RMSE: {riv_rmse}')

RMSE: 0.8090958475477619
