<a href="https://colab.research.google.com/github/abelowska/dataPy/blob/main/Classes_04_Linear_Regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Linear regression: transforming data

Today we are going to use our own dataset.

The dataset consists of data on **personality** (Big Five assesed with [NEO FFI](https://sjdm.org/dmidi/NEO-FFI.html)) and **cognitive religious belief styles** ([The Post-Critical Belief Scale](https://theo.kuleuven.be/apps/press/ecsi/files/2019/03/4.-Pollefeyt-Bouwens-PCB-Melb-Vict-for-dummies-EN.pdf)) from 342 individuals. We will be interested wheter it is possible to predict  cognitive religious belief style from personality traits. Make sure you downloaded the dataset from github repository and uploaded it into Colabolatory *Files*.

Imports

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn.metrics import median_absolute_error, r2_score
from sklearn.metrics import PredictionErrorDisplay, median_absolute_error
import io

## Load dataset

In [None]:
df = pd.read_csv('data_neo-ffi_religion.csv')
df.head()

Inspect the dataset

In [None]:
df.describe(include='all')

## Exercise 1
Let's see which personality traits are most associated with orthodox cognitive style. Create the model:

*Orthodoxy ~ Extraversion + Agreeableness + Openness + Neuroticism + Conscientiousness*

Fit the model using the training part of the data, then calculate predictions on the testing dataset. Calculate $R^2$ and MedianAE scores - you can use `compute_score()` method defined below. Then plot `y_true ~ y_predicted` to see how good your predictions are.

There is a nice scikit-learn function for plotting true vs predicted values: [PredictionErrorDisplay.from_predictions()](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.PredictionErrorDisplay.html#sklearn.metrics.PredictionErrorDisplay.from_predictions)

In [None]:
def compute_score(y_true, y_pred):
  '''
  Helper function for printing scores.

  Parameters:
  y_true: ndarray of y values from original dataset.
  y_pred: ndarray of y values predicted with given model.

  Return:
  dictionary object that consists of R2 and median absolute error scores.

  '''
  return {
        "R2": f"{r2_score(y_true, y_pred):.3f}",
        "MedianAE": f"{median_absolute_error(y_true, y_pred):.3f}",
}

In [None]:
X = # your code

y = # your code

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# your code

scores = compute_score(y_test, y_pred)

In [None]:
_ = PredictionErrorDisplay.from_predictions(
    # your code
    kind='actual_vs_predicted',
)

As you can see, our model is clearly not good. Do you have any idea what could be the reason? Take a close look at the `true vs predicted` plot and recall linear regression assumptioms. Why are they violated?

HINT: Plot and then analyse the distribution of y data To plot distribution, you can use [`histplot()`](https://seaborn.pydata.org/generated/seaborn.histplot.html) from `seaborn`.

In [None]:
# your code

*Side note: There is another very useful method from `seaborn` that shows pairwise relationships in a dataset along with distributions.*

In [None]:
_ = sns.pairplot(df, kind="reg", diag_kind="kde")

From your graph, it is clear that the y variable (*Orthodoxy*) does not have a normal distribution, when features (independent variables) have. For linear models, **normal distribution of residuals (observed - predicted) is crucial. And this is often violated when your variables have different distributions.**

## Exercise 2

Now you know that for a linear regression to be successful, you **might want to transform the non-normal data to have normal distributions**. Try again to model

*Orthodoxy ~ Extraversion + Agreeableness + Openness + Neuroticism + Conscientiousness*

but now, before fitting the model, transform your y data to have more Gaussian-like distrubutions. Compare `true vs predicted` plots, $R^2$ and $MAE$ of models before and after the transformation.

HINT: There are automatic methods to make the data more Gauusian-like. Try googling (e.g. [stackoverflow](https://stackoverflow.com/questions/53624804/how-to-normalize-a-non-normal-distribution)) or use ChatGPT for help.

In [None]:
# see how skewness and kurtosis are big for Orthodoxy!
summary = df.agg(['skew', 'kurtosis', 'mean', 'std', 'min', 'max']).transpose()
summary

In [None]:
df_transformed = df.copy()
transformed_y = # your code

df_transformed['Orthodoxy'] = transformed_y

In [None]:
# see skewness and kurtosis after data transformation
summary = df_transformed.agg(['skew', 'kurtosis', 'mean', 'std', 'min', 'max']).transpose()
summary

In [None]:
X = df_transformed[[
    'Extraversion',
    'Agreeableness',
    'Conscientiousness',
    'Openness',
    'Neuroticism']]

y = df_transformed[['Orthodoxy']]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# your code

scores = compute_score(y_test, y_pred)
print(scores)

In [None]:
_, ax = plt.subplots(figsize=(5, 5))

display_ = PredictionErrorDisplay.from_predictions(
    y_test.to_numpy(),
    y_pred,
    kind="actual_vs_predicted",
    ax=ax,
    scatter_kwargs={"alpha": 0.5}
)

ax.set_title("Linear model")
for name, score in scores.items():
    ax.plot([], [], " ", label=f"{name}: {score}")
ax.legend(loc="upper left")
plt.tight_layout()

Now you should see that the model estimated on normal data performs much better than the model estimated on exponential data.

## Exercise 3

Extract coefficients from exercise 2 and plot them to see them better. Which trait has the greatest impact on orthodoxy?

In [None]:
# your code

## Interpreting coefficients: scale matters

Recall the scales of our features:

In [None]:
df_transformed.describe().transpose()

Do means and standard deviations of all features look similar? If not, you CANNOT compare coefficients of the model. An increase of 0.1 in variable a, 10 times larger than variable b, is not equal to an increase of 0.1 in variable b:

```
b = 1
a = 10 * b

a_01 = 0.1 * 10 = 1
b_01 = 0.1 * 1 = 0.1
```

Thus, it is crucial for most models that features have similar scales (i.e. means and standard deviations). We cannot compare the magnitude of different coefficients since the features have different natural scales, and hence value ranges, e.g. because of their different unit of measure.

*NEO-FFI Openness* clearly has different scale, thus coefficient next to this feature is not comparable to other coefficients. Look at standard deviation plot below.



In [None]:
df_transformed.std(axis=0).plot.barh(figsize=(9, 7))
plt.title("Feature ranges")
plt.xlabel("Std. dev. of feature values")

Normalizing the feature set before modelling is an important step of data processing.

## Exercise 4

Create the same model as in exercises 1 and 2:

*Orthodoxy ~ Extraversion + Agreeableness + Openness + Neuroticism + Conscientiousness*

This time, before fitting, scale the feature set so that each feature has a similar scale; then compare coefficients of this model to coefficients from exercise 3. Use your transformed dataframe to correctly model a linear relationship.


----
Do you have an idea how to scale your data? Maybe you know some popular techniques?

One of the most common ways to scale a vector of data is to subtract the mean of that vector from each element of the vector and divide the elements by the standard deviation of the vector. This results in the entire list having a mean of 0 and a standard deviation of 1. This kind of scaling can be done automatically with [`StandardScaler()`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) from scikit-learn. Use [`fit_transform()`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler.fit_transform) method to learn means and standard deviations of your features from the training dataset. Then use [`transform()`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler.transform) method to transform your testing data.

In [None]:
X = df_transformed[[
    'Extraversion',
    'Agreeableness',
    'Conscientiousness',
    'Openness',
    'Neuroticism']]

y = df_transformed[['Orthodoxy']]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Create the model with scaling

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

X_train_transfromed = # your code

# your code

X_test_transformed = # your code

scores = compute_score(y_test, y_pred)
print(scores)

Plot `true vs predicted`

In [None]:
_, ax = plt.subplots(figsize=(5, 5))

display_ = PredictionErrorDisplay.from_predictions(
    y_test.to_numpy(), y_pred, kind="actual_vs_predicted", ax=ax, scatter_kwargs={"alpha": 0.5}
)

ax.set_title("Linear model")
for name, score in scores.items():
    ax.plot([], [], " ", label=f"{name}: {score}")
ax.legend(loc="upper left")
plt.tight_layout()

Extract coefficients and then plot them

In [None]:
# your code