# Sources

- https://lifewithdata.com/2022/03/13/how-to-remove-highly-correlated-features-from-a-dataset/
- https://app.pluralsight.com/player?course=building-regression-models-scikit-learn&author=janani-ravi&name=1616b48f-65fd-4abd-b9fa-7a2560c9d5de&clip=3


# Notation

|General Notation | Description | Python (if applicable) |
|---|---|---|
| $a$ | scalar ||
| $\mathbf{a}$ | vector ||
| $A$ | matrix ||
| **Regression** | | | |
|  $X$ | training example matrix | `X_train` |   
|  $\mathbf{y}$  | training example  targets | `y_train` |
|  $\mathbf{x}^{(i)}$| Features of $ith$ Training Example | `X[i]` |
|  $\mathbf{x}^{(i)}$, $y^{(i)}$ | $i{th}$ Training Example | `X[i]`, `y[i]`|
| m | number of training examples | `m` |
| n | number of features in each example | `n` |
|  $\mathbf{w}$  |  parameter: weight, | `w` |
| $b$ | parameter: bias | `b` |     
| $f_{\mathbf{w},b}(\mathbf{x}^{(i)})$ | The result of the model evaluation at $\mathbf{x}^{(i)}$ parameterized by $\mathbf{w},b$: $f_{\mathbf{w},b}(\mathbf{x}^{(i)}) = \mathbf{w} \cdot \mathbf{x}^{(i)}+b$  | `f_wb` | 

# Dataset

You are asked to predict a final grade of the math course based on the information we have about the student. The dataset is provided in the accompanying file 'student-mat.csv'. A full description of the data set can be found in the file 'metadata.txt'.

In [15]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as pl
%matplotlib inline
from sklearn.model_selection import train_test_split

# Read from csv.
mathscores = pd.read_csv('./data/student-mat.csv', sep=';')
# Features G1 and G2 are highly correlated to target and are therefore dropped from the dataset. The idea is that the information they contain are very similar and it would be redundant to include them.
print(mathscores.corr()['G3'].sort_values())
mathscores_without_G1_G2 = mathscores.drop(['G1', 'G2'], axis = 'columns')

mathscores_one_hot_encoded = pd.get_dummies(mathscores_without_G1_G2, columns = ['school', 'sex', 'address', 'famsize', 'Pstatus', 'Mjob', 'Fjob', 'reason', 'guardian', 'schoolsup', 'famsup', 'paid', 'activities', 'nursery', 'higher', 'internet', 'romantic'])

# 80% - 20% split for the training and testing sets. 316/395 = 0.8 
# Assign train and test sets (in your experiments, you want to do cross-validation).
X_train, X_test, y_train, y_test = train_test_split(mathscores_one_hot_encoded, mathscores_one_hot_encoded['G3'], test_size = 0.2, random_state = 42)

X_train = X_train.to_numpy()
X_test = X_test.to_numpy()
y_train = y_train.to_numpy()
y_test = y_test.to_numpy()

print(f"X shape: {X_train.shape}, X type:{type(X_train)})")
print(f"y shape: {y_train.shape}, y type:{type(y_train)})")

features = mathscores_one_hot_encoded.columns.to_numpy()

failures     -0.360415
age          -0.161579
goout        -0.132791
traveltime   -0.117142
health       -0.061335
Dalc         -0.054660
Walc         -0.051939
freetime      0.011307
absences      0.034247
famrel        0.051363
studytime     0.097820
Fedu          0.152457
Medu          0.217147
G1            0.801468
G2            0.904868
G3            1.000000
Name: G3, dtype: float64
X shape: (316, 57), X type:<class 'numpy.ndarray'>)
y shape: (316,), y type:<class 'numpy.ndarray'>)


# Encoding and features scaling

In [16]:
# This method prints information about a DataFrame including the index dtype and columns, non-null values and memory usage.
mathscores_one_hot_encoded[pd.Series(features)].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 395 entries, 0 to 394
Data columns (total 57 columns):
 #   Column             Non-Null Count  Dtype
---  ------             --------------  -----
 0   age                395 non-null    int64
 1   Medu               395 non-null    int64
 2   Fedu               395 non-null    int64
 3   traveltime         395 non-null    int64
 4   studytime          395 non-null    int64
 5   failures           395 non-null    int64
 6   famrel             395 non-null    int64
 7   freetime           395 non-null    int64
 8   goout              395 non-null    int64
 9   Dalc               395 non-null    int64
 10  Walc               395 non-null    int64
 11  health             395 non-null    int64
 12  absences           395 non-null    int64
 13  G3                 395 non-null    int64
 14  school_GP          395 non-null    uint8
 15  school_MS          395 non-null    uint8
 16  sex_F              395 non-null    uint8
 17  sex_M           

In [17]:
print(f"Peak to peak range by column in raw X: {np.ptp(X_train,axis=0)}") 

Peak to peak range by column in raw X: [ 7  4  4  3  3  3  4  4  4  4  4  4 75 20  1  1  1  1  1  1  1  1  1  1
  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
  1  1  1  1  1  1  1  1  1]


In [18]:
# Describe target.
print(mathscores['G3'].describe())

count    395.000000
mean      10.415190
std        4.581443
min        0.000000
25%        8.000000
50%       11.000000
75%       14.000000
max       20.000000
Name: G3, dtype: float64
