# Sources

- https://lifewithdata.com/2022/03/13/how-to-remove-highly-correlated-features-from-a-dataset/
- https://app.pluralsight.com/player?course=building-regression-models-scikit-learn&author=janani-ravi&name=1616b48f-65fd-4abd-b9fa-7a2560c9d5de&clip=3


# Notation

|General Notation | Description | Python (if applicable) |
|---|---|---|
| $a$ | scalar ||
| $\mathbf{a}$ | vector ||
| $A$ | matrix ||
| **Regression** | | | |
|  $X$ | training example matrix | `X_train` |   
|  $\mathbf{y}$  | training example  targets | `y_train` |
|  $\mathbf{x}^{(i)}$| Features of $ith$ Training Example | `X[i]` |
|  $\mathbf{x}^{(i)}$, $y^{(i)}$ | $i{th}$ Training Example | `X[i]`, `y[i]`|
| m | number of training examples | `m` |
| n | number of features in each example | `n` |
|  $\mathbf{w}$  |  parameter: weight, | `w` |
| $b$ | parameter: bias | `b` |     
| $f_{\mathbf{w},b}(\mathbf{x}^{(i)})$ | The result of the model evaluation at $\mathbf{x}^{(i)}$ parameterized by $\mathbf{w},b$: $f_{\mathbf{w},b}(\mathbf{x}^{(i)}) = \mathbf{w} \cdot \mathbf{x}^{(i)}+b$  | `f_wb` | 

# Dependencies and data import

You are asked to predict a final grade of the math course based on the information we have about the student. The dataset is provided in the accompanying file 'student-mat.csv'. A full description of the data set can be found in the file 'metadata.txt'.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.model_selection import train_test_split
from sklearn import preprocessing

# Read from csv.
mathscores = pd.read_csv('./data/student-mat.csv', sep=';')

# Features G1 and G2 are highly correlated to target and are therefore dropped from the dataset.
print(mathscores.corr()['G3'].sort_values())
mathscores_without_G1_G2 = mathscores.drop(['G1', 'G2'], axis = 'columns')

# Model 1: Multiple variable linear regression

## Data preparation

In [None]:
# Encoding categorical features
drop_enc = preprocessing.OrdinalEncoder()
mathscores_encoded_ndarray = drop_enc.fit_transform(mathscores_without_G1_G2)
mathscores_encoded = pd.DataFrame(mathscores_encoded_ndarray, columns = mathscores_without_G1_G2.columns)

X_features = mathscores_encoded.columns.to_numpy()

# This method prints information about a DataFrame including the index dtype and columns, non-null values and memory usage.
mathscores_encoded[pd.Series(X_features)].info()

X = mathscores_encoded.drop('G3', axis = 'columns')
y = mathscores_encoded['G3']

# 80% - 20% split for the training and testing sets. 316/395 = 0.8 
# Assign train and test sets (in your experiments, you want to do cross-validation).
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

X_train = X_train.to_numpy()
X_test = X_test.to_numpy()
y_train = y_train.to_numpy()
y_test = y_test.to_numpy()

print(f"X shape: {X_train.shape}, X type:{type(X_train)})")
print(f"y shape: {y_train.shape}, y type:{type(y_train)})")

In [29]:
mathscores_encoded

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,internet,romantic,famrel,freetime,goout,Dalc,Walc,health,absences,G3
0,0.0,0.0,3.0,1.0,0.0,0.0,4.0,4.0,0.0,4.0,...,0.0,0.0,3.0,2.0,3.0,0.0,0.0,2.0,6.0,3.0
1,0.0,0.0,2.0,1.0,0.0,1.0,1.0,1.0,0.0,2.0,...,1.0,0.0,4.0,2.0,2.0,0.0,0.0,2.0,4.0,3.0
2,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,2.0,...,1.0,0.0,3.0,2.0,1.0,1.0,2.0,2.0,10.0,7.0
3,0.0,0.0,0.0,1.0,0.0,1.0,4.0,2.0,1.0,3.0,...,1.0,1.0,2.0,1.0,1.0,0.0,0.0,4.0,2.0,12.0
4,0.0,0.0,1.0,1.0,0.0,1.0,3.0,3.0,2.0,2.0,...,0.0,0.0,3.0,2.0,1.0,0.0,1.0,4.0,4.0,7.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
390,1.0,1.0,5.0,1.0,1.0,0.0,2.0,2.0,3.0,3.0,...,0.0,0.0,4.0,4.0,3.0,3.0,4.0,3.0,11.0,6.0
391,1.0,1.0,2.0,1.0,1.0,1.0,3.0,1.0,3.0,3.0,...,1.0,0.0,1.0,3.0,4.0,2.0,3.0,1.0,3.0,13.0
392,1.0,1.0,6.0,0.0,0.0,1.0,1.0,1.0,2.0,2.0,...,0.0,0.0,4.0,4.0,2.0,2.0,2.0,2.0,3.0,4.0
393,1.0,1.0,3.0,0.0,1.0,1.0,3.0,2.0,3.0,2.0,...,1.0,0.0,3.0,3.0,0.0,2.0,3.0,4.0,0.0,7.0


In [None]:
# Describe target.
print(mathscores['G3'].describe())

## Create and fit the regression model - scikit-learn

In [None]:
from sklearn.linear_model import LinearRegression

linear_model = LinearRegression()

linear_model.fit(X_train, y_train) 

### Parameters

In [None]:
b = linear_model.intercept_
w = linear_model.coef_
print(f"w = {w:}, b = {b:0.2f}")

### Make predictions

In [None]:
print(f"Prediction on training set:\n {linear_model.predict(X_train)[:4]}" )
print(f"prediction using w,b:\n {(X_train @ w + b)[:4]}")
print(f"Target values \n {y_train[:4]}")


### Cross-validation

In [None]:
y_pred = linear_model.predict(X_test)

df_pred_actual = pd.DataFrame({'predicted': y_pred, 'actual': y_test})

df_pred_actual.head(10)

### Plot

In [None]:
plt.figure(figsize = (12, 8))

plt.plot(y_pred, label='Predicted')
plt.plot(y_test, label='Actual')

plt.ylabel('G3')

plt.legend()
plt.show()