![under_construction](figures/under_construction.gif)

I dati utilizzati in questo notebook sono stati presi dalla competizione di Analytics Vidhya [Practice Problem: Big Mart Sales III](https://datahack.analyticsvidhya.com/contest/practice-problem-big-mart-sales-iii/#data_dictionary).

### Riferimenti bibliografici:

* Azzalini, A. &  Scarpa B. (2012), [Data Analysis and Data Mining: An Introduction](https://global.oup.com/academic/product/data-analysis-and-data-mining-9780199767106?q=Data%20Mining&lang=en&cc=it).
* Hastie, T.; Tibshirani, R. & Friedman, J. (2009), [The Elements of Statistical Learning](https://web.stanford.edu/~hastie/ElemStatLearn/).

# Regressione stepwise, LASSO e Ridge

## Indice

1. [](#)<br />
    1.1 [](#)<br />

In [None]:
import os
import seaborn as sns
import warnings

warnings.filterwarnings("ignore")

In [None]:
import inspect
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

%load_ext autoreload
%autoreload 2

> Nota: per la descrizione del problema e dei dati vedere il notebook 07_analisi_esplorativa_e_preprocessamento_dei_dati.ipynb.

In [None]:
PATH = "output/07/"

X_train = pd.read_pickle(PATH + "/X_train.pkl")
X_val = pd.read_pickle(PATH + "/X_val.pkl")
X_test = pd.read_pickle(PATH + "/X_test.pkl")
y_train = pd.read_pickle(PATH + "/y_train.pkl")
y_val = pd.read_pickle(PATH + "/y_val.pkl")
y_test = pd.read_pickle(PATH + "/y_test.pkl")

# Alcune definizioni utili

Sia $y_i$ l'osservazione i-esima della variabile risposta, $\overline{y}$ la media degli $y_i$ e $\hat{y}_i$ la stima di $y_i$ data dal modello, si definiscono le seguenti quantità:

**Somma dei quadrati dei residui**:
$$
\mathrm{RSS} = \sum\limits_{i=1}^n(y_i - \hat{y}_i)^2
$$

**Somma dei quadrati totale**:
$$
\mathrm{TSS} = \sum\limits_{i=1}^n(y_i - \overline{y}_i)^2
$$

**Coefficiente di determinazione**:
$$
R^2 = 1 - \frac{\mathrm{RSS}}{\mathrm{TSS}}
$$

**Errore quadratico medio / stima della varianza dei residui**:
$$
\mathrm{MSE} = \hat{\sigma}^2 = \frac{\mathrm{RSS}}{n}
$$

**Radice dell'errore quadratico medio**:
$$
\mathrm{RMSE} = \sqrt{\mathrm{MSE}}
$$

**Criterio d'informazione di Akaike**:
$$
\mathrm{AIC} = 2k - 2\ln(\hat{L})
$$

**Vaolore massimo della log-verosimiglianza, caso errori i.i.d. $\sim{\mathcal{N}}(0,\sigma)$**:
$$
\ln{(\hat{L})} = -\frac{n}{2}\ln(2\pi) - \frac{n}{2}\ln({\hat{\sigma}}^2) - \frac{1}{2{\hat{\sigma}}^2}\mathrm{RSS}
$$

# Decidere la metrica in base a cui valutare il modello

### Evaluation Metric
​
Your model performance will be evaluated on the basis of your prediction of the sales for the test data (test.csv), which contains similar data-points as train except for the sales to be predicted. Your submission needs to be in the format as shown in "SampleSubmission.csv".
​
We at our end, have the actual sales for the test dataset, against which your predictions will be evaluated. We will use the Root Mean Square Error value to judge your response.
​
$
RMSE = \sqrt{\frac{\sum_{i=1}^N(Predicted_i - Actual_i)^2}{N}}
$
​
Where,
$N$: total number of observations
Predicted: the response entered by user
Actual: actual values of sales
​
Also, note that the test data is further divided into Public (25%) and Private (75%) data. Your initial responses will be checked and scored on the Public data. But, the final rankings will be based on score on Private data set. Since this is a practice problem, we will keep declare winners after specific time intervals and refresh the competition.

In [None]:
from msbd.metriche import radice_errore_quadratico_medio

print(inspect.getsource(radice_errore_quadratico_medio))

# Creare una baseline

## `DummyRegressor()`

In [None]:
from sklearn.dummy import DummyRegressor

In [None]:
dr = DummyRegressor(strategy='mean')
dr.fit(X_train, y_train)

print("RMSE training: {:.2f}".format(radice_errore_quadratico_medio(y_train, dr.predict(X_train))))
print("RMSE validation: {:.2f}".format(radice_errore_quadratico_medio(y_val, dr.predict(X_val))))

## `LinearRegression()`

> Nota: per un'implementazione del modello lineare e un summary più vicini a quelli di R considerare la classe [`OLS()`](https://www.statsmodels.org/dev/generated/statsmodels.regression.linear_model.OLS.html) di [`statsmodels`](https://www.statsmodels.org/stable/index.html).

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
lr = LinearRegression()
lr.fit(X_train, y_train)
y_pred = lr.predict(X_val)

print("RMSE training: {:.2f}".format(radice_errore_quadratico_medio(y_train, lr.predict(X_train))))
print("RMSE validation: {:.2f}".format(radice_errore_quadratico_medio(y_val, lr.predict(X_val))))

In [None]:
from msbd.grafici import grafico_coefficienti

print(inspect.getsource(grafico_coefficienti))

In [None]:
plt.figure(figsize=(15, 3))

print("Intercetta: {:.2f}".format(lr.intercept_))
grafico_coefficienti(lr.coef_, X_train.columns.tolist())

plt.show()

# Regressione stepwise

In [None]:
from msbd.selezione_variabili import Stepwise

print(inspect.getsource(Stepwise))

In [None]:
from msbd.metriche import gauss_aic

print(inspect.getsource(gauss_aic))

In [None]:
sw = Stepwise(LinearRegression(), gauss_aic, 'avanti', verboso=True)

In [None]:
sw.fit(X_train, y_train)

In [None]:
print("RMSE training: {:.2f}".format(radice_errore_quadratico_medio(y_train, sw.predict(X_train))))
print("RMSE validation: {:.2f}".format(radice_errore_quadratico_medio(y_val, sw.predict(X_val))))

In [None]:
plt.figure(figsize=(15, 3))

print("Intercetta: {:.2f}".format(sw.stimatore_.intercept_))
grafico_coefficienti(sw.stimatore_.coef_, sw.variabili_selezionate_)

plt.show()