<img src="https://www.bestdesigns.co/uploads/inspiration_images/4350/990__1511457498_404_walmart.png" alt="WALMART LOGO" />

# Walmart : predict weekly sales

## Company's Description 📇

Walmart Inc. is an American multinational retail corporation that operates a chain of hypermarkets, discount department stores, and grocery stores from the United States, headquartered in Bentonville, Arkansas. The company was founded by Sam Walton in 1962.

## Project 🚧

Walmart's marketing service has asked you to build a machine learning model able to estimate the weekly sales in their stores, with the best precision possible on the predictions made. Such a model would help them understand better how the sales are influenced by economic indicators, and might be used to plan future marketing campaigns.

## Goals 🎯

The project can be divided into three steps:

- Part 1 : make an EDA and all the necessary preprocessings to prepare data for machine learning
- Part 2 : train a **linear regression model** (baseline)
- Part 3 : avoid overfitting by training a **regularized regression model**

## Scope of this project 🖼️

For this project, you'll work with a dataset that contains information about weekly sales achieved by different Walmart stores, and other variables such as the unemployment rate or the fuel price, that might be useful for predicting the amount of sales. The dataset has been taken from a Kaggle competition, but we made some changes compared to the original data. Please make sure that you're using **our** custom dataset (available on JULIE). 🤓

## Deliverable 📬

To complete this project, your team should: 

- Create some visualizations
- Train at least one **linear regression model** on the dataset, that predicts the amount of weekly sales as a function of the other variables
- Assess the performances of the model by using a metric that is relevant for regression problems
- Interpret the coefficients of the model to identify what features are important for the prediction
- Train at least one model with **regularization (Lasso or Ridge)** to reduce overfitting


## Helpers 🦮

To help you achieve this project, here are a few tips that should help you: 

### Part 1 : EDA and data preprocessing

Start your project by exploring your dataset : create figures, compute some statistics etc...

Then, you'll have to make some preprocessing on the dataset. You can follow the guidelines from the *preprocessing template*. There will also be some specific transformations to be planned on this dataset, for example on the *Date* column that can't be included as it is in the model. Below are some hints that might help you 🤓

 #### Preprocessing to be planned with pandas

 **Drop lines where target values are missing :**
 - Here, the target variable (Y) corresponds to the column *Weekly_Sales*. One can see above that there are some missing values in this column.
 - We never use imputation techniques on the target : it might create some bias in the predictions !
 - Then, we will just drop the lines in the dataset for which the value in *Weekly_Sales* is missing.
 
**Create usable features from the *Date* column :**
The *Date* column cannot be included as it is in the model. Either you can drop this column, or you will create new columns that contain the following numeric features : 
- *year*
- *month*
- *day*
- *day of week*

**Drop lines containing invalid values or outliers :**
In this project, will be considered as outliers all the numeric features that don't fall within the range : $[\bar{X} - 3\sigma, \bar{X} + 3\sigma]$. This concerns the columns : *Temperature*, *Fuel_price*, *CPI* and *Unemployment*
 


**Target variable/target (Y) that we will try to predict, to separate from the others** : *Weekly_Sales*

 **------------**

 #### Preprocessings to be planned with scikit-learn

 **Explanatory variables (X)**
We need to identify which columns contain categorical variables and which columns contain numerical variables, as they will be treated differently.

 - Categorical variables : Store, Holiday_Flag
 - Numerical variables : Temperature, Fuel_Price, CPI, Unemployment, Year, Month, Day, DayOfWeek

### Part 2 : Baseline model (linear regression)
Once you've trained a first model, don't forget to assess its performances on the train and test sets. Are you satisfied with the results ?
Besides, it would be interesting to analyze the values of the model's coefficients to know what features are important for the prediction. To do so, the `.coef_` attribute of scikit-learn's LinearRegression class might be useful. Please refer to the following link for more information 😉 https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html

### Part 3 : Fight overfitting
In this last part, you'll have to train a **regularized linear regression model**. You'll find below some useful classes in scikit-learn's documentation :
- https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html#sklearn.linear_model.Ridge
- https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html#sklearn.linear_model.Lasso

**Bonus question**

In regularized regression models, there's a hyperparameter called *the regularization strength* that can be fine-tuned to get the best generalized predictions on a given dataset. This fine-tuning can be done thanks to scikit-learn's GridSearchCV class : https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html

Also, you'll find here some examples of how to use GridSearchCV together with Ridge or Lasso models : https://alfurka.github.io/2018-11-18-grid-search/

#################################################################################################################

### Part 1 : EDA and data preprocessing
Start your project by exploring your dataset : create figures, compute some statistics etc...

In [1]:
# Import librairies

In [2]:
!pip install plotly



In [1]:
import pandas as pd
import numpy as np

from sklearn.linear_model import LinearRegression, LogisticRegression, Ridge
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import  OneHotEncoder, StandardScaler, LabelEncoder
from sklearn.compose import ColumnTransformer
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score, GridSearchCV
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix, roc_curve
from sklearn.metrics import r2_score

import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning) # to avoid deprecation warnings
import plotly.express as px
import plotly.graph_objects as go
import plotly.io as pio
# setting Jedha color palette as default
pio.templates["jedha"] = go.layout.Template(
    layout_colorway=["#4B9AC7", "#4BE8E0", "#9DD4F3", "#97FBF6", "#2A7FAF", "#23B1AB", "#0E3449", "#015955"]
)
pio.templates.default = "jedha"
pio.renderers.default = "iframe_connected" # to be replaced by "iframe" if working on JULIE


# from datetime import datetime

In [2]:
# read csv

df = pd.read_csv("Walmart_Store_sales.csv")

In [3]:
# .shape
df.shape

(150, 8)

In [4]:
# .describe

df.describe(include='all')

Unnamed: 0,Store,Date,Weekly_Sales,Holiday_Flag,Temperature,Fuel_Price,CPI,Unemployment
count,150.0,132,136.0,138.0,132.0,136.0,138.0,135.0
unique,,85,,,,,,
top,,19-10-2012,,,,,,
freq,,4,,,,,,
mean,9.866667,,1249536.0,0.07971,61.398106,3.320853,179.898509,7.59843
std,6.231191,,647463.0,0.271831,18.378901,0.478149,40.274956,1.577173
min,1.0,,268929.0,0.0,18.79,2.514,126.111903,5.143
25%,4.0,,605075.7,0.0,45.5875,2.85225,131.970831,6.5975
50%,9.0,,1261424.0,0.0,62.985,3.451,197.908893,7.47
75%,15.75,,1806386.0,0.0,76.345,3.70625,214.934616,8.15


In [5]:
# drop lines with target Y null values

In [6]:
#compte les valeurs manquantes BEFORE
df['Weekly_Sales'].isnull().sum()

14

In [7]:
df = df.dropna(subset=['Weekly_Sales'])


In [8]:
df.describe(include='all')

Unnamed: 0,Store,Date,Weekly_Sales,Holiday_Flag,Temperature,Fuel_Price,CPI,Unemployment
count,136.0,118,136.0,125.0,121.0,124.0,125.0,122.0
unique,,79,,,,,,
top,,28-05-2010,,,,,,
freq,,3,,,,,,
mean,10.014706,,1249536.0,0.072,60.853967,3.316992,178.091144,7.665582
std,6.124614,,647463.0,0.259528,18.514432,0.47954,40.243105,1.619428
min,1.0,,268929.0,0.0,18.79,2.514,126.111903,5.143
25%,4.0,,605075.7,0.0,45.22,2.8385,131.637,6.69
50%,10.0,,1261424.0,0.0,62.25,3.451,196.919506,7.477
75%,15.25,,1806386.0,0.0,75.95,3.724,214.878556,8.15


In [9]:
#compte les valeurs manquantes AFTER
df['Weekly_Sales'].notnull().sum()

136

In [10]:
# correct Date

Then, you'll have to make some preprocessing on the dataset. You can follow the guidelines from the preprocessing template. There will also be some specific transformations to be planned on this dataset, for example on the Date column that can't be included as it is in the model. Below are some hints that might help you 🤓

Preprocessing to be planned with pandas¶

## Drop lines where target values are missing :

Here, the target variable (Y) corresponds to the column **Weekly_Sales**. <br> 
One can see above that there are some missing values in this column.<br> 
We never use imputation techniques on the target : it might create some bias in the predictions !<br> 
Then, we will just drop the lines in the dataset for which the value in Weekly_Sales is missing.<br> 

 #### Preprocessing to be planned with pandas

 **Drop lines where target values are missing :**
 - Here, the target variable (Y) corresponds to the column *Weekly_Sales*. One can see above that there are some missing values in this column.
 - We never use imputation techniques on the target : it might create some bias in the predictions !
 - Then, we will just drop the lines in the dataset for which the value in *Weekly_Sales* is missing.

In [11]:
df.dtypes

Store           float64
Date             object
Weekly_Sales    float64
Holiday_Flag    float64
Temperature     float64
Fuel_Price      float64
CPI             float64
Unemployment    float64
dtype: object

In [12]:
df.head()


Unnamed: 0,Store,Date,Weekly_Sales,Holiday_Flag,Temperature,Fuel_Price,CPI,Unemployment
0,6.0,18-02-2011,1572117.54,,59.61,3.045,214.777523,6.858
1,13.0,25-03-2011,1807545.43,0.0,42.38,3.435,128.616064,7.47
3,11.0,,1244390.03,0.0,84.57,,214.556497,7.346
4,6.0,28-05-2010,1644470.66,0.0,78.89,2.759,212.412888,7.092
5,4.0,28-05-2010,1857533.7,0.0,,2.756,126.160226,7.896


In [13]:
df['Date'] = pd.to_datetime(df['Date'], format="%d-%m-%Y")

In [14]:
df.dtypes

Store                  float64
Date            datetime64[ns]
Weekly_Sales           float64
Holiday_Flag           float64
Temperature            float64
Fuel_Price             float64
CPI                    float64
Unemployment           float64
dtype: object

In [15]:
# removing date with NA to avoid decimal formatting ex.:"2020.0"

to_keep = df['Date'].notnull()
df = df.loc[to_keep,:] 

**Create usable features from the *Date* column :**
The *Date* column cannot be included as it is in the model. Either you can drop this column, or you will create new columns that contain the following numeric features : 
- *year*
- *month*
- *day*
- *day of week*


In [16]:
df['year']=df['Date'].dt.year
df['month']=df['Date'].dt.month
df['day']=df['Date'].dt.day
df['weekday']=df['Date'].dt.weekday

In [17]:
type(df['year'].isnull())

pandas.core.series.Series

In [18]:
df.head()

Unnamed: 0,Store,Date,Weekly_Sales,Holiday_Flag,Temperature,Fuel_Price,CPI,Unemployment,year,month,day,weekday
0,6.0,2011-02-18,1572117.54,,59.61,3.045,214.777523,6.858,2011,2,18,4
1,13.0,2011-03-25,1807545.43,0.0,42.38,3.435,128.616064,7.47,2011,3,25,4
4,6.0,2010-05-28,1644470.66,0.0,78.89,2.759,212.412888,7.092,2010,5,28,4
5,4.0,2010-05-28,1857533.7,0.0,,2.756,126.160226,7.896,2010,5,28,4
6,15.0,2011-06-03,695396.19,0.0,69.8,4.069,134.855161,7.658,2011,6,3,4


In [19]:
df.shape

(118, 12)

In [20]:
df['Store'].value_counts()

3.0     10
13.0     9
14.0     9
1.0      8
19.0     8
7.0      7
5.0      7
18.0     7
4.0      6
8.0      6
6.0      6
2.0      6
17.0     5
12.0     5
16.0     4
20.0     4
9.0      4
10.0     3
15.0     3
11.0     1
Name: Store, dtype: int64

In [21]:
# Drop lines containing invalid values or outliers

temp_tokeep = (df['Temperature'] > df['Temperature'].mean() - 3*df['Temperature'].std()) & (df['Temperature'] < df['Temperature'].mean() + 3*df['Temperature'].std())
fuelprice_tokeep = (df['Fuel_Price'] > df['Fuel_Price'].mean() - 3*df['Fuel_Price'].std()) & (df['Fuel_Price'] < df['Fuel_Price'].mean() + 3*df['Fuel_Price'].std())
cpi_tokeep = (df['CPI'] > df['CPI'].mean() - 3*df['CPI'].std()) & (df['CPI'] < df['CPI'].mean() + 3*df['CPI'].std())
unemployment_tokeep = (df['Unemployment'] > df['Unemployment'].mean() - 3*df['Unemployment'].std()) & (df['Unemployment'] < df['Unemployment'].mean() + 3*df['Unemployment'].std())


In [22]:
print(temp_tokeep.value_counts())
print(fuelprice_tokeep.value_counts())
print(cpi_tokeep.value_counts())
print(unemployment_tokeep.value_counts())

True     107
False     11
Name: Temperature, dtype: int64
True     107
False     11
Name: Fuel_Price, dtype: int64
True     109
False      9
Name: CPI, dtype: int64
True     102
False     16
Name: Unemployment, dtype: int64


In [23]:
to_keep = temp_tokeep & fuelprice_tokeep & cpi_tokeep & unemployment_tokeep
df = df.loc[to_keep,:]
df.shape

(80, 12)

In [24]:
df.describe(include='all')





Unnamed: 0,Store,Date,Weekly_Sales,Holiday_Flag,Temperature,Fuel_Price,CPI,Unemployment,year,month,day,weekday
count,80.0,80,80.0,71.0,80.0,80.0,80.0,80.0,80.0,80.0,80.0,80.0
unique,,62,,,,,,,,,,
top,,2011-03-25 00:00:00,,,,,,,,,,
freq,,3,,,,,,,,,,
first,,2010-02-05 00:00:00,,,,,,,,,,
last,,2012-10-19 00:00:00,,,,,,,,,,
mean,9.575,,1221522.0,0.084507,61.12775,3.2907,181.077638,7.301775,2010.8875,6.3625,16.125,4.0
std,6.143382,,679927.0,0.280126,17.4476,0.491223,38.847021,0.955392,0.826672,3.028321,8.521566,0.0
min,1.0,,268929.0,0.0,18.79,2.548,126.1392,5.143,2010.0,1.0,1.0,4.0
25%,4.0,,529510.7,0.0,45.5875,2.804,132.610242,6.52075,2010.0,4.0,10.0,4.0


In [25]:
# decsion to delete na values on 'holiday flag' as we don't know on which value to impute missing values

to_keep = df['Holiday_Flag'].notna()
df = df.loc[to_keep,:]


**Drop lines containing invalid values or outliers :**
In this project, will be considered as outliers all the numeric features that don't fall within the range : $[\bar{X} - 3\sigma, \bar{X} + 3\sigma]$. This concerns the columns : *Temperature*, *Fuel_price*, *CPI* and *Unemployment*
 


**Target variable/target (Y) that we will try to predict, to separate from the others** : *Weekly_Sales*

In [26]:
target_name = 'Weekly_Sales'

Y = df.loc[:,target_name]
X = df.loc[:,[c for c in df.columns if c!=target_name]]
print("...Done.")
print(Y.head())
print()
print(X.head())
print()


...Done.
1     1807545.43
4     1644470.66
6      695396.19
7     2203523.20
10     895066.50
Name: Weekly_Sales, dtype: float64

    Store       Date  Holiday_Flag  Temperature  Fuel_Price         CPI  \
1    13.0 2011-03-25           0.0        42.38       3.435  128.616064   
4     6.0 2010-05-28           0.0        78.89       2.759  212.412888   
6    15.0 2011-06-03           0.0        69.80       4.069  134.855161   
7    20.0 2012-02-03           0.0        39.93       3.617  213.023623   
10    8.0 2011-08-19           0.0        82.92       3.554  219.070197   

    Unemployment  year  month  day  weekday  
1          7.470  2011      3   25        4  
4          7.092  2010      5   28        4  
6          7.658  2011      6    3        4  
7          6.961  2012      2    3        4  
10         6.425  2011      8   19        4  



In [27]:
X

Unnamed: 0,Store,Date,Holiday_Flag,Temperature,Fuel_Price,CPI,Unemployment,year,month,day,weekday
1,13.0,2011-03-25,0.0,42.38,3.435,128.616064,7.470,2011,3,25,4
4,6.0,2010-05-28,0.0,78.89,2.759,212.412888,7.092,2010,5,28,4
6,15.0,2011-06-03,0.0,69.80,4.069,134.855161,7.658,2011,6,3,4
7,20.0,2012-02-03,0.0,39.93,3.617,213.023623,6.961,2012,2,3,4
10,8.0,2011-08-19,0.0,82.92,3.554,219.070197,6.425,2011,8,19,4
...,...,...,...,...,...,...,...,...,...,...,...
139,7.0,2012-05-25,0.0,50.60,3.804,197.588605,8.090,2012,5,25,4
143,3.0,2010-06-04,0.0,78.53,2.705,214.495838,7.343,2010,6,4,4
144,3.0,2012-10-19,0.0,73.44,3.594,226.968844,6.034,2012,10,19,4
145,14.0,2010-06-18,0.0,72.62,2.780,182.442420,8.899,2010,6,18,4


In [28]:
# Store	Date	Holiday_Flag	Temperature	Fuel_Price	CPI	Unemployment	year	month	day	weekday


In [29]:
Y

1      1807545.43
4      1644470.66
6       695396.19
7      2203523.20
10      895066.50
          ...    
139     532739.77
143     396968.80
144     424513.08
145    2248645.59
149    1255087.26
Name: Weekly_Sales, Length: 71, dtype: float64

#### Preprocessings to be planned with scikit-learn

 **Explanatory variables (X)**
We need to identify which columns contain categorical variables and which columns contain numerical variables, as they will be treated differently.

 - Categorical variables : Store, Holiday_Flag
 - Numerical variables : Temperature, Fuel_Price, CPI, Unemployment, Year, Month, Day, DayOfWeek

In [30]:
# idx = 0
# numeric_features = []
# categorical_features = []
# numeric_indices = []
# categorical_indices = []

# for i,t in X.dtypes.iteritems():
#     if ('float' in str(t)) or ('int' in str(t)) :
#         numeric_features.append(i)
#         numeric_indices.append(idx)
#     else :
#         categorical_features.append(i)
#         categorical_indices.append(idx)
#     idx = idx + 1


In [31]:
# print('numeric features are :',numeric_features,' at indeces : ',numeric_indices)
# print('')

# print('categorical features are :',categorical_features,' at indeces : ',categorical_indices)
# print('')

In [32]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, 
                                                    random_state=0)

In [33]:
numeric_transformer = Pipeline(
steps=[
    ('imputer',SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

In [34]:
categorical_transformer = Pipeline(
steps=[
      ('encoder', OneHotEncoder(drop='first'))
])

In [35]:
df['Holiday_Flag'].value_counts()

0.0    65
1.0     6
Name: Holiday_Flag, dtype: int64

In [36]:
# warning : 'holiday_flag' to be considered as categorical_features

type(df['Holiday_Flag'])

pandas.core.series.Series

In [37]:
categorical_features = ['Holiday_Flag']
# 'Date' 
categorical_indeces = [2]
# 

numeric_features = ['Store', 'Temperature', 'Fuel_Price', 'CPI', 'Unemployment', 'year', 'month', 'day', 'weekday']
# 
numerical_indeces = [0, 3, 4, 5, 6, 7, 8, 9, 10]
# 

In [38]:
df.isna().sum().sort_values(ascending=False)

weekday         0
day             0
month           0
year            0
Unemployment    0
CPI             0
Fuel_Price      0
Temperature     0
Holiday_Flag    0
Weekly_Sales    0
Date            0
Store           0
dtype: int64

In [39]:
df['Holiday_Flag'].groupby(df['Holiday_Flag']).count()

Holiday_Flag
0.0    65
1.0     6
Name: Holiday_Flag, dtype: int64

In [40]:
# df['Weekly_Sales'].sort_values(ascending = True)

In [41]:
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ])

In [42]:
#  df['day'].sort_values(ascending=True)

In [43]:
X_train = preprocessor.fit_transform(X_train)
print('...Done.')
print(X_train[0:5,:])
print()

...Done.
[[ 0.20223489  0.58946243 -1.01803247  0.85152091  0.31289936 -0.95553309
   1.37236059  1.22044325  0.          1.        ]
 [ 0.68076251  0.74245008 -0.92573711  0.0122789   1.66748597 -0.95553309
  -0.19605151  0.21721651  0.          0.        ]
 [ 0.5212533   0.92974745  0.7683955  -1.3575354  -0.38418004  0.31851103
   0.43131333 -0.53520356  0.          0.        ]
 [-0.11678353 -0.53207198 -0.94419618  0.86119186 -0.70583094 -0.95553309
   1.37236059  0.34261985  0.          0.        ]
 [-0.27629273  1.32177828  0.66174308  0.95466173 -0.84281161  0.31851103
   0.43131333  0.34261985  0.          0.        ]]



In [44]:
X_test = preprocessor.transform(X_test)

print('...Done.')
print(X_test[0:5,:])
print()

...Done.
[[ 1.47830854  0.38416651 -0.56065769 -1.27210896  0.85574868 -0.95553309
   0.11763091 -1.78923699  0.          0.        ]
 [ 0.99978092 -0.62600189  1.06374065  0.39745993 -1.10967025  1.59255514
  -1.13709878  1.72205663  0.          0.        ]
 [ 0.5212533   1.09117191 -0.85600284 -1.43632628  0.70557728 -0.95553309
   0.11763091 -1.78923699  0.          0.        ]
 [ 1.31879933 -0.50563661  1.62366583 -1.13172665  1.06375636  1.59255514
  -0.82341636  1.3458466   0.          0.        ]
 [ 0.04272568  1.54394784  0.89145598 -1.31847797 -0.08688126  1.59255514
   0.11763091 -1.28762362  0.          0.        ]]



### Part 2 : Baseline model (linear regression)
Once you've trained a first model, don't forget to assess its performances on the train and test sets. Are you satisfied with the results ?
Besides, it would be interesting to analyze the values of the model's coefficients to know what features are important for the prediction. To do so, the `.coef_` attribute of scikit-learn's LinearRegression class might be useful. Please refer to the following link for more information 😉 https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html


In [45]:
# train
model = LinearRegression()
model.fit(X_train, Y_train)


print('done')

done


In [46]:
Y_train_pred = model.predict(X_train)
Y_test_pred = model.predict(X_test)

In [47]:
print("R2 score on training set : ", r2_score(Y_train, Y_train_pred))
print("R2 score on test set : ", r2_score(Y_test, Y_test_pred))

R2 score on training set :  0.22765959654474022
R2 score on test set :  0.06227021717098913


In [48]:
# assess performance : "Don't forget to assess its performances on the train and test sets."

In [49]:
# model's coefficients : analyze the values of the model's coefficients to know what features are important for the prediction.
# To do so, the .coef_ attribute of scikit-learn's LinearRegression class might be useful. Please refer to the following link for more information 😉 https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html

model.coef_

array([ -58017.55365787,   -5979.03933369, -162172.12752555,
       -366470.32424461,   36668.93676126,   81232.3016656 ,
          3153.7964688 ,  113941.87635632,       0.        ,
        315430.65577552])

### Part 3 : Fight overfitting
In this last part, you'll have to train a **regularized linear regression model**. You'll find below some useful classes in scikit-learn's documentation :
- https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html#sklearn.linear_model.Ridge
- https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html#sklearn.linear_model.Lasso

In [50]:
params = {'alpha': np.arange(0,10000,100)} # determine the range of parameters to try
ridge = Ridge() # create an instance of the model

grid = GridSearchCV(ridge, params, cv = 10, verbose = 1)
grid_fit = grid.fit(X_train, Y_train)

Fitting 10 folds for each of 100 candidates, totalling 1000 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 1000 out of 1000 | elapsed:    2.5s finished


In [51]:
print("Optimal value for alpha : ", grid_fit.best_params_)

Optimal value for alpha :  {'alpha': 100}


In [52]:
print('Test score for the best model : ', grid_fit.best_estimator_.score(X_test,Y_test))

Test score for the best model :  -0.028740589125844096


In [53]:
scores = cross_val_score(grid_fit.best_estimator_, X_train, Y_train, cv = 10)

print('The cross-validated R2-score is : ', scores.mean())
print('The standard deviation is : ', scores.std())

The cross-validated R2-score is :  -0.6985000335986612
The standard deviation is :  1.447789814013879


**Bonus question**

In regularized regression models, there's a hyperparameter called *the regularization strength* that can be fine-tuned to get the best generalized predictions on a given dataset. This fine-tuning can be done thanks to scikit-learn's GridSearchCV class : https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html

Also, you'll find here some examples of how to use GridSearchCV together with Ridge or Lasso models : https://alfurka.github.io/2018-11-18-grid-search/