# Projet Walmart sales

Bloc3 : PROJECTS Supervised Machine Learning

Walmart sales

360 min

https://app.jedha.co/course/projects-supervised-machine-learning-ft/walmart-sales-ft

## Company's Description 📇

Walmart Inc. is an American multinational retail corporation that operates a chain of hypermarkets, discount department stores, and grocery stores from the United States, headquartered in Bentonville, Arkansas. The company was founded by Sam Walton in 1962.

## Project 🚧

Walmart's marketing service has asked you to build a machine learning model able to estimate the weekly sales in their stores, with the best precision possible on the predictions made. Such a model would help them understand better how the sales are influenced by economic indicators, and might be used to plan future marketing campaigns.

## Goals 🎯

The project can be divided into three steps:

- Part 1 : make an EDA and all the necessary preprocessings to prepare data for machine learning
- Part 2 : train a **linear regression model** (baseline)
- Part 3 : avoid overfitting by training a **regularized regression model**

In [None]:
#import libraries for EDA
import pandas as pd
import numpy as np

# Visualization libraries
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots


#from src.eda import *
# import libraries for modeling
# preprocessing selection
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
# model selection
from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.model_selection import cross_val_score, GridSearchCV
# model evaluation

from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score



In [None]:

# For reproducibility
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

In [None]:
def how_null_is_it(df: pd.DataFrame):
        print()
        print(f"Overall missing values in dataset : {df.isnull().sum().sum()}")
        print( )
        print("missing values in dataset per column :")
        print(df.isnull().sum() )
        print( )

def summary(df: pd.DataFrame):
        """Print a summary of the dataset."""
        print("________________________________________________" )
        print("Data Start")
        display(df.head(10) )
        print()
        print("Data End")
        display(df.tail(10) )
        print()
        print("shape of the dataset : ")
        display(df.shape)
        print()
        print("columns of the dataset : ")
        display(df.columns)
        print()
        print("data describe : ")
        display(df.describe(include='all') )
        print()
        print("types in dataset :")
        display(df.dtypes)
        print()
        print(f"Overall missing values in dataset : {df.isnull().sum().sum()}")
        print( )
        print("missing values in dataset per column :")
        display(df.isnull().sum() )
        print("________________________________________________" )
def score_model(model,x_train, y_train, x_test, y_test):
    print(model.score(x_train, y_train))
    print(model.score(x_test, y_test))

## Part 0 : import dataset and inception

We import the csv dataset into a pandas dataframe and have a first look to the datastructure.
Since we know that original data come from Kaggle before modifying by Jedha, we watch in Kaggle about the meaning of eaxh column. 
This is the historical data that covers sales from 2010-02-05 to 2012-11-01, in the file Walmart_Store_sales. Within this file you will find the following fields:


 |  column name   |   description   |  
 |  -------- | ------- |
 | Store | the store number | 
 | Date | the week of sales | 
 | Weekly_Sales  |  sales for the given store | 
 |  Holiday_Flag  | whether the week is a special holiday week <br /> 1 – Holiday week <br /> 0 – Non-holiday week | 
 | Temperature |  Temperature on the day of sale | 
 | Fuel_Price  |  Cost of fuel in the region | 
 | CPI |  Prevailing consumer price index  | 
 | Unemployment  |  Prevailing unemployment rate  | 
 | Holiday Events  |    Super Bowl: 12-Feb-10, 11-Feb-11, 10-Feb-12, 8-Feb-13<br /> Labour Day: 10-Sep-10, 9-Sep-11, 7-Sep-12, 6-Sep-13<br /> Thanksgiving: 26-Nov-10, 25-Nov-11, 23-Nov-12, 29-Nov-13<br /> Christmas: 31-Dec-10, 30-Dec-11, 28-Dec-12, 27-Dec-13  | 

 L'indice des prix à la consommation ou IPC (en anglais, consumer price index ou CPI) mesure l'évolution du niveau moyen des prix des biens et services consommés par les ménages, pondérés par leur part dans la consommation moyenne des ménages. L'indice (105 par exemple) permet de mesurer l'inflation (ici +5 % de hausse des prix), ou la déflation en cas de baisse des prix, sur une période et donc l'évolution de la valeur de la monnaie (la valeur de la monnaie diminue lorsque les prix augmentent). Le taux (annuel) d'inflation désigne généralement, lorsque l'indice n'est pas précisé, le pourcentage d'augmentation de cet indice (IPC) particulier sur une année.

 

In [None]:
df= pd.read_csv("Data/Walmart_Store_sales.csv")
summary(df)

In [None]:
display(df)

We noticed also that the Store is an integer id to identify the store. We convert it into intger since the store sounds more like a classification than a regression value.


In [None]:
df=df.dropna()
df.reset_index(drop=True)
df.shape


In [None]:
df = df.astype({"Store": int, "Weekly_Sales": float, "Temperature": float, "Fuel_Price": float, "CPI": float, "Unemployment": float},errors='ignore')

In [None]:
df['Holiday_Flag'] = df['Holiday_Flag'].astype('int64')
df.dtypes
df

The date is an object time, we need to convert in real date figures (new column dt) from the string format _day_-_month_-_year_ (`"%d-%m-%Y"`)

In [None]:
df['dt'] = pd.to_datetime(df["Date"], format = "%d-%m-%Y")
df.dtypes

We split the date on year, month and week.

In [None]:
df['Year'] = df['dt'].dt.year.astype('Int64')
df['Month'] = df['dt'].dt.month.astype('Int64')
df['Day'] = df['dt'].dt.day.astype('Int64')
df['Week'] = df['dt'].dt.isocalendar().week.astype('Int64')
df['DayOfWeek'] = df['dt'].dt.weekday.astype('Int64')

df.dtypes



In [None]:
display(df)
print()
display(df.dtypes)

## Part 1 : Exploration, Exploratory data analysis (EDA)



In [None]:
# Constant for the layout of the plots
WIDTH = 600
HEIGHT = 400
MARGIN = 30

In [None]:
# Compute total revenue cumulated by day
df_store = df.groupby('Store')['Week'].count().reset_index()
df_store.sort_values(by='Store',ascending= False)
fig = px.pie(df_store, names = "Store", values = "Week")
fig.show()

In [None]:
orders_store_by_weekly_sales=df.groupby(['Store'])[['Weekly_Sales']].mean()
orders_store_by_weekly_sales.sort_values(by='Weekly_Sales',ascending= False)
orders_year_by_weekly_sales=df.groupby(['Year'])[['Weekly_Sales']].mean()
orders_month_by_weekly_sales=df.groupby(['Month'])[['Weekly_Sales']].mean()
orders_week_by_weekly_sales=df.groupby(['Week'])[['Weekly_Sales']].mean()
orders_week_by_weekly_sales.sort_values(by='Weekly_Sales',ascending= False)

#px.bar(orders_year_by_weekly_sales, x= orders_year_by_weekly_sales.index, y='Weekly_Sales', 
#        title="Average Weekly Sales by Year", labels={"x":"Store","Weekly_Sales":"Average Weekly Sales per Year"}).show()
#px.bar(orders_store_by_weekly_sales, x= orders_store_by_weekly_sales.index, y='Weekly_Sales', title="Average Weekly Sales by Store", labels={"x":"Store","Weekly_Sales":"Average Weekly Sales per store"}).show()
#px.bar(orders_month_by_weekly_sales, x= orders_month_by_weekly_sales.index, y='Weekly_Sales', title="Average Weekly Sales by Month", labels={"x":"Month","Weekly_Sales":"Average Weekly Sales per month"}).show()
#px.bar(orders_week_by_weekly_sales, x= orders_week_by_weekly_sales.index, y='Weekly_Sales', title="Average Weekly Sales", labels={"x":"Month","Weekly_Sales":"Average Weekly Sales"}).show()

fig = make_subplots(rows = 4, cols = 1, subplot_titles = (["Wallmart average weekly sales per store",
                                        "Wallmart average weekly sales per year",
                                        "Wallmart average weekly sales per month",
                                        "Wallmart average weekly sales per week"] ))
fig.add_bar(

        x = orders_store_by_weekly_sales.index,
        y = orders_store_by_weekly_sales.Weekly_Sales,
        row = 1,
        col = 1

)    
fig.add_bar(
        x = orders_year_by_weekly_sales.index,
        y = orders_year_by_weekly_sales['Weekly_Sales'],
        row = 2,
        col = 1

)    
fig.add_bar(

        x = orders_month_by_weekly_sales.index,
        y = orders_month_by_weekly_sales['Weekly_Sales'],
        row = 3,
        col = 1
)    
fig.add_bar(
        x = orders_week_by_weekly_sales.index,
        y = orders_week_by_weekly_sales['Weekly_Sales'],
        row = 4,
        col = 1
)    
layout = go.Layout(
    title = go.layout.Title(text = "Average Weekly Sales", x = 1.0),
    showlegend = False,
    autosize=False,
    width=1000,
    height=2000,
    xaxis=go.layout.XAxis(linecolor="black", linewidth=1, mirror=True),
    yaxis=go.layout.YAxis(linecolor="black", linewidth=1, mirror=True),
    margin=go.layout.Margin(l=50, r=50, b=100, t=100, pad=4),
)

fig.update_layout(layout)
#px.bar(orders_store_by_weekly_sales, x= orders_store_by_weekly_sales.index, y='Weekly_Sales', title="Average Weekly Sales by Store", labels={"x":"Store","Weekly_Sales":"Average Weekly Sales"}).show()

In [None]:
orders_month_by_weekly_sales=df.groupby(['Month'])[['Weekly_Sales']].sum()
orders_month_by_weekly_sales.sort_values(by='Weekly_Sales',ascending= False)
fig = px.line(orders_month_by_weekly_sales, x=orders_month_by_weekly_sales.index, y='Weekly_Sales', title="Total Weekly Sales by Month", width=WIDTH, height=HEIGHT)
fig = px.scatter(orders_month_by_weekly_sales, x=orders_month_by_weekly_sales.index, y='Weekly_Sales', title="Total Weekly Sales by Month", width=WIDTH, height=HEIGHT)
#fig.update_traces(marker=dict(size=12, color='LightSkyBlue', line=dict(width=2, color='DarkSlateGrey')), selector=dict(mode='markers'))
fig.update_layout(margin=dict(l=MARGIN, r=MARGIN, t=MARGIN, b=MARGIN))
px.density_heatmap(orders_month_by_weekly_sales, x=orders_month_by_weekly_sales.index, y='Weekly_Sales', nbinsx=20, nbinsy=10, width=WIDTH, height=HEIGHT)
px.box(df, x='Month', y='Weekly_Sales', width=WIDTH, height=HEIGHT)
fig = px.line(orders_month_by_weekly_sales, x=orders_month_by_weekly_sales.index, y='Weekly_Sales', title="Total Weekly Sales by Month", width=WIDTH, height=HEIGHT)
fig.show()

In [None]:
orders_month_by_weekly_sales

In [None]:
orders_week_by_weekly_sales=df.groupby(['Week'])[['Weekly_Sales']].sum()
orders_week_by_weekly_sales.sort_values(by='Weekly_Sales',ascending= False)
fig = px.line(orders_week_by_weekly_sales, x=orders_week_by_weekly_sales.index, y='Weekly_Sales', title="Total Weekly Sales by week", width=WIDTH, height=HEIGHT)
fig = px.scatter(orders_week_by_weekly_sales, x=orders_week_by_weekly_sales.index, y='Weekly_Sales', title="Total Weekly Sales by week", width=WIDTH, height=HEIGHT)
#fig.update_traces(marker=dict(size=12, color='LightSkyBlue', line=dict(width=2, color='DarkSlateGrey')), selector=dict(mode='markers'))
fig.update_layout(margin=dict(l=MARGIN, r=MARGIN, t=MARGIN, b=MARGIN))
px.density_heatmap(orders_week_by_weekly_sales, x=orders_week_by_weekly_sales.index, y='Weekly_Sales', nbinsx=20, nbinsy=10, width=WIDTH, height=HEIGHT)
px.box(df, x='Week', y='Weekly_Sales', width=WIDTH, height=HEIGHT)
fig = px.line(orders_week_by_weekly_sales, x=orders_week_by_weekly_sales.index, y='Weekly_Sales', title="Total Weekly Sales by Week", width=WIDTH, height=HEIGHT)
fig.show()

In [None]:
orders_week_by_weekly_sales

In [None]:
fig = px.line(orders_week_by_weekly_sales, x=orders_week_by_weekly_sales.index, y='Weekly_Sales' , title='Weekly Sales according to week', width=WIDTH, height=HEIGHT)
fig.update_layout(margin=dict(l=MARGIN, r=MARGIN, t=MARGIN, b=MARGIN))

In [None]:
fig = px.line(orders_month_by_weekly_sales, x=orders_month_by_weekly_sales.index, y='Weekly_Sales' , title='Weekly Sales according to month', width=WIDTH, height=HEIGHT)
fig.update_layout(margin=dict(l=MARGIN, r=MARGIN, t=MARGIN, b=MARGIN))

In [None]:
dfcpi=df[df['CPI'].notna()]
cpi_over_store = dfcpi.groupby('Store')['CPI'].mean()
display(cpi_over_store)

In [None]:
px.density_heatmap(df, x='Store', y='Weekly_Sales', nbinsx=20, nbinsy=40, width=WIDTH, height=HEIGHT)

In [None]:
orders_month_by_weekly_sales=df.groupby(['Year','Month'])['Weekly_Sales'].sum()
display(orders_month_by_weekly_sales)

In [None]:
orders_month_by_weekly_sales.index

In [None]:
year=2010
orders_month_by_weekly_sales.loc[year]

In [None]:
orders_month_by_weekly_sales=df.groupby(['Year','Month'])['Weekly_Sales'].sum()
dfm=orders_month_by_weekly_sales
years = [2010, 2011, 2012]

fig = go.Figure()

visible = True, False, False, False, False

for i, year in enumerate(years):
    if i == 0:
        fig.add_trace(
            go.Scatter(
                x=dfm[year].index,
                y=dfm,
                visible=True
            )
        )

    else:
        fig.add_trace(
            go.Scatter(
                x=dfm[year].index,
                y=dfm,
                visible=False
            )
        )
        


fig.update_layout(
    updatemenus=[go.layout.Updatemenu(
        active=0,
        buttons=[
            go.layout.updatemenu.Button(
                label="2010",
                method='update',
                args=[{'visible': [True,  False, False],
                        "xaxis": dict(range=[1, 12], title="Month", tick0=1,dtick=1)
                }]
            ),
            go.layout.updatemenu.Button(
                label="2011",
                method='update',
                args=[{'visible': [False,  True,  False],
                        "xaxis": dict(range=[1, 12], title="Month", tick0=1,dtick=1)
                }]
            ),
            go.layout.updatemenu.Button(
                label="2012",
                method='update',
                args=[{'visible': [False, False,  True],
                        "xaxis": dict(range=[1, 12], title="Month", tick0=1,dtick=1)
                }]
            ),
            
        ]
    )]
)

fig.update_layout(title=dict(text="Monthly Sales observations in a chosen year", x=0.5))

fig.show()

In [None]:
orders_week_by_weekly_sales=df.groupby(['Year','Week'])['Weekly_Sales'].sum()
dfw=orders_week_by_weekly_sales
years = [2010, 2011, 2012]

fig = go.Figure()

visible = True, False, False, False, False

for i, year in enumerate(years):
    if i == 0:
        fig.add_trace(
            go.Scatter(
                x=dfw[year].index,
                y=dfw,
                visible=True
            )
        )

    else:
        fig.add_trace(
            go.Scatter(
                x=dfw[year].index,
                y=dfw,
                visible=False
            )
        )
        


fig.update_layout(
    updatemenus=[go.layout.Updatemenu(
        active=0,
        buttons=[
            go.layout.updatemenu.Button(
                label="2010",
                method='update',
                args=[{'visible': [True,  False, False],
                        "xaxis": dict(range=[1, 52], title="Week", tick0=1,dtick=4)
                }]
            ),
            go.layout.updatemenu.Button(
                label="2011",
                method='update',
                args=[{'visible': [False,  True,  False],
                        "xaxis": dict(range=[1, 52], title="Week", tick0=1,dtick=4)
                }]
            ),
            go.layout.updatemenu.Button(
                label="2012",
                method='update',
                args=[{'visible': [False, False,  True],
                       "xaxis": dict(range=[1, 52], title="Week", tick0=1,dtick=4)
                }]
            ),
            
        ]
    )]
)

fig.update_layout(title=dict(text="Weekly Sales observations in a chosen year", x=0.5))

fig.show()

In [None]:
df.describe()

In [None]:
df_c=df.drop(columns=['dt','Date'])
# Correlation
df_corr = df_c.corr().round(1)  
# Mask to matrix
mask = np.zeros_like(df_corr, dtype=bool)
mask[np.triu_indices_from(mask)] = True
# Viz
df_corr_viz = df_corr.mask(mask).dropna(how='all').dropna( how='all')
fig = px.imshow(df_corr_viz, text_auto=True)
fig.show()

In [None]:
df_corr

In [None]:
px.imshow(df_c.corr())

# 0.0 - 0.3 = No correlation
# 0.3 - 0.5 = Weak correlation
# 0.5 - 0.7 = Moderate correlation
# 0.7 - 1.0 = Strong correlation

In [None]:
Sales_per_week=df.groupby(['Year','Week'])[['Weekly_Sales','CPI',"Temperature"]].sum()
df_spw=Sales_per_week

In [None]:
px.scatter_matrix(df, width=1200, height=1000)

## Preprocessing - pandas part 🐼🐼 
In this dataset, some features are removed since they are useless for the modelling.

### Remove row where target values are missing

Drop lines where target values are missing :

Here, the target variable (y) corresponds to the column Weekly_Sales. One can see above that there are some missing values in this column.
We never use imputation techniques on the target : it might create some bias in the predictions !
Then, we will just drop the lines in the dataset for which the value in Weekly_Sales is missing.

We noticed that 14 Weekly_Sales are missing in the dataset.

Since this is the target value we have no other choice than remove thes rows from the original dataset.



In [None]:
df_model = df[df['Weekly_Sales'].notna()]
print(df_model['Weekly_Sales'].isnull().sum())
df_model.describe()

### Remove duplicate rows

In [None]:

rs,cs = df.shape

df_model.drop_duplicates(inplace=True)

if df_model.shape==(rs,cs):
    print('\n\033[1mInference:\033[0m The dataset doesn\'t have any duplicates')
else:
    print(f'\n\033[1mInference:\033[0m Number of duplicates dropped/fixed ---> {rs-df_model.shape[0]}')

### Remove row with Not Available data

In [None]:
df_model=df_model.dropna()
df_model.reset_index(drop=True)
df_model.shape


### Remove Date with date format and wrong format


Remove row with Not available Date.

In [None]:
#df=df.dropna(subset=['Date'])

In [None]:
df_model=df_model.drop('Date', axis=1)
df_model=df_model.drop('dt', axis=1)
df_model.reset_index(drop=True)

### Holidays Flag analysis

There is a very small correlation between Weekly_Sales and Holiday_Flag.

In [None]:
print(f"Moreover there is only  {df_model['Holiday_Flag'].sum()}  holidays rows  in the dataset " )
print(f"and {df_model['Holiday_Flag'].isna().sum()} Non available values over {df_model.shape[0]} rows.") 

We can remove Holidays data from this dataset which is more a perturbation data than a descriptive data for the Weekly Sales prediction".

In [None]:
#df_model = df_model.drop(columns=['Holiday_Flag'])

### Fuel Price analysis

There is no correlation between Weekly_Sales and Fuel Price.

In [None]:
print(f"Moreover there is {df_model['Fuel_Price'].isna().sum()} Non available values over {df_model.shape[0]} rows.") 

In [None]:
df_model = df_model.drop(columns=['Fuel_Price'])
how_null_is_it(df_model)

We can remove Fuel Price from this dataset data which is more a perturbation data than a descriptive data for the Weekly Sales prediction".

In [None]:
y=df_model['Weekly_Sales']
print(y)

### Outlier analysis

We track outliers to remove these rows.



In [None]:
numeric_list = ['Weekly_Sales', 'Temperature', 'CPI', 'Unemployment']

def detect_outliers_iqr(data, column):
    Q1 = data[column].quantile(0.25)
    Q3 = data[column].quantile(0.75)
    IQR = Q3 - Q1
    
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    
    outliers = data[(data[column] < lower_bound) | (data[column] > upper_bound)]
    return outliers

for col in numeric_list:
    outliers = detect_outliers_iqr(df_model, col)
    print(f"{col} -> Outlier : {outliers.shape[0]}")

##### Unemployement outlier

In [None]:

outliers_unemp = detect_outliers_iqr(df_model, 'Unemployment')
outliers_unemp.index
#
df_model=df_model.drop(outliers_unemp.index,axis=0)
df_model.reset_index(drop=True)
display(df_model)

## Preprocessing - scikit-learn part 🔬🔬
We will use ColumnTransformer and Pipeline from sklearn to preprocess the data before modeling.

In [None]:

from sklearn.model_selection import train_test_split
x=df_model.drop('Weekly_Sales', axis=1)
y=df_model['Weekly_Sales']
x_train, x_test, y_train, y_test= train_test_split(x,y, test_size=0.2, random_state= 42)
print("Train set:", x_train.shape, y_train.shape)
print("Test set:", x_test.shape, y_test.shape)

In [None]:
x

Create the preprocessing pipeline for numeric columns

* list of numerical columns
* impute numeric -> median
* standardise

In [None]:
numerical_columns = x[['Store','Temperature', 'CPI', 'Unemployment', 'Year', 'Month', 'Week','Day','DayOfWeek']].columns.tolist() #x.select_dtypes(include=np.number).columns.tolist()
#numerical_columns =x.select_dtypes(exclude="object").columns
numerical_columns

In [None]:

numeric_imputer = SimpleImputer(strategy='median')
numerical_scaler = StandardScaler()
numerical_pipeline = Pipeline(steps=[
    ('num_imputer', numeric_imputer),
    ('num_scaler', numerical_scaler)
])

Create the preprocessing pipeline for category columns

In [None]:
#categorical_columns = x.select_dtypes(include="object").columns #x.select_dtypes(exclude=np.number).columns.tolist()
categorical_columns = x[['Store','Holiday_Flag']].columns.tolist()
categorical_columns

In [None]:


categorical_imputer = SimpleImputer(strategy='most_frequent')
#categorical_imputer = SimpleImputer(strategy='constant', fill_value='Unknown')

categorical_encoder = OneHotEncoder(drop='first')

categorical_pipeline = Pipeline(steps=[
    ('cat_imputer', categorical_imputer),
    ('cat_encoder', categorical_encoder)
])

In [None]:
# Use ColumnTransformer to make a preprocessor object that describes all the treatments to be done
preprocessor = ColumnTransformer(
    transformers=[
        ("num", numerical_pipeline, numerical_columns),
        ("cat", categorical_pipeline, categorical_columns),
    ]
)

In [None]:
print(x_test)
print()
print(x_train)

In [None]:
# Preprocessings on train set
print("Performing preprocessings on train set...")
print(x_train.head())
x_train = preprocessor.fit_transform(x_train)
print("...Done.")
print(
    x_train[0:5]
)  # MUST use this syntax because X_train is a numpy array and not a pandas DataFrame anymore
print()

# Preprocessings on test set
print("Performing preprocessings on test set...")
print(x_test.head())
x_test = preprocessor.transform(x_test)  # Don't fit again !! The test set is used for validating decisions
# we made based on the training set, therefore we can only apply transformations that were parametered using the training set.
# Otherwise this creates what is called a leak from the test set which will introduce a bias in all your results.
print("...Done.")
print(
    x_test[0:5, :]
)  # MUST use this syntax because X_test is a numpy array and not a pandas DataFrame anymore
print()

In [None]:
x_train

## Model Training
We start training a baseline model, we'll analyze the results and we'll build improved model.
###  Baseline model (linear regression)
Once you've trained a first model, don't forget to assess its performances on the train and test sets. Are you satisfied with the results ?
Besides, it would be interesting to analyze the values of the model's coefficients to know what features are important for the prediction. To do so, the `.coef_` attribute of scikit-learn's LinearRegression class might be useful. Please refer to the following link for more information 😉 https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html



In [None]:
# Model training
model = LinearRegression()
model.fit(x_train, y_train)

#### Model estimation

In [None]:
y_pred = model.predict(x_test)



# Metrics
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)

print("Model Performance:")
score_model(model,x_train, y_train, x_test, y_test)
print(f"MAE  : {mae:.2f}")
print(f"MSE  : {mse:.2f}")
print(f"RMSE : {rmse:.2f}")
print(f"R²   : {r2:.2f}")

In [None]:
print(x.columns)
print()
print(model.coef_)

In [None]:
preprocessor.get_feature_names_out()

In [None]:
coef_df = pd.DataFrame({
    'Feature': preprocessor.get_feature_names_out(),
    'Coefficient': model.coef_
}).sort_values(by='Coefficient', ascending=False)

print("Feature Importance (Linear Regression Coefficients):")
print(coef_df)

In [None]:
px.scatter(x=y_test, y=y_pred, title="Actual vs Predicted Sales")


In [None]:
regressor = Ridge(alpha=0.005)
regressor.fit(x_train, y_train)
scores = cross_val_score(regressor, x_train, y_train, cv=3)
print("Cross-validation scores:", scores)
print("Average cross-validation score:", np.mean(scores))

In [None]:
regressor = Ridge()

params = {
    "alpha": [0,0.05,0.2, 0.5, 1, 1.5, 2, 3]
}

gridsearch = GridSearchCV(regressor, param_grid = params, cv = 3) # cv : the number of folds to be used for CV
gridsearch.fit(x_train, y_train)

In [None]:
print("Best hyperparameters : ", gridsearch.best_params_)
print("Best R2 score : ", gridsearch.best_score_)

In [None]:
pd.DataFrame.from_dict(gridsearch.cv_results_).T

In [None]:
regressor = Lasso()

params = {
    'alpha': [0.05,0.1,0.3,0.5,0.8, 1.4, 1.5, 1.7] 
}

gridsearch = GridSearchCV(regressor, param_grid = params, cv = 3) # cv : the number of folds to be used for CV
gridsearch.fit(x_train, y_train)

In [None]:
pd.DataFrame.from_dict(gridsearch.cv_results_).T.iloc[4:]

In [None]:
best_regressor = gridsearch.best_estimator_
y_test_pred = best_regressor.predict(x_test)

In [None]:
r2_score(y_test, y_test_pred)

In [None]:
best_feature_importance = pd.DataFrame(
    {
        'feature': preprocessor.get_feature_names_out(),
        'coef_linear': model.coef_,
        'coef_lasso': best_regressor.coef_
    }
)



In [None]:
features_to_keep = best_feature_importance[best_feature_importance['coef_lasso'] > 0]['feature'].tolist()
features_to_keep

In [None]:
X_reduced = x[features_to_keep]

xr_train, xr_test, y_train, y_test = train_test_split(X_reduced, y, test_size=0.2, random_state=42)

xr_train = preprocessor.fit_transform(xr_train)
xr_test = preprocessor.transform(xr_test)

In [None]:
final_regressor = LinearRegression()

final_regressor.fit(xr_train, y_train)

print(final_regressor.score(xr_train, y_train))
print(final_regressor.score(xr_test, y_test))

In [None]:
final_lasso = Lasso(alpha=1)

final_lasso.fit(xr_train, y_train)

print(final_lasso.score(xr_train, y_train))
print(final_lasso.score(xr_test, y_test))