# Prediction of sales

### Problem Statement
This dataset represents sales data for 1559 products across 10 stores in different cities. Also, certain attributes of each product and store are available. The aim is to build a predictive model and find out the sales of each product at a particular store.

|Variable|Description|
|: ------------- |:-------------|
|Item_Identifier|Unique product ID|
|Item_Weight|Weight of product|
|Item_Fat_Content|Whether the product is low fat or not|
|Item_Visibility|The % of total display area of all products in a store allocated to the particular product|
|Item_Type|The category to which the product belongs|
|Item_MRP|Maximum Retail Price (list price) of the product|
|Outlet_Identifier|Unique store ID|
|Outlet_Establishment_Year|The year in which store was established|
|Outlet_Size|The size of the store in terms of ground area covered|
|Outlet_Location_Type|The type of city in which the store is located|
|Outlet_Type|Whether the outlet is just a grocery store or some sort of supermarket|
|Item_Outlet_Sales|Sales of the product in the particulat store. This is the outcome variable to be predicted.|

Please note that the data may have missing values as some stores might not report all the data due to technical glitches. Hence, it will be required to treat them accordingly.



### Explore the problem in following stages:

1. Hypothesis Generation – understanding the problem better by brainstorming possible factors that can impact the outcome
2. Data Exploration – looking at categorical and continuous feature summaries and making inferences about the data.
3. Data Cleaning – imputing missing values in the data and checking for outliers
4. Feature Engineering – modifying existing variables and creating new ones for analysis
5. Model Building – making predictive models on the data

We have covered data preparation and feature engineering two weeks ago. Now, it's time to do some predictive models.

## Model Building

## Task
Make a baseline model. Baseline models help us set a benchmark to gauge the performance of our future models. If your new model is below the baseline, something has gone wrong, and you should check your data.

To make a baseline model, run a simple regression model without altering the default parameters in sklearn. 

In [1]:
import pandas as pd
import numpy as np
import sklearn.model_selection as model_selection
from sklearn.linear_model import LinearRegression

In [2]:
data = pd.read_csv(r'..\..\..\Unit_3\Day_3\data_preparation_exercise-master\regression_exercise_cleaned_transformed.csv', index_col=0)

In [3]:
data.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Visibility,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Item_Outlet_Sales,Years_of_Operation,Item_Fat_Content_Low Fat,...,Outlet_Location_Type_Tier 1,Outlet_Location_Type_Tier 2,Outlet_Location_Type_Tier 3,Outlet_Type_Grocery Store,Outlet_Type_Supermarket Type1,Outlet_Type_Supermarket Type2,Outlet_Type_Supermarket Type3,Broad_Item_Type_DR,Broad_Item_Type_FD,Broad_Item_Type_NC
0,156,9.3,0.016047,249.8092,9,1999-01-01,2,3735.138,23.0,1,...,1,0,0,0,1,0,0,0,1,0
1,8,5.92,0.019278,48.2692,3,2009-01-01,2,443.4228,13.0,0,...,0,0,1,0,0,1,0,1,0,0
2,662,17.5,0.01676,141.618,9,1999-01-01,2,2097.27,23.0,1,...,1,0,0,0,1,0,0,0,1,0
3,1121,19.2,0.0,182.095,0,1998-01-01,1,732.38,24.0,0,...,0,0,1,1,0,0,0,0,1,0
4,1297,8.93,0.0,53.8614,1,1987-01-01,3,994.7052,35.0,0,...,0,0,1,0,1,0,0,0,0,1


In [4]:
data = data.drop(columns='Item_Identifier')

In [5]:
from datetime import datetime

data['Outlet_Establishment_Year'] = pd.DatetimeIndex(data['Outlet_Establishment_Year']).year

In [6]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8523 entries, 0 to 8522
Data columns (total 37 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   Item_Weight                      8523 non-null   float64
 1   Item_Visibility                  8523 non-null   float64
 2   Item_MRP                         8523 non-null   float64
 3   Outlet_Identifier                8523 non-null   int64  
 4   Outlet_Establishment_Year        8523 non-null   int64  
 5   Outlet_Size                      8523 non-null   int64  
 6   Item_Outlet_Sales                8523 non-null   float64
 7   Years_of_Operation               8523 non-null   float64
 8   Item_Fat_Content_Low Fat         8523 non-null   int64  
 9   Item_Fat_Content_NA              8523 non-null   int64  
 10  Item_Fat_Content_Regular         8523 non-null   int64  
 11  Item_Type_Baking Goods           8523 non-null   int64  
 12  Item_Type_Breads    

In [7]:
data.head()

Unnamed: 0,Item_Weight,Item_Visibility,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Item_Outlet_Sales,Years_of_Operation,Item_Fat_Content_Low Fat,Item_Fat_Content_NA,...,Outlet_Location_Type_Tier 1,Outlet_Location_Type_Tier 2,Outlet_Location_Type_Tier 3,Outlet_Type_Grocery Store,Outlet_Type_Supermarket Type1,Outlet_Type_Supermarket Type2,Outlet_Type_Supermarket Type3,Broad_Item_Type_DR,Broad_Item_Type_FD,Broad_Item_Type_NC
0,9.3,0.016047,249.8092,9,1999,2,3735.138,23.0,1,0,...,1,0,0,0,1,0,0,0,1,0
1,5.92,0.019278,48.2692,3,2009,2,443.4228,13.0,0,0,...,0,0,1,0,0,1,0,1,0,0
2,17.5,0.01676,141.618,9,1999,2,2097.27,23.0,1,0,...,1,0,0,0,1,0,0,0,1,0
3,19.2,0.0,182.095,0,1998,1,732.38,24.0,0,0,...,0,0,1,1,0,0,0,0,1,0
4,8.93,0.0,53.8614,1,1987,3,994.7052,35.0,0,1,...,0,0,1,0,1,0,0,0,0,1


In [8]:
# scale the numeric data
#use MinMaxScaler to scale data into a given range ((0,1) by default)

from sklearn import preprocessing
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaled_data = pd.DataFrame(scaler.fit_transform(data), columns=data.columns)
scaled_data

Unnamed: 0,Item_Weight,Item_Visibility,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Item_Outlet_Sales,Years_of_Operation,Item_Fat_Content_Low Fat,Item_Fat_Content_NA,...,Outlet_Location_Type_Tier 1,Outlet_Location_Type_Tier 2,Outlet_Location_Type_Tier 3,Outlet_Type_Grocery Store,Outlet_Type_Supermarket Type1,Outlet_Type_Supermarket Type2,Outlet_Type_Supermarket Type3,Broad_Item_Type_DR,Broad_Item_Type_FD,Broad_Item_Type_NC
0,0.282525,0.048866,0.927507,1.000000,0.583333,0.5,0.283587,0.416667,1.0,0.0,...,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
1,0.081274,0.058705,0.072068,0.333333,1.000000,0.5,0.031419,0.000000,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0
2,0.770765,0.051037,0.468288,1.000000,0.583333,0.5,0.158115,0.416667,1.0,0.0,...,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
3,0.871986,0.000000,0.640093,0.000000,0.541667,0.0,0.053555,0.458333,0.0,0.0,...,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
4,0.260494,0.000000,0.095805,0.111111,0.083333,1.0,0.073651,0.916667,0.0,1.0,...,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8518,0.137541,0.172914,0.777729,0.111111,0.083333,1.0,0.210293,0.916667,1.0,0.0,...,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
8519,0.227746,0.143069,0.326263,0.777778,0.708333,0.0,0.039529,0.291667,0.0,0.0,...,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
8520,0.359929,0.107148,0.228492,0.666667,0.791667,0.0,0.088850,0.208333,0.0,1.0,...,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0
8521,0.158083,0.442219,0.304939,0.333333,1.000000,0.5,0.138835,0.000000,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0


In [9]:
X = pd.concat((scaled_data.iloc[:,0:6],scaled_data.iloc[:,7:]),axis=1).to_numpy()
y = scaled_data.iloc[:,6].to_numpy()
print(X.shape)
print(y.shape)

(8523, 36)
(8523,)


In [10]:
from sklearn.model_selection import train_test_split

train_ratio = .8

X_train, X_test, y_train, y_test = train_test_split(X, y, shuffle=True, train_size=train_ratio)

print(f'{len(X_train)} training samples and {len(X_test)} test samples')

6818 training samples and 1705 test samples


In [11]:
# Train our model
from sklearn.linear_model import LinearRegression

reg = LinearRegression()
reg.fit(X_train, y_train)

In [12]:
# Check performance on train and test set
from sklearn.metrics import r2_score

y_train_pred = reg.predict(X_train)
y_test_pred = reg.predict(X_test)

r2_train = r2_score(y_train, y_train_pred)
r2_test = r2_score(y_test, y_test_pred)

print(f'Train R^2:\t{r2_train}\nTest R^2:\t{r2_test}')

Train R^2:	0.56106099193737
Test R^2:	0.5603853555237754


## Task
Split your data in 80% train set and 20% test set.

In [13]:
from sklearn.model_selection import train_test_split

train_ratio = .8

X_train, X_test, y_train, y_test = train_test_split(X, y, shuffle=True, train_size=train_ratio)

print(f'{len(X_train)} training samples and {len(X_test)} test samples')

6818 training samples and 1705 test samples


## Task
Use grid_search to find the best value of the parameter `alpha` for Ridge and Lasso regressions from `sklearn`.

In [14]:
from sklearn.linear_model import Ridge, Lasso
from sklearn.model_selection import GridSearchCV

In [15]:
parameter_candidates = [
  {'alpha': [0.0001, 0.001, 0.01, 0.1, 0.5, 1, 5, 10, 50, 100]},
]

In [16]:
# creating ridge regression
rr = Ridge()

# Create a classifier object with the classifier and parameter candidates
clf = GridSearchCV(estimator=rr, param_grid=parameter_candidates, n_jobs=-1)

# Train the classifier on data's feature and target data
clf.fit(X_train, y_train)

In [17]:
# View the accuracy score
print('Best score for data:', clf.best_score_)

Best score for data: 0.5549293672237665


In [18]:
# View the best parameters for the model found using grid search
print('Best alpha:',clf.best_estimator_)

Best alpha: Ridge(alpha=5)


In [19]:
# creating lasso regression
lasso = Lasso()

# Create a classifier object with the classifier and parameter candidates
clf = GridSearchCV(estimator=lasso, param_grid=parameter_candidates, n_jobs=-1)

# Train the classifier on data's feature and target data
clf.fit(X_train, y_train)

In [20]:
# View the accuracy score
print('Best score for data:', clf.best_score_)

Best score for data: 0.5556393142680507


In [21]:
# View the best parameters for the model found using grid search
print('Best alpha:',clf.best_estimator_)

Best alpha: Lasso(alpha=0.0001)


## Task
Using the model from grid_search, predict the values in the test set and compare against your benchmark.

In [22]:
# Train our Ridge Regression model
from sklearn.linear_model import Ridge

rr = Ridge(alpha=5)
rr.fit(X_train, y_train)

In [23]:
# Check performance on train and test set
from sklearn.metrics import r2_score

y_train_pred = rr.predict(X_train)
y_test_pred = rr.predict(X_test)

r2_train = r2_score(y_train, y_train_pred)
r2_test = r2_score(y_test, y_test_pred)

print(f'Train R^2:\t{r2_train}\nTest R^2:\t{r2_test}')

Train R^2:	0.5605771758331033
Test R^2:	0.5709640622777961


In [24]:
# Train our Lasso Regression model
from sklearn.linear_model import Lasso

lassor = Lasso(alpha=0.0001)
lassor.fit(X_train, y_train)

In [25]:
# Check performance on train and test set
from sklearn.metrics import r2_score

y_train_pred = lassor.predict(X_train)
y_test_pred = lassor.predict(X_test)

r2_train = r2_score(y_train, y_train_pred)
r2_test = r2_score(y_test, y_test_pred)

print(f'Train R^2:\t{r2_train}\nTest R^2:\t{r2_test}')

Train R^2:	0.5600603040031318
Test R^2:	0.5722578863393357


training:
- LR = 0.56106099193737
- Ridge = 0.5605771758331033 (alpha = 5)
- Lasso = 0.5600603040031318 (alpha = .0001)

test:
- LR = 0.5603853555237754
- Ridge = 0.5709640622777961
- Lasso = 0.5722578863393357