# Prediction of sales

### Problem Statement
This dataset represents sales data for 1559 products across 10 stores in different cities. Also, certain attributes of each product and store are available. The aim is to build a predictive model and find out the sales of each product at a particular store.

|Variable|Description|
|: ------------- |:-------------|
|Item_Identifier|Unique product ID|
|Item_Weight|Weight of product|
|Item_Fat_Content|Whether the product is low fat or not|
|Item_Visibility|The % of total display area of all products in a store allocated to the particular product|
|Item_Type|The category to which the product belongs|
|Item_MRP|Maximum Retail Price (list price) of the product|
|Outlet_Identifier|Unique store ID|
|Outlet_Establishment_Year|The year in which store was established|
|Outlet_Size|The size of the store in terms of ground area covered|
|Outlet_Location_Type|The type of city in which the store is located|
|Outlet_Type|Whether the outlet is just a grocery store or some sort of supermarket|
|Item_Outlet_Sales|Sales of the product in the particulat store. This is the outcome variable to be predicted.|

Please note that the data may have missing values as some stores might not report all the data due to technical glitches. Hence, it will be required to treat them accordingly.



### Explore the problem in following stages:

1. Hypothesis Generation – understanding the problem better by brainstorming possible factors that can impact the outcome
2. Data Exploration – looking at categorical and continuous feature summaries and making inferences about the data.
3. Data Cleaning – imputing missing values in the data and checking for outliers
4. Feature Engineering – modifying existing variables and creating new ones for analysis
5. Model Building – making predictive models on the data

In [10]:
import numpy as np
import pandas as pd

In [11]:
data = pd.read_csv('data/nproduct_data.csv')
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8523 entries, 0 to 8522
Data columns (total 38 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   Unnamed: 0                       8523 non-null   int64  
 1   Item_Fat_Content                 8523 non-null   int64  
 2   Item_Visibility                  8523 non-null   float64
 3   Item_MRP                         8523 non-null   float64
 4   Outlet_Size                      8523 non-null   int64  
 5   Outlet_Location_Type             8523 non-null   int64  
 6   Outlet_Type                      8523 non-null   int64  
 7   Item_Outlet_Sales                8523 non-null   float64
 8   Years_Operating                  8523 non-null   int64  
 9   Item_Type_Baking Goods           8523 non-null   int64  
 10  Item_Type_Breads                 8523 non-null   int64  
 11  Item_Type_Breakfast              8523 non-null   int64  
 12  Item_Type_Canned    

In [12]:
# split target variable from dataset
target = data['Item_Outlet_Sales']

# remove target from training
data = data.drop('Item_Outlet_Sales', axis=1)

In [159]:
data.head()

Unnamed: 0.1,Unnamed: 0,Item_Fat_Content,Item_Visibility,Item_MRP,Outlet_Size,Outlet_Location_Type,Outlet_Type,Years_Operating,Item_Type_Baking Goods,Item_Type_Breads,...,Outlet_Identifier_OUT018,Outlet_Identifier_OUT019,Outlet_Identifier_OUT027,Outlet_Identifier_OUT035,Outlet_Identifier_OUT045,Outlet_Identifier_OUT046,Outlet_Identifier_OUT049,Item_Category_Drink,Item_Category_Food,Item_Category_Non Consumable
0,0,1,0.016047,249.8092,2,1,2,10,0,0,...,0,0,0,0,0,0,1,0,1,0
1,1,2,0.019278,48.2692,2,3,3,0,0,0,...,1,0,0,0,0,0,0,1,0,0
2,2,1,0.01676,141.618,2,1,2,10,0,0,...,0,0,0,0,0,0,1,0,1,0
3,3,2,0.0,182.095,0,3,1,11,0,0,...,0,0,0,0,0,0,0,0,1,0
4,4,1,0.0,53.8614,3,3,2,22,0,0,...,0,0,0,0,0,0,0,0,0,1


We have covered data preparation and feature engineering two weeks ago. Now, it's time to do some predictive models.

## Model Building

## Task
Make a baseline model. Baseline models help us set a benchmark to gauge the performance of our future models. If your new model is below the baseline, something has gone wrong, and you should check your data.

To make a baseline model, run a simple regression model without altering the default parameters in sklearn. 

In [135]:
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import mean_squared_error
from sklearn import svm


# fit model
reg = LinearRegression().fit(data,target)

# make prediction
baseline = reg.predict(data)

# Check accuracy
baseline_msescore = mean_squared_error(target, baseline)
baseline_msescore




1270207.9629626963

## Task
Split your data in 80% train set and 20% test set.

In [136]:
X_train, X_test, y_train, y_test = train_test_split(data, target)

In [137]:
X_train.shape

(6392, 37)

In [138]:
y_train.shape

(6392,)

In [156]:
# fit model
regtrain = LinearRegression().fit(X_train, y_train)

# make prediction
baseline_trainpred = reg.predict(X_train)
baseline_testpred = reg.predict(X_test)

# Check accuracy
baselinemse_trainscore = mean_squared_error(y_train, baseline_trainpred)
baselinemse_testscore = mean_squared_error(y_test, baseline_testpred)

print(baselinemse_trainscore)
print(baselinemse_testscore)

1259909.6276599064
1301098.1362407035


## Task
Use grid_search to find the best value of the parameter `alpha` for Ridge and Lasso regressions from `sklearn`.

In [139]:
# set candidate parameter for alpha
parameter_candidates = [
    {'alpha': [1.0,2.5,5.0,7.5,10.0,12.5,15.0]}
]



In [144]:
# create classifier for ridge and set alpha
clf = GridSearchCV(estimator=Ridge(), param_grid=parameter_candidates, n_jobs=-1)
clf.fit(X_train, y_train)

ridge_alpha = clf.best_estimator_.alpha
ridge_alpha

15.0

In [147]:
# create classifer for Lasso and set alpha
clf = GridSearchCV(estimator=Lasso(), param_grid=parameter_candidates, n_jobs=-1)
clf.fit(X_train, y_train)

lasso_alpha = clf.best_estimator_.alpha
lasso_alpha

2.5

## Task
Using the model from grid_search, predict the values in the test set and compare against your benchmark.

In [157]:
# fit the data for ridge
ridge = Ridge(ridge_alpha)
ridge.fit(X_train, y_train)

# predict target values
ridgetrain_pred = ridge.predict(X_train)
ridgetest_pred = ridge.predict(X_test)

# check MSE of model
ridgetrain_score = mean_squared_error(y_train, ridgetrain_pred)
ridgetest_score = mean_squared_error(y_test, ridgetest_pred)

print(baselinemse_trainscore - ridgetrain_score)
print(baselinemse_testscore - ridgetest_score) # <- overfit consider removing features

937.3638203635346
-8315.42003893177


In [158]:
# fit the data for lasso
lasso = Lasso(lasso_alpha)
lasso.fit(X_train, y_train)

# predict target values
lassotrain_pred = lasso.predict(X_train)
lassotest_pred = lasso.predict(X_test)

# check MSE of model
lassotrain_score = mean_squared_error(y_train, lassotrain_pred)
lassotest_score = mean_squared_error(y_test, lassotest_pred)

print(baselinemse_trainscore - lassotrain_score)
print(baselinemse_testscore - lassotest_score) # <- overfit consider removing features

-795.4324603101704
-6210.50917203608
