# Perdiction of sales

### Problem Statement
The dataset represents sales data for 1559 products across 10 stores in different cities. Also, certain attributes of each product and store are available. The aim is to build a predictive model and find out the sales of each product at a particular store.

|Variable|Description|
|: ------------- |:-------------|
|Item_Identifier|Unique product ID|
|Item_Weight|Weight of product|
|Item_Fat_Content|Whether the product is low fat or not|
|Item_Visibility|The % of total display area of all products in a store allocated to the particular product|
|Item_Type|The category to which the product belongs|
|Item_MRP|Maximum Retail Price (list price) of the product|
|Outlet_Identifier|Unique store ID|
|Outlet_Establishment_Year|The year in which store was established|
|Outlet_Size|The size of the store in terms of ground area covered|
|Outlet_Location_Type|The type of city in which the store is located|
|Outlet_Type|Whether the outlet is just a grocery store or some sort of supermarket|
|Item_Outlet_Sales|Sales of the product in the particulat store. This is the outcome variable to be predicted.|

Please note that the data may have missing values as some stores might not report all the data due to technical glitches. Hence, it will be required to treat them accordingly.



### Explore the problem in following stages:

1. Hypothesis Generation – understanding the problem better by brainstorming possible factors that can impact the outcome
2. Data Exploration – looking at categorical and continuous feature summaries and making inferences about the data.
3. Data Cleaning – imputing missing values in the data and checking for outliers
4. Feature Engineering – modifying existing variables and creating new ones for analysis
5. Model Building – making predictive models on the data

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import copy

In [2]:
df = pd.read_csv('C:/Users/Tim/Desktop/lighthouse/w5/d1/regression_exercise.csv', sep = ',')

In [3]:
data = copy.deepcopy(df)

In [4]:
def missing(x):
    n_missing = x.isnull().sum().sort_values(ascending=False)
    p_missing = (x.isnull().sum()/x.isnull().count()).sort_values(ascending=False)
    missing_ = pd.concat([n_missing, p_missing],axis=1, keys = ['number','percent'])
    return missing_
# missing(df)

In [5]:
data['Item_Weight'] = data.groupby("Item_Type").transform(lambda x: x.fillna(x.mean))
data['Outlet_Size']=data['Outlet_Size'].fillna("Empty")

In [6]:
missing(data)

Unnamed: 0,number,percent
Item_Outlet_Sales,0,0.0
Outlet_Type,0,0.0
Outlet_Location_Type,0,0.0
Outlet_Size,0,0.0
Outlet_Establishment_Year,0,0.0
Outlet_Identifier,0,0.0
Item_MRP,0,0.0
Item_Type,0,0.0
Item_Visibility,0,0.0
Item_Fat_Content,0,0.0


In [7]:
data.describe()

Unnamed: 0,Item_Visibility,Item_MRP,Outlet_Establishment_Year,Item_Outlet_Sales
count,8523.0,8523.0,8523.0,8523.0
mean,0.066132,140.992782,1997.831867,2181.288914
std,0.051598,62.275067,8.37176,1706.499616
min,0.0,31.29,1985.0,33.29
25%,0.026989,93.8265,1987.0,834.2474
50%,0.053931,143.0128,1999.0,1794.331
75%,0.094585,185.6437,2004.0,3101.2964
max,0.328391,266.8884,2009.0,13086.9648


In [8]:
# Moving to nominal (categorical) variable, lets have a look at the number of unique values in each of them.
cat_cols = ["Item_Fat_Content", "Item_Type", "Outlet_Identifier", "Outlet_Location_Type", "Outlet_Type", "Outlet_Size"]

for i in cat_cols:
    print(data[i].unique())
    
for i in cat_cols:
    print("{}: {}".format(i,data[i].nunique()))

['Low Fat' 'Regular' 'low fat' 'LF' 'reg']
['Dairy' 'Soft Drinks' 'Meat' 'Fruits and Vegetables' 'Household'
 'Baking Goods' 'Snack Foods' 'Frozen Foods' 'Breakfast'
 'Health and Hygiene' 'Hard Drinks' 'Canned' 'Breads' 'Starchy Foods'
 'Others' 'Seafood']
['OUT049' 'OUT018' 'OUT010' 'OUT013' 'OUT027' 'OUT045' 'OUT017' 'OUT046'
 'OUT035' 'OUT019']
['Tier 1' 'Tier 3' 'Tier 2']
['Supermarket Type1' 'Supermarket Type2' 'Grocery Store'
 'Supermarket Type3']
['Medium' 'Empty' 'High' 'Small']
Item_Fat_Content: 5
Item_Type: 16
Outlet_Identifier: 10
Outlet_Location_Type: 3
Outlet_Type: 4
Outlet_Size: 4


In [9]:
# Item_Type variable has many categories which might prove to be very useful in analysis. Look at the Item_Identifier, i.e. the unique ID of each item, it starts with either FD, 
# DR or NC. If you see the categories, these look like being Food, Drinks and Non-Consumables. Use the Item_Identifier variable to create a new column

def labelcat(y):
    for i in y:
        if "FD" in i:
            return "FD"
        if "DR" in i:
            return "DR"
        return "NC"
    
y = data['Item_Identifier']
data.apply(lambda y: labelcat(y), axis=1)
data['Broad_Category'] = data.apply(lambda y: labelcat(y), axis=1)

In [10]:
# Make a new column depicting the years of operation of a store (i.e. how long the store exists).

from datetime import datetime
currentyear = datetime.now().year
currentyear

2021

In [11]:
data['YoO'] = currentyear - data['Outlet_Establishment_Year']
data['YoO']

0       22
1       12
2       22
3       23
4       34
        ..
8518    34
8519    19
8520    17
8521    12
8522    24
Name: YoO, Length: 8523, dtype: int64

In [12]:
# There are difference in representation in categories of Item_Fat_Content variable. This should be corrected.

data['Item_Fat_Content'] = data['Item_Fat_Content'].replace({
    'low fat':'Low Fat',
    'LF':'Low Fat',
    'reg':'Regular'
})

In [13]:
# There are some non-consumables as well and a fat-content should not be specified for them. Create a separate category for such kind of observations.

data["Item_Fat_Content"].loc[data['Broad_Category']=="NC"]="NC"

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_with_indexer(indexer, value)


In [16]:
data.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales,Broad_Category,YoO
0,FDA15,FDA15,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138,FD,22
1,DRC01,DRC01,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228,DR,12
2,FDN15,FDN15,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27,FD,22
3,FDX07,FDX07,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,Empty,Tier 3,Grocery Store,732.38,FD,23
4,NCD19,NCD19,NC,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052,NC,34


In [14]:
data.describe()

Unnamed: 0,Item_Visibility,Item_MRP,Outlet_Establishment_Year,Item_Outlet_Sales,YoO
count,8523.0,8523.0,8523.0,8523.0,8523.0
mean,0.066132,140.992782,1997.831867,2181.288914,23.168133
std,0.051598,62.275067,8.37176,1706.499616,8.37176
min,0.0,31.29,1985.0,33.29,12.0
25%,0.026989,93.8265,1987.0,834.2474,17.0
50%,0.053931,143.0128,1999.0,1794.331,22.0
75%,0.094585,185.6437,2004.0,3101.2964,34.0
max,0.328391,266.8884,2009.0,13086.9648,36.0


In [15]:
data.dtypes

Item_Identifier               object
Item_Weight                   object
Item_Fat_Content              object
Item_Visibility              float64
Item_Type                     object
Item_MRP                     float64
Outlet_Identifier             object
Outlet_Establishment_Year      int64
Outlet_Size                   object
Outlet_Location_Type          object
Outlet_Type                   object
Item_Outlet_Sales            float64
Broad_Category                object
YoO                            int64
dtype: object

In [None]:
#Item_Outlet_Sales	Sales of the product in the particulat store. This is the outcome variable to be predicted.

We have covered data preparation and feature engineering two weeks ago. Now, it's time to do some predictive models.

## Model Building

## Task
Make a baseline model. Baseline model is the one which requires no predictive model and its like an informed guess. For instance, predict the sales as the overall average sales or just zero.
Making baseline models helps in setting a benchmark. If your predictive algorithm is below this, there is something going seriously wrong and you should check your data.

In [20]:
d_size = data['Item_Outlet_Sales'].count()
d_mean = data['Item_Outlet_Sales'].mean()
baseline = np.full((1,d_size), d_mean)
baseline.shape

(1, 8523)

## Task
Split your data in 80% train set and 20% test set.

In [22]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [33]:
data = pd.get_dummies(data,drop_first=True)

In [34]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(data)

data_scale = pd.DataFrame(scaler.transform(data),columns = data.columns)

In [35]:
X = data.loc[:, data.columns != "Item_Outlet_Sales"]
y = data['Item_Outlet_Sales']

In [36]:
X_train, X_test, y_train, y_test = train_test_split(X,y,shuffle = True, train_size = 0.8, test_size = 0.2)

## Task
Use grid_search to find the best value of parameter `alpha` for Ridge and Lasso regressions from `sklearn`.

In [41]:
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.model_selection import GridSearchCV
from sklearn import metrics

In [38]:
paramgrid = {
    'alpha': [0.001, 0.01, 0.1, 1]
}
n = 5

model = Ridge()
grid = GridSearchCV(estimator=model, param_grid=paramgrid, cv=n, scoring='r2', verbose=1, n_jobs=-1)
grid_result = grid.fit(X_train,y_train)

best_r2 = grid_result.best_score_
best_alpha = grid_result.best_params_['alpha']
print(f'The best hyperparameter settings achieve a cross-validated R^2 of: {best_r2}\nAlpha:\t{best_alpha}')

Fitting 5 folds for each of 4 candidates, totalling 20 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  20 out of  20 | elapsed:   17.8s finished


The best hyperparameter settings achieve a cross-validated R^2 of: 0.4437996857222933
Alpha:	1


In [42]:
# Using the best hyperparameters, retrain on the entire train set and evaluate on the test set
best_model = grid_result.best_estimator_    # Sklearn automatically retrains the model on the whole training set following cross-validation using the best hyperparameters
y_pred = best_model.predict(X_test)
r2_test = metrics.r2_score(y_test, y_pred)
print(f'R^2 on the test set:\t{r2_test}')

R^2 on the test set:	0.4774935007708374


In [44]:
paramgrid = {
    'alpha': [0.001, 0.01, 0.1, 1]
}
n = 5

model = Lasso()
grid = GridSearchCV(estimator=model, param_grid=paramgrid, cv=n, scoring='r2', verbose=1, n_jobs=-1)
grid_result = grid.fit(X_train,y_train)

best_r2 = grid_result.best_score_
best_alpha = grid_result.best_params_['alpha']
print(f'The best hyperparameter settings achieve a cross-validated R^2 of: {best_r2}\nAlpha:\t{best_alpha}')

Fitting 5 folds for each of 4 candidates, totalling 20 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  20 out of  20 | elapsed:  4.7min finished


The best hyperparameter settings achieve a cross-validated R^2 of: 0.5579758904015995
Alpha:	1


In [45]:
# Using the best hyperparameters, retrain on the entire train set and evaluate on the test set
best_model = grid_result.best_estimator_    # Sklearn automatically retrains the model on the whole training set following cross-validation using the best hyperparameters
y_pred = best_model.predict(X_test)
r2_test = metrics.r2_score(y_test, y_pred)
print(f'R^2 on the test set:\t{r2_test}')

R^2 on the test set:	0.5609916029035458


## Task
Using the model from grid_search, predict the values in the test set and compare against the benchmark.

In [50]:
r2_baseline = metrics.r2_score(y_test,baseline)
r2_baseline

-2.7167786784687564e-06

In [51]:
#basically 0

In [48]:
#baseline = np.full((y_test.shape), d_mean)