# Perdiction of sales

### Problem Statement
The dataset represents sales data for 1559 products across 10 stores in different cities. Also, certain attributes of each product and store are available. The aim is to build a predictive model and find out the sales of each product at a particular store.

|Variable|Description|
|: ------------- |:-------------|
|Item_Identifier|Unique product ID|
|Item_Weight|Weight of product|
|Item_Fat_Content|Whether the product is low fat or not|
|Item_Visibility|The % of total display area of all products in a store allocated to the particular product|
|Item_Type|The category to which the product belongs|
|Item_MRP|Maximum Retail Price (list price) of the product|
|Outlet_Identifier|Unique store ID|
|Outlet_Establishment_Year|The year in which store was established|
|Outlet_Size|The size of the store in terms of ground area covered|
|Outlet_Location_Type|The type of city in which the store is located|
|Outlet_Type|Whether the outlet is just a grocery store or some sort of supermarket|
|Item_Outlet_Sales|Sales of the product in the particulat store. This is the outcome variable to be predicted.|

Please note that the data may have missing values as some stores might not report all the data due to technical glitches. Hence, it will be required to treat them accordingly.



### Explore the problem in following stages:

1. Hypothesis Generation – understanding the problem better by brainstorming possible factors that can impact the outcome
2. Data Exploration – looking at categorical and continuous feature summaries and making inferences about the data.
3. Data Cleaning – imputing missing values in the data and checking for outliers
4. Feature Engineering – modifying existing variables and creating new ones for analysis
5. Model Building – making predictive models on the data

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import copy

In [2]:
df = pd.read_csv('C:/Users/Tim/Desktop/lighthouse/w5/d1/regression_exercise.csv', sep = ',')

In [3]:
data = copy.deepcopy(df)

In [4]:
def missing(x):
    n_missing = x.isnull().sum().sort_values(ascending=False)
    p_missing = (x.isnull().sum()/x.isnull().count()).sort_values(ascending=False)
    missing_ = pd.concat([n_missing, p_missing],axis=1, keys = ['number','percent'])
    return missing_
# missing(df)

In [5]:
data['Item_Weight'] = data.groupby("Item_Type").transform(lambda x: x.fillna(x.mean))
data['Outlet_Size']=data['Outlet_Size'].fillna("Empty")

In [6]:
missing(data)

Unnamed: 0,number,percent
Item_Outlet_Sales,0,0.0
Outlet_Type,0,0.0
Outlet_Location_Type,0,0.0
Outlet_Size,0,0.0
Outlet_Establishment_Year,0,0.0
Outlet_Identifier,0,0.0
Item_MRP,0,0.0
Item_Type,0,0.0
Item_Visibility,0,0.0
Item_Fat_Content,0,0.0


In [7]:
#data.describe()

In [8]:
# Moving to nominal (categorical) variable, lets have a look at the number of unique values in each of them.
cat_cols = ["Item_Fat_Content", "Item_Type", "Outlet_Identifier", "Outlet_Location_Type", "Outlet_Type", "Outlet_Size"]

for i in cat_cols:
    print(data[i].unique())
    
for i in cat_cols:
    print("{}: {}".format(i,data[i].nunique()))

['Low Fat' 'Regular' 'low fat' 'LF' 'reg']
['Dairy' 'Soft Drinks' 'Meat' 'Fruits and Vegetables' 'Household'
 'Baking Goods' 'Snack Foods' 'Frozen Foods' 'Breakfast'
 'Health and Hygiene' 'Hard Drinks' 'Canned' 'Breads' 'Starchy Foods'
 'Others' 'Seafood']
['OUT049' 'OUT018' 'OUT010' 'OUT013' 'OUT027' 'OUT045' 'OUT017' 'OUT046'
 'OUT035' 'OUT019']
['Tier 1' 'Tier 3' 'Tier 2']
['Supermarket Type1' 'Supermarket Type2' 'Grocery Store'
 'Supermarket Type3']
['Medium' 'Empty' 'High' 'Small']
Item_Fat_Content: 5
Item_Type: 16
Outlet_Identifier: 10
Outlet_Location_Type: 3
Outlet_Type: 4
Outlet_Size: 4


In [9]:
# Item_Type variable has many categories which might prove to be very useful in analysis. Look at the Item_Identifier, i.e. the unique ID of each item, it starts with either FD, 
# DR or NC. If you see the categories, these look like being Food, Drinks and Non-Consumables. Use the Item_Identifier variable to create a new column

def labelcat(y):
    for i in y:
        if "FD" in i:
            return "FD"
        if "DR" in i:
            return "DR"
        return "NC"
    
y = data['Item_Identifier']
data.apply(lambda y: labelcat(y), axis=1)
data['Broad_Category'] = data.apply(lambda y: labelcat(y), axis=1)

In [10]:
# Make a new column depicting the years of operation of a store (i.e. how long the store exists).

from datetime import datetime
currentyear = datetime.now().year
currentyear

2021

In [11]:
data['YoO'] = currentyear - data['Outlet_Establishment_Year']
data['YoO']

0       22
1       12
2       22
3       23
4       34
        ..
8518    34
8519    19
8520    17
8521    12
8522    24
Name: YoO, Length: 8523, dtype: int64

In [12]:
# There are difference in representation in categories of Item_Fat_Content variable. This should be corrected.

data['Item_Fat_Content'] = data['Item_Fat_Content'].replace({
    'low fat':'Low Fat',
    'LF':'Low Fat',
    'reg':'Regular'
})

In [13]:
# There are some non-consumables as well and a fat-content should not be specified for them. Create a separate category for such kind of observations.

data["Item_Fat_Content"].loc[data['Broad_Category']=="NC"]="NC"

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_with_indexer(indexer, value)


In [14]:
data = pd.get_dummies(data,drop_first=True)

In [15]:
data.head()

Unnamed: 0,Item_Visibility,Item_MRP,Outlet_Establishment_Year,Item_Outlet_Sales,YoO,Item_Identifier_DRA24,Item_Identifier_DRA59,Item_Identifier_DRB01,Item_Identifier_DRB13,Item_Identifier_DRB24,...,Outlet_Size_High,Outlet_Size_Medium,Outlet_Size_Small,Outlet_Location_Type_Tier 2,Outlet_Location_Type_Tier 3,Outlet_Type_Supermarket Type1,Outlet_Type_Supermarket Type2,Outlet_Type_Supermarket Type3,Broad_Category_FD,Broad_Category_NC
0,0.016047,249.8092,1999,3735.138,22,0,0,0,0,0,...,0,1,0,0,0,1,0,0,1,0
1,0.019278,48.2692,2009,443.4228,12,0,0,0,0,0,...,0,1,0,0,1,0,1,0,0,0
2,0.01676,141.618,1999,2097.27,22,0,0,0,0,0,...,0,1,0,0,0,1,0,0,1,0
3,0.0,182.095,1998,732.38,23,0,0,0,0,0,...,0,0,0,0,1,0,0,0,1,0
4,0.0,53.8614,1987,994.7052,34,0,0,0,0,0,...,1,0,0,0,1,1,0,0,0,1


In [16]:
data.describe()

Unnamed: 0,Item_Visibility,Item_MRP,Outlet_Establishment_Year,Item_Outlet_Sales,YoO,Item_Identifier_DRA24,Item_Identifier_DRA59,Item_Identifier_DRB01,Item_Identifier_DRB13,Item_Identifier_DRB24,...,Outlet_Size_High,Outlet_Size_Medium,Outlet_Size_Small,Outlet_Location_Type_Tier 2,Outlet_Location_Type_Tier 3,Outlet_Type_Supermarket Type1,Outlet_Type_Supermarket Type2,Outlet_Type_Supermarket Type3,Broad_Category_FD,Broad_Category_NC
count,8523.0,8523.0,8523.0,8523.0,8523.0,8523.0,8523.0,8523.0,8523.0,8523.0,...,8523.0,8523.0,8523.0,8523.0,8523.0,8523.0,8523.0,8523.0,8523.0,8523.0
mean,0.066132,140.992782,1997.831867,2181.288914,23.168133,0.000821,0.000939,0.000352,0.000587,0.000469,...,0.109351,0.327702,0.280183,0.326763,0.393054,0.654347,0.108882,0.109703,0.718644,0.18761
std,0.051598,62.275067,8.37176,1706.499616,8.37176,0.028648,0.030625,0.018759,0.024215,0.02166,...,0.312098,0.469403,0.449115,0.469057,0.488457,0.475609,0.311509,0.312538,0.449687,0.390423
min,0.0,31.29,1985.0,33.29,12.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.026989,93.8265,1987.0,834.2474,17.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.053931,143.0128,1999.0,1794.331,22.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
75%,0.094585,185.6437,2004.0,3101.2964,34.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,0.0
max,0.328391,266.8884,2009.0,13086.9648,36.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [17]:
data.dtypes

Item_Visibility                  float64
Item_MRP                         float64
Outlet_Establishment_Year          int64
Item_Outlet_Sales                float64
YoO                                int64
                                  ...   
Outlet_Type_Supermarket Type1      uint8
Outlet_Type_Supermarket Type2      uint8
Outlet_Type_Supermarket Type3      uint8
Broad_Category_FD                  uint8
Broad_Category_NC                  uint8
Length: 3157, dtype: object

We have covered data preparation and feature engineering two weeks ago. Plus, we have created Lasso and Ridge regressions on Monday. Now, we will work on more complex ensemble models.

## Model Building

### Ensemble Models

Try different  ensemble models (Random Forest Regressor, Gradient Boosting, XGBoost)

Calculate the mean squared error on the test set. Explore how different parameters of the model affect the results and the performance of the model

- Use GridSearchCV to find optimal paramaters of models.
- Compare agains the Lasso and Ridge Regression models from Monday.

In [25]:
from sklearn import ensemble
from sklearn import metrics
from sklearn.model_selection import train_test_split, GridSearchCV
import xgboost as xgb

In [26]:
X = data.drop(columns = 'Item_Outlet_Sales')
y = data['Item_Outlet_Sales']

In [31]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(X)

X_scale = pd.DataFrame(scaler.transform(X),columns = X.columns)


In [33]:
x_train, x_test, y_train, y_test = train_test_split(X_scale,y,train_size = 0.75,random_state=123)

In [36]:
rfr = ensemble.RandomForestRegressor()
rfr.fit(x_train,y_train)
fry_pred = rfr.predict(x_test)
rmse = np.sqrt(metrics.mean_squared_error(y_test, fry_pred))
print("RMSE: %f" % (rmse))

RMSE: 1137.112426


In [38]:
paramgrid = {
    'n_estimators':[50,100,150,200,250,300],
    'max_depth':[1,3,5,7,9,11],
    'max_features':['auto','sqrt','log2'],
}
n = 5

model = ensemble.RandomForestRegressor()
grid = GridSearchCV(estimator=model, param_grid=paramgrid, cv=n, scoring='r2', verbose=1, n_jobs=-1)
grid_result = grid.fit(x_train,y_train)

Fitting 5 folds for each of 108 candidates, totalling 540 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:  1.9min
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:  9.0min
[Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed: 42.5min
[Parallel(n_jobs=-1)]: Done 540 out of 540 | elapsed: 56.1min finished


In [39]:
best_r2 = grid_result.best_score_
print(f'The best hyperparameter settings achieve a cross-validated R^2 of: {best_r2}')
print(f'The best hyperparameter settings:\n{grid_result.best_params_}')

The best hyperparameter settings achieve a cross-validated R^2 of: 0.5944408488280664
The best hyperparameter settings:
{'max_depth': 5, 'max_features': 'auto', 'n_estimators': 250}


In [41]:
param_grid2 = {
    'n_estimators':[200,225,250,275,300],
    'max_depth':[4,5,6],
}
grid = GridSearchCV(estimator=model, param_grid=param_grid2, cv=n, scoring='r2', verbose=1, n_jobs=-1)
grid_result = grid.fit(x_train,y_train)
print(f'The best hyperparameter settings:\n{grid_result.best_params_}')

Fitting 5 folds for each of 15 candidates, totalling 75 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed: 14.0min
[Parallel(n_jobs=-1)]: Done  75 out of  75 | elapsed: 27.0min finished


The best hyperparameter settings:
{'max_depth': 6, 'n_estimators': 225}


In [42]:
rfr = ensemble.RandomForestRegressor(n_estimators = 225, max_depth = 6)
rfr.fit(x_train,y_train)
fry_pred = rfr.predict(x_test)
rmse = np.sqrt(metrics.mean_squared_error(y_test, fry_pred))
print("RMSE: %f" % (rmse))

r2_test = metrics.r2_score(y_test, fry_pred)
print(f'R^2 on the test set:\t{r2_test}')

RMSE: 1076.361069
R^2 on the test set:	0.6024720202658449


In [None]:
data_dmatrix = xgb.DMatrix(data=X,label=y)

xg_reg = xgb.XGBRegressor(objective ='reg:linear', colsample_bytree = 0.3, learning_rate = 0.1,
                max_depth = 5, alpha = 10, n_estimators = 10)

In [None]:
xg_reg.fit(x_train,y_train)

preds = xg_reg.predict(x_test)

rmse = np.sqrt(mean_squared_error(y_test, preds))
print("RMSE: %f" % (rmse))

In [None]:
params = {"objective":"reg:linear",'colsample_bytree': 0.3,'learning_rate': 0.1,
                'max_depth': 5, 'alpha': 10}

cv_results = xgb.cv(dtrain=data_dmatrix, params=params, nfold=3,
                    num_boost_round=50,early_stopping_rounds=10,metrics="rmse", as_pandas=True, seed=123)

cv_results.head()