# Perdiction of sales

### Problem Statement
This dataset represents sales data for 1559 products across 10 stores in different cities. Also, certain attributes of each product and store are available. The aim is to build a predictive model and find out the sales of each product at a particular store.

|Variable|Description|
|: ------------- |:-------------|
|Item_Identifier|Unique product ID|
|Item_Weight|Weight of product|
|Item_Fat_Content|Whether the product is low fat or not|
|Item_Visibility|The % of total display area of all products in a store allocated to the particular product|
|Item_Type|The category to which the product belongs|
|Item_MRP|Maximum Retail Price (list price) of the product|
|Outlet_Identifier|Unique store ID|
|Outlet_Establishment_Year|The year in which store was established|
|Outlet_Size|The size of the store in terms of ground area covered|
|Outlet_Location_Type|The type of city in which the store is located|
|Outlet_Type|Whether the outlet is just a grocery store or some sort of supermarket|
|Item_Outlet_Sales|Sales of the product in the particulat store. This is the outcome variable to be predicted.|

Please note that the data may have missing values as some stores might not report all the data due to technical glitches. Hence, it will be required to treat them accordingly.

---------------------

### Explore the problem in following stages:

1. Hypothesis Generation – understanding the problem better by brainstorming possible factors that can impact the outcome
2. Data Exploration – looking at categorical and continuous feature summaries and making inferences about the data.
3. Data Cleaning – imputing missing values in the data and checking for outliers
4. Feature Engineering – modifying existing variables and creating new ones for analysis
5. Model Building – making predictive models on the data

In [75]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor, GradientBoostingClassifier
from sklearn.metrics import mean_squared_error

from rfpimp import *

import xgboost as xgb

In [57]:
data = pd.read_csv('data/nproduct_data.csv', index_col=0)
data

Unnamed: 0,Item_Fat_Content,Item_Visibility,Item_MRP,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales,Years_Operating,Item_Type_Baking Goods,Item_Type_Breads,...,Outlet_Identifier_OUT018,Outlet_Identifier_OUT019,Outlet_Identifier_OUT027,Outlet_Identifier_OUT035,Outlet_Identifier_OUT045,Outlet_Identifier_OUT046,Outlet_Identifier_OUT049,Item_Category_Drink,Item_Category_Food,Item_Category_Non Consumable
0,1,0.016047,249.8092,2,1,2,3735.1380,10,0,0,...,0,0,0,0,0,0,1,0,1,0
1,2,0.019278,48.2692,2,3,3,443.4228,0,0,0,...,1,0,0,0,0,0,0,1,0,0
2,1,0.016760,141.6180,2,1,2,2097.2700,10,0,0,...,0,0,0,0,0,0,1,0,1,0
3,2,0.000000,182.0950,0,3,1,732.3800,11,0,0,...,0,0,0,0,0,0,0,0,1,0
4,1,0.000000,53.8614,3,3,2,994.7052,22,0,0,...,0,0,0,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8518,1,0.056783,214.5218,3,3,2,2778.3834,22,0,0,...,0,0,0,0,0,0,0,0,1,0
8519,2,0.046982,108.1570,0,2,2,549.2850,7,1,0,...,0,0,0,0,1,0,0,0,1,0
8520,1,0.035186,85.1224,1,2,2,1193.1136,5,0,0,...,0,0,0,1,0,0,0,0,0,1
8521,2,0.145221,103.1332,2,3,3,1845.5976,0,0,0,...,1,0,0,0,0,0,0,0,1,0


In [58]:
drop_columns = data.columns[8:-3]
data = data.drop(drop_columns, axis=1)
data

Unnamed: 0,Item_Fat_Content,Item_Visibility,Item_MRP,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales,Years_Operating,Item_Category_Drink,Item_Category_Food,Item_Category_Non Consumable
0,1,0.016047,249.8092,2,1,2,3735.1380,10,0,1,0
1,2,0.019278,48.2692,2,3,3,443.4228,0,1,0,0
2,1,0.016760,141.6180,2,1,2,2097.2700,10,0,1,0
3,2,0.000000,182.0950,0,3,1,732.3800,11,0,1,0
4,1,0.000000,53.8614,3,3,2,994.7052,22,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...
8518,1,0.056783,214.5218,3,3,2,2778.3834,22,0,1,0
8519,2,0.046982,108.1570,0,2,2,549.2850,7,0,1,0
8520,1,0.035186,85.1224,1,2,2,1193.1136,5,0,0,1
8521,2,0.145221,103.1332,2,3,3,1845.5976,0,0,1,0


In [59]:
# split training data and features
X = data.drop(['Item_Outlet_Sales'], axis=1)
y = data['Item_Outlet_Sales']

In [60]:
# scale the numeric data that isn't a target
numeric_data = X[['Item_MRP', 'Item_Visibility', 'Years_Operating']]
num_columns = numeric_data.columns

# instantiate Standard Scaler
scaler = StandardScaler()

# scale
numeric_data_scaled = scaler.fit_transform(numeric_data)

# put it back in a frame
scaled_data = pd.DataFrame(numeric_data_scaled, columns=num_columns)
scaled_data


Unnamed: 0,Item_MRP,Item_Visibility,Years_Operating
0,1.747454,-0.970732,-0.139541
1,-1.489023,-0.908111,-1.334103
2,0.010040,-0.956917,-0.139541
3,0.660050,-1.281758,-0.020085
4,-1.399220,-1.281758,1.293934
...,...,...,...
8518,1.180783,-0.181193,1.293934
8519,-0.527301,-0.371154,-0.497909
8520,-0.897208,-0.599784,-0.736822
8521,-0.607977,1.532880,-1.334103


In [61]:
# drop the numeric values from the OG df
X = X.drop(num_columns, axis=1)

In [62]:
# add the scaled values back into features
X = pd.concat([X,scaled_data], axis=1)
X

Unnamed: 0,Item_Fat_Content,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Category_Drink,Item_Category_Food,Item_Category_Non Consumable,Item_MRP,Item_Visibility,Years_Operating
0,1,2,1,2,0,1,0,1.747454,-0.970732,-0.139541
1,2,2,3,3,1,0,0,-1.489023,-0.908111,-1.334103
2,1,2,1,2,0,1,0,0.010040,-0.956917,-0.139541
3,2,0,3,1,0,1,0,0.660050,-1.281758,-0.020085
4,1,3,3,2,0,0,1,-1.399220,-1.281758,1.293934
...,...,...,...,...,...,...,...,...,...,...
8518,1,3,3,2,0,1,0,1.180783,-0.181193,1.293934
8519,2,0,2,2,0,1,0,-0.527301,-0.371154,-0.497909
8520,1,1,2,2,0,0,1,-0.897208,-0.599784,-0.736822
8521,2,2,3,3,0,1,0,-0.607977,1.532880,-1.334103


We have covered how to prepare a dataset and the process of feature engineering two weeks ago. In addition, we have already created Lasso and Ridge regressions on Monday. Today, we will be working with the ensemble methods. 

-------------------------
### Model Building: Ensemble Models

Try out the different ensemble models (Random Forest Regressor, Gradient Boosting, XGBoost)
- **Note:** Spend some time on the documention for each of these models.
- **Note:** As you spend time on this challenge, it is suggested to review how each of these models work and how they compare to each other.

Calculate the **mean squared error** on the test set. Explore how different parameters of the model affect the results and the performance of the model. (*Stretch: Create a visualization to display this information*)

- Use GridSearchCV to find optimal paramaters of models.
- Compare agains the Lasso and Ridge Regression models from Monday.

**Questions to answer:**
- Which ensemble model performed the best? 

# Random Forest

In [69]:
# split the data to start
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

In [70]:
# We will start with random forest regressor, not needed but the data is split anyway
rfr = RandomForestRegressor()
rfr_model = rfr.fit(X_train,y_train)
rfr_pred = rfr.predict(X_test)

In [73]:
# Calculate MSE for RFR
rfr_mse = mean_squared_error(y_test, rfr_pred)
rfr_mse

1288382.7986937696

In [77]:
rfr_model.feature_importances_

array([0.01463304, 0.01581249, 0.0089661 , 0.23540323, 0.0071967 ,
       0.00867326, 0.00706714, 0.49414825, 0.15171747, 0.05638233])

In [76]:
# Check feature importances
importances(rfr_model, X_train, y_train)

Unnamed: 0_level_0,Importance
Feature,Unnamed: 1_level_1
Item_MRP,1.117232
Outlet_Type,0.637776
Item_Visibility,0.27396
Years_Operating,0.144575
Outlet_Size,0.038527
Item_Fat_Content,0.036731
Outlet_Location_Type,0.024577
Item_Category_Food,0.016192
Item_Category_Non Consumable,0.013525
Item_Category_Drink,0.013246


# XGBoost