# Perdiction of sales

### Problem Statement
The dataset represents sales data for 1559 products across 10 stores in different cities. Also, certain attributes of each product and store are available. The aim is to build a predictive model and find out the sales of each product at a particular store.

|Variable|Description|
|: ------------- |:-------------|
|Item_Identifier|Unique product ID|
|Item_Weight|Weight of product|
|Item_Fat_Content|Whether the product is low fat or not|
|Item_Visibility|The % of total display area of all products in a store allocated to the particular product|
|Item_Type|The category to which the product belongs|
|Item_MRP|Maximum Retail Price (list price) of the product|
|Outlet_Identifier|Unique store ID|
|Outlet_Establishment_Year|The year in which store was established|
|Outlet_Size|The size of the store in terms of ground area covered|
|Outlet_Location_Type|The type of city in which the store is located|
|Outlet_Type|Whether the outlet is just a grocery store or some sort of supermarket|
|Item_Outlet_Sales|Sales of the product in the particulat store. This is the outcome variable to be predicted.|

Please note that the data may have missing values as some stores might not report all the data due to technical glitches. Hence, it will be required to treat them accordingly.



### Explore the problem in following stages:

1. Hypothesis Generation – understanding the problem better by brainstorming possible factors that can impact the outcome
2. Data Exploration – looking at categorical and continuous feature summaries and making inferences about the data.
3. Data Cleaning – imputing missing values in the data and checking for outliers
4. Feature Engineering – modifying existing variables and creating new ones for analysis
5. Model Building – making predictive models on the data

In [1]:
import pandas as pd
import numpy as np
from sklearn.metrics import r2_score
from sklearn import model_selection
from sklearn.linear_model import LinearRegression

In [2]:
data = pd.read_csv('regression_exercise_modified.csv')

In [3]:
data = data.drop(['Unnamed: 0'], axis = 1)
data

Unnamed: 0,Item_Weight,Item_Fat_Content,Item_Visibility,Item_MRP,Outlet_Size,Item_Outlet_Sales,Outlet_Operation_Years,Drinks,Food,OUT010,...,OUT019,OUT027,OUT035,OUT045,OUT046,Tier_1,Tier_2,Grocery_Store,Supermarket_Type1,Supermarket_Type2
0,9.300,1,0.016047,249.8092,2,3735.1380,21,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
1,5.920,2,0.019278,48.2692,2,443.4228,11,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
2,17.500,1,0.016760,141.6180,2,2097.2700,21,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
3,19.200,2,0.000000,182.0950,0,732.3800,22,0.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
4,8.930,0,0.000000,53.8614,3,994.7052,33,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8518,6.865,1,0.056783,214.5218,3,2778.3834,33,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
8519,8.380,2,0.046982,108.1570,0,549.2850,18,0.0,1.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0
8520,10.600,0,0.035186,85.1224,1,1193.1136,16,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
8521,7.210,2,0.145221,103.1332,2,1845.5976,11,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0


In [4]:
y = np.array(data.Item_Outlet_Sales.array)

We have covered data preparation and feature engineering last week. Now, it's time to do some predictive models.

## Model Building

Its time to start making predictive models.

## Task
Make a baseline model. Baseline model is the one which requires no predictive model and its like an informed guess. For instance, predict the sales as the overall average sales or just zero.
Making baseline models helps in setting a benchmark. If your predictive algorithm is below this, there is something going seriously wrong and you should check your data.

In [5]:
mean = data['Item_Outlet_Sales'].mean()
baseline_mean = np.array([mean] * 8523)
print(r2_score(y, baseline_mean))

0.0


In [6]:
baseline_zero = np.zeros(8523)
print(r2_score(y, baseline_zero))

-1.634048539244958


## Task
Split your data in 80% train set and 20% test set.

In [7]:
data = data.drop('Item_Outlet_Sales', axis = 1)
X = np.array(data)

In [8]:
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, train_size=0.8, test_size=0.2)

print ("X_train: ", X_train)
print ("y_train: ", y_train)
print("X_test: ", X_test)
print ("y_test: ", y_test)

X_train:  [[19.6         0.          0.09416166 ...  0.          1.
   0.        ]
 [12.30570501  1.          0.0684892  ...  0.          0.
   0.        ]
 [10.895       1.          0.13669689 ...  0.          1.
   0.        ]
 ...
 [13.14231395  0.          0.07904699 ...  0.          0.
   0.        ]
 [16.6         1.          0.12247536 ...  0.          1.
   0.        ]
 [15.7         1.          0.04516624 ...  0.          1.
   0.        ]]
y_train:  [4566.0564 4165.2448 5536.7928 ... 1480.0734  347.5476 2516.724 ]
X_test:  [[12.6         2.          0.         ...  0.          1.
   0.        ]
 [ 6.985       2.          0.13736688 ...  0.          1.
   0.        ]
 [14.3         1.          0.1311528  ...  0.          1.
   0.        ]
 ...
 [13.35        1.          0.0385194  ...  0.          1.
   0.        ]
 [13.5         1.          0.         ...  0.          1.
   0.        ]
 [13.3847365   0.          0.0794198  ...  0.          0.
   0.        ]]
y_test:  [6723.24

## Task
Use grid_search to find the better value of parameter `normalize` for LinearRegression from `sklearn`. Possible values are either True or False.

In [31]:
parameters = {'normalize':(True, False)}
svc = LinearRegression()
clf = model_selection.GridSearchCV(svc, parameters)
clf.fit(X_train, y_train)

GridSearchCV(estimator=LinearRegression(),
             param_grid={'normalize': (True, False)})

In [32]:
print('Best score for training data: ', clf.best_score_)

Best score for training data:  0.5616441353858669


In [33]:
print('Best normalize:',clf.best_estimator_.normalize) 

Best normalize: False


## Task
Using the model from grid_search, predict the values in the test set and compare again the benchmark.

In [24]:
regressor = LinearRegression(normalize = False).fit(X_train, y_train)
y_pred = regressor.predict(X_test)

In [25]:
print(regressor.coef_)

[   -1.79402926    14.77199496  -122.9428991     15.31756423
   267.57763435     8.52516896    31.93437158    51.72225812
  -455.67487104  -622.11714969   222.62498938  -169.27317535
  -648.43686093  1193.47011487   -27.256397      36.15285953
   179.79251814  -177.9263707    231.52145191 -1104.11173197
    79.91479245  -169.27317535]


In [26]:
r2 = r2_score(y_test, y_pred)
print(r2)

0.550952453109344


Compared to the benchmark our model did much better, but this $R^2$-score demonstrates that the model is not that good nonetheless.