# Prediction of sales

### Problem Statement
This dataset represents sales data for 1559 products across 10 stores in different cities. Also, certain attributes of each product and store are available. The aim is to build a predictive model and find out the sales of each product at a particular store.

|Variable|Description|
|: ------------- |:-------------|
|Item_Identifier|Unique product ID|
|Item_Weight|Weight of product|
|Item_Fat_Content|Whether the product is low fat or not|
|Item_Visibility|The % of total display area of all products in a store allocated to the particular product|
|Item_Type|The category to which the product belongs|
|Item_MRP|Maximum Retail Price (list price) of the product|
|Outlet_Identifier|Unique store ID|
|Outlet_Establishment_Year|The year in which store was established|
|Outlet_Size|The size of the store in terms of ground area covered|
|Outlet_Location_Type|The type of city in which the store is located|
|Outlet_Type|Whether the outlet is just a grocery store or some sort of supermarket|
|Item_Outlet_Sales|Sales of the product in the particulat store. This is the outcome variable to be predicted.|

Please note that the data may have missing values as some stores might not report all the data due to technical glitches. Hence, it will be required to treat them accordingly.



### Explore the problem in following stages:

1. Hypothesis Generation – understanding the problem better by brainstorming possible factors that can impact the outcome
2. Data Exploration – looking at categorical and continuous feature summaries and making inferences about the data.
3. Data Cleaning – imputing missing values in the data and checking for outliers
4. Feature Engineering – modifying existing variables and creating new ones for analysis
5. Model Building – making predictive models on the data

# Hypothesis Generation
Larger percent of display area will affect the sales.
Larger store will increase the sales of the product of the store.
Depending on the category it will increase or decrease, common everyday items will be higher in sales compared to items that are not used everyday.

In [12]:
# load libraries
import pandas as pd
import numpy as np

# load data
df = pd.read_csv('sales_data_cleaned.csv', sep = ',', index_col=0)

In [13]:
df.head(3)

Unnamed: 0,Item_Weight,Item_Fat_Content,Item_Visibility,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Item_Outlet_Sales,Years_Operation,...,Item_Type_Snack Foods,Item_Type_Soft Drinks,Item_Type_Starchy Foods,Outlet_Type_Grocery Store,Outlet_Type_Supermarket Type1,Outlet_Type_Supermarket Type2,Outlet_Type_Supermarket Type3,Item_Category_DR,Item_Category_FD,Item_Category_NC
0,9.3,0,0.016047,249.8092,9,1999,1,0,3735.138,23,...,0,0,0,0,1,0,0,0,1,0
1,5.92,1,0.019278,48.2692,3,2009,1,2,443.4228,13,...,0,1,0,0,0,1,0,1,0,0
2,17.5,0,0.01676,141.618,9,1999,1,0,2097.27,23,...,0,0,0,0,1,0,0,0,1,0


In [14]:
y = df['Item_Outlet_Sales']
df.drop('Item_Outlet_Sales', axis=1, inplace=True)
X = df

We have covered data preparation and feature engineering two weeks ago. Now, it's time to do some predictive models.

## Model Building

## Task
Make a baseline model. Baseline models help us set a benchmark to gauge the performance of our future models. If your new model is below the baseline, something has gone wrong, and you should check your data.

To make a baseline model, run a simple regression model without altering the default parameters in sklearn. 

In [15]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
reg = LinearRegression()

## Task
Split your data in 80% train set and 20% test set.

In [16]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=123)

reg.fit(X_train,y_train)

In [21]:
y_pred = reg.predict(X_test)

# print(y_pred)
print(np.sqrt(mean_squared_error(y_test,y_pred)))

1115.6095976513993


## Task
Use grid_search to find the best value of the parameter `alpha` for Ridge and Lasso regressions from `sklearn`.

In [22]:
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Ridge, Lasso

params = {"alpha": (0, 0.5, 1, 3, 5, 10)}

ridge = Ridge()
lasso = Lasso()

grid_r = GridSearchCV(ridge, params)

grid_l = GridSearchCV(lasso, params)

In [24]:
grid_r.fit(X_train, y_train)
y_pred = grid_r.predict(X_test)

# print(y_pred)
print(np.sqrt(mean_squared_error(y_test,y_pred)))

1115.3684947475015


In [25]:
grid_l.fit(X_train, y_train)

y_pred = grid_l.predict(X_test)

# print(y_pred)
print(np.sqrt(mean_squared_error(y_test,y_pred)))

  estimator.fit(X_train, y_train, **fit_params)
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  estimator.fit(X_train, y_train, **fit_params)
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  estimator.fit(X_train, y_train, **fit_params)
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  estimator.fit(X_train, y_train, **fit_params)
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  estimator.fit(X_train, y_train, **fit_params)
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(


1113.5438072596971


## Task
Using the model from grid_search, predict the values in the test set and compare against your benchmark.

error went down a little bit but not a lot against our benchmark