# Prediction of sales

### Problem Statement
[The dataset](https://drive.google.com/file/d/1B07fvYosBNdIwlZxSmxDfeAf9KaygX89/view?usp=sharing) represents sales data for 1559 products across 10 stores in different cities. Also, attributes of each product and store are available. The aim is to build a predictive model and determine the sales of each product at a particular store.

|Variable|Description|
|: ------------- |:-------------|
|Item_Identifier|Unique product ID|
|Item_Weight|Weight of product|
|Item_Fat_Content|Whether the product is low fat or not|
|Item_Visibility|The % of total display area of all products in a store allocated to the particular product|
|Item_Type|The category to which the product belongs|
|Item_MRP|Maximum Retail Price (list price) of the product|
|Outlet_Identifier|Unique store ID|
|Outlet_Establishment_Year|The year in which store was established|
|Outlet_Size|The size of the store in terms of ground area covered|
|Outlet_Location_Type|The type of city in which the store is located|
|Outlet_Type|Whether the outlet is just a grocery store or some sort of supermarket|
|Item_Outlet_Sales|Sales of the product in the particulat store. This is the outcome variable to be predicted.|

Please note that the data may have missing values as some stores might not report all the data due to technical glitches. Hence, it will be required to treat them accordingly.



### In following weeks, we will explore the problem in following stages:

1. **Hypothesis Generation – understanding the problem better by brainstorming possible factors that can impact the outcome**
2. **Data Exploration – looking at categorical & continuous feature summaries and making inferences about the data**
3. **Data Cleaning – imputing missing values in the data and checking for outliers**
4. **Feature Engineering – modifying existing variables and/or creating new ones for analysis**
5. **Model Building – making predictive models on the data**
---------

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns

In [2]:
data = pd.read_csv('data/regression_exercise.csv')

In [3]:
# Add Unknown categorical variable for missing outlet size
data['Outlet_Size'] = data["Outlet_Size"].fillna("Unknown")

In [4]:
# drop weight, uniform distribution and not useful
data = data.drop('Item_Weight', axis=1)

In [5]:
# missing data
total = data.isnull().sum().sort_values(ascending=False)
percent = (data.isnull().sum()/data.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data

Unnamed: 0,Total,Percent
Item_Identifier,0,0.0
Item_Fat_Content,0,0.0
Item_Visibility,0,0.0
Item_Type,0,0.0
Item_MRP,0,0.0
Outlet_Identifier,0,0.0
Outlet_Establishment_Year,0,0.0
Outlet_Size,0,0.0
Outlet_Location_Type,0,0.0
Outlet_Type,0,0.0


## 4. Feature Engineering

1. Resolving the issues in the data to make it ready for the analysis.
2. Create some new variables using the existing ones.





### Create a broad category of Type of Item

`Item_Type` variable has many categories which might prove to be very useful in analysis. Look at the `Item_Identifier`, i.e. the unique ID of each item, it starts with either FD, DR or NC. If you see the categories, these look like being Food, Drinks and Non-Consumables. 

**Task:** Use the Item_Identifier variable to create a new column

In [6]:

# use lambda
# data['Item_Weight'] = data.groupby(['Item_Type'], sort=False)['Item_Weight'].apply(lambda x: x.fillna(x.median()))


def findCategory(row):
    if row['Item_Identifier'][0:2] == 'FD':
        return 'Food'
    if row['Item_Identifier'][0:2] == 'DR':
        return 'Drink'
    if row['Item_Identifier'][0:2] == 'NC':
        return 'Non Consumable'

data['Item_Category'] = data.apply(lambda row: findCategory(row) ,axis=1)
data



Unnamed: 0,Item_Identifier,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales,Item_Category
0,FDA15,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.1380,Food
1,DRC01,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228,Drink
2,FDN15,Low Fat,0.016760,Meat,141.6180,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.2700,Food
3,FDX07,Regular,0.000000,Fruits and Vegetables,182.0950,OUT010,1998,Unknown,Tier 3,Grocery Store,732.3800,Food
4,NCD19,Low Fat,0.000000,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052,Non Consumable
...,...,...,...,...,...,...,...,...,...,...,...,...
8518,FDF22,Low Fat,0.056783,Snack Foods,214.5218,OUT013,1987,High,Tier 3,Supermarket Type1,2778.3834,Food
8519,FDS36,Regular,0.046982,Baking Goods,108.1570,OUT045,2002,Unknown,Tier 2,Supermarket Type1,549.2850,Food
8520,NCJ29,Low Fat,0.035186,Health and Hygiene,85.1224,OUT035,2004,Small,Tier 2,Supermarket Type1,1193.1136,Non Consumable
8521,FDN46,Regular,0.145221,Snack Foods,103.1332,OUT018,2009,Medium,Tier 3,Supermarket Type2,1845.5976,Food


In [7]:
data['Item_Category'].value_counts()

Food              6125
Non Consumable    1599
Drink              799
Name: Item_Category, dtype: int64

### Determine the years of operation of a store

**Task:** Make a new column depicting the years of operation of a store (i.e. how long the store exists). 

In [8]:
data.head()

Unnamed: 0,Item_Identifier,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales,Item_Category
0,FDA15,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138,Food
1,DRC01,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228,Drink
2,FDN15,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27,Food
3,FDX07,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,Unknown,Tier 3,Grocery Store,732.38,Food
4,NCD19,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052,Non Consumable


In [9]:
data['Years_Operating'] = data['Outlet_Establishment_Year'].max() - data['Outlet_Establishment_Year'] 

In [10]:
data

Unnamed: 0,Item_Identifier,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales,Item_Category,Years_Operating
0,FDA15,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.1380,Food,10
1,DRC01,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228,Drink,0
2,FDN15,Low Fat,0.016760,Meat,141.6180,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.2700,Food,10
3,FDX07,Regular,0.000000,Fruits and Vegetables,182.0950,OUT010,1998,Unknown,Tier 3,Grocery Store,732.3800,Food,11
4,NCD19,Low Fat,0.000000,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052,Non Consumable,22
...,...,...,...,...,...,...,...,...,...,...,...,...,...
8518,FDF22,Low Fat,0.056783,Snack Foods,214.5218,OUT013,1987,High,Tier 3,Supermarket Type1,2778.3834,Food,22
8519,FDS36,Regular,0.046982,Baking Goods,108.1570,OUT045,2002,Unknown,Tier 2,Supermarket Type1,549.2850,Food,7
8520,NCJ29,Low Fat,0.035186,Health and Hygiene,85.1224,OUT035,2004,Small,Tier 2,Supermarket Type1,1193.1136,Non Consumable,5
8521,FDN46,Regular,0.145221,Snack Foods,103.1332,OUT018,2009,Medium,Tier 3,Supermarket Type2,1845.5976,Food,0


### Modify categories of Item_Fat_Content

**Task:** There are difference in representation in categories of Item_Fat_Content variable. This should be corrected.

In [11]:
data['Item_Fat_Content'].value_counts()

Low Fat    5089
Regular    2889
LF          316
reg         117
low fat     112
Name: Item_Fat_Content, dtype: int64

In [12]:
data['Item_Fat_Content'] = data['Item_Fat_Content'].replace({'LF': 'Low Fat', 'low fat': 'Low Fat', 'reg': 'Regular'})

In [13]:
data['Item_Fat_Content'].value_counts()

Low Fat    5517
Regular    3006
Name: Item_Fat_Content, dtype: int64

**Task:** There are some non-consumables as well and a fat-content should not be specified for them. Create a separate category for such kind of observations.

In [14]:
data.loc[data['Item_Category'] == 'Non Consumable']['Item_Fat_Content'] = 'Non Consumable'

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data.loc[data['Item_Category'] == 'Non Consumable']['Item_Fat_Content'] = 'Non Consumable'


In [15]:
data['Item_Fat_Content'].value_counts()

Low Fat    5517
Regular    3006
Name: Item_Fat_Content, dtype: int64

### Numerical and One-Hot Encoding of Categorical variables

Since scikit-learn algorithms accept only numerical variables, we need to **convert all categorical variables into numeric types.** 

- if the variable is Ordinal we can simply map its values into numbers
- if the variable is Nominal (we cannot sort the values) we need to One-Hot Encode them --> create dummy variables

In [16]:
data.dtypes

Item_Identifier               object
Item_Fat_Content              object
Item_Visibility              float64
Item_Type                     object
Item_MRP                     float64
Outlet_Identifier             object
Outlet_Establishment_Year      int64
Outlet_Size                   object
Outlet_Location_Type          object
Outlet_Type                   object
Item_Outlet_Sales            float64
Item_Category                 object
Years_Operating                int64
dtype: object

In [17]:
#drop ID, no use
clean_data = data.drop('Item_Identifier', axis=1)
clean_data.dtypes

Item_Fat_Content              object
Item_Visibility              float64
Item_Type                     object
Item_MRP                     float64
Outlet_Identifier             object
Outlet_Establishment_Year      int64
Outlet_Size                   object
Outlet_Location_Type          object
Outlet_Type                   object
Item_Outlet_Sales            float64
Item_Category                 object
Years_Operating                int64
dtype: object

In [18]:
#Categorical
clean_data.Item_Type.value_counts()

Fruits and Vegetables    1232
Snack Foods              1200
Household                 910
Frozen Foods              856
Dairy                     682
Canned                    649
Baking Goods              648
Health and Hygiene        520
Soft Drinks               445
Meat                      425
Breads                    251
Hard Drinks               214
Others                    169
Starchy Foods             148
Breakfast                 110
Seafood                    64
Name: Item_Type, dtype: int64

In [19]:
#categorical
data.Outlet_Identifier.value_counts()

OUT027    935
OUT013    932
OUT049    930
OUT046    930
OUT035    930
OUT045    929
OUT018    928
OUT017    926
OUT010    555
OUT019    528
Name: Outlet_Identifier, dtype: int64

In [20]:
#drop established year
clean_data = clean_data.drop('Outlet_Establishment_Year', axis=1)

In [21]:
clean_data.Outlet_Type.value_counts()

Supermarket Type1    5577
Grocery Store        1083
Supermarket Type3     935
Supermarket Type2     928
Name: Outlet_Type, dtype: int64

In [22]:
# Encode ordinal variables
clean_data.Item_Fat_Content.value_counts()

clean_data = clean_data.replace({
    'Item_Fat_Content': {'Non Consumable': 0, 'Low Fat': 1, 'Regular': 2},
    'Outlet_Size': {'Unknown': 0, 'Small': 1, 'Medium': 2, 'High': 3},
    'Outlet_Location_Type': {'Tier 1': 1, 'Tier 2': 2, 'Tier 3': 3},
    'Outlet_Type' : {'Grocery Store': 1, 'Supermarket Type1': 2, 'Supermarket Type2': 3, 'Supermarket Type3': 4},
})

In [23]:
# make dummy variables
cat_feats = clean_data.dtypes[clean_data.dtypes == 'object'].index.to_list()
data_dummy = pd.get_dummies(clean_data[cat_feats])
data_dummy

Unnamed: 0,Item_Type_Baking Goods,Item_Type_Breads,Item_Type_Breakfast,Item_Type_Canned,Item_Type_Dairy,Item_Type_Frozen Foods,Item_Type_Fruits and Vegetables,Item_Type_Hard Drinks,Item_Type_Health and Hygiene,Item_Type_Household,...,Outlet_Identifier_OUT018,Outlet_Identifier_OUT019,Outlet_Identifier_OUT027,Outlet_Identifier_OUT035,Outlet_Identifier_OUT045,Outlet_Identifier_OUT046,Outlet_Identifier_OUT049,Item_Category_Drink,Item_Category_Food,Item_Category_Non Consumable
0,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,1,0,1,0
1,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,1,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,1,0
3,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,1,0
4,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8518,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
8519,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,1,0
8520,0,0,0,0,0,0,0,0,1,0,...,0,0,0,1,0,0,0,0,0,1
8521,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,1,0


In [24]:
clean_data

Unnamed: 0,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales,Item_Category,Years_Operating
0,1,0.016047,Dairy,249.8092,OUT049,2,1,2,3735.1380,Food,10
1,2,0.019278,Soft Drinks,48.2692,OUT018,2,3,3,443.4228,Drink,0
2,1,0.016760,Meat,141.6180,OUT049,2,1,2,2097.2700,Food,10
3,2,0.000000,Fruits and Vegetables,182.0950,OUT010,0,3,1,732.3800,Food,11
4,1,0.000000,Household,53.8614,OUT013,3,3,2,994.7052,Non Consumable,22
...,...,...,...,...,...,...,...,...,...,...,...
8518,1,0.056783,Snack Foods,214.5218,OUT013,3,3,2,2778.3834,Food,22
8519,2,0.046982,Baking Goods,108.1570,OUT045,0,2,2,549.2850,Food,7
8520,1,0.035186,Health and Hygiene,85.1224,OUT035,1,2,2,1193.1136,Non Consumable,5
8521,2,0.145221,Snack Foods,103.1332,OUT018,2,3,3,1845.5976,Food,0


In [25]:
#remove remaining objects
clean_data = clean_data.drop(cat_feats, axis=1)

In [26]:
#merge 
ndata = pd.concat([clean_data, data_dummy], axis=1)
ndata

Unnamed: 0,Item_Fat_Content,Item_Visibility,Item_MRP,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales,Years_Operating,Item_Type_Baking Goods,Item_Type_Breads,...,Outlet_Identifier_OUT018,Outlet_Identifier_OUT019,Outlet_Identifier_OUT027,Outlet_Identifier_OUT035,Outlet_Identifier_OUT045,Outlet_Identifier_OUT046,Outlet_Identifier_OUT049,Item_Category_Drink,Item_Category_Food,Item_Category_Non Consumable
0,1,0.016047,249.8092,2,1,2,3735.1380,10,0,0,...,0,0,0,0,0,0,1,0,1,0
1,2,0.019278,48.2692,2,3,3,443.4228,0,0,0,...,1,0,0,0,0,0,0,1,0,0
2,1,0.016760,141.6180,2,1,2,2097.2700,10,0,0,...,0,0,0,0,0,0,1,0,1,0
3,2,0.000000,182.0950,0,3,1,732.3800,11,0,0,...,0,0,0,0,0,0,0,0,1,0
4,1,0.000000,53.8614,3,3,2,994.7052,22,0,0,...,0,0,0,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8518,1,0.056783,214.5218,3,3,2,2778.3834,22,0,0,...,0,0,0,0,0,0,0,0,1,0
8519,2,0.046982,108.1570,0,2,2,549.2850,7,1,0,...,0,0,0,0,1,0,0,0,1,0
8520,1,0.035186,85.1224,1,2,2,1193.1136,5,0,0,...,0,0,0,1,0,0,0,0,0,1
8521,2,0.145221,103.1332,2,3,3,1845.5976,0,0,0,...,1,0,0,0,0,0,0,0,1,0


**All variables should be by now numeric.**

---------
### Exporting Data

**Task:** You can save the processed data to your local machine as a csv file.

In [31]:
data.head()

Unnamed: 0,Item_Identifier,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales,Item_Category,Years_Operating
0,FDA15,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138,Food,10
1,DRC01,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228,Drink,0
2,FDN15,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27,Food,10
3,FDX07,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,Unknown,Tier 3,Grocery Store,732.38,Food,11
4,NCD19,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052,Non Consumable,22


In [27]:
ndata.to_csv('nproduct_data.csv')

In [28]:
ndata.dtypes

Item_Fat_Content                     int64
Item_Visibility                    float64
Item_MRP                           float64
Outlet_Size                          int64
Outlet_Location_Type                 int64
Outlet_Type                          int64
Item_Outlet_Sales                  float64
Years_Operating                      int64
Item_Type_Baking Goods               uint8
Item_Type_Breads                     uint8
Item_Type_Breakfast                  uint8
Item_Type_Canned                     uint8
Item_Type_Dairy                      uint8
Item_Type_Frozen Foods               uint8
Item_Type_Fruits and Vegetables      uint8
Item_Type_Hard Drinks                uint8
Item_Type_Health and Hygiene         uint8
Item_Type_Household                  uint8
Item_Type_Meat                       uint8
Item_Type_Others                     uint8
Item_Type_Seafood                    uint8
Item_Type_Snack Foods                uint8
Item_Type_Soft Drinks                uint8
Item_Type_S

In [30]:
y = ndata.SalePrice
ndata.drop("SalePrice", axis=1, inplace=True)

AttributeError: 'DataFrame' object has no attribute 'SalePrice'