##  Stages of Exploring a Problem

1. **Hypothesis Generation** : Underatanding the Problem by deciding possible factors which impact the Outcome.
2. **Data Exploration** : Looking at data and making inferences about the data.
3. **Data Cleaning** : Imputing missing values, checking for outliers and correcting/removing them.
4. **Feature Engineering** : Modifying existing variables and creating new ones.
5. **Model Building** : Making Predictive Models on data.

## 1. Hypothesis Generation

### Store Level Hypothesis:
1. **City Type** : Stores in Urban Areas have more sales.
2. **Pupulation** : Stores located in densly Populated area will have more sales.
3. **Store Capacity** : Big Size stores have more sales as they act like One-Stop-Stores and people prefer getting everything from one place.
4. **Competitors** : Stores having Similar establishments will have less sales because of more competition.
5. **Marketing** : Stores which have a good marketing division should have higher sales as it will be able to attract customers through the right offers and advertising.
6. **Ambiance** : Stores which are well-maintained and managed by polite and humble people are expected to have higher footfall and thus higher sales.

### Product Level Hypothesis:
1. **Brand** : Branded products have more sale because of more trust in them.
2. **Packaging** : Good Packaging can attract more customers.
3. **Utility** : Daily use products have higher tendency to sell.
4. **Display Area**: Products which are given bigger shelves in the store are likely to catch attention first and sell more.
5. **Visibility in Store** : The location of product in a store will impact sales. Ones which are right at entrance will catch the eye of customer first rather than the ones in back.
6. **Advertising** : Better advertising of products in the store will should higher sales in most cases.
7. **Promotional Offers** : Products accompanied with attractive offers and discounts will sell more.

## 2. Data Exploration

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
sns.set(color_codes=True)
import matplotlib.pyplot as plt
from scipy.stats import mode
%matplotlib inline

In [2]:
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
train['source'] = 'train'
test['source'] = 'test'
data = pd.concat([train,test],ignore_index=True)

In [3]:
print train.shape, test.shape, data.shape

(8523, 13) (5681, 12) (14204, 13)


In [4]:
data.apply(lambda x: sum(x.isnull()))

Item_Fat_Content                0
Item_Identifier                 0
Item_MRP                        0
Item_Outlet_Sales            5681
Item_Type                       0
Item_Visibility                 0
Item_Weight                  2439
Outlet_Establishment_Year       0
Outlet_Identifier               0
Outlet_Location_Type            0
Outlet_Size                  4016
Outlet_Type                     0
source                          0
dtype: int64

`Item_Outlet_Sales` is the target variable of `test` data

In [5]:
data.describe()

Unnamed: 0,Item_MRP,Item_Outlet_Sales,Item_Visibility,Item_Weight,Outlet_Establishment_Year
count,14204.0,8523.0,14204.0,11765.0,14204.0
mean,141.004977,2181.288914,0.065953,12.792854,1997.830681
std,62.086938,1706.499616,0.051459,4.652502,8.371664
min,31.29,33.29,0.0,4.555,1985.0
25%,94.012,834.2474,0.027036,8.71,1987.0
50%,142.247,1794.331,0.054021,12.6,1999.0
75%,185.8556,3101.2964,0.094037,16.75,2004.0
max,266.8884,13086.9648,0.328391,21.35,2009.0


`Item_Visibility` has minimum value of `0.0`. Though it's not possible because they are kept in stores which can't have none visibility.

`Outlet_Establishment_Year` won't make much sense, can convert it into _How Old a Store is?_

In [6]:
data.describe(include=['O'])

Unnamed: 0,Item_Fat_Content,Item_Identifier,Item_Type,Outlet_Identifier,Outlet_Location_Type,Outlet_Size,Outlet_Type,source
count,14204,14204,14204,14204,14204,10188,14204,14204
unique,5,1559,16,10,3,3,4,2
top,Low Fat,NCK18,Fruits and Vegetables,OUT027,Tier 3,Medium,Supermarket Type1,train
freq,8485,10,2013,1559,5583,4655,9294,8523


`Item_dentifier` has 1559 unique values.

There are 16 types of `Item_Type`.

`Outlet_Location_Type` has 3 unique locations. `Tier 1`, `Tier 2`, `Tier 3`

`Outlet_Size` also has 3 different values.

**Getting Unique values of Categorical Variables**

In [7]:
categorical_cols = [x for x in data.dtypes.index if data.dtypes[x]=='object' and x not in ['Item_Identifier','Outlet_Identifier','source']]
print categorical_cols

['Item_Fat_Content', 'Item_Type', 'Outlet_Location_Type', 'Outlet_Size', 'Outlet_Type']


In [8]:
for cols in categorical_cols:
    print 'In %s' %(cols)
    print data[cols].value_counts()
    print '------------------------'

In Item_Fat_Content
Low Fat    8485
Regular    4824
LF          522
reg         195
low fat     178
Name: Item_Fat_Content, dtype: int64
------------------------
In Item_Type
Fruits and Vegetables    2013
Snack Foods              1989
Household                1548
Frozen Foods             1426
Dairy                    1136
Baking Goods             1086
Canned                   1084
Health and Hygiene        858
Meat                      736
Soft Drinks               726
Breads                    416
Hard Drinks               362
Others                    280
Starchy Foods             269
Breakfast                 186
Seafood                    89
Name: Item_Type, dtype: int64
------------------------
In Outlet_Location_Type
Tier 3    5583
Tier 2    4641
Tier 1    3980
Name: Outlet_Location_Type, dtype: int64
------------------------
In Outlet_Size
Medium    4655
Small     3980
High      1553
Name: Outlet_Size, dtype: int64
------------------------
In Outlet_Type
Supermarket Type1    92

1. **Item_Fat_Content**:
    1. Some `Low Fat` are miscoded as `low fat` and `LF`
    1. Similarly `Regular` are miscoded as `reg`

2. **Item_Type**: May get good result in combining some of these categories

## 3. Data Cleaning

### Imputing Missing Values
    1. Item_Weight
    2. Outlet_Size

In [9]:
## Imputing Item_Weight by average weight of particular Item.

# data['Item_Weight'] = data.groupby('Item_Identifier').transform(lambda x: x.fillna(x.mean()))
data.loc[data.Item_Weight.isnull(),'Item_Weight'] = data.groupby('Item_Identifier').transform('mean')

In [10]:
data.Item_Weight.describe()

count    14204.000000
mean        12.793380
std          4.651716
min          4.555000
25%          8.710000
50%         12.600000
75%         16.750000
max         21.350000
Name: Item_Weight, dtype: float64

In [11]:
## Imputing Outlet_Size values by mode of Outlet_Size in that Outlet_Type Category
data.pivot_table(values='Outlet_Size',columns='Outlet_Type', aggfunc=lambda x: x.mode()[0])

Outlet_Type,Grocery Store,Supermarket Type1,Supermarket Type2,Supermarket Type3
Outlet_Size,Small,Small,Medium,Medium


In [12]:
data['Outlet_Size'].fillna(data.groupby('Outlet_Type')['Outlet_Size'].transform(lambda x: x.mode()[0]),inplace=True)

In [13]:
data.Outlet_Size.describe()

count     14204
unique        3
top       Small
freq       7996
Name: Outlet_Size, dtype: object

## 4. Feature Engineering

#### Combining `Outlet_Type`
First to check if its OK to combine `Supermarket Type2` and `Supermarket Type3`.(If they are kind of same or not)

In [14]:
data.pivot_table(values='Item_Outlet_Sales',index='Outlet_Type')

Unnamed: 0_level_0,Item_Outlet_Sales
Outlet_Type,Unnamed: 1_level_1
Grocery Store,339.8285
Supermarket Type1,2316.181148
Supermarket Type2,1995.498739
Supermarket Type3,3694.038558


**NOPE!** Leave'em Alone!

#### Correcting `Item_Visibility`
The `0` values in it were suspicious. Treat them as `NaN`
Filling them with Mean of the corrosponding product Identifiers

In [15]:
data.pivot_table(values='Item_Visibility',index='Item_Identifier')

Unnamed: 0_level_0,Item_Visibility
Item_Identifier,Unnamed: 1_level_1
DRA12,0.034938
DRA24,0.045646
DRA59,0.133384
DRB01,0.079736
DRB13,0.006799
DRB24,0.020596
DRB25,0.079407
DRB48,0.023973
DRC01,0.020653
DRC12,0.037862


In [16]:
data.loc[data['Item_Visibility']==0,'Item_Visibility'] = data.groupby('Item_Identifier').transform('mean')

In [17]:
data.pivot_table(values='Item_Visibility', index='Item_Identifier')

Unnamed: 0_level_0,Item_Visibility
Item_Identifier,Unnamed: 1_level_1
DRA12,0.042702
DRA24,0.045646
DRA59,0.146722
DRB01,0.089703
DRB13,0.007554
DRB24,0.020596
DRB25,0.079407
DRB48,0.026637
DRC01,0.020653
DRC12,0.037862


#### Creating new Variable

Getting ratio of Item_Visibility on a particular store with overall average.

Helps us determining on importance to the product in that particular store

In [18]:
visibility_avg = pd.pivot_table(data,values=['Item_Visibility'],index=['Item_Identifier'])

In [19]:
#Determine another variable with means ratio
data['Item_Visibility_MeanRatio'] = data.apply(lambda x: x['Item_Visibility']/visibility_avg.loc[x['Item_Identifier']][0], axis=1)
print data['Item_Visibility_MeanRatio'].describe()

count    14204.000000
mean         1.000000
std          0.207021
min          0.600000
25%          0.879677
50%          0.928859
75%          0.999070
max          1.806056
Name: Item_Visibility_MeanRatio, dtype: float64


####  Creating a new Broad Catagory of `Item_Type`


In [21]:
data.Item_Identifier.apply(lambda x: x[0:2]).unique()

array(['FD', 'DR', 'NC'], dtype=object)

These Categories may be `FD=>Food`, `DR=>Drink`, `NC=>Non-Consumables`

In [24]:
data['Item_Type_Combined'] = data['Item_Identifier'].apply(lambda x: x[0:2])

In [25]:
data.Item_Type_Combined.value_counts()

FD    10201
NC     2686
DR     1317
Name: Item_Type_Combined, dtype: int64

#### Determining Age of Store

In [27]:
data['Outlet_Years'] = 2013 - data['Outlet_Establishment_Year']
data.Outlet_Years.describe()

count    14204.000000
mean        15.169319
std          8.371664
min          4.000000
25%          9.000000
50%         14.000000
75%         26.000000
max         28.000000
Name: Outlet_Years, dtype: float64

####  Correcting Labels of Item_Fat_Content

In [28]:
data['Item_Fat_Content'] = data['Item_Fat_Content'].replace({'LF':'Low Fat',
                                                           'reg':'Regular',
                                                            'low fat':'Low Fat'})

In [29]:
data.Item_Fat_Content.value_counts()

Low Fat    9185
Regular    5019
Name: Item_Fat_Content, dtype: int64

But for `Non-Consumables` Item Types, fat content couldn't be specified

In [31]:
data.loc[data['Item_Type_Combined']=='NC','Item_Fat_Content'] = 'Non-Edible'

In [32]:
data.Item_Fat_Content.value_counts()

Low Fat       6499
Regular       5019
Non-Edible    2686
Name: Item_Fat_Content, dtype: int64

### One-Hot Encoding of Categorical variables

In [33]:
from sklearn.preprocessing import LabelEncoder

In [34]:
le = LabelEncoder()


In [35]:
data['Outlet'] = le.fit_transform(data['Outlet_Identifier'])

In [37]:
var_mod = ['Item_Fat_Content','Outlet_Location_Type','Outlet_Size','Item_Type_Combined','Outlet_Type','Outlet']

for i in var_mod:
    data[i] = le.fit_transform(data[i])

In [38]:
data.head()

Unnamed: 0,Item_Fat_Content,Item_Identifier,Item_MRP,Item_Outlet_Sales,Item_Type,Item_Visibility,Item_Weight,Outlet_Establishment_Year,Outlet_Identifier,Outlet_Location_Type,Outlet_Size,Outlet_Type,source,Item_Visibility_MeanRatio,Item_Type_Combined,Outlet_Years,Outlet
0,0,FDA15,249.8092,3735.138,Dairy,0.016047,9.3,1999,OUT049,0,1,1,train,0.931078,1,14,9
1,2,DRC01,48.2692,443.4228,Soft Drinks,0.019278,5.92,2009,OUT018,2,1,2,train,0.93342,0,4,3
2,0,FDN15,141.618,2097.27,Meat,0.01676,17.5,1999,OUT049,0,1,1,train,0.87279,1,14,9
3,2,FDX07,182.095,732.38,Fruits and Vegetables,0.017834,19.2,1998,OUT010,2,2,0,train,0.818182,1,15,0
4,1,NCD19,53.8614,994.7052,Household,0.00978,8.93,1987,OUT013,2,0,1,train,0.75,2,26,1


In [39]:
data = pd.get_dummies(data, columns=var_mod)

In [40]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14204 entries, 0 to 14203
Data columns (total 37 columns):
Item_Identifier              14204 non-null object
Item_MRP                     14204 non-null float64
Item_Outlet_Sales            8523 non-null float64
Item_Type                    14204 non-null object
Item_Visibility              14204 non-null float64
Item_Weight                  14204 non-null float64
Outlet_Establishment_Year    14204 non-null int64
Outlet_Identifier            14204 non-null object
source                       14204 non-null object
Item_Visibility_MeanRatio    14204 non-null float64
Outlet_Years                 14204 non-null int64
Item_Fat_Content_0           14204 non-null uint8
Item_Fat_Content_1           14204 non-null uint8
Item_Fat_Content_2           14204 non-null uint8
Outlet_Location_Type_0       14204 non-null uint8
Outlet_Location_Type_1       14204 non-null uint8
Outlet_Location_Type_2       14204 non-null uint8
Outlet_Size_0               