### Automated Feature Engineering

We have train (8523) and test (5681) data set, train data set has both input and output variable(s). You need to predict the sales for test data set.

 

Variable

Item_Identifier
    Unique product ID

Item_Weight
    Weight of product

Item_Fat_Content
    Whether the product is low fat or not

Item_Visibility
	The % of total display area of all products in a store allocated to the particular product

Item_Type
	The category to which the product belongs

Item_MRP
	Maximum Retail Price (list price) of the product

Outlet_Identifier
	Unique store ID

Outlet_Establishment_Year
	The year in which store was established

Outlet_Size
	The size of the store in terms of ground area covered

Outlet_Location_Type
	The type of city in which the store is located

Outlet_Type
    Whether the outlet is just a grocery store or some sort of supermarket

Item_Outlet_Sales
	Sales of the product in the particulat store. This is the outcome variable to be predicted.

In [14]:
# load libraries

import numpy as np
import pandas as pd

# import featuretools as ft


In [2]:
# load dataframe
train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")

In [3]:
# saving identifiers
test_Item_Identifier = test['Item_Identifier']
test_Outlet_Identifier = test['Outlet_Identifier']
sales = train['Item_Outlet_Sales']
train.drop(['Item_Outlet_Sales'], axis=1, inplace=True)

In [7]:
# combining train and test
comb = train.append(test, ignore_index=True)

In [8]:
comb.sample(4)

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type
13420,FDN60,15.1,Low Fat,0.095079,Baking Goods,159.6604,OUT013,1987,High,Tier 3,Supermarket Type1
3895,NCK05,20.1,Low Fat,0.0,Health and Hygiene,61.3536,OUT010,1998,,Tier 3,Grocery Store
8985,FDH52,9.42,Regular,0.0,Frozen Foods,61.3194,OUT049,1999,Medium,Tier 1,Supermarket Type1
6044,NCM06,7.475,Low Fat,0.126753,Household,154.2656,OUT010,1998,,Tier 3,Grocery Store


In [9]:
# missing check
combi.isnull().sum()

Item_Identifier                 0
Item_Weight                  2439
Item_Fat_Content                0
Item_Visibility                 0
Item_Type                       0
Item_MRP                        0
Outlet_Identifier               0
Outlet_Establishment_Year       0
Outlet_Size                  4016
Outlet_Location_Type            0
Outlet_Type                     0
dtype: int64

In [10]:
# imputing missing data
comb['Item_Weight'].fillna(comb['Item_Weight'].mean(), inplace=True)
comb['Outlet_Size'].fillna("missing", inplace=True)

In [11]:
comb['Item_Fat_Content'].value_counts()

Low Fat    8485
Regular    4824
LF          522
reg         195
low fat     178
Name: Item_Fat_Content, dtype: int64

In [12]:
# dictionary to replace the categories
fat_content_dict = {'Low Fat':0, 'Regular':1, 'LF':0, 'reg':1, 'low fat':0}

comb['Item_Fat_Content'] = comb['Item_Fat_Content'].replace(fat_content_dict, regex=True)

In [13]:
comb['id'] = comb['Item_Identifier'] + comb['Outlet_Identifier']
comb.drop(['Item_Identifier'], axis=1, inplace=True)

In [None]:
# creating and entity set 'es'
es = ft.EntitySet(id='sales')

# adding a dataframe 
es.entity_from_dataframe(entity_id='bigmart', dataframe=comb, index='id')

In [None]:
es.normalize_entity(base_entity_id='bigmart', new_entity_id='outlet', index='Outlet_Identifier', 
additional_variables = ['Outlet_Establishment_Year', 'Outlet_Size', 'Outlet_Location_Type', 'Outlet_Type'])

In [None]:
# summary 
es

In [None]:
feature_matrix, feature_names = ft.dfs(entityset=es, target_entity='bigmart',
                                       max_depth=2, verbose=1, n_jobs=3)

In [None]:
feature_matrix.columns

In [None]:
feature_matrix.head()

In [None]:
feature_matrix = feature_matrix.reindex(index=comb['id'])
feature_matrix = feature_matrix.reset_index()