# Problem Statement 

- BigMart have collected 2013 sales data for 1559 products across 10 stores in different cities. Also certain attributes of each prouct and store have been defined. The aim is **to build a predictive model and predict the sales of each product at a particular outlet.**

- And using this model, we will try to understand the properties of products and oulets which play a key role in increasing sales.

__We will explore the problem in following stages:__

- __Hypothesis Generation :__ understanding the problem better by brainstorming possible factors that can impact the outcome

- __Data Exploration :__ looking at categorical and continuous feature summaries and making inferences about the data.

- __Data Cleaning :__ imputing missing values in the data and checking for outliers

- __Feature Engineering :__ modifying existing variables and creating new ones for analysis

- __Model Building :__ making predictive models on the data

##  Hypothesis Generation 

- Before looking at the data we try to understand the problem and making some hypothesis about what could potentially have a good impact on the outcome. 

     ### Problem Statement 
 

- BigMart have collected 2013 sales data for 1559 products across 10 stores in different cities. Also certain attributes of each prouct and store have been defined. The aim is **to build a predictive model and predict the sales of each product at a particular outlet.**


- And using this model, we will try to understand the properties of products and outlets which play a key role in increasing sales.

- So, we should find out the properties of a product and store which impacts the sales of a product.

     ### The Hypotheses 


- Some hypothesis examples : 

#### Store Level Hypotheses: 



__City type:__ Stores located in urban or Tier 1 cities should have higher sales because of the higher income levels of people there.

__Population Density:__ Stores located in densely populated areas should have higher sales because of more demand.

__Store Capacity:__ Stores which are very big in size should have higher sales as they act like one-stop-shops and people would prefer getting everything from one place

__Competitors:__ Stores having similar establishments nearby should have less sales because of more competition.

__Marketing:__ Stores which have a good marketing division should have higher sales as it will be able to attract customers through the right offers and advertising.

__Location:__ Stores located within popular marketplaces should have higher sales because of better access to customers.

__Customer Behavior:__ Stores keeping the right set of products to meet the local needs of customers will have higher sales.

__Ambiance:__ Stores which are well-maintained and managed by polite and humble people are expected to have higher footfall and thus higher sales.

#### Product Level Hypotheses: 



__Brand:__ Branded products should have higher sales because of higher trust in the customer.

__Packaging:__ Products with good packaging can attract customers and sell more.

__Utility:__ Daily use products should have a higher tendency to sell as compared to the specific use products.

__Display Area:__ Products which are given bigger shelves in the store are likely to catch attention first and sell more.

__Visibility in Store:__ The location of product in a store will impact sales. Ones which are right at entrance will catch the eye of customer first rather than the ones in back.

__Advertising:__ Better advertising of products in the store will should higher sales in most cases.

__Promotional Offers:__ Products accompanied with attractive offers and discounts will sell more.


__NOTE:__ We can also make any others hypothesis. We will try to find answers to these hypothesis as data allow this.

# Data Exploration 

<img src="Data Dictionary.png" />

In [1]:
import pandas as pd
import numpy as np

train = pd.read_csv("train.csv")
test = pd.read_csv('test.csv')

In [2]:
# Let's make a copy of the data 
train_original = train.copy()
test_original = test.copy()

- Its generally a good idea to combine both train and test data sets into one, perform feature engineering and then divide them later again. This saves the trouble of performing the same steps twice on test and train.

In [3]:
train["source"] = 'train'
test['source'] = 'test'

data = pd.concat([train, test], ignore_index=True)

In [4]:
print(train.shape, test.shape, data.shape)

(8523, 13) (5681, 12) (14204, 13)


In [5]:
# Let's look at the null values 

data.apply(lambda x: sum(x.isnull())) # data.isnull().sum()

Item_Identifier                 0
Item_Weight                  2439
Item_Fat_Content                0
Item_Visibility                 0
Item_Type                       0
Item_MRP                        0
Outlet_Identifier               0
Outlet_Establishment_Year       0
Outlet_Size                  4016
Outlet_Location_Type            0
Outlet_Type                     0
Item_Outlet_Sales            5681
source                          0
dtype: int64

- Item_Outlet_Sales is the target variable and missing values are ones in the test set.

In [6]:
# Let's look at some basic statistics for numerical variables

data.describe()

Unnamed: 0,Item_Weight,Item_Visibility,Item_MRP,Outlet_Establishment_Year,Item_Outlet_Sales
count,11765.0,14204.0,14204.0,14204.0,8523.0
mean,12.792854,0.065953,141.004977,1997.830681,2181.288914
std,4.652502,0.051459,62.086938,8.371664,1706.499616
min,4.555,0.0,31.29,1985.0,33.29
25%,8.71,0.027036,94.012,1987.0,834.2474
50%,12.6,0.054021,142.247,1999.0,1794.331
75%,16.75,0.094037,185.8556,2004.0,3101.2964
max,21.35,0.328391,266.8884,2009.0,13086.9648


- **Item_Visibility** has a min value of zero. This makes no practical

- **Outlet_Establishment_Years** vary from 1985 to 2009. The values might not be apt in this form. Rather, if we can convert them to how old the particular store is, it should be have a better impact on sales.



In [7]:
# Let's look at the categorical variables 

data.apply(lambda x : len(x.unique()))

Item_Identifier               1559
Item_Weight                    416
Item_Fat_Content                 5
Item_Visibility              13006
Item_Type                       16
Item_MRP                      8052
Outlet_Identifier               10
Outlet_Establishment_Year        9
Outlet_Size                      4
Outlet_Location_Type             3
Outlet_Type                      4
Item_Outlet_Sales             3494
source                           2
dtype: int64

- So, we can see that there are __1559 products and 10 outlets.__

In [8]:
categorical_columns = [x for x in data.columns if data.dtypes[x]=='object']
categorical_columns

['Item_Identifier',
 'Item_Fat_Content',
 'Item_Type',
 'Outlet_Identifier',
 'Outlet_Size',
 'Outlet_Location_Type',
 'Outlet_Type',
 'source']

In [9]:
# Let's explore further using the freauency of different categories in each nominal variables.

# Filter categorical variables

categ_columns = [x for x in data.dtypes.index if data.dtypes[x]=='object']

# Exclude ID cols and source

categ_columns = [x for x in categ_columns if x not in ['Item_Identifier','Outlet_Identifier', 'source' ]]

# print Frequency of categories 

for col in categ_columns :
    print('\nFrequency of Categories for variables',col)
    print(data[col].value_counts())


Frequency of Categories for variables Item_Fat_Content
Low Fat    8485
Regular    4824
LF          522
reg         195
low fat     178
Name: Item_Fat_Content, dtype: int64

Frequency of Categories for variables Item_Type
Fruits and Vegetables    2013
Snack Foods              1989
Household                1548
Frozen Foods             1426
Dairy                    1136
Baking Goods             1086
Canned                   1084
Health and Hygiene        858
Meat                      736
Soft Drinks               726
Breads                    416
Hard Drinks               362
Others                    280
Starchy Foods             269
Breakfast                 186
Seafood                    89
Name: Item_Type, dtype: int64

Frequency of Categories for variables Outlet_Size
Medium    4655
Small     3980
High      1553
Name: Outlet_Size, dtype: int64

Frequency of Categories for variables Outlet_Location_Type
Tier 3    5583
Tier 2    4641
Tier 1    3980
Name: Outlet_Location_Type, dtype: 

The output gives us following observations:

- __Item_Fat_Content:__ Some of ‘Low Fat’ values mis-coded as ‘low fat’ and ‘LF’. Also, some of ‘Regular’ are mentioned as ‘regular’.

- __Item_Type:__ Not all categories have substantial numbers. It looks like combining them can give better results.

- __Outlet_Type:__ Supermarket Type2 and Type3 can be combined. But we should check if that’s a good idea before doing it.

# Data Cleaning 

- This step typically involves imputing missing values and treating outliers. Though outlier removal is very important in regression techniques, advanced tree based algorithms are impervious to outliers.So we will focus on the imputation step.

In [10]:
data['Item_Weight'].fillna(data['Item_Weight'].mean(), inplace=True)
data['Outlet_Size'].fillna(data['Outlet_Size'].mode()[0], inplace=True)

In [11]:
data.isnull().sum()

Item_Identifier                 0
Item_Weight                     0
Item_Fat_Content                0
Item_Visibility                 0
Item_Type                       0
Item_MRP                        0
Outlet_Identifier               0
Outlet_Establishment_Year       0
Outlet_Size                     0
Outlet_Location_Type            0
Outlet_Type                     0
Item_Outlet_Sales            5681
source                          0
dtype: int64

# Feature Engineering 

- We explored some nuances in the data in the data exploration section.Let's resolve them.

- We also create some new variables using the existing ones in this section.

__1. Condider combining Outlet_Type__

During exploration, we decided to consider combining the Supermarket Type2 and Type3 variables.But is that a good idea? A quick way to check that could be to analyze the mean sales by type of store. If they have similar sales, then keeping them separate won’t help much.


In [12]:
data.pivot_table(values='Item_Outlet_Sales',index='Outlet_Type')

Unnamed: 0_level_0,Item_Outlet_Sales
Outlet_Type,Unnamed: 1_level_1
Grocery Store,339.8285
Supermarket Type1,2316.181148
Supermarket Type2,1995.498739
Supermarket Type3,3694.038558


- This shows significant difference between them and we’ll leave them as it is

__2. Modify Item_Visibility__ 

We noticed that the minimum value here is 0, which makes no practical sense. Lets consider it like missing information and impute it with mean visibility of that product.

In [13]:
data['Item_Visibility'].replace(0, data['Item_Visibility'].mean(), inplace=True)

In [14]:
data['Item_Visibility'].min()

0.003574698

__3. Create a broad category of Type of Item__


In [15]:
data[['Item_Identifier', 'Item_Type']].head(10)

Unnamed: 0,Item_Identifier,Item_Type
0,FDA15,Dairy
1,DRC01,Soft Drinks
2,FDN15,Meat
3,FDX07,Fruits and Vegetables
4,NCD19,Household
5,FDP36,Baking Goods
6,FDO10,Snack Foods
7,FDP10,Snack Foods
8,FDH17,Frozen Foods
9,FDU28,Frozen Foods


- __Item_Type__ variable has 16 categories which might prove to be very useful in analysis. So its a good idea to combine them. 
If you look at the Item_Identifier, i.e. the unique ID of each item, it starts with either FD, DR or NC. If you see the categories, these look like being Food, Drinks and Non-Consumables.

In [16]:
# Get the first two characters of ID 
data['Item_Type_Combined'] = data['Item_Identifier'].apply(lambda x: x[0:2])

# Rename them to more intuitive categories 
data['Item_Type_Combined'] = data['Item_Type_Combined'].map({ 'FD':'Food',
                                                               'NC':'Non-Consumable',
                                                                'DR':'Drinks' })
 
data['Item_Type_Combined'].value_counts()    

Food              10201
Non-Consumable     2686
Drinks             1317
Name: Item_Type_Combined, dtype: int64

__4. Determine the years of operation of a store__

- Let's make a new column depicting the years of operation of a store.

In [17]:
data['Outlet_Years'] = 2013 - data['Outlet_Establishment_Year']
data['Outlet_Years'].head()

0    14
1     4
2    14
3    15
4    26
Name: Outlet_Years, dtype: int64

__5. Modify categories of Item_Fat_Content__ 

- We found typos and difference in representation in categories of Item_Fat_Content variable.


'Low fat, Regular, LF, reg, low fat'

In [18]:
data['Item_Fat_Content'] = data['Item_Fat_Content'].replace({'LF': 'Low Fat',
                                                            'reg': 'Regular',
                                                            'low fat': 'Low Fat'})
data['Item_Fat_Content'].value_counts()


Low Fat    9185
Regular    5019
Name: Item_Fat_Content, dtype: int64

- It makes more sense. But in step 4 we saw there were some non-consumables, a fat content should not be specified for them.
- Let's creat a separet category 

In [19]:
data.loc[data['Item_Type_Combined']=='Non-Consumable', 'Item_Fat_Content'] = 'Non-Edible'
data['Item_Fat_Content'].value_counts()

Low Fat       6499
Regular       5019
Non-Edible    2686
Name: Item_Fat_Content, dtype: int64

__6. Numerical and One-Hot Coding of Categorical Variables__


- Since scikit-learn accepts only numerical variables, We will convert all categories of nominal variables into numeric types. 
- Also, I wanted Outlet_Identifier as a variable as well. So I created a new variable ‘Outlet’ same as Outlet_Identifier and coded that. Outlet_Identifier should remain as it is, because it will be required in the submission file.

In [20]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
data['Outlet'] = le.fit_transform(data['Outlet_Identifier'])

In [21]:
var_mod = ['Item_Fat_Content','Outlet_Location_Type','Outlet_Size','Item_Type_Combined','Outlet_Type','Outlet']
le = LabelEncoder()
for i in var_mod:
    data[i] = le.fit_transform(data[i])

__One Hot Coding__

It refers to creating dummy variables, one for each category of a categorical variable. 

In [22]:
data = pd.get_dummies(data, columns=var_mod)

In [23]:
data.dtypes

Item_Identifier               object
Item_Weight                  float64
Item_Visibility              float64
Item_Type                     object
Item_MRP                     float64
Outlet_Identifier             object
Outlet_Establishment_Year      int64
Item_Outlet_Sales            float64
source                        object
Outlet_Years                   int64
Item_Fat_Content_0             uint8
Item_Fat_Content_1             uint8
Item_Fat_Content_2             uint8
Outlet_Location_Type_0         uint8
Outlet_Location_Type_1         uint8
Outlet_Location_Type_2         uint8
Outlet_Size_0                  uint8
Outlet_Size_1                  uint8
Outlet_Size_2                  uint8
Item_Type_Combined_0           uint8
Item_Type_Combined_1           uint8
Item_Type_Combined_2           uint8
Outlet_Type_0                  uint8
Outlet_Type_1                  uint8
Outlet_Type_2                  uint8
Outlet_Type_3                  uint8
Outlet_0                       uint8
O

__Exporting Data__
- convert data back into train and test data sets.

In [24]:
#Drop the columns which have been converted to different types:
data.drop(['Item_Type','Outlet_Establishment_Year'],axis=1,inplace=True)

# Divide into test and train datasets 
train = data.loc[data['source'] == 'train']
test = data.loc[data['source'] == 'test']

# Drop unnecesary columns 
test.drop(['source', 'Item_Outlet_Sales'], axis=1, inplace=True)
train.drop(['source'], axis=1, inplace=True)

# Drop column that has not any effect on 'Item_Outlet_Sales'
train = train.drop(['Item_Identifier', 'Outlet_Identifier'], axis=1)
test = test.drop(['Item_Identifier', 'Outlet_Identifier'], axis=1)

#Export files as modified vrsions 
train.to_csv("train_modified.csv", index=False)
test.to_csv("test_modified.csv", index=False)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


# Model Building 

## Linear Regression 

In [25]:
dtrain = pd.read_csv('train_modified.csv')
dtest = pd.read_csv('test_modified.csv')


In [27]:
print(dtrain.shape, dtest.shape)

(8523, 31) (5681, 30)


In [81]:
# Let's make prediction firstly using just train dataset, 
# We divide train dataset into two part train and validation

X = dtrain.drop('Item_Outlet_Sales', axis=1)
y = dtrain.Item_Outlet_Sales

from sklearn.model_selection import train_test_split

x_train, x_val, y_train, y_val = train_test_split(X, y, test_size = 0.3)

In [85]:
# fitting the model and pediction 

from sklearn import linear_model

model = linear_model.LinearRegression()

model.fit(x_train, y_train)

pred_val = model.predict(x_val)

model.score(x_val, y_val)

0.5794373737992737

In [86]:
# Let's make prediction for the test dataset

pred_test= model.predict(dtest)

In [91]:
# let's import submission file to submit our result

submission = pd.read_csv('sample_submission.csv')

# We need the 'Item_Identifier', 'Outlet_Identifier' and 'Item_Outlet_Sales' columns 

submission['Item_Outlet_Sales'] = pred_test 
submission['Item_Identifier'] = test_original['Item_Identifier']
submission['Outlet_Identifier'] = test_original['Outlet_Identifier']

submission['Item_Outlet_Sales'] = abs(submission['Item_Outlet_Sales'])

# Finally we will convert submission to csv file to submitand check the accuracy 

pd.DataFrame(submission, columns=['Item_Identifier','Outlet_Identifier','Item_Outlet_Sales']).to_csv('submission_1.csv', index=False)