## Exploratory Data Analysis

## Big Mart Sales Data

### Introduction
The data scientists at BigMart have collected 2013 sales data for 1559 products across 10 stores in different cities. Also, certain attributes of each product and store have been defined. The aim is to build a predictive model and find out the sales of each product at a particular store. Using this model, BigMart will try to understand the properties of products and stores which play a key role in increasing sales. 

Please note that the data may have missing values as some stores might not report all the data due to technical glitches. Hence, it will be required to treat them accordingly.

### Data Dictionary

Column Position | Atrribute Name | Definition | Data Type | Example | % Null Ratios
 --- | --- | --- | --- | --- | ---
1 | Item_Identifier | It is a unique product ID assigned to every distinct item. It consists of an alphanumeric string of length 5 | Alphanumeric | FDN15 | 0
2 | Item_Weight | This field includes the wieght of the product | Numeric (float) | 17.5 | 17.16531738
3 | Item_Fat_Content | This attribute is categorical and describes whether the product is low fat or not. There are 2 categories of this attribute: ['Low Fat', 'Regular']. However, it is important to note that 'Low Fat' has also been written as 'low fat' and 'LF' in dataset, whereas, 'Regular' has been referred as 'reg' as well | Alpha | Low Fat | 0
4 | Item_Visibility | This field mentions the percentage of total display area of all products in a store allocated to the particular product | Numeric (float) | 0.01676 | 0
5 | Item_Type | This is a categorical attribute and describes the food category to which the item belongs. There are 16 different categories listed as follows: ['Dairy', 'Soft Drinks', 'Meat', 'Fruits and Vegetables', 'Household', 'Baking Goods', 'Snack Foods', 'Frozen Foods', 'Breakfast', 'Health and Hygiene', 'Hard Drinks', 'Canned', 'Breads', 'Starchy Foods', 'Others', 'Seafood'] | Alpha | Meat | 0
6 | Item_MRP | This is the Maximum Retail Price (list price) of the product | Numeric (float) | 141.618 | 0
7 | Outlet_Identifier | It is a unique store ID assigned. It consists of an alphanumeric string of length 6 | Alphanumeric | OUT049 | 0
8 | Outlet_Establishment_Year | This attribute mentions the year in which store was established | Numeric (Integer) | 1998 | 0
9 | Outlet_Size | The attribute tells the size of the store in terms of ground area covered. It is a categorical value and described in 3 categories: ['High', 'Medium', 'Small'] | Alpha | Medium | 28.27642849
10 | Outlet_Location_Type | This field has categorical data and tells about the size of the city in which the store is located through 3 categories: ['Tier 1', 'Tier 2', 'Tier 3'] | Alpha | Tier 3 | 0
11 | Outlet_Type | This field contains categorical value and tells whether the outlet is just a grocery store or some sort of supermarket. Following are the 4 categories in which the data is divided: ['Supermarket Type1', 'Supermarket Type2', 'Grocery Store','Supermarket Type3'] | Alpha | Supermarket Type2 | 0
12 | Item_Outlet_Sales | This is the outcome variable to be predicted. It contains the sales of the product in the particulat store | Numeric (float) | 2097.27 | 0



------------------------

# Hypothesis Generation

**Exploratory Data Analysis refers to the critical process of performing initial investigations on data so as to discover patterns,to spot anomalies,
to test hypothesis and to check assumptions with the help of summary statistics and graphical representations.**

So the idea is to find out the properties of a product, and store which impacts the sales of a product. Let’s think about some of the analysis that can be done and come up with certain hypothesis.

The Hypotheses
I came up with the following hypothesis while thinking about the problem. These are just my thoughts and you can come-up with many more of these. Since we’re talking about stores and products, lets make different sets for each.

**Store Level Hypotheses:**  

`City type:` Stores located in urban or Tier 1 cities should have higher sales because of the higher income levels of people there.  

`Population Density:` Stores located in densely populated areas should have higher sales because of more demand.  

`Store Capacity:` Stores which are very big in size should have higher sales as they act like one-stop-shops and people would prefer getting everything from one place.  
Competitors: Stores having similar establishments nearby should have less sales because of more competition.  

`Marketing:` Stores which have a good marketing division should have higher sales as it will be able to attract customers through the right offers and advertising. 

`Location:` Stores located within popular marketplaces should have higher sales because of better access to customers.  

`Customer Behavior:` Stores keeping the right set of products to meet the local needs of customers will have higher sales.  

`Ambiance:` Stores which are well-maintained and managed by polite and humble people are expected to have higher footfall and thus higher sales.  

**Product Level Hypotheses:**  

`Brand:` Branded products should have higher sales because of higher trust in the customer.
Packaging: Products with good packaging can attract customers and sell more.  

`Utility:` Daily use products should have a higher tendency to sell as compared to the specific use products.   

`Display Area:` Products which are given bigger shelves in the store are likely to catch attention first and sell more. 

`Visibility in Store:` The location of product in a store will impact sales. Ones which are right at entrance will catch the eye of customer first rather than the ones in back. 

`Advertising:` Better advertising of products in the store will should higher sales in most cases.

`Promotional Offers:` Products accompanied with attractive offers and discounts will sell more.  


# OR

### Following are some of the hypotheses based on the problem statement.

1. Sales are higher during weekends.
2. Higher sales during morning and late evening.
3. Higher sales during end of the year.
4. Store size affects the sales.
5. Location of the store affects the sales.
6. Items with more shelf space sell more.

You can come up with more hypotheses of your own, the more the better. Let’s begin exploring the dataset and try to find interesting patterns.

We’ll first load the required packages.



# Data Exploration

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
#Read the data set

In [2]:
df = pd.read_csv('Train_UWu5bXk.csv')

In [3]:
df.head(2)

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228


In [4]:
df.tail(2)

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
8521,FDN46,7.21,Regular,0.145221,Snack Foods,103.1332,OUT018,2009,Medium,Tier 3,Supermarket Type2,1845.5976
8522,DRG01,14.8,Low Fat,0.044878,Soft Drinks,75.467,OUT046,1997,Small,Tier 1,Supermarket Type1,765.67


In [5]:
df.shape

(8523, 12)

In [6]:
df.shape[0] # Number of Rows

8523

In [7]:
df.shape[1] # Number of Columns

12

In [8]:
df.columns

Index(['Item_Identifier', 'Item_Weight', 'Item_Fat_Content', 'Item_Visibility',
       'Item_Type', 'Item_MRP', 'Outlet_Identifier',
       'Outlet_Establishment_Year', 'Outlet_Size', 'Outlet_Location_Type',
       'Outlet_Type', 'Item_Outlet_Sales'],
      dtype='object')

In [9]:
df.describe().columns # To identify which is categorical, which is numerical

Index(['Item_Weight', 'Item_Visibility', 'Item_MRP',
       'Outlet_Establishment_Year', 'Item_Outlet_Sales'],
      dtype='object')

In [15]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8523 entries, 0 to 8522
Data columns (total 12 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Item_Identifier            8523 non-null   object 
 1   Item_Weight                7060 non-null   float64
 2   Item_Fat_Content           8523 non-null   object 
 3   Item_Visibility            8523 non-null   float64
 4   Item_Type                  8523 non-null   object 
 5   Item_MRP                   8523 non-null   float64
 6   Outlet_Identifier          8523 non-null   object 
 7   Outlet_Establishment_Year  8523 non-null   int64  
 8   Outlet_Size                6113 non-null   object 
 9   Outlet_Location_Type       8523 non-null   object 
 10  Outlet_Type                8523 non-null   object 
 11  Item_Outlet_Sales          8523 non-null   float64
dtypes: float64(4), int64(1), object(7)
memory usage: 799.2+ KB


In [10]:
df.describe(include='O').columns # Object means categorical.

Index(['Item_Identifier', 'Item_Fat_Content', 'Item_Type', 'Outlet_Identifier',
       'Outlet_Size', 'Outlet_Location_Type', 'Outlet_Type'],
      dtype='object')

In [11]:
df.select_dtypes(include='float')[1:3]

Unnamed: 0,Item_Weight,Item_Visibility,Item_MRP,Item_Outlet_Sales
1,5.92,0.019278,48.2692,443.4228
2,17.5,0.01676,141.618,2097.27


In [12]:
df.select_dtypes(include='O')[1:3]

Unnamed: 0,Item_Identifier,Item_Fat_Content,Item_Type,Outlet_Identifier,Outlet_Size,Outlet_Location_Type,Outlet_Type
1,DRC01,Regular,Soft Drinks,OUT018,Medium,Tier 3,Supermarket Type2
2,FDN15,Low Fat,Meat,OUT049,Medium,Tier 1,Supermarket Type1


**One of the key challenges in any data set is missing values. Lets start by checking which columns contain missing values.**

In [13]:
#df.apply(lambda x: sum(x.isnull()))

In [14]:
df.isnull().sum()               

Item_Identifier                 0
Item_Weight                  1463
Item_Fat_Content                0
Item_Visibility                 0
Item_Type                       0
Item_MRP                        0
Outlet_Identifier               0
Outlet_Establishment_Year       0
Outlet_Size                  2410
Outlet_Location_Type            0
Outlet_Type                     0
Item_Outlet_Sales               0
dtype: int64

we will impute the missing values in `Item_Weight` and `Outlet_Size` in the data cleaning section.

**Lets look at some basic statistics for numerical variables.**

In [None]:
df.describe()

**Some observations:**    

- Item_Visibility has a min value of zero. This makes no practical sense because when a product is being sold in a store, the visibility cannot be 0.  
- Outlet_Establishment_Years vary from 1985 to 2009. The values might not be usefull in this form. Rather, if we can convert them to how old the particular store is, it should have a better impact on sales.  
- The lower 'count' of Item_Weight and Item_Outlet_Sales confirms the findings from the missing value check.  

Statictics for categorical variable

In [None]:
df.describe(include='O')

In [None]:
#OR
#df.apply(lambda x: len(x.unique()))

We can see that, Item_Type has 16 unique values.   
Let’s explore further using the frequency of different categories in each nominal variable. I will exclude the ID variables for obvious reasons.

In [None]:
#Filter categorical variables
categorical_columns = [x for x in df.dtypes.index if df.dtypes[x]=='object']

#Exclude ID cols and source:
categorical_columns = [x for x in categorical_columns if x not in ['Item_Identifier','Outlet_Identifier']]

#Print frequency of categories
for col in categorical_columns:
    print(f'\nFrequency of Categories for varible {col}')
    print(df[col].value_counts())
    print("******************************************************")

**The output gives us following observation:**

- Item_Fat_Content: Some of 'Low Fat' values mis-coded as 'low fat' and 'LF'. Also, some of 'Regular' are mentioned as ‘regular’.


# Data Cleaning
This step typically involves imputing missing values and treating outliers. Though outlier removal is very important in regression techniques, advanced tree based algorithms are impervious to outliers. So I'll leave it to you to try it out. We'll focus on the imputation step here, which is a very important step.



### Imputing Missing Values
We found two variables with missing values – Item_Weight and Outlet_Size. Lets impute the former by the average weight of the particular item. 

In [None]:
# Chck the percentage of missing values in each variable

In [None]:
def null_value_percent(df):
    total = df.isnull().sum().sort_values(ascending=False)
    percent = (total/df.shape[0]*100)
    return pd.concat([total,percent],axis=1,keys=['total','percent'])

In [None]:
null_value_percent(df)

In [None]:
for i in df.describe().columns:
    df[i] = df[i].fillna(df[i].mean())
for j in df.describe(include='O').columns:
    df[j] = df[j].fillna(df[j].mode()[0])

In [None]:
# check  the null values after imputation

In [None]:
df.isnull().sum()

This confirms that the columns has no missing values now

# Data Visualization 

### Univariant Analysis

**numeric variables**  
Now let’s check the numeric variables. We' ll again use the histograms for visualizations because that will help us in visualizing the distribution of individual variables.

In [None]:
#df.describe().columns

In [None]:
plt.hist(df.Item_Outlet_Sales)
#sns.distplot(df.Item_Outlet_Sales)

As you can see, it is a right skewd variable and would need some data transformation to treat its skewness. 

In [None]:
a = np.sqrt(df.Item_Outlet_Sales)

In [None]:
#sns.distplot(a)
plt.hist(a)

In [None]:
plt.hist(df.Item_MRP)

In [None]:
plt.hist(df.Item_Weight)

In [None]:
plt.hist(df.Item_Visibility)

In [None]:
b = np.sqrt(df.Item_Visibility)

In [None]:
sns.distplot(b)

As you can see, there is no clear pattern in Item_Weight and Item_MRP. However, Item_Visibility is right-skewed and should be transformed to curb its skewness.

In [None]:
sns.boxplot(df.Item_Outlet_Sales)

In [None]:
sns.pairplot(df[['Item_Weight', 'Item_Visibility', 'Item_MRP']])
plt.show()

**categorical variables**   
Now we will try to explore and gain some insights from the categorical variables. A categorical variable or feature can have only a finite set of values. Let’s first plot Item_Fat_Content.

In [None]:
df.Item_Fat_Content.value_counts().plot.bar()

In the figure above, 'LF', 'low fat', and 'Low Fat' are the same category and can be combined into one. Similarly we can be done for 'reg' and 'Regular' into one. After making these corrections we will plot the same figure again.

In [None]:
df['Item_Fat_Content'] = df['Item_Fat_Content'].replace({'LF':'Low Fat',
                                                             'reg':'Regular',
                                                             'low fat':'Low Fat'})
df['Item_Fat_Content'].value_counts()

In [None]:
df.Item_Fat_Content.value_counts().plot.bar()

In [None]:
sns.countplot(df['Item_Fat_Content'])

Now let’s check the other categorical variables.

In [None]:
sns.countplot(df['Outlet_Size'])

In [None]:
#df.Outlet_Type.value_counts().plot.barh(color='orange')
plt.figure(figsize=(8,4))
sns.countplot(df['Outlet_Type'])

Supermarket Type 1 seems to be the most popular category of Outlet_Type.

In [None]:
plt.figure(figsize=(25,8))
sns.countplot(df['Item_Type'])
#df.Item_Type.value_counts().plot.bar()

## Bivariate Analysis
After looking at every feature individually, let’s now explore them again with respect to the target variable. Here we will make use of scatter plots for continuous or numeric variables and violin plots for the categorical variables.

In [None]:
sns.jointplot(df['Item_Weight'], df['Item_Outlet_Sales'])

In [None]:
sns.jointplot(df['Item_Visibility'], df['Item_Outlet_Sales'])

In [None]:
plt.figure(figsize=(10,10))
plt.xlabel('MRP')
plt.ylabel('Sales')
plt.title('MRP vs Sales')
plt.scatter(df.Item_MRP,df.Item_Outlet_Sales)

Item_Outlet_Sales is spread well across the entire range of the Item_Weight without any obvious pattern. In the Item_Visibility vs Item_Outlet_Sales, there is a string of points at Item_Visibility = 0.0 which seems strange as item visibility cannot be completely zero. 

In the third plot of Item_MRP vs Item_Outlet_Sales, we can clearly see 4 segments of prices that can be used in feature engineering to create a new variable.

Now we’ll visualise the categorical variables with respect to Item_Outlet_Sales. We will try to check the distribution of the target variable across all the categories of each of the categorical variable.

In [None]:
#sns.boxplot(df.Item_MRP)
sns.stripplot(df['Item_Fat_Content'], df['Item_Outlet_Sales'])

In [None]:
plt.figure(figsize=(25,4))
sns.boxplot(df['Item_Type'], df['Item_Outlet_Sales'])

In [None]:
sns.barplot(df['Outlet_Size'], df['Item_Outlet_Sales'])

In [None]:
plt.figure(figsize=(8,4))
sns.barplot(df['Outlet_Type'], df['Item_Outlet_Sales'])

In the Outlet_Type, Grocery Store has most of its data points around the lower sales values as compared to the other categories. 

In [None]:
plt.figure(figsize=(10,4))
sns.barplot(df['Outlet_Identifier'], df['Item_Outlet_Sales'])

Distribution of Item_Outlet_Sales across the categories of Item_Type is not very distinct and same is the case with Item_Fat_Content. However, the distribution for OUT010 and OUT019 categories of Outlet_Identifier are quite similar and very much different from the rest of the categories of Outlet_Identifier.

These are the kind of insights that we can extract by visualizing our data. Hence, data visualization should be an important part of any kind data analysis.

In [None]:
pd.crosstab(df.Item_Fat_Content,df.Item_Type)