# Big Mart Sales Predictions

In this notebook, I am building a sales predictor based on BigMart Outlets data recolted by their data scientists. In the first time, I will try to understand every features of our data(Exploratory Data Analysis, EDA). Then, I will move to preprocessing step to prepare data to the final step, training model.

#### Steps:
===========
- Information about data (with recall of the goal of work)
- Exploratory Data Analysis (EDA)
- Data preprocessing
- Model training
- Conclusion


In [75]:
# Setup notebook by importing necessary librairies
import numpy as np
import pandas as pd
pd.plotting.register_matplotlib_converters() # Register pandas formatters and converters with matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.feature_selection import mutual_info_regression
from sklearn.feature_selection import SelectKBest

###### Information about dataset
This notebook has two main goals: make analysis of data and build Machine Learning model to predict the sales of each item at a particular outlet. We have 12 columns in our dataset.

| Column          | Description       | Data Type    |
|-----------------|-------------------|--------------|
| Item_Identifier | Unique product ID | Alphanumeric |
| Item_Weight     | Weight of product | Numeric      |
| Item_Fat_Content | Wether the product is low fat or not | Alphanumeric |
| Item_Visibility | The percent of total display area of all products in a store allocated to a particular product |  Numeric  |
| Item_Type       |  Category of product |  Alphanumeric |
| Item_MRP        |  Maximum Retail Price of product | Numeric             |
| Outlet_Identifier| Unique store ID | Alphanumeric |
| Item_Established_Year | The year in which store was created | Numeric |
| Outlet_Size     | Size of the store in terms of ground area covered | Numeric |
| Outlet_Location_Type | The type of city in which the store is located | Alphanumeric |
| Outlet_Type | Whether the outlet is just a grocery store or some sort of supermarket | Alphanumeric |
| Item_Outlet_Sales | Sales of the product in the particular store. This is the outcome variable to be predicted. | Numeric |

## Exploratory Data Analysis (EDA)
Let's to make some hypotheses about features of data. Let's ask what element can increase or decrease items sales for a supermarket?

- The usefulness of product for the vast majority of customers. It is defined by product category or the brand of product, and so one... But here, based on our dataset, we can use "Item_Type" feature to estimate the kind of product is loved by customers.
- The price of product can influence the sales. To valid this hypothese we will check correlation between Item_MRP and "Item_Outlet_Sales"
- The visibility of product in the store: Column "Item_Visibility"
- The place where the store is established: depending on the location of the store the product's price may be different so column "Outlet_Location_Type" is important

These hypotheses are subjective. but with further exploration of the data, we will accept or reject each of these assumptions

### Explore Data Analysis

Let's read dataset and make quick review

In [4]:
# Read data
TRAIN_FILE_PATH = "../input/bigmart-sales-data/Train.csv"
data = pd.read_csv(TRAIN_FILE_PATH)
print("Shape of data: ", data.shape)

In [5]:
data.head()

In [6]:
data.describe()

As we can see it, "Item_Outlet_Sales" is the target we must predict with our future model.

In [7]:
target = 'Item_Outlet_Sales'
features = [col for col in data.columns if col!=target]
print("We have", len(features), "features:", features)

In [9]:
# make two lists, one for categorial features names and second for others features
cat_features = list(data[features].select_dtypes(include='object').columns)
num_features = list(set(features)-set(cat_features))
print("Categorical features:(",len(cat_features),")", cat_features)
print("Numerical features(",len(num_features),"):", num_features)

In [67]:
# get name of columns which contains null values
def summarize_null_cols(X, cols):
    null_columns = list(X[cols].isnull().sum()[X[cols].isnull().sum()>0].index)
    print("Columns which contain null:",null_columns)
    for col in null_columns:
        percent = (X[col].isnull().sum().sum()/len(X[col]))*100
        print("For", col,"column we have", round(percent,2),"%of null values")
all_cols = features + [target]
summarize_null_cols(data, all_cols)

First observations we can make it is:
- There are 4 numerical features and target
- "Item_Weight" and "Outlet_Size" columns have some null entries, so we must do imputation in preprocessing step


### Numerical features exploratory

I make correlation figure analyse numericals features interaction with target

In [10]:
# plot correlation between numerical features and target
fig = plt.figure()
corr = data[num_features+[target]].corr()
sns.heatmap(corr, mask=np.zeros_like(corr, dtype=np.bool), cmap=sns.diverging_palette(240,10,as_cmap=True),
            square=True)

In [11]:
plt.figure()
plt.scatter(data['Item_MRP'], data[target])
plt.xlabel("Item_MRP")
plt.ylabel("Item_Outlet_Sales")

If you remember, one of my assumptions above is this: Item sales can depend on the price of the items. With this graph, we can notice the strong correlation between "Item_MRP" and "Item_Sales_Outlet".
<br>
But one thing seems odd to me about the correlation graph. I thought the correlation between "Item_Visibility" and "Item_Outlet_Sales" would have been higher than what I get because the more visible a product is the more likely it is to be sold, vice versa. Maybe it's due to the many values of 0.0 we have in the "Item_Visibility" columns? My answer is subjective (yes). Because having 0.0% visibility area for a product is inconceivable, I guess it's probably missing data and they set 0.0% by default.

In [55]:
plt.figure()
sns.distplot(data[target], color="blue")
plt.show()
print("Skew (is positive): ", data[target].skew())

Our target is skewed. We can make transformation like 'log' to reduce the skewness(We will do it later)

### Categorical features exploratory

In [12]:
for col in cat_features:
    unique_val = data[col].unique()
    print(col,'(', len(unique_val),'):',unique_val)

Among the categorical characteristics, we can notice that there are ordinal characteristics. An ordinal characteristic is first of all a categorical characteristic which contains values which can be ordered. In our dataset we have:
- 'Item_Fat_Content'
- 'Outlet_Size'
- 'Outlet_Type'
- 'Outlet_Location_Type'

Following anlysis is based on median value. When you have a skewed distribution, the median is a better measure of central tendency than the mean.

In [17]:
plt.figure()
data[['Outlet_Type', target]].groupby('Outlet_Type').median().plot(kind='bar')
plt.show()

In [19]:
plt.figure()
data[['Outlet_Location_Type', target]].groupby('Outlet_Location_Type').median().plot(kind='bar')
plt.show()

In [22]:
plt.figure()
data[['Item_Fat_Content', target]].groupby('Item_Fat_Content').median().plot.barh()
plt.show()

Here,we can notice we need to preprocess this feature to make value uniform

In [23]:
plt.figure()
data[['Outlet_Size', target]].groupby('Outlet_Size').median().plot(kind='bar')
plt.show()

In [25]:
data.pivot_table(values='Item_Outlet_Sales', index='Item_Type').sort_values(by='Item_Outlet_Sales').plot(kind="barh")

In [53]:
for otype in data['Outlet_Type'].unique():
    print("\n",otype)
    print(data[data['Outlet_Type']==otype]['Outlet_Size'].value_counts(dropna=False))

We can notice that, supermarket Type1 is alone which have 'High' as Outlet Size. Supermarket Type2 and Type3 have only a Medium size for its outlets. This analysis will be helpful in preprocessing step to do imputation in order to give value which will have sense for Outlet_Size column

In [54]:
data[['Outlet_Identifier', target]].groupby('Outlet_Identifier').median().plot(kind='bar')
plt.show()

## Data Preprocessing

As we saw it, our dataset contains null values in two columns: Item_Weight and Outlet_Size. By making exploratory we found some idea about how we can deal with NaN values in these columns. 
<br>
For 'Item_Weight' column, we have items which weights are known except 4 items (like you can see below). So for these 4 items we will use the mean value of 'Item_Weight'
<br>

In [63]:
items_weight_mean = data[['Item_Identifier', 'Item_Weight']].groupby('Item_Identifier').mean()
print(items_weight_mean[items_weight_mean['Item_Weight'].isnull()])
items_weight_mean[items_weight_mean['Item_Weight'].isnull()] = data['Item_Weight'].mean()
print(items_weight_mean[items_weight_mean['Item_Weight'].isnull()])

In [68]:
def impute_item_weight(row):
    item_id = row['Item_Identifier']
    item_weight = row['Item_Weight']
    
    if not pd.isnull(item_weight):
        return item_weight
    # else
    return items_weight_mean['Item_Weight'][items_weight_mean.index==item_id]
    
# impute item_weight
data['Item_Weight'] = data.apply(impute_item_weight, axis=1).astype(float)
summarize_null_cols(data, features)

For 'Outlet_Size' column, we found previously a relation between Outlet_Size and Outlet_Type. We will use this to impute Outlet_Size by taking most frequent Size based on Outlet type.

In [69]:
most_outlet_size_by_type = data.pivot_table(values='Outlet_Size', columns='Outlet_Type', aggfunc=(lambda x: x.mode()[0]))

def impute_outlet_size(row):
    outlet_type = row['Outlet_Type']
    outlet_size = row['Outlet_Size']
    
    if not pd.isnull(outlet_size):
        return outlet_size
    return most_outlet_size_by_type.loc['Outlet_Size'][most_outlet_size_by_type.columns==outlet_type][0]

# impute outlet_size

data['Outlet_Size'] = data.apply(impute_outlet_size, axis=1)
summarize_null_cols(data, features)

Let's make the values of the column "Item_Fat_Content" uniform 

In [70]:
data.replace({'Item_Fat_Content': {'low fat':'Low Fat','LF':'Low Fat', 'reg':'Regular'}}, inplace=True)

Check if we still have null values in our datasets

Now, let's manage categorical columns with labelencoder

In [72]:
for col in cat_features:
    encoder = LabelEncoder()
    data[col] = encoder.fit_transform(data[col])

In [73]:
data.head()

## Model Training and parameters tuning

I choose RandomForestRegressor as model. Let's prepare data. I will use gridsearchcv to find best parameters of my model.

In [90]:
X_train, X_valid, y_train, y_valid = train_test_split(data[features], data[target], test_size=0.3, random_state=1)

Let's try a simple model on our data

In [100]:
model = RandomForestRegressor(n_estimators=1000, random_state=0)
model.fit(X_train, y_train)
preds = model.predict(X_valid)
score = r2_score(y_valid, preds)
print('R2:', score)

Now, we will try to improve model by tuning model parameters

In [104]:
model = RandomForestRegressor(random_state=0)

my_pipeline = Pipeline(steps=[
                              ('rfr', model)
                             ])

param_grid = [
{'rfr__n_estimators': [100, 1000], 'rfr__max_features': ["auto", "log2"], 
 'rfr__max_depth': [None, 25]}
]

grid_search_rfr = GridSearchCV(my_pipeline, param_grid, cv=4, scoring='r2', n_jobs=-1)
grid_search_rfr.fit(X_train, y_train)

In [105]:
print("Best parameter (CV score=%0.3f):" % grid_search_rfr.best_score_)
print(grid_search_rfr.best_params_)

In [95]:
model = RandomForestRegressor(n_estimators=1000, random_state=0, max_depth=None, max_features='log2')
model.fit(X_train, y_train)
preds = model.predict(X_valid)
score = r2_score(y_valid, preds)
print('R2:', score)

This is the end of notebook, we started by making some assumptions, analyzes and then preprocessing the data and finally learning the model and tuning parameters. The score I have can be improved by using another model like XGRegressor, but also by doing some feature engineering which is also important to get a better score. In my work, I did not make any transformation on the skwed target's data. I will try to improve that point. I have seen a few articles that deal with this and I tried the logarithmic transformation which did not give good results. Reason why I did not keep it in the final version of the notebook.

I will appreciate if anyone has a suggestion or question for me, please feel free to write a comment. Thank you:)