# The BigMart Sales Project

The goal of this project is to predict the sales for each product in BigMart. The data for 1559 products across 10 outlets in different cities has been collected.

Let's first import all the libraries we are going to need:

In [1]:
from pandas import read_csv, get_dummies
from sklearn.preprocessing import LabelEncoder

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.linear_model import ElasticNet
from sklearn.tree import DecisionTreeRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.ensemble import AdaBoostRegressor
from sklearn.metrics import mean_squared_error, r2_score

from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.ensemble import AdaBoostRegressor

In [2]:
filename = 'Train.csv'
Dataset = read_csv(filename)

We can start by exploring the data and try to make sense of it.
Different items from different outlets have been collected, and multiple features have been measured.
The output for this data is the price of the item in a particular type of outlet

## Feature Engineering

The list and types of the features is:

In [3]:
Types = Dataset.dtypes
Values_Table = Dataset.count()

In [4]:
print(Types)

Item_Identifier               object
Item_Weight                  float64
Item_Fat_Content              object
Item_Visibility              float64
Item_Type                     object
Item_MRP                     float64
Outlet_Identifier             object
Outlet_Establishment_Year      int64
Outlet_Size                   object
Outlet_Location_Type          object
Outlet_Type                   object
Item_Outlet_Sales            float64
dtype: object


In [5]:
print(Values_Table)

Item_Identifier              8523
Item_Weight                  7060
Item_Fat_Content             8523
Item_Visibility              8523
Item_Type                    8523
Item_MRP                     8523
Outlet_Identifier            8523
Outlet_Establishment_Year    8523
Outlet_Size                  6113
Outlet_Location_Type         8523
Outlet_Type                  8523
Item_Outlet_Sales            8523
dtype: int64


As we can see, there are some missing values. let's confirm that:

In [6]:
Missing_Values_Table = Dataset.isnull().sum()
print(Missing_Values_Table)

Item_Identifier                 0
Item_Weight                  1463
Item_Fat_Content                0
Item_Visibility                 0
Item_Type                       0
Item_MRP                        0
Outlet_Identifier               0
Outlet_Establishment_Year       0
Outlet_Size                  2410
Outlet_Location_Type            0
Outlet_Type                     0
Item_Outlet_Sales               0
dtype: int64


Before we move any further, let's take care of the missing values.

we have some missing values in Item weight, This is very easy to fill as the 
product with the same identifier must have the same weight

In [10]:
for Id in Dataset.Item_Identifier.unique():
    Dataset.loc[Dataset['Item_Identifier']==Id] = Dataset.loc[Dataset['Item_Identifier']==Id].fillna(
            Dataset.loc[Dataset['Item_Identifier']==Id].mean())

Now Let's take care of Item_Fat_Content, we have some LF and low fat values that need to be replaced by Low Fat
and Reg value that need to be replaced by Regular

In [15]:
Item_Counting = Dataset['Item_Fat_Content'].value_counts(ascending=True)
print(Item_Counting)

low fat     112
reg         117
LF          316
Regular    2889
Low Fat    5089
Name: Item_Fat_Content, dtype: int64


In [16]:
Dataset = Dataset.replace(['LF','low fat'],'Low Fat')
Dataset = Dataset.replace('reg','Regular')

Now let's take care of the missing values of the Outlet_Size,
For grocery stores the size will be small Looks like the majority of the Sizes are small

In [18]:
Dataset.Outlet_Size = Dataset.Outlet_Size.fillna('Small')

Let's recheck for the mssing values:

In [19]:
Missing_Values_Table = Dataset.isnull().sum()
print(Missing_Values_Table)

Item_Identifier              0
Item_Weight                  0
Item_Fat_Content             0
Item_Visibility              0
Item_Type                    0
Item_MRP                     0
Outlet_Identifier            0
Outlet_Establishment_Year    0
Outlet_Size                  0
Outlet_Location_Type         0
Outlet_Type                  0
Item_Outlet_Sales            0
dtype: int64


The number of Unique products we have

In [20]:
Item_Id = Dataset['Item_Identifier'].value_counts(ascending=True)
No_Unique_Items = Item_Id.size
print(No_Unique_Items)

1559


Alright, regarding the Item_Identifier, I don't think it adds any value to the data so I drop it

In [21]:
Dataset = Dataset.drop(['Item_Identifier'], axis=1)

I think the Outlet Identifier column could be dropped as it's hight correlated with the outlet location type and outlet type

In [22]:
Dataset = Dataset.drop(['Outlet_Identifier'], axis=1)

## The Encoding Part

In [23]:
encoder = LabelEncoder()

In [24]:
print(Dataset['Item_Fat_Content'].value_counts(ascending=True))

Regular    3006
Low Fat    5517
Name: Item_Fat_Content, dtype: int64


There are only two value so we can use the LabelEncoder

In [25]:
Dataset['Item_Fat_Content']=encoder.fit_transform(Dataset['Item_Fat_Content'])

In [26]:
print(Dataset['Outlet_Size'].value_counts(ascending=True))

High       932
Medium    2793
Small     4798
Name: Outlet_Size, dtype: int64


For outlet_size LabelEncoder is most suitable because the sizes are comparable,
so  Small:1 Medium:2 High:3

In [27]:
Dataset = Dataset.replace('Small',1)
Dataset = Dataset.replace('Medium',2)
Dataset = Dataset.replace('High',3)

Now we are going to apply one hot encoding to the remaining columns

In [28]:
Dataset = get_dummies(Dataset, columns=['Item_Type'])
Dataset = get_dummies(Dataset, columns=['Outlet_Location_Type'])
Dataset = get_dummies(Dataset, columns=['Outlet_Type'])

Move the sales column to the end

In [30]:
cols = Dataset.columns.tolist()
cols.insert(29, cols.pop(cols.index('Item_Outlet_Sales')))
Dataset=Dataset.reindex(columns=cols)

## The Machine Learning Part

In [31]:
array = Dataset.values
X = array[:,0:-1]
Y = array[:,-1]

validation_size = 0.2
seed = 7

X_train, X_validation, Y_train, Y_validation = train_test_split(X, Y,
        test_size=validation_size, random_state=seed)

In [32]:
# Test options and evaluation metric
num_folds = 10
seed = 7
scoring = 'neg_mean_squared_error'

reg = LinearRegression().fit(X_train, Y_train)
print(r2_score(Y_validation, reg.predict(X_validation)))

GBR = GradientBoostingRegressor(random_state = seed)
GBR.fit(X_train,Y_train)
print(GBR.score(X_train,Y_train))

0.543416484472155
0.6406342445748947
