# BigMart Sales Prediction
Data scientists at BigMart have collected 2013 sales data for 1559 products across 10 stores in different cities.        
Also, certain attributes of each product and store have been defined.              
The aim of this data science project is to build a **predictive model and find out the sales of each product at a particular store.**

# Imports

In [125]:
import numpy as np, pandas as pd, matplotlib.pyplot as plt, seaborn as sns

# Get Data and Info

In [126]:
data = pd.read_csv('C:\\Users\\kruth\\OneDrive\\Desktop\\DS Lab Case study datasets\\BigMartSales\\Train.csv')
data.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27
3,FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,,Tier 3,Grocery Store,732.38
4,NCD19,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052


In [127]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8523 entries, 0 to 8522
Data columns (total 12 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Item_Identifier            8523 non-null   object 
 1   Item_Weight                7060 non-null   float64
 2   Item_Fat_Content           8523 non-null   object 
 3   Item_Visibility            8523 non-null   float64
 4   Item_Type                  8523 non-null   object 
 5   Item_MRP                   8523 non-null   float64
 6   Outlet_Identifier          8523 non-null   object 
 7   Outlet_Establishment_Year  8523 non-null   int64  
 8   Outlet_Size                6113 non-null   object 
 9   Outlet_Location_Type       8523 non-null   object 
 10  Outlet_Type                8523 non-null   object 
 11  Item_Outlet_Sales          8523 non-null   float64
dtypes: float64(4), int64(1), object(7)
memory usage: 799.2+ KB


In [128]:
print("Target Variable: Item_Outlet_Sales")
data.iloc[:,-1]

Target Variable: Item_Outlet_Sales


0       3735.1380
1        443.4228
2       2097.2700
3        732.3800
4        994.7052
          ...    
8518    2778.3834
8519     549.2850
8520    1193.1136
8521    1845.5976
8522     765.6700
Name: Item_Outlet_Sales, Length: 8523, dtype: float64

## Observations
From the above information, we can note that:
1. There are 11 features and 1 target variable (Item_Outlet_Sales).
2. This is a Supervised Machine Learning problem with Regression.
3. There are 7 categorical features and 4 numerical features.
4. The data is not scaled.
5. There are missing values in the data.
6. There might be some outliers in the data.

# Data Preprocessing

## Identify Categorical and Numerical Features

In [129]:
cat_cols = data.select_dtypes(include='object').columns
num_cols = data.select_dtypes(exclude='object').columns
print("Categorical Columns: \n",cat_cols)
print("Numerical Columns: \n",num_cols)

Categorical Columns: 
 Index(['Item_Identifier', 'Item_Fat_Content', 'Item_Type', 'Outlet_Identifier',
       'Outlet_Size', 'Outlet_Location_Type', 'Outlet_Type'],
      dtype='object')
Numerical Columns: 
 Index(['Item_Weight', 'Item_Visibility', 'Item_MRP',
       'Outlet_Establishment_Year', 'Item_Outlet_Sales'],
      dtype='object')


In [130]:
# separate target variable
y = data.iloc[:,-1]

In [131]:
all_features = data.columns

In [132]:
for col in cat_cols:
    print(data[col].value_counts(),"\n") 

Item_Identifier
FDW13    10
FDG33    10
NCY18     9
FDD38     9
DRE49     9
         ..
FDY43     1
FDQ60     1
FDO33     1
DRF48     1
FDC23     1
Name: count, Length: 1559, dtype: int64 

Item_Fat_Content
Low Fat    5089
Regular    2889
LF          316
reg         117
low fat     112
Name: count, dtype: int64 

Item_Type
Fruits and Vegetables    1232
Snack Foods              1200
Household                 910
Frozen Foods              856
Dairy                     682
Canned                    649
Baking Goods              648
Health and Hygiene        520
Soft Drinks               445
Meat                      425
Breads                    251
Hard Drinks               214
Others                    169
Starchy Foods             148
Breakfast                 110
Seafood                    64
Name: count, dtype: int64 

Outlet_Identifier
OUT027    935
OUT013    932
OUT049    930
OUT046    930
OUT035    930
OUT045    929
OUT018    928
OUT017    926
OUT010    555
OUT019    528
Name: cou

Observations for categorical features:
1. Item_Identifier - numerical values exist but as strings
2. Item_Fat_Content - 5 categories, Ordinal
3. Item_Type - many categories, Nominal
4. Outlet_Identifier - many categories, Nominal
5. Outlet_Size - 3 categories, Ordinal
6. Outlet_Location_Type - 3 categories, Ordinal
7. Outlet_Type - 4 categories, Ordinal


In [133]:
data[num_cols].head()

Unnamed: 0,Item_Weight,Item_Visibility,Item_MRP,Outlet_Establishment_Year,Item_Outlet_Sales
0,9.3,0.016047,249.8092,1999,3735.138
1,5.92,0.019278,48.2692,2009,443.4228
2,17.5,0.01676,141.618,1999,2097.27
3,19.2,0.0,182.095,1998,732.38
4,8.93,0.0,53.8614,1987,994.7052


These numerical features need to be scaled

## Duplicates

In [134]:
data.duplicated().sum()

0

No duplicates found

## Missing Values

In [135]:
data.isna().sum()

Item_Identifier                 0
Item_Weight                  1463
Item_Fat_Content                0
Item_Visibility                 0
Item_Type                       0
Item_MRP                        0
Outlet_Identifier               0
Outlet_Establishment_Year       0
Outlet_Size                  2410
Outlet_Location_Type            0
Outlet_Type                     0
Item_Outlet_Sales               0
dtype: int64

Item_weight is numerical and Outlet_Size is categorical. Both have missing values.

## Transform all features

### Cat to Float 

In [136]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer # missing values
from sklearn.base import TransformerMixin, BaseEstimator
# define a tranformer for Item_Identifier - numerical values exist but as strings
class CatToFloat(TransformerMixin, BaseEstimator):

    # X is the col which has numerical values exist but as strings
    # it should be numpy array or series of 1D
    def __init__(self):
        pass

    def fit(self, X, y = None):
        return self
    
    def transform(self, X):

        transformed = []
        # get the numeric part of each instance and append it to transformed
        for instance in X:
            # get the numeric part
            # store digit in list if ch.isdigit() is returned True
            # join the digits found in particular instance
            numeric_part = "".join([ch for ch in instance if ch.isdigit()])

            if numeric_part: # if there exist a numeric_part
                transformed.append([float(numeric_part)])
            else: # if the instance is a missing value
                transformed.append([np.nan])

        # convert to array and reshape it
        transformed = np.array(transformed).reshape(-1,1)
        return transformed
    
# test the working of transformer

trans = CatToFloat()
x = ["CAT09", "CAT0203", "CA099", "088C"]
trans.fit(x)
transformed = trans.transform(x)
print(transformed)


[[  9.]
 [203.]
 [ 99.]
 [ 88.]]


this works!!
now define the column transformer

In [138]:
# cat to float
cat_to_float = CatToFloat()
cat_to_float.fit(data["Item_Identifier"])
res = cat_to_float.transform(data["Item_Identifier"])
print(res)

[[15.]
 [ 1.]
 [15.]
 ...
 [29.]
 [46.]
 [ 1.]]


In [137]:
'''col_trans = ColumnTransformer(
    transformers= [('nomial', OneHotEncoder(), ['Item_Type', 'Outlet_Identifier']),
                   ('ordinal', OrdinalEncoder(), ["Item_Fat_Content","Outlet_Size","Outlet_Location_Type","Outlet_Type"]),
                   ('cat_to_num', CatToFloat(), ["Item_Identifier"]),
                   ('scale', StandardScaler, [all_features])
    ],
    remainder="passthrough",
    verbose_feature_names_out=True,
    n_jobs=-1
)'''

In [139]:
# integrate
data["Item_Identifier"] = res
data.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,15.0,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
1,1.0,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
2,15.0,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27
3,7.0,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,,Tier 3,Grocery Store,732.38
4,19.0,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052


### Ordinal Encoding

In [140]:
# ordinal values
ordinal = OrdinalEncoder()
ordinal.fit(data[["Item_Fat_Content","Outlet_Size","Outlet_Location_Type","Outlet_Type"]])
res = ordinal.transform(data[["Item_Fat_Content","Outlet_Size","Outlet_Location_Type","Outlet_Type"]])
print(res)

[[1. 1. 0. 1.]
 [2. 1. 2. 2.]
 [1. 1. 0. 1.]
 ...
 [1. 2. 1. 1.]
 [2. 1. 2. 2.]
 [1. 2. 0. 1.]]


In [141]:
# integrate
data[["Item_Fat_Content","Outlet_Size","Outlet_Location_Type","Outlet_Type"]] = res
data.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,15.0,9.3,1.0,0.016047,Dairy,249.8092,OUT049,1999,1.0,0.0,1.0,3735.138
1,1.0,5.92,2.0,0.019278,Soft Drinks,48.2692,OUT018,2009,1.0,2.0,2.0,443.4228
2,15.0,17.5,1.0,0.01676,Meat,141.618,OUT049,1999,1.0,0.0,1.0,2097.27
3,7.0,19.2,2.0,0.0,Fruits and Vegetables,182.095,OUT010,1998,,2.0,0.0,732.38
4,19.0,8.93,1.0,0.0,Household,53.8614,OUT013,1987,0.0,2.0,1.0,994.7052


### One Hot Encoding

In [142]:
# nominal values
nominal = OneHotEncoder()
nominal.fit(data[['Item_Type', 'Outlet_Identifier']])
res = nominal.transform(data[['Item_Type', 'Outlet_Identifier']])
print(res)

  (0, 4)	1.0
  (0, 25)	1.0
  (1, 14)	1.0
  (1, 19)	1.0
  (2, 10)	1.0
  (2, 25)	1.0
  (3, 6)	1.0
  (3, 16)	1.0
  (4, 9)	1.0
  (4, 17)	1.0
  (5, 0)	1.0
  (5, 19)	1.0
  (6, 13)	1.0
  (6, 17)	1.0
  (7, 13)	1.0
  (7, 21)	1.0
  (8, 5)	1.0
  (8, 23)	1.0
  (9, 5)	1.0
  (9, 18)	1.0
  (10, 6)	1.0
  (10, 25)	1.0
  (11, 4)	1.0
  (11, 24)	1.0
  (12, 6)	1.0
  :	:
  (8510, 22)	1.0
  (8511, 5)	1.0
  (8511, 19)	1.0
  (8512, 4)	1.0
  (8512, 17)	1.0
  (8513, 10)	1.0
  (8513, 22)	1.0
  (8514, 3)	1.0
  (8514, 23)	1.0
  (8515, 0)	1.0
  (8515, 19)	1.0
  (8516, 11)	1.0
  (8516, 19)	1.0
  (8517, 5)	1.0
  (8517, 24)	1.0
  (8518, 13)	1.0
  (8518, 17)	1.0
  (8519, 0)	1.0
  (8519, 23)	1.0
  (8520, 8)	1.0
  (8520, 22)	1.0
  (8521, 13)	1.0
  (8521, 19)	1.0
  (8522, 14)	1.0
  (8522, 24)	1.0


In [143]:
# integrate
# drop the nomial columns and add the transformed columns
data = data.drop(['Item_Type', 'Outlet_Identifier'], axis=1)
data = pd.concat([data, pd.DataFrame(res.toarray())], axis=1)
data.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_MRP,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales,...,16,17,18,19,20,21,22,23,24,25
0,15.0,9.3,1.0,0.016047,249.8092,1999,1.0,0.0,1.0,3735.138,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
1,1.0,5.92,2.0,0.019278,48.2692,2009,1.0,2.0,2.0,443.4228,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
2,15.0,17.5,1.0,0.01676,141.618,1999,1.0,0.0,1.0,2097.27,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
3,7.0,19.2,2.0,0.0,182.095,1998,,2.0,0.0,732.38,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,19.0,8.93,1.0,0.0,53.8614,1987,0.0,2.0,1.0,994.7052,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [144]:
# check for no of columns
data.shape

(8523, 36)

### Missing Values

In [145]:
# impute missing values
data.columns = data.columns.astype(str)
imputer = SimpleImputer(strategy="median")
imputer.fit(data)
data = imputer.transform(data)

### Split Data

In [146]:
# split data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(data, y, test_size=0.2, random_state=42)

### Scale Data

In [147]:
# scale
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

# Model

In [148]:
# model
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train, y_train)

# predict
y_pred = model.predict(X_test)

# evaluate
from sklearn.metrics import mean_squared_error
mse = mean_squared_error(y_test, y_pred)
print("MSE: ",mse)
print("score: ",model.score(X_test, y_test)*100, "%")

MSE:  2.4439200752211653e-23
score:  100.0 %
