# Sales Predictive Modeling: Project

In [17]:
import numpy as np
import pandas as pd
import pandas.api.types as ptype
import matplotlib.pyplot as plt
import seaborn as sns

## Business Requirements

**Problem Statement:**

Big Mart is a retail store chain that sells various products across different cities. The company is seeking to enhance its sales forecasting accuracy to optimize inventory management, marketing strategies, and operational decisions. The sales data includes information about various items, stores, and sales over a period.

### Goal:
The goal is to build a predictive model that can forecast the sales for each product at a particular store. This will help Big Mart make better stocking decisions, reduce wastage, and maximize profit margins.

### Business Implications:
Accurate sales predictions will allow Big Mart to:
1. **Optimize inventory management**: Ensuring that high-demand products are always in stock and minimizing excess inventory.
2. **Increase profit margins**: By reducing costs associated with overstocking or understocking.
3. **Improve customer satisfaction**: Ensure products are available when and where they are needed.
4. **Guide marketing and promotions**: By understanding which products are likely to perform well, the company can focus on targeted promotions.

## Data Collection

In [2]:
# collect data from bigmart sales
bigmart = pd.read_csv("bigmart_sales.csv")
bigmart.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27
3,FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,,Tier 3,Grocery Store,732.38
4,NCD19,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052


In [3]:
# identify sales data
bigmart.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8523 entries, 0 to 8522
Data columns (total 12 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Item_Identifier            8523 non-null   object 
 1   Item_Weight                7060 non-null   float64
 2   Item_Fat_Content           8523 non-null   object 
 3   Item_Visibility            8523 non-null   float64
 4   Item_Type                  8523 non-null   object 
 5   Item_MRP                   8523 non-null   float64
 6   Outlet_Identifier          8523 non-null   object 
 7   Outlet_Establishment_Year  8523 non-null   int64  
 8   Outlet_Size                6113 non-null   object 
 9   Outlet_Location_Type       8523 non-null   object 
 10  Outlet_Type                8523 non-null   object 
 11  Item_Outlet_Sales          8523 non-null   float64
dtypes: float64(4), int64(1), object(7)
memory usage: 799.2+ KB


In [4]:
# summary statistics
bigmart.describe()

Unnamed: 0,Item_Weight,Item_Visibility,Item_MRP,Outlet_Establishment_Year,Item_Outlet_Sales
count,7060.0,8523.0,8523.0,8523.0,8523.0
mean,12.857645,0.066132,140.992782,1997.831867,2181.288914
std,4.643456,0.051598,62.275067,8.37176,1706.499616
min,4.555,0.0,31.29,1985.0,33.29
25%,8.77375,0.026989,93.8265,1987.0,834.2474
50%,12.6,0.053931,143.0128,1999.0,1794.331
75%,16.85,0.094585,185.6437,2004.0,3101.2964
max,21.35,0.328391,266.8884,2009.0,13086.9648


In [5]:
# summary statistics for categorical variable
bigmart.describe(include="object")

Unnamed: 0,Item_Identifier,Item_Fat_Content,Item_Type,Outlet_Identifier,Outlet_Size,Outlet_Location_Type,Outlet_Type
count,8523,8523,8523,8523,6113,8523,8523
unique,1559,5,16,10,3,3,4
top,FDW13,Low Fat,Fruits and Vegetables,OUT027,Medium,Tier 3,Supermarket Type1
freq,10,5089,1232,935,2793,3350,5577


+ **Numberical Variables**: Item_Weight	Item_Visibility	Item_MRP	Outlet_Establishment_Year	Item_Outlet_Sales
+ **Categorical Variables**: Item_Identifier	Item_Fat_Content	Item_Type	Outlet_Identifier	Outlet_Size	Outlet_Location_Type	Outlet_Type

## Data Preprocessing

Approach for preprocessing sales data from BigMart: 

   - **Handling Missing Values:** Check for any missing values in the dataset and decide how to handle them (e.g., imputation, removal).
   - **Feature Engineering:** Create new features if needed (e.g., age of the store from the establishment year).
   - **Normalization/Standardization:** Scale features to ensure that numerical values are on a similar scale.
   - **Categorical Encoding:** Convert categorical variables into numerical formats using methods such as one-hot encoding or label encoding.

In [7]:
# identify some missing values
bigmart.isnull().sum()

Item_Identifier                 0
Item_Weight                  1463
Item_Fat_Content                0
Item_Visibility                 0
Item_Type                       0
Item_MRP                        0
Outlet_Identifier               0
Outlet_Establishment_Year       0
Outlet_Size                  2410
Outlet_Location_Type            0
Outlet_Type                     0
Item_Outlet_Sales               0
dtype: int64

In [8]:
# identify duplicated rows
bigmart.duplicated().sum()

0

In [9]:
# columns with the missing values
columns_with_missval = ["Item_Weight", "Outlet_Size"]
bigmart_missval = bigmart[columns_with_missval]
bigmart_missval.head()

Unnamed: 0,Item_Weight,Outlet_Size
0,9.3,Medium
1,5.92,Medium
2,17.5,Medium
3,19.2,
4,8.93,High


+ **Item Weight**: replacing missing value by taking the mean value of the feature
+ **Outlet Size**: replacing the Outlet size by unknown size

In [10]:
# remove missing values from item weight
item_weight_mean = bigmart["Item_Weight"].mean()
bigmart["Item_Weight"] = bigmart["Item_Weight"].fillna(0)
bigmart["Item_Weight"] = bigmart["Item_Weight"].replace(0, item_weight_mean)

# remove missing values from outlet size
bigmart["Outlet_Size"] = bigmart["Outlet_Size"].fillna("Unknown size")

In [11]:
# check nullvalues
bigmart.isnull().sum()

Item_Identifier              0
Item_Weight                  0
Item_Fat_Content             0
Item_Visibility              0
Item_Type                    0
Item_MRP                     0
Outlet_Identifier            0
Outlet_Establishment_Year    0
Outlet_Size                  0
Outlet_Location_Type         0
Outlet_Type                  0
Item_Outlet_Sales            0
dtype: int64

In [21]:
print(ptype.is_object_dtype(bigmart["Outlet_Size"]))

True


In [23]:
# store only categorical properties of items
def seperate_features(sales_data): 
  item_cat_properties = []
  item_num_properties = []
  for column in sales_data.columns: 
    if ptype.is_object_dtype(sales_data[column]): 
      item_cat_properties.append(column)
    else:
      item_num_properties.append(column)

  return item_cat_properties, item_num_properties

item_qualities = seperate_features(bigmart)[0]
bigmart[item_qualities].head()

Unnamed: 0,Item_Identifier,Item_Fat_Content,Item_Type,Outlet_Identifier,Outlet_Size,Outlet_Location_Type,Outlet_Type
0,FDA15,Low Fat,Dairy,OUT049,Medium,Tier 1,Supermarket Type1
1,DRC01,Regular,Soft Drinks,OUT018,Medium,Tier 3,Supermarket Type2
2,FDN15,Low Fat,Meat,OUT049,Medium,Tier 1,Supermarket Type1
3,FDX07,Regular,Fruits and Vegetables,OUT010,Unknown size,Tier 3,Grocery Store
4,NCD19,Low Fat,Household,OUT013,High,Tier 3,Supermarket Type1


In [38]:
whitespace_left.sum() == 0

True

In [40]:
# check whitespaces in categorical and string data
wspace_rx = r'^\s+|\s+$'
columns_with_wspace = []

for column in item_qualities: 
  whitespace_left = bigmart[bigmart[column].str.startswith(" ")].sum()
  whitespace_right = bigmart[bigmart[column].str.endswith(" ")].sum()

  if (whitespace_right.sum() > 0) and (whitespace_left.sum() > 0):
    columns_with_wspace.append(column)
  else:
    print(f"{column}: No white spaces are found")

Item_Identifier: No white spaces are found
Item_Fat_Content: No white spaces are found
Item_Type: No white spaces are found
Outlet_Identifier: No white spaces are found
Outlet_Size: No white spaces are found
Outlet_Location_Type: No white spaces are found
Outlet_Type: No white spaces are found


### Feature Engineering