<a href="https://colab.research.google.com/github/patelmedha/Prediction-of-Product-Sales/blob/main/PREDICTION_OF_PRODUCT_SALES.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Prediction of Product Sales**
**Author: Medha Patel**

##**Project Overview**

  This project aims to help retailers understand the properties of their products and outlets that play crucial roles in increasing sales. Using data analysis and machine learning techniques, the project will identify the most significant factors that influence sales performance and explore segmentation strategies to group similar products and outlets together. The goal is to provide retailers with a comprehensive understanding of their sales data and enable them to make data-driven decisions that increase revenue, improve customer satisfaction, and drive long-term success. An interactive dashboard will allow retailers to visualize and explore the data, experiment with input parameters, and generate custom reports.

###Data Dictionary

  - **Item_Identifier**: Unique product ID
  - **Item_Weight**: Weight of product
  - **Item_Fat_Content**: Whether the product is low fat or regular
  - **Item_Visibility**: The percentage of total display area of all products in store allocated to the particular product
  - **Item_Type**: The category to which the product belongs
  - **Item_MRP**: Maximum Retail Price (list price) of the product
  - **Outlet_Identifier**: Unique store ID
  - **Outlet_Establishment_Year**: The year in which store was established
  - **Outlet_Size**: The size of the store in terms of ground area covered
  - **Outlet_Location_Type**: The type of area in which the store is located
  - **Outlet_Type**: Whether the outlet is a grocery store or some sort of supermarket
  - **Item_Outlet_Sales**: Sales of product in particular store. This is the target variable to be predicted





##**Load and Inspect Data**

Import Libraries

In [1]:
#Imports
## Pandas
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

## Numpy
import numpy as np

##Seaborn
import seaborn as sns

##Matplotlib
import matplotlib.pyplot as plt

## Preprocessing
from sklearn.model_selection import train_test_split
from sklearn.compose import make_column_selector
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import make_pipeline
from sklearn.compose import ColumnTransformer

#Regression Model IMPORTS
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import BaggingRegressor
from sklearn.ensemble import RandomForestRegressor

## Models
from sklearn.dummy import DummyRegressor
from sklearn.linear_model import LinearRegression

## REGRESSION METRIX
from sklearn.metrics import r2_score
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error

## Set global scikit-learn configuration
from sklearn import set_config
## Display estimators as a diagram
set_config(display='diagram') # 'text' or 'diagram'}

### **Load Data**

In [2]:
#Load Data
file_url = '/content/drive/MyDrive/CodingDojo/01-Fundamentals/PROJECT: PREDICTION OF PRODUCT SALES/Data/sales_predictions_2023.csv'

df = pd.read_csv(file_url)

#Copy of Dataframe
df_ml = df.copy()

FileNotFoundError: [Errno 2] No such file or directory: '/content/drive/MyDrive/CodingDojo/01-Fundamentals/PROJECT: PREDICTION OF PRODUCT SALES/Data/sales_predictions_2023.csv'

###**Inspect Data**

####**Head()**

In [None]:
#Head()
df.head()

####**shape()**

In [None]:
df.shape
print(f'There are {df.shape[0]} rows, and {df.shape[1]} columns.')

####**dtypes**

In [None]:
df.dtypes

####**Info()**

In [None]:
#Info()
df.info()

####**describe()**

In [None]:
#Descriptive statistics for numeric columns
df.describe(include='number')

In [None]:
#Descriptive statistics for categoric columns
df.describe(include='object')

##**Clean Data**

####**Dropping/Replacing Columns**

In [None]:
df.info()

####**Duplicated Data**


In [None]:
dup_rows = df.duplicated().sum()
print(f'There are {dup_rows} duplicate rows.')

####**Unique Values**

In [None]:
df.nunique()

In [None]:
#Unique Value percentage
unique_percentage = df.nunique()/len(df) * 100
unique_percentage

####**Missing Values**

In [None]:
#Finding number of missing values
missing_values = df.isna().sum()
missing_values

In [None]:
#percent of missing_values
missing_values_percent = missing_values/len(df) * 100
missing_values_percent

####**Fixing Data Types**

In [None]:
df.info()

####**Visualizing Missing Values with Missingno**

In [None]:
import missingno as msno
msno.matrix(df);

#####**Address the Null Values**


######**Null Values in Categorical Columns**

In [None]:
# save list of categorical column name.
categorical_col = df.select_dtypes('object').columns
categorical_col

In [None]:
# Check for nunique for categorical columns
for col in categorical_col:
  print(f'Value Counts for {col}')
  print(df[col].value_counts())
  print('\n')

####**Data Consistency**

##### **Data Consistency- Categorical Columns**




In [None]:
# Item_Fat_Content- fix the values
df['Item_Fat_Content'] = df['Item_Fat_Content'].replace(['low fat' , 'LF'], 'Low Fat')
df['Item_Fat_Content'] = df['Item_Fat_Content'].replace('reg' , 'Regular')
df['Item_Fat_Content'].value_counts()

###### **Replacing Data in Categorical Columns**
- **Drop the column**
 - con: This would result in a significant loss of data.
 - +50% missing values per row would be significant enough to justify this option.

- **Drop rows with missing values**
 - con: This would result in a significant loss of data.
 - +5% missing values per column would too great to justify this option, however -5% missing values would justify this option.

- **Replace missing values with the value 'Unknown'**
 - pro: This typically will not create bias in favor of a specific label or class.
 - con: This will not factor correlations between features.

- **Impute missing values using the most frequent value, 'mode', value of the column**
 - pro: This typically works well with small numeric datasets.
 - con: This may create bias in favor of a specific label or class.
 - con: This will not factor correlations between features.


In [None]:
#Replace missing categorical column values with 'Unknown'

df['Outlet_Size'] = df['Outlet_Size'].fillna('Unknown')


- Replacing with 'Unknown' since, about 28% of the data is missing. This is to avoid any bias, or significant loss of data.

#####**Data Consistency- Numerical Columns**

In [None]:
# Have list of numerical column name.
numerical_col = df.select_dtypes(['int', 'float']).columns
numerical_col

In [None]:
# check for nunique for numerical columns
for col in numerical_col:
  print(f'Value Counts for {col}')
  print(df[col].value_counts())
  print('\n')

###### **Replacing Data in Numeric Columns**
- **Drop the column**
 - con: This would result in a significant loss of data.
 - +50% missing values per row would be significant enough to justify this option.

- **Drop rows with missing values**
 - con: This would result in a significant loss of data.
 - +5% missing values per column would too great to justify this option, however -5% missing values would justify this option.

- **Impute missing values using the 'mean' value of the column**
 - pro: This typically works well with small numeric datasets.
 - con: This can introduce bias and is affected by skew and outliers more than the 'median' value.
 - con: This will not factor correlations between features.

- **Impute missing values using the 'median' value of the column**
 - pro: This typically works well with small numeric datasets.
 - pro: This is less affected by outliers than strategy = 'mean'.
 - con: This will not factor correlations between features.

In [None]:
import statistics as stat
weight_median = df['Item_Weight'].median()

In [None]:
# Impute missing values using 'median' value
df['Item_Weight'] = df['Item_Weight'].fillna(weight_median)


- Replacing with 'median' value since, about 17% of the data is unknown- to avoid any bias, or losing valueable data. Using Median to replace missing values since it is less affected by skew and outliers.

In [None]:
#checking Data types
df.dtypes

In [None]:
#Descriptive data for numerical columns
df.describe(include='number')

#### Summary Statistic for Numerical Columns
For any numerical columns, obtain the summary statistics of each (min, max, mean).

In [None]:
df.describe(include = 'number')

##**Exploratory Data Analysis**

###**Categorical Columns**

In [None]:
## Display the descriptive statistics for the non-numeric columns
df.describe(include='object')

#### **'Item_Fat_Content' column**

In [None]:
## Display the value counts for the column
df['Item_Fat_Content'].value_counts()

In [None]:
## Display the descriptive statistics for the column
df['Item_Fat_Content'].describe()

In [None]:
# Define ax using seaborn.countplot()
ax = sns.countplot(data=df, x ='Item_Fat_Content')
#Set title, and axis label name
ax.set_xlabel('Item Fat Content')
ax.set_ylabel('Count')
ax.set_title('Item Fat Content');

Interpretation:
  - Low Fat Content has a higher count of instances compared to Regular Fat Content.

#### **'Item_Type' column**

In [None]:
## Display the descriptive statistics for the column
df['Item_Type'].describe()

In [None]:
## Display the value counts for the column in descending order
count_item_type = df['Item_Type'].value_counts().sort_values(ascending = False)
count_item_type

In [None]:
#Define ax using seaborn.barplot()
ax = sns.barplot(data=df, x= count_item_type.index, y= count_item_type.values)
plt.figure(figsize = (18,8));

#Rotate x-axis to avid overlap
ax.tick_params(axis='x', rotation = 90)
#Set the title
ax.set_title("Item Type Distribution")
#set x-axis and y-axis labels
ax.set_xlabel = ('Item Type')
ax.set_ylabel = ('Count');

Interpretation:
  - Fruits and Vegetables has the highest count of instances.
  - Seafood has the lowest count of instances.

#### **'Outlet_Identifier' column**

In [None]:
## Display the value counts for the column
count_outlet_identifier = df['Outlet_Identifier'].value_counts().sort_values( ascending = False)
count_outlet_identifier


In [None]:
## Display the descriptive statistics for the column
df['Outlet_Identifier'].describe()

In [None]:
# Define ax using seaborn.countplot()
ax = sns.countplot(data=df, x = 'Outlet_Identifier')
#Rotate x-axis to avid overlap
ax.tick_params(axis='x', rotation = 45)
#Set title, and axis label name
ax.set_xlabel('Outlet Identifier')
ax.set_ylabel('Count')
ax.set_title('Outlet Identifier Distribution');

Interpretation:
  - OUT027 has the highest count of instances.
  - OUT019 has the lowest count of instances.

#### **'Outlet_Size' column**

In [None]:
## Display the value counts for the column
count_outlet_size = df['Outlet_Size'].value_counts()
count_outlet_size

In [None]:
## Display the descriptive statistics for the column
df['Outlet_Size'].describe()

In [None]:
# Define ax using seaborn.countplot()
ax = sns.countplot(data=df, x = 'Outlet_Size')
#Set title, and axis label name
ax.set_xlabel('Outlet Size')
ax.set_ylabel('Count')
ax.set_title('Outlet Size Distribution');

Interpretation:
- Medium Outlet Size has the highest count of instances.
- High Outlet Size has the lowest count of instances.

#### **'Outlet_Location_Type' column**

In [None]:
## Display the value counts for the column
df['Outlet_Location_Type'].value_counts().sort_values()

In [None]:
## Display the descriptive statistics for the column
df['Outlet_Location_Type'].describe()

In [None]:
# Define ax using seaborn.countplot()
ax = sns.countplot(data=df, x = 'Outlet_Location_Type')
#Set title, and axis label name
ax.set_xlabel('Outlet Location Type')
ax.set_ylabel('Count')
ax.set_title('Outlet Location Type Distribution');

Interpretation:
  - Tier 3 Outlet Location Type has the highest count of instances.
  - Tier 1 Outlet Location Type has the lowest count of instances.


#### **'Outlet_Type' column**

In [None]:
## Display the value counts for the column
df['Outlet_Type'].value_counts()

In [None]:
## Display the descriptive statistics for the column
df['Outlet_Type'].describe()

In [None]:
# Define ax using seaborn.countplot()
ax = sns.countplot(data=df, x = 'Outlet_Type')
#Rotate x-axis to avid overlap
ax.tick_params(axis='x', rotation = 90)
#Set title, and axis label name
ax.set_xlabel('Outlet Type')
ax.set_ylabel('Count')
ax.set_title('Outlet Type Distribution');

Interpretation:
  - Supermarket Type 1 has the highest count of instances.
  - Supermarket Type 2 has the lowest count of instances.


### **Numerical Columns**

In [None]:
## Display the descriptive statistics for the numeric columns
df.describe(include=('number'))

#### **'Item_Weight' column**

In [None]:
### Display the value counts for the column
df['Item_Weight'].value_counts()

In [None]:
## Display the descriptive statistics for the column
df['Item_Weight'].describe()

- **'Item Weight' Histogram**

In [None]:
## Define a MatplotLib ax object using seaborn.histplot()
## Default Bins = 'auto'
fig , ax = plt.subplots()

ax= sns.histplot(data = df, x = 'Item_Weight')




#Set title name and axis names
ax.set_title('Distribution of Weight')
ax.set_xlabel('Item Weight')
ax.set_ylabel("Count");

'Item Weight' Histogram Interpretation:
  - Values range from 4.55 to 21.35.
  - The median value is 12.6.
  - The data is very slightly negatively skewed.


- **'Item Weight' Boxplot**

In [None]:
## Define a MatplotLib ax object using seaborn.boxplot()
## Use x = for horizontal
ax = sns.boxplot(data = df,
                 x = 'Item_Weight')

## Set the Title
ax.set_title('Item Weight')
ax.set_xlabel('Item Weight');

'Item Weight' Boxplot Interpretation:
  - The data is very slightly negatively skewed.
  - No outliers are noted.

#### **'Item_Visibility' column**

In [None]:
### Display the value counts for the column
df['Item_Visibility'].value_counts()

In [None]:
## Display the descriptive statistics for the column
df['Item_Visibility'].describe()

- **'Item Visibility' Histogram**

In [None]:
## Define a MatplotLib ax object using seaborn.histplot()
## Default Bins = 'auto'
fig , ax = plt.subplots()

ax= sns.histplot(data = df, x = 'Item_Visibility')




#Set title name and axis names
ax.set_title('Distribution of Item Visibility')
ax.set_xlabel('Item Visibility')
ax.set_ylabel("Count");

'Item Visibility' Histogram Interpretation:
 - Values range from 0.000 to 0.328.
 - The median value is 0.053.
 - The data is very positively skewed.



- **'Item Visibility' Boxplot**

In [None]:
## Define a MatplotLib ax object using seaborn.boxplot()
## Use x = for horizontal
ax = sns.boxplot(data = df,
                 x = 'Item_Visibility')

## Set the Title ans x-axis label
ax.set_title('Item Visibility')
ax.set_xlabel('Item Visibility');


'Item Visibility' Boxplot Interpretation:
 - The data is very positively skewed.
 - Outliers are noted on the high side.


#### **'Item_MRP' column**

In [None]:
### Display the value counts for the column
df['Item_MRP'].value_counts()

In [None]:
## Display the descriptive statistics for the column
df['Item_MRP'].describe()

- **'Item MRP' Histogram**

In [None]:
## Define a MatplotLib ax object using seaborn.histplot()
## Default Bins = 'auto'
fig , ax = plt.subplots()

ax= sns.histplot(data = df, x = 'Item_MRP')


#Set title name and axis names
ax.set_title('Distribution of Item MRP')
ax.set_xlabel('Item MRP')
ax.set_ylabel("Count");

'Item MRP' Histogram Interpretation:
  - Values range from  31.29 to 266.88.
  - The median value is 143.01.
  - The data is very slightly negatively skewed.


**'Item MRP' Boxplot**

In [None]:
## Define a MatplotLib ax object using seaborn.boxplot()
## Use x = for horizontal
ax = sns.boxplot(data = df,
                 x = 'Item_MRP')

## Set the Title
ax.set_title('Item MRP')
ax.set_xlabel('Item MRP');

'Item MRP' Boxplot Interpretation:
  - The data is very slightly negatively skewed.
  - No outliers are noted.

#### **'Outlet_Establishment_Year' column**

In [None]:
### Display the value counts for the column
df['Outlet_Establishment_Year'].value_counts()

In [None]:
## Display the descriptive statistics for the column
df['Outlet_Establishment_Year'].describe()

**'Outlet Establishment Year' Histogram**

In [None]:
## Define a MatplotLib ax object using seaborn.histplot()
fig , ax = plt.subplots()

ax= sns.histplot(data = df, x = 'Outlet_Establishment_Year', bins = (20))


#Set title name and axis names
ax.set_title('Distribution of Outlet Establishment Year')
ax.set_xlabel('Outlet Establishment Year')
ax.set_ylabel("Count");

'Outlet Establishment Year' Histogram Interpretation:
  - Values range from  1985 to 2009.
  - The median value is 1999.
  - The data is very slightly negatively skewed.


**'Outlet Establishment Year' Boxplot**

In [None]:
## Define a MatplotLib ax object using seaborn.boxplot()
## Use x = for horizontal
ax = sns.boxplot(data = df,
                 x = 'Outlet_Establishment_Year')

## Set the Title
ax.set_title('Outlet Establishment Year')
ax.set_xlabel('Outlet Establishment Year');

'Outlet Establishment Year' Boxplot Interpretation:
  - The data is very slightly negatively skewed.
  - No outliers are noted.

#### **'Item_Outlet_Sales' column**

In [None]:
### Display the value counts for the column
df['Item_Outlet_Sales'].value_counts()

In [None]:
## Display the descriptive statistics for the column
df['Item_Outlet_Sales'].describe()

**'Item Outlet Sales' Histogram**

In [None]:
## Define a MatplotLib ax object using seaborn.histplot()
## Default Bins = 'auto'
fig , ax = plt.subplots()

ax= sns.histplot(data = df, x = 'Item_Outlet_Sales')


#Set title name and axis names
ax.set_title('Distribution of Item Outlet Sales')
ax.set_xlabel('Item Outlet Sales ')
ax.set_ylabel("Count");

'Item Outlet Sales' Histogram Interpretation:
  - Values range from 33.29 to 13086.96.
  - The median value is 1794.33.
  - The data is very positively skewed


**'Item Outlet Sales' Boxplot**

In [None]:
## Define a MatplotLib ax object using seaborn.boxplot()
## Use x = for horizontal
ax = sns.boxplot(data = df,
                 x = 'Item_Outlet_Sales')

## Set the Title
ax.set_title('Item Outlet Sales')
ax.set_xlabel('Item Outlet Sales');

'Item Outlet Sales' Boxplot Interpretation:
  - The data is very positively skewed .
  - Outliers are noted on the high end.

###**Correlation**

####.corr() method

In [None]:
#To check all numeric features in the dataframe for correlations, use df.corr()
corr = df.corr()

corr

####Heatmap of Correlations

- Heatmap of correlations will display any correlation between numeric features.

In [None]:
##Define Matplotlib fix and ax objects using plt.subplots()
## Use figsize= to set the size of the figure
fig, ax = plt.subplots()

##Define Matplotlib ax object using sns.heatmap()
##Use cmap= to define the color map
##Use annot= to annotate the correlation values
ax = sns.heatmap(corr, cmap = 'viridis', annot = True);

Interpretation of Heatmap:
  - The highest correlation is between Item_MRP and Item_Outlet Sales

##**Explanatory Data Analysis**

##### **Impact of Item MRP on Item Outlet Sales**

In [None]:
## Define a MatplotLib ax object using sns.regplot()

scatter_kws = dict(edgecolor='white')
ax = sns.regplot(data = df,
                 x ='Item_MRP',
                 y = 'Item_Outlet_Sales',
                 scatter_kws= scatter_kws,
                 line_kws = {'color':'yellow'})
## Set the Title
ax.set_title('Item Outlet Sales vs Item MRP', fontsize=12, fontweight = 'bold')
# Set Axes Labels
ax.set_xlabel('Item MRP', fontsize=10, fontweight = 'bold')
ax.set_ylabel('Item Outlet Sales', fontsize=10, fontweight = 'bold');

- Prediction: Positive Correlation- Item MRP Influences Item Outlet Sales.

##### **Impact of Outlet Size on Outlet Sales**

In [None]:
count_outlet_size = df['Outlet_Size'].value_counts()
count_outlet_size

In [None]:
## Define label_order
outlet_size_sales_mean = df.groupby('Outlet_Size')['Item_Outlet_Sales'].mean().sort_values(ascending=False)
outlet_size_sales_mean
## Define a MatplotLib ax object using sns.barplot()
fig, ax = plt.subplots()
ax = sns.barplot(data = df,
                 x ='Outlet_Size',
                 y = 'Item_Outlet_Sales',
                 order = outlet_size_sales_mean.index,
                 errorbar = None)
## Set the Title
ax.set_title('Item Outlet Sales by Outlet Size', fontsize=12, fontweight = 'bold')
# Set Axes Labels
ax.set_xlabel('Outlet Size', fontsize=10, fontweight = 'bold')
ax.set_ylabel('Item Outlet Sales', fontsize=10, fontweight = 'bold');

- Prediction: Medium Outlets contribute the most towards Outlet Sales, while Unknown Outlet Size contribute the least.

##### **Impact of Item Type on Outlet Sales**

In [None]:
## Display the descriptive statistics for the column
df['Item_Type'].value_counts()

In [None]:
#Item type sales in percent using groupby
total_item_type_sales = df.groupby('Item_Type')['Item_Outlet_Sales'].sum().round(2)
total_sales = df["Item_Outlet_Sales"].sum()
percent_item_type_sales = ((total_item_type_sales / total_sales)*100).sort_values(ascending=False)
percent_item_type_sales

In [None]:
percent_item_type_sales = ((total_item_type_sales / total_sales) * 100).reset_index()

## Define a Matplotlib ax object using sns.barplot()
fig, ax = plt.subplots(figsize=(12, 6))
ax = sns.barplot(data=percent_item_type_sales,
                 y='Item_Type',
                 x='Item_Outlet_Sales',
                 order=percent_item_type_sales.sort_values('Item_Outlet_Sales', ascending=False)['Item_Type'],
                 errorbar=None,
                 palette = 'plasma')
## Set the Title
ax.set_title('Percentage of Outlet Sales by Item Type', fontsize=12, fontweight='bold')
# Set Axes Labels
ax.set_ylabel('Item Type', fontsize=10, fontweight='bold')
ax.set_xlabel('Percentage of Outlet Sales', fontsize=10, fontweight='bold');


- Prediction: Fruits and Vegetables (15%) exhibit the highest sales, whereas Seafood (0.8%) demonstrates the lowest sales.
  - The top three Items Types to impact most on Item Outlet sales are:
      - Fruits and Vegetables: 15%
      - Snack Foods: 14%
      - Household: 11%

##### Impact of Outlet Location on Outlet Sales

In [None]:
## Display the descriptive statistics for the column
df['Outlet_Location_Type'].value_counts()

In [None]:
location_sales= df.groupby('Outlet_Location_Type')['Item_Outlet_Sales'].sum().sort_values(ascending = False)
location_sales_percent = ((location_sales / total_sales)*100)
location_sales_percent

In [None]:
location_sales_percent = ((location_sales / total_sales)*100).reset_index()
## Define a Matplotlib ax object using sns.barplot()
fig, ax = plt.subplots(figsize=(12, 6))
ax = sns.barplot(data=location_sales_percent,
                 y='Outlet_Location_Type',
                 x='Item_Outlet_Sales',
                 order=location_sales_percent.sort_values('Item_Outlet_Sales', ascending=False)['Outlet_Location_Type'],
                 errorbar=None,
                 palette = 'plasma')
## Set the Title
ax.set_title('Percentage of Outlet Sales by Outlet Location Type', fontsize=12, fontweight='bold')
# Set Axes Labels
ax.set_ylabel('Outlet Location Type', fontsize=10, fontweight='bold')
ax.set_xlabel('Percentage of Outlet Sales', fontsize=10, fontweight='bold');

- Prediction: Tier 3 Outlet Location Types make the biggest contribution to total sales, with 41% of the sales coming from these outlets. On the other hand, Tier 1 Outlet Location Types have the smallest contribution, accounting for only 24% of the total sales.

##**Preprocessing for Machine Learning**

###**Inspect Data**

####**shape()**

In [None]:
df_ml.shape
print(f'There are {df_ml.shape[0]} rows, and {df_ml.shape[1]} columns.')

####**Info()**

In [None]:
#Info()
df_ml.info()

####**head()**

In [None]:
#Head()
df_ml.head()

####**describe()**

In [None]:
#Descriptive statistics for numeric columns
df_ml.describe(include='number')

In [None]:
#Descriptive statistics for categoric columns
df_ml.describe(include='object')

###**Performing Preprocessing Data**

In [None]:
# Checking for Duplicates
df_ml.duplicated().sum()

-There are 0 duplicates

In [None]:
# Checking missing values
df_ml.isna().sum()

####**Data Consistency**

##### **Data Consistency- Categorical Columns**




In [None]:
# save list of categorical column name.
categorical_col = df_ml.select_dtypes('object').columns
categorical_col

In [None]:
# Check for nunique for categorical columns
for col in categorical_col:
  print(f'Value Counts for {col}')
  print(df_ml[col].value_counts())
  print('\n')

In [None]:
#Drop 'Item_Identifier'
df_ml.drop(columns = 'Item_Identifier', inplace = True)
df_ml.info()

**Cardinality**
- The column "ITem_Identifier" has 1559 unique values.
- High cardinality will create a very sparse dataset when it is One Hot Encoded, which can negatively impact the models' metrics, and greatly increase processing times.
- It will be better to drop this column.

In [None]:
# Item_Fat_Content- fix the values
df_ml['Item_Fat_Content'] = df_ml['Item_Fat_Content'].replace(['low fat' , 'LF'], 'Low Fat')
df_ml['Item_Fat_Content'] = df_ml['Item_Fat_Content'].replace('reg' , 'Regular')
df_ml['Item_Fat_Content'].value_counts()

In [None]:
df_ml.info()

#####**Data Consistency- Numerical Columns**

In [None]:
# Have list of numerical column name.
numerical_col = df_ml.select_dtypes(['int', 'float']).columns
numerical_col

In [None]:
# check for nunique for numerical columns
for col in numerical_col:
  print(f'Value Counts for {col}')
  print(df_ml[col].value_counts())
  print('\n')

###**Defining X and y**

In [None]:
#Check and Drop null values in target column Item_Outlet_Sales
df_ml['Item_Outlet_Sales'].isna().sum()

- There are 0 null values in target column.

#### Define X and y

In [None]:
## Define X and y
target = 'Item_Outlet_Sales'

X = df_ml.drop(columns=target).copy()
y = df_ml[target].copy()
X.head()

####Train-Test-Split

In [None]:
# Perfoming a train-test-split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)

In [None]:
X_test.shape

In [None]:
X_train.dtypes

### Create 3 Pipelines

The data is going to divided as follows:

- numeric columns:
  - Item_Weight, Item_Visibility, Item_MRP, Outlet_Establishment_Year
- ordinal categorical columns :
  - Item_Fat_Content, Outlet_Size, Outlet_Location_Type
- nominal categorical columns :
  - Item_Type, Outlet_Identifier, Outlet_Type
and preprocess each subset differently.


#### 1. Numeric

In [None]:
# PREPROCESSING PIPELINE FOR NUMERIC DATA

# Save list of number column names
num_cols = X_train.select_dtypes("number").columns
print("Numeric Columns:", num_cols)

# Transformers
impute_mean = SimpleImputer(strategy='mean')
scaler = StandardScaler()

# Pipeline
num_pipe = make_pipeline(impute_mean, scaler)
num_pipe

# Tuple
numeric_tuple = ('numeric',num_pipe, num_cols)

#### 2. Ordinal

In [None]:
# PREPROCESSING PIPELINE FOR ORDINAL DATA

# Save list of number column names
ordinal_cols = ['Item_Fat_Content', 'Outlet_Size', 'Outlet_Location_Type']

# Ordered Category Lists
Item_Fat_Content_list = ['Low Fat', 'Regular']
Outlet_Size_list = ['Small', 'Medium', 'High']
Outlet_Location_list = ['Tier 1', 'Tier 2', 'Tier 3']


# Transformers

ord_encoder = OrdinalEncoder(categories=[Item_Fat_Content_list, Outlet_Size_list, Outlet_Location_list])
freq_imputer = SimpleImputer(strategy='most_frequent')

# you might have 100 diff cat for ordinal so its getting out of range so good to scale
scaler_ord = StandardScaler()

# Pipeline
ord_pipe = make_pipeline(freq_imputer, ord_encoder, scaler_ord)

# Tuple
ord_tuple = ('ordinal',ord_pipe, ordinal_cols)

#### 2. Nominal

In [None]:
# PREPROCESSING PIPELINE FOR ONE-HOT-ENCODED DATA

# Save list of nominal column names
nominal_cols = X_train.select_dtypes('object').drop(columns=ordinal_cols).columns

# Transformers

missing_imputer = SimpleImputer(strategy='constant', fill_value='missing')
ohe_encoder = OneHotEncoder(sparse=False, handle_unknown='ignore')

# Pipeline
nom_pipe = make_pipeline(missing_imputer , ohe_encoder)

# Tuple
ohe_tuple = ('categorical',nom_pipe, nominal_cols)

###Column Transformer

In [None]:
#Preprocessing ColumnTransformer
preprocessor = ColumnTransformer([numeric_tuple, ord_tuple, ohe_tuple], verbose_feature_names_out=False)
preprocessor

##**Machine Learning - Training the Models**

####Evaluate Model Performance

In [None]:
#function for true and predicted values
# print MAE, MSE, RMSE, and R2 metrics for the model
def eval_regression(y_true, y_pred, name='model'):
  """Uses true targets and predictions from a regression model and prints the metrics(MAE, MSE, RMSE and R2 Score)
  set 'name' to name of model and 'train' or 'test' as appropriate"""
  mae= mean_absolute_error(y_true, y_pred)
  mse= mean_squared_error(y_true,y_pred)
  rmse= np.sqrt(mse)
  r2= r2_score(y_true,y_pred)

  print(f'{name} Scores')
  print(f'MAE: {mae:,.4f} \nMSE: {mse:,.4f} \nRMSE: {rmse:,.4f} \nR2: {r2:,.4f}\n')

####Linear Regression Model

In [None]:
#Make and fit model
lr_pipe = make_pipeline(preprocessor,LinearRegression())
lr_pipe.fit(X_train, y_train)

In [None]:
#Make predictions using training and testing data
training_preds = lr_pipe.predict(X_train)
testing_preds = lr_pipe.predict(X_test)
training_preds

In [None]:
##Evaluate Model's Performance
eval_regression(y_train, training_preds, name='Training')
eval_regression(y_test, testing_preds, name='Testing')

**Observations**
  - According to the MAE scores since model seems to be a little underfitting.
  - However, as per the R2 scores, the model is working well with both test and train data.

####Decision Tree Model

In [None]:
#Make and fit model
dt_pipe = make_pipeline(preprocessor,DecisionTreeRegressor(random_state=42))
dt_pipe.fit(X_train, y_train)

#Make predictions using training and testing data
training_preds = dt_pipe.predict(X_train)
testing_preds = dt_pipe.predict(X_test)

In [None]:
##Evaluate Model's Performance
eval_regression(y_train, training_preds, name='Training')
eval_regression(y_test, testing_preds, name='Testing')

##### Tuning Decision Tree Regressor Model

In [None]:
#Create range of max_depth value
depths = range(1, dt_pipe['decisiontreeregressor'].get_depth())

#create a dataframe to store train and test scores.
scores = pd.DataFrame(columns=['Train', 'Test'], index = depths)

#loop over the values in depths
for n in depths:
  #fit a new model with max_depth
  tree = DecisionTreeRegressor(random_state = 42, max_depth=n)

  #put the model into a pipeline
  tree_pipe = make_pipeline(preprocessor, tree)

  #fit the model
  tree_pipe.fit(X_train, y_train)

  #create prediction arrays
  train_pred = tree_pipe.predict(X_train)
  test_pred = tree_pipe.predict(X_test)

  #evaluate the model using R2 Score
  train_r2score = r2_score(y_train, train_pred)
  test_r2score = r2_score(y_test, test_pred)

  #store the scores in the scores dataframe
  scores.loc[n, 'Train'] = train_r2score
  scores.loc[n, 'Test'] = test_r2score

In [None]:
scores

In [None]:
#plot the scores to visually determine the best max_depth
plt.plot(depths, scores['Train'], label = 'train')
plt.plot(depths, scores['Test'], label = 'test')
plt.ylabel('R2 Scores')
plt.xlabel('Max Depths')
plt.legend()
plt.show()

In [None]:

#sort the dataframe by test scores and save the index (k) of the best score
best_depth = scores.sort_values(by='Test', ascending=False).index[0]
best_depth

- Best Depth for Decision Tree Regressor Model is 5.

In [None]:
#Reevaluate Decision Tree using the best_depth
best_dt = DecisionTreeRegressor(random_state=42, max_depth = best_depth)

best_dt_pipe = make_pipeline(preprocessor, best_dt)

best_dt_pipe.fit(X_train, y_train)

print('Training Scores for High Variance Decision Tree')
eval_regression(y_train, best_dt_pipe.predict(X_train), name = 'training')

print('Testing Scores for High Variance Decision Tree')
eval_regression(y_test, best_dt_pipe.predict(X_test), name = 'testing')

**Observations**
  - Tuning the model to the max depth on the decision tree improved the results for the testing data.
  - This model has high bias after tuning to the max depth

#### Random Forest Tree Model

In [None]:
#Make and fit model
rf_pipe = make_pipeline(preprocessor,RandomForestRegressor())
rf_pipe.fit(X_train, y_train)

#Make predictions using training and testing data
training_preds = rf_pipe.predict(X_train)
testing_preds = rf_pipe.predict(X_test)
training_preds

##Evaluate Model's Performance
eval_regression(y_train, training_preds, name='Training')
eval_regression(y_test, testing_preds, name='Testing')

**Obervations**
  - This model seems to have improved results on the testing data.
  - The R2 score is at 55% variance, however the RMSE on testing data is off by 1107.2359

##### Tuning Random Forest Tree Model

In [None]:
#create a range of max_depth values
n_estimators = [2000]

#create a dataframe to store train and test scores.
scores = pd.DataFrame(columns=['Train', 'Test'], index=n_estimators)

#loop over the values in depths
for n in n_estimators:
  #fit a new model with max_depth
  rf = RandomForestRegressor(random_state = 42, n_estimators=n)

  #put the model into a pipeline
  rf_pipe = make_pipeline(preprocessor, rf)

  #fit the model
  rf_pipe.fit(X_train, y_train)

  #create prediction arrays
  train_pred = rf_pipe.predict(X_train)
  test_pred = rf_pipe.predict(X_test)

  #evaluate the model using R2 Score
  train_r2score = r2_score(y_train, train_pred)
  test_r2score = r2_score(y_test, test_pred)

  #store the scores in the scores dataframe
  scores.loc[n, 'Train'] = train_r2score
  scores.loc[n, 'Test'] = test_r2score

In [None]:
scores

In [None]:
#Best n_estimator
best_estimators = scores.sort_values(by='Test', ascending=False).index[0]
best_estimators


In [None]:
#Re-evaluating Random Tree Model using best n_estimator

best_rf = RandomForestRegressor(random_state = 42, n_estimators=best_estimators)

best_rf_pipe = make_pipeline(preprocessor, best_rf)

best_rf_pipe.fit(X_train, y_train)

print('Training Scores for High Variance Decision Tree')
eval_regression(y_train, best_rf_pipe.predict(X_train), name = 'training')

print('\n')

print('Testing Scores for High Variance Decision Tree')
eval_regression(y_test, best_rf_pipe.predict(X_test), name = 'testing')



**Observations**
  - The tuned model is biased. However, it is produce the best performance on the test data.
  - The R2 score for test data is 55.89%, while the RMSE score for the test data is off by about 1103.2075.

##**Overall Recommendation**
- Model Performance:
    - Overall, the best model is definitely the Linear Regression Model. This model avoids the bias.
    - The Linear Regression Model performed best giving an R2 score for test data at 56.71%, and the RMSE score for test data at 1092.8631.