<a href="https://colab.research.google.com/github/iamanantalok/Retail-Sales-Prediction-Rossmann/blob/main/Capstone_2_Retail_Sales_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Retail Sales Prediction**



##### **Project Type**    - Regression
##### **Contribution**    - Individual
##### **Name**            - Anant Alok



# **Project Summary -**

In the ever-evolving landscape of the retail industry, data-driven decision-making has become paramount for success. Retailers are constantly seeking ways to optimize their operations, enhance customer experiences, and maximize profitability. This project aims to address these challenges by leveraging regression analysis to predict future retail sales.

Retail sales prediction is a critical task for businesses in the retail sector. Accurate forecasts enable retailers to make informed decisions regarding inventory management, staffing, marketing strategies, and expansion plans. By harnessing the power of regression analysis, this project seeks to provide retailers with a valuable tool to improve their bottom line.

The foundation of any predictive analysis is data. For this project, we collected historical sales data from the retailer, including information on sales volumes, pricing, promotional activities, and external factors like economic indicators and holidays. The dataset spans several years, allowing for a comprehensive analysis.

Data preprocessing is a crucial step in any data-driven project. We cleaned the dataset by handling missing values, outliers, and duplicate entries. Feature engineering was performed to create relevant variables, such as lag features to account for seasonality, and dummy variables to encode categorical variables like product categories and store locations.

EDA was conducted to gain a deeper understanding of the data. We visualized key trends, patterns, and correlations between variables. EDA revealed insights into sales behavior, seasonality, and the impact of promotions on sales.

For this project, we employed multiple regression techniques, including linear regression, decision tree regression, and random forest regression. We assessed the performance of each model using metrics such as Mean Absolute Error (MAE), Mean Squared Error (MSE), and R-squared to determine the best-fitting model.

The dataset was split into training and testing sets to evaluate the model's performance. Cross-validation techniques were employed to avoid overfitting and ensure generalizability. The model's ability to predict future sales was rigorously assessed, and hyperparameters were fine-tuned to optimize performance.

To identify the key drivers of retail sales, a feature importance analysis was conducted using the selected regression model. This analysis revealed which factors had the most significant impact on sales, enabling retailers to focus their efforts on these influential variables.

Model Deployment:
The final regression model was deployed into a user-friendly interface or integrated into the retailer's existing systems for real-time sales prediction. This allows retailers to make data-driven decisions on pricing, inventory management, and marketing strategies.

Results and Recommendations:
The predictive model achieved a high level of accuracy in forecasting retail sales, with a low Mean Absolute Error and high R-squared value. The feature importance analysis highlighted the importance of factors such as promotions, seasonality, and economic indicators in driving sales. Based on these insights, we recommend the following actions for retailers:

Optimize promotional strategies by identifying the most effective types and timing of promotions.
Align inventory management with predicted sales to minimize stockouts and overstock situations.
Tailor marketing efforts to leverage seasonal trends and external economic conditions.
Continuously monitor and update the model to adapt to changing market dynamics.
Conclusion:
In conclusion, this project demonstrates the power of regression analysis in predicting retail sales. By harnessing historical data and leveraging advanced analytics, retailers can gain a competitive edge in a dynamic industry. Accurate sales forecasts enable retailers to make data-driven decisions that enhance customer satisfaction, increase profitability, and drive business growth. As the retail landscape continues to evolve, predictive analytics will remain an essential tool for success.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


**Rossmann, a retail chain with over 3,000 drug stores across seven European countries, is currently dealing with the issue of predicting their daily sales for up to six weeks ahead. This prediction task is complicated by numerous factors, including promotions, competition, holidays, seasons, and the unique characteristics of each store's location. Store managers face varying challenges in accurately forecasting sales due to these diverse circumstances.**

**To address this challenge, historical sales data for 1,115 Rossmann stores has been provided. The goal is to generate sales forecasts for the "Sales" column in the test dataset, while accounting for the fact that some stores in the dataset were temporarily closed for renovations.**

# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries

# Import numpy for numerical computations and array manipulation
import numpy as np

# Import pandas for data manipulation and analysis with DataFrames
import pandas as pd

# Import plotly express for interactive data visualization
import plotly.express as px

# Import matplotlib.pyplot for static plotting and visualization
import matplotlib.pyplot as plt

# Import seaborn for higher-level statistical visualizations
import seaborn as sns

# Import datetime for working with date and time data
from datetime import datetime

# Import warnings to handle and filter warning messages
import warnings
warnings.filterwarnings('ignore')

# Import scipy.stats for statistical functions and probability distributions
import scipy.stats as stats

# Import SelectKBest, f_regression for feature selection based on statistical tests
from sklearn.feature_selection import SelectKBest, f_regression

# Import StandardScaler for feature scaling
from sklearn.preprocessing import StandardScaler

# Import train_test_split for splitting data into training and testing sets
from sklearn.model_selection import train_test_split

# Import regression models: LinearRegression, Ridge, Lasso
from sklearn.linear_model import LinearRegression, Ridge, Lasso

# Import regression metrics: r2_score, mean_squared_error, mean_absolute_error
from sklearn.metrics import r2_score as r2, mean_squared_error as mse, mean_absolute_error as mae

# Import math module for basic mathematical operations
import math

# Import GridSearchCV, RandomizedSearchCV for hyperparameter tuning
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

# Import xgboost for gradient boosting algorithms
import xgboost as xgb

# Import XGBRegressor, a scikit-learn compatible wrapper for XGBoost's regression model
from xgboost.sklearn import XGBRegressor

# Import DecisionTreeRegressor for regression based on decision trees
from sklearn.tree import DecisionTreeRegressor

# Import RandomForestRegressor for regression based on random forests
from sklearn.ensemble import RandomForestRegressor


### Dataset Loading

In [None]:
# Load Dataset(Mounting google drive)
from google.colab import drive
drive.mount('/content/drive')

In [None]:

path = "/content/drive/MyDrive/Capstone-project-2-Retail-Sales-Prediction/"
rossmann_sales = pd.read_csv(path + "Rossmann Stores Data.csv", low_memory=False)
stores = pd.read_csv(path + "store.csv")

### Dataset First View

In [None]:
# Displays all dataframe columns
pd.set_option('display.max_columns', None)

# Creating a copy of the original dataframe 'rossmann_sales' , 'stores' and assigns it to the variables 'rossmann_sales_df' and 'stores_df'.
rossmann_sales_df = rossmann_sales.copy()
stores_df = stores.copy()


In [None]:
# Printing first 5 rows and last 5 rows of rossmann_sales_df dataframe
rossmann_sales_df.head().append(rossmann_sales_df.tail())

In [None]:
# Printing first 5 rows and last 5 rows of stores_df dataframe
stores_df.head().append(stores_df.tail())

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
print(f'Rossmann sales dataset has {rossmann_sales_df.shape} rows and columns respectively.')
print(f'Stores dataset has {rossmann_sales_df.shape} rows and columns respectively.')

### Dataset Information

In [None]:
#  Rossmann Sales Dataset Info
rossmann_sales_df.info()

In [None]:
#Stores dataset info
stores_df.info()

#### Duplicate Values

In [None]:
# Count duplicate values in rossmann_sales_df
print(f"Duplicate count in Rossmann Sales DataFrame: {rossmann_sales_df.duplicated().sum()}")

# Count duplicate values in stores_df
print(f"Duplicate count in Stores DataFrame: {stores_df.duplicated().sum()}")

#### Missing Values/Null Values

In [None]:
# Missing values count in rossmann_sales_df
sales_misssing_count = rossmann_sales_df.isnull().sum()

# Missing values count in stores_df
stores_missing_count = stores_df.isnull().sum()

# Printing count of missing values for both dataframes
print("Null Value Counts in Rossmann Sales DataFrame:\n",sales_misssing_count)
print("\nNull Value Counts in Stores DataFrame:\n",stores_missing_count)


In [None]:
# Visualizing the missing values

# Calculate the percentage of missing values in each column for stores_df
stores_missing_value_percent = (stores_df.isnull().sum() / len(stores_df)) * 100

# Create a bar plot to visualize the missing values
plt.figure(figsize=(12, 6))
plt.xticks(rotation=90)
sns.barplot(x=stores_missing_value_percent.index, y=stores_missing_value_percent.values,palette='coolwarm')
plt.xlabel('Features')
plt.ylabel('% of Missing Values')
plt.title('Percentage of Missing Values in Stores DataFrame')
plt.show()

### What did you know about your dataset?

**CompetitionDistance:** There are only 3 missing values for the CompetitionDistance column, suggesting that the majority of stores have recorded information about their nearest competitor's distance.

**CompetitionOpenSinceMonth** and **CompetitionOpenSinceYear:** Both of these columns have 354 missing values each, indicating that a considerable number of stores lack information regarding the month and year when their nearest competitor opened.

**Promo2SinceWeek** and **Promo2SinceYear:** Similar to CompetitionOpenSinceMonth and CompetitionOpenSinceYear, these columns also have 544 missing values each. This implies that a significant portion of stores does not have data on when they initiated their participation in Promo2, which is an ongoing and consecutive promotion.

**PromoInterval:** The PromoInterval column also contains 544 missing values, suggesting that many stores do not possess information about the specific intervals at which they participate in Promo2.





**We are now focusing on addressing the issue of missing values within the stores dataset.**

1. The "**CompetitionDistance**" variable represents the distance in meters to the nearest competitor store. Analyzing the distribution plot of these distances will provide insights into the typical opening distances for stores, helping us decide how to fill in missing values for this variable.

In [None]:
#distribution plot of competition distance

# Create a distribution plot
sns.set(style="whitegrid")
plt.figure(figsize=(10, 6))
sns.distplot(stores_df['CompetitionDistance'], hist=True, color="skyblue", bins=30)

# Add labels and title
plt.xlabel('Competition Distance (meters)')
plt.ylabel('Frequency')
plt.title('Distribution of Competition Distances Among Stores')

# Show the plot
plt.show()

The distribution of the CompetitionDistance variable indicates a left-skewed pattern, with the majority of values clustered towards the lower end. In such cases, using the median as a measure of central tendency is a more robust choice because it is less influenced by outlier values.

In [None]:
# filling competition distance with the median value
stores_df['CompetitionDistance'].fillna(stores_df['CompetitionDistance'].median(), inplace=True)

# Convert the column to 'int64'
stores_df['CompetitionDistance'] = stores_df['CompetitionDistance'].astype('int64')

2. Filling missing values in the **'CompetitionOpenSinceMonth'** and **'CompetitionOpenSinceYear'** columns with their respective modes is appropriate because it captures the most common values, preserving the typical patterns for competition opening dates across stores. This approach maintains the central tendency of the data, ensures data integrity, and is simple to understand. However, the choice of imputation method should always consider the dataset's specific context and problem at hand, as alternative methods may be more suitable in certain scenarios.

In [None]:
# Fill missing values with the mode (most frequent value) of each respective column

stores_df['CompetitionOpenSinceMonth'].fillna(stores_df['CompetitionOpenSinceMonth'].mode()[0], inplace=True)
stores_df['CompetitionOpenSinceYear'].fillna(stores_df['CompetitionOpenSinceYear'].mode()[0], inplace=True)


3.
Filling NaN values in the **'Promo2'** related columns with 0 is a reasonable choice due to the binary nature of these columns, where 1 signifies the presence of a promotion, and 0 indicates its absence. Imputing with 0 is a consistent and easily interpretable approach, maintaining dataset integrity.

In [None]:
# Impute NaN values in Promo2 related columns with 0
columns_to_impute = ['Promo2SinceWeek', 'Promo2SinceYear', 'PromoInterval']
stores_df[columns_to_impute] = stores_df[columns_to_impute].fillna(0)


In [None]:
# Check for missing values and display the count for each column
missing_values = stores_df.isnull().sum()
print(missing_values)

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns

# Columns of rossmann sales dataframe
print('Columns of Rossmann Sales Dataset are:\n',rossmann_sales_df.columns)

# Columns of stores dataframe
print('Columns of Stores Dataset are:\n',stores_df.columns)

In [None]:
# Dataset Describe

# Discribe rossmann dataset
print('Discription of Rossmann Dataset:\n',rossmann_sales_df.describe().T)

# Printing seperation line between two dataset
print('-'*100)

#Discribe stores dataset
print('Discription of Stores Dataset:\n',stores_df.describe().T)

### Variables Description

**Rossmann Stores Data.csv** - historical data including Sales

**store.csv** - supplemental information about the stores

**Most of the fields are self-explanatory.**

**1.Id** - an Id that represents a (Store, Date) duple within the set.

**2.Store** - a unique Id for each store.

**3.Sales** - the turnover for any given day (Dependent Variable).

**4.Customers** - the number of customers on a given day.

**5.Open** - an indicator for whether the store was open: 0 = closed, 1 = open.

**6.StateHoliday** - indicates a state holiday. Normally all stores, with few exceptions, are closed on state holidays. Note that all schools are closed on public holidays and weekends. a = public holiday, b = Easter holiday, c = Christmas, 0 = None.

**7.SchoolHoliday** - indicates if the (Store, Date) was affected by the closure of public schools.

**8.StoreType** - differentiates between 4 different store models: a, b, c, d.

**9.Assortment** - describes an assortment level: a = basic, b = extra, c = extended. An assortment strategy in retailing involves the number and type of products that stores display for purchase by consumers.

**10.CompetitionDistance** - distance in meters to the nearest competitor store.

**11.CompetitionOpenSince**[Month/Year] - gives the approximate year and month of the time the nearest competitor was opened.

**12.Promo** - indicates whether a store is running a promo on that day.

**13.Promo2** - Promo2 is a continuing and consecutive promotion for some stores: 0 = store is not participating, 1 = store is participating.

**14.Promo2Since**[Year/Week] - describes the year and calendar week when the store started participating in Promo2.

**15.PromoInterval** - describes the consecutive intervals Promo2 is started, naming the months the promotion is started anew. E.g. "Feb,May,Aug,Nov" means each round starts in February, May, August, November of any given year for that store.

### Check Unique Values for each variable.

In [None]:
# Unique Values for each variable in rossmann_sales_df (excluding 'Date')
for column in rossmann_sales_df.columns:
    if column != 'Date':
        unique_values = rossmann_sales_df[column].unique()
        print(f"Unique Values for {column}:\n{unique_values}\n")

In [None]:
# Display the count of unique values for each variable (excluding 'Date')
for column in rossmann_sales_df.columns:
    if column != 'Date':
        unique_values_count = rossmann_sales_df[column].nunique()
        print(f"Unique Value Count for {column}: {unique_values_count}")



In [None]:
# Unique Values for each variable in stores_df
for column in stores_df.columns:
    unique_values = stores_df[column].unique()
    print(f"Unique Values for {column}:\n{unique_values}\n")

In [None]:
# Display the count of unique values for each variable (excluding 'Date')
for column in stores_df.columns:
    if column != 'Date':
        unique_values_count = stores_df[column].nunique()
        print(f"Unique Value Count for {column}: {unique_values_count}")

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Convert "Date" column to datetime datatype
rossmann_sales_df['Date'] = pd.to_datetime(rossmann_sales_df['Date'])

# Extract year, month, and day of the week
rossmann_sales_df['Year'] = rossmann_sales_df['Date'].dt.year
rossmann_sales_df['Month'] = rossmann_sales_df['Date'].dt.month
rossmann_sales_df['DayOfMonth'] = rossmann_sales_df['Date'].dt.day
rossmann_sales_df['WeekOfYear'] = rossmann_sales_df['Date'].dt.weekofyear

# Display the updated Rossman Sales DataFrame with added date-related columns
print("Rossman Sales DataFrame with Date Information:")
rossmann_sales_df.head()


In [None]:
# Count how many rows have 0 sales value
num_rows_with_zero_sales = len(rossmann_sales_df[rossmann_sales_df['Sales'] == 0])
print("Number of rows with 0 sales value:", num_rows_with_zero_sales)

# Count how many rows have 0 customer value
num_rows_with_zero_customers = len(rossmann_sales_df[rossmann_sales_df['Customers'] == 0])
print("Number of rows with 0 customer value:", num_rows_with_zero_customers)

# Display updated Rossman Sales DataFrame after data cleaning
print("Rossman Sales DataFrame after removing rows with 0 sales and 0 customers:")
rossmann_sales_df.head()


In [None]:
# Map 'StateHoliday' values to a binary format (1 for holidays, 0 for non-holidays)
rossmann_sales_df['StateHoliday'] = rossmann_sales_df['StateHoliday'].replace({'a': 1, 'b': 1, 'c': 1, '0': 0})

# Display updated Rossman Sales DataFrame with 'StateHoliday' converted to binary
print("Rossman Sales DataFrame with 'StateHoliday' converted to binary:")
rossmann_sales_df.head()


In [None]:
# Data Type Conversion for Specific Columns
stores_df['CompetitionOpenSinceMonth'] = stores_df['CompetitionOpenSinceMonth'].astype('Int64')
stores_df['CompetitionOpenSinceYear'] = stores_df['CompetitionOpenSinceYear'].astype('Int64')
stores_df['Promo2SinceWeek'] = stores_df['Promo2SinceWeek'].astype('Int64')
stores_df['Promo2SinceYear'] = stores_df['Promo2SinceYear'].astype('Int64')


In [None]:
# Data Filtering: Remove rows with 'CompetitionOpenSinceYear' values 1900 and 1961
stores_df = stores_df[~stores_df['CompetitionOpenSinceYear'].isin([1900, 1961])]

# Reset Index: Reorganize the DataFrame index after filtering
stores_df.reset_index(drop=True, inplace=True)

# Display the updated Rossman Stores DataFrame after data filtering and index reset
print("Rossman Stores DataFrame after filtering and index reset:")
stores_df.head()


In [None]:
# Data Merging: Merge the datasets on the 'Store' column using an inner join
merged_df = pd.merge(rossmann_sales_df, stores_df, on='Store', how='inner')

# Display the merged DataFrame containing sales and store information
print("Merged DataFrame for EDA:")
merged_df.head()


In [None]:
#Sales vs. Competition Distance
sales_competition_data = merged_df[['Sales', 'CompetitionDistance']]

# Print the extracted data to inspect it
print(sales_competition_data)

# Calculate the correlation coefficient between Sales and CompetitionDistance
correlation_coefficient = sales_competition_data['Sales'].corr(sales_competition_data['CompetitionDistance'])

# Print the correlation coefficient
print("Correlation Coefficient between Sales and CompetitionDistance is:", correlation_coefficient)


In [None]:
# Group the data by 'Year' and calculate the mean sales for each year
sales_by_year = merged_df.groupby('Year')['Sales'].mean().reset_index()

# Display the resulting DataFrame
sales_by_year


### What all manipulations have you done and insights you found?

1. Date- Change datatype from object to datetime and extract year, month and days of month.

2. Sales & Customers- identify those rows that zero sale value or zero customers.

3. StateHoliday- The feature StateHoliday changed into a boolean variable. The value {a, b, c} became 1, other 0.The purpose of this action is to transform categorical values ('a', 'b', 'c', and '0') into their numerical counterparts (1 and 0). This conversion facilitates the utilization of these variables in numerical calculations and analytical processes.

4. Converting Data Types: The columns CompetitionOpenSinceMonth, CompetitionOpenSinceYear, Promo2SinceWeek, and Promo2SinceYear are currently in float format. To align them with their nature as representations of months and years, we need to convert them into integer data types, as these values should be whole numbers.





## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### **Sales Distribution**

In [None]:
# Set a larger figure size for better visualization
plt.figure(figsize=(12, 8))

# Create a histogram with a kernel density estimate (KDE) overlay
sns.histplot(merged_df['Sales'], kde=True, color='blue')

# Add a clear and descriptive title
plt.title('Distribution of Sales in Rossmann Stores', fontsize=16, fontweight='bold', color='navy')

# Label the x and y axes
plt.xlabel('Sales', fontsize=14)
plt.ylabel('Frequency', fontsize=14)

# Display a legend
plt.legend()

# Show the plot
plt.show()


##### 1. Why did you pick the specific chart?


I chose to create a histogram with a kernel density estimate (KDE) plot for visualizing the distribution of the 'Sales' variable because it is a suitable choice for several reasons. Firstly, it effectively displays how sales values are distributed across different ranges, providing insight into their frequency distribution. Secondly, as 'Sales' is a continuous numerical variable, a histogram is an appropriate visualization method for such data, allowing us to see the distribution's shape and range. Additionally, it helps identify any skewness or outliers in the data, which can be crucial for understanding sales patterns. Lastly, it enables the visualization of the central tendency of sales, which can be further validated with statistical measures like mean and median.

##### 2. What is/are the insight(s) found from the chart?

1. The histogram analysis reveals a positively skewed distribution for 'Sales,' indicating that there are more occurrences of lower sales values and fewer instances of exceptionally high sales.
2. The central tendency of the data, represented by the peak of the histogram, is observed to be in the range of 6,000 to 8,000 sales, indicating where the most common sales values are concentrated.
3. The width of the distribution illustrates the spread and variability in the sales data, with a moderate spread covering a range from approximately 0 to 40,000 sales values.
4. Notably, there are potential outliers in the data, visible as sales values that significantly deviate from the central region and extend to the right in the histogram's long tail. These outliers represent unusually high sales values compared to the majority of the data points.





##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.


Positive Business Impact:

Gaining Insights into Sales Distribution: The histogram and KDE plot provide valuable insights into the distribution of sales, enabling businesses to identify common sales figures and assess the overall sales spread. This understanding aids in setting realistic sales targets, optimizing inventory management, and allocating resources effectively.

Negative Business Impact:

Skewed Distribution Challenges: A heavily skewed sales distribution, primarily concentrated at lower sales values, may pose challenges for achieving significant sales growth or expanding market share. In response, businesses may need to devise strategies to stimulate demand and attract a larger customer base.

#### Customer Engagement and Spending

In [None]:
plt.figure(figsize=(14, 6))

# Overall Title
plt.suptitle('Exploring Customer Engagement and Spending Patterns', fontsize=16, fontweight='bold', color='navy')


# Average Number of Visits per Customer
plt.subplot(1, 2, 1)
sns.histplot(merged_df.groupby('Customers')['Date'].count(), kde=True, color='purple')
plt.title('Distribution of Number of Visits per Customer')
plt.xlabel('Number of Visits')
plt.ylabel('Frequency')

# Average Spending per Customer
plt.subplot(1, 2, 2)
sns.histplot(merged_df.groupby('Customers')['Sales'].mean(), kde=True, color='orange')
plt.title('Distribution of Average Spending per Customer')
plt.xlabel('Average Spending')
plt.ylabel('Frequency')

plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

While histograms, often created using Seaborn's histplot, are a popular choice for visualizing the distribution of single variables like sales or customer visits, they are not necessary for calculating metrics such as visits per customer, average spending per customer, and customer retention rate.

##### 2. What is/are the insight(s) found from the chart?

Average Number of Visits per Customer: On average, each customer visits the store about 206 times in the dataset's timeframe. This tells us how often customers engage with the store.

Average Spending per Customer: On average, each customer spends roughly 13,917 units of currency during their visits. This reveals customer spending habits and overall sales potential.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.


Number of Visits per Customer:

Positive Impact: High visit frequency suggests customer loyalty and satisfaction, potentially boosting customer lifetime value and positive word-of-mouth.

Negative Impact: Low visit frequency may indicate customer dissatisfaction, reducing retention and damaging the business's reputation.

Average Spending per Customer:

Positive Impact: High spending per customer increases revenue and profitability, making these customers more valuable to the business.

Negative Impact: Low spending per customer results in reduced revenue and profitability.

#### **Sales and Customer Traffic**

In [None]:
# Create subplots with two side-by-side axes
fig, (axis1, axis2) = plt.subplots(1, 2, figsize=(20, 4))

# Promo vs. Sales
sns.barplot(x='Promo', y='Sales', data=merged_df, ax=axis1)
axis1.set_title('Impact of Promo on Sales')
axis1.set_xlabel('Promo')
axis1.set_ylabel('Sales')

# Promo vs. Customers
sns.barplot(x='Promo', y='Customers', data=merged_df, ax=axis2)
axis2.set_title('Impact of Promo on Customer Traffic')
axis2.set_xlabel('Promo')
axis2.set_ylabel('Customers')

# Overall Title
plt.suptitle('Impact of Promo on Sales and Customer Traffic', fontsize=16, fontweight='bold', color='navy')

plt.tight_layout()
plt.show()







##### 1. Why did you pick the specific chart?

The reason for selecting the chart, which in this case is the grouped bar plot comparing sales and customer traffic during promotional and non-promotional periods, is because it effectively illustrates the impact of promotions on both sales and customer engagement in a straightforward and visually informative manner.

##### 2. What is/are the insight(s) found from the chart?

We can observe a substantial increase in both sales and customer traffic during promotional periods. This indicates that promotions have a positive impact on store performance.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights that promotions lead to higher sales and increased customer traffic can have a positive business impact. They enable businesses to boost revenue, engage customers more effectively, make data-driven decisions for marketing, optimize inventory management, gain a competitive edge, and foster customer loyalty.

#### **Average Sales and Sales Growth Trends Over Time (Year-Week)**

In [None]:
# Group by date and get average sales and percent change
average_sales = merged_df.groupby('WeekOfYear')["Sales"].mean()
pct_change_sales = merged_df.groupby('WeekOfYear')["Sales"].sum().pct_change()

# Create subplots
fig, (axis1, axis2) = plt.subplots(2, 1, sharex=True, figsize=(15, 8))

# Plot average sales over time (year-week)
ax1 = average_sales.plot(legend=True, ax=axis1, marker='o', title="Average Sales Per Week.")
ax1.set_xticks(range(len(average_sales)))
ax1.set_xticklabels(average_sales.index.tolist())
ax1.set_ylabel('Sales', size=12)

# Plot percent change for sales over time (year-week)
ax2 = pct_change_sales.plot(legend=True, ax=axis2, marker='o', colormap="summer", title="Sales Percent Change Per Week.")
ax2.set_xlabel('Week Of Year', size=12)
plt.ylabel("Sales", size=12)

# Overall Title
plt.suptitle('Trends and Growth Insights', fontsize=16, fontweight='bold', color='navy')

plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

A line chart was selected for visualizing average sales and sales growth over time (Year-Week) due to its ability to effectively display trends, patterns, and comparisons in time-series data. This choice offers clarity, simplicity, and a straightforward representation of how sales metrics change over weeks, making it accessible to a broad audience.

##### 2. What is/are the insight(s) found from the chart?

It helps identify sales patterns, seasonality, anomalies, and growth trends. These insights guide inventory management and marketing decisions, improving retail operations. Upon closer examination, it becomes evident that there are fluctuations in weekly sales, with some weeks performing exceptionally well while others experience a decline. Towards the end of the year, particularly in the final few weeks, there is a notable surge in sales.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights gained from these charts can be valuable for making informed decisions in retail, leading to a positive business impact by optimizing operations, marketing efforts, and resource allocation. However, it's essential to analyze the underlying reasons for negative growth and take appropriate actions to mitigate any adverse effects.

#### **Sales Analysis**


In [None]:
# Create a grid of subplots
fig, axes = plt.subplots(2, 2, figsize=(8, 6))

#Sales vs. StateHoliday
sns.barplot(data=merged_df, x='StateHoliday', y='Sales', palette='pastel', ci=None, ax=axes[0, 0])
axes[0, 0].set_title('Sales vs. StateHoliday')
axes[0, 0].set_xlabel('StateHoliday')
axes[0, 0].set_ylabel('Sales')

#Sales vs. SchoolHoliday
sns.barplot(data=merged_df, x='SchoolHoliday', y='Sales', palette='Set2', ci=None, ax=axes[0, 1])
axes[0, 1].set_title('Sales vs. SchoolHoliday')
axes[0, 1].set_xlabel('SchoolHoliday')
axes[0, 1].set_ylabel('Sales')

#Sales vs. StoreType
sns.barplot(data=merged_df, x='StoreType', y='Sales', palette='tab10', ci=None, ax=axes[1, 0])
axes[1, 0].set_title('Sales vs. StoreType')
axes[1, 0].set_xlabel('StoreType')
axes[1, 0].set_ylabel('Sales')

#Sales vs. Assortment
sns.barplot(data=merged_df, x='Assortment', y='Sales', palette='Set1', ci=None, ax=axes[1, 1])
axes[1, 1].set_title('Sales vs. Assortment')
axes[1, 1].set_xlabel('Assortment')
axes[1, 1].set_ylabel('Sales')

# Add a bit more space between subplots
plt.tight_layout(pad=3)

plt.show()





##### 1. Why did you pick the specific chart?


The bar plots show the average sales for each category of the respective variable, aiding the comparison of sales across groups. The color and style choices enhance visualization and category differentiation. Bar plots are apt for this categorical variable analysis, offering insights into how sales are impacted by these factors.

##### 2. What is/are the insight(s) found from the chart?

StateHoliday: Stores have lower sales (258.64) on State Holidays (StateHoliday = 1.0) compared to higher sales (5945.92) on non-holidays (StateHoliday = 0.0).

SchoolHoliday: Stores experience higher sales (6474.89) on School Holidays (SchoolHoliday = 1.0) compared to lower sales (5619.54) on non-holidays (SchoolHoliday = 0.0).

StoreType: StoreType b has the highest sales (10058.84) among all types, followed by StoreType a (5736.60) and StoreType c (5723.63). StoreType d has sales averaging 5639.35.

Assortment: Stores with Assortment type b have the highest sales (8553.93) compared to type a (5479.04) and type c (6057.87).

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.


The insights from the chart offer opportunities for positive business impact:

StateHoliday: Lower sales on State Holidays suggest a chance to boost revenue through special promotions and discounts on those days, attracting more customers and increasing sales.

SchoolHoliday: Higher sales during School Holidays indicate a potential market. Targeted marketing and optimized inventory can tap into this demand.

StoreType: Variations in sales by StoreType highlight successful formats (e.g., StoreType b). Businesses can replicate winning strategies across stores.

Assortment: Stores with Assortment type b have the highest sales, signaling customer preferences. Adapting product offerings to match this assortment can enhance sales.

#### **Average number of Customers**

In [None]:
# Group the DataFrame by 'DayOfWeek' and calculate the mean of 'Customers' for each group
average_customers_per_day = merged_df.groupby('DayOfWeek')[['Customers']].mean()

# Create a plot with a specified figure size, marker style, and color
axis = average_customers_per_day.plot(figsize=(10, 5), marker='^', color='b')

# Set the title for the plot
axis.set_title('Average Number of Customers per Day of the Week')

# Label the x-axis
axis.set_xlabel('Day of the Week')

# Label the y-axis
axis.set_ylabel('Average Number of Customers')

# Show the plot
plt.show()

##### 1. Why did you pick the specific chart?


I chose the line plot because it's a useful method for illustrating trends or fluctuations in a numerical metric, such as the average number of customers, across a continuous variable like the days of the week. The '^' markers represent individual data points for each day, and the connecting line visually displays the overall pattern. The use of the color blue enhances the chart's aesthetics and readability.

##### 2. What is/are the insight(s) found from the chart?

The examination shows that Mondays typically experience the highest number of visitors, potentially because it marks the beginning of the workweek, and customers may be more inclined to make purchases after the weekend. In contrast, Sundays witness notably lower foot traffic, possibly because many businesses are closed or have reduced operating hours on Sundays. These findings can inform decisions about staffing and promotional tactics for different days of the week to accommodate the fluctuating patterns of customer demand.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.


Positive Impact:

Efficient Staffing: Optimize staff levels for peak and off-peak days.

Tailored Promotions: Customize promotions to boost traffic on slower days.

Resource Efficiency: Allocate resources effectively for maximum returns.

Negative Insights:

Sunday Footfall Drop: Consider viability and attracting more customers on Sundays.

Weekday Decline: Analyze and address the gradual weekday footfall decrease.

#### **Average Number of Customers on State Holidays and School Holidays**

In [None]:
# Create subplots
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Violin plot for customers vs. SchoolHoliday
sns.violinplot(data=merged_df, x='SchoolHoliday', y='Customers', ax=axes[0])
axes[0].set_title('Customers vs. SchoolHoliday')
axes[0].set_xlabel('SchoolHoliday')
axes[0].set_ylabel('Number of Customers')

# Violin plot for customers vs. StateHoliday
sns.violinplot(data=merged_df, x='StateHoliday', y='Customers', ax=axes[1])
axes[1].set_title('Customers vs. StateHoliday')
axes[1].set_xlabel('StateHoliday')
axes[1].set_ylabel('Number of Customers')

plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?


Violin plots are valuable for visualizing how a numerical variable is distributed within categorical groups. They offer a compact summary of data distributions, facilitating straightforward group comparisons. Furthermore, the incorporation of color palettes in violin plots enhances their visual appeal and assists in emphasizing distinctions among categories.

##### 2. What is/are the insight(s) found from the chart?


On State Holidays, the average number of customers drops significantly to 40.13, showcasing a notable decrease in foot traffic compared to regular days. Conversely, on standard non-holiday days, customer attendance rises to an average of 651.84, indicating higher visitation rates. During School Holidays, there is a slight uptick in customer numbers, averaging 704.44, compared to the regular days, where the average customer count slightly dips to 617.67.

In summary, these insights emphasize that State Holidays have a more substantial impact on reducing customer footfall compared to School Holidays, with regular non-holiday days drawing the highest number of customers. This information can be instrumental for businesses in tailoring their staffing, inventory management, and promotional strategies around holidays to better cater to customer needs and maximize sales potential.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.


Positive Business Impact:

Efficient Staffing: Adjust staffing to minimize costs on State Holidays and meet demand during School Holidays.

Tailored Promotions: Customize offers for State Holidays to attract shoppers and focus on upselling during School Holidays.

Inventory Management: Optimize stock levels for cost savings on State Holidays and ensure adequate supply during School Holidays.

These strategies enhance operational efficiency and can boost sales.

#### **Total Customers in Store Type**

In [None]:
# Create a subplot with 1 row and 2 columns, specifying the figure size
fig, axes = plt.subplots(1, 2, figsize=(8, 6))

# Plot 1: Share of Store Types
store_type_counts = merged_df["StoreType"].value_counts()
axes[0].pie(store_type_counts, labels=store_type_counts.index, autopct='%1.1f%%')
axes[0].set_title('Share of Store Types')  # Set the title for the first pie chart

# Plot 2: Customer Share by Store Type
customer_by_store_type = merged_df.groupby('StoreType')['Customers'].sum()
axes[1].pie(customer_by_store_type, labels=customer_by_store_type.index, autopct='%1.1f%%')
axes[1].set_title('Customer Share by Store Type')  # Set the title for the second pie chart

# Adjust the layout for better spacing
plt.tight_layout()

# Display the plot
plt.show()


##### 1. Why did you pick the specific chart?

Pie charts are great for depicting parts-to-whole relationships and are especially beneficial for highlighting the contribution of separate categories to a total or comparing the relative sizes of different categories. They are visually pleasing and provide an easy-to-understand representation of facts.

##### 2. What is/are the insight(s) found from the chart?

Store Type 'a' holds the largest portion of customers, making up the majority of the total customer base and also having the highest share in terms of numbers.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Impact:
- Understanding customer distribution by store type helps tailor marketing efforts, enhancing customer loyalty, especially for Store Type 'a.'
- Identifying Store Type 'd' as popular allows replicating successful strategies for overall growth.

Negative Growth:
- Store Type 'b' has low customer (4.89%) and store type (1.56%) shares, indicating underperformance.
- Store Type 'c' performs moderately (14.33% customers, 13.48% store types), prompting competitive analysis for improvement.

#### **The Impact of Competition Distance on Sales**

In [None]:
# Create a scatter plot to visualize the relationship between Sales and CompetitionDistance
plt.figure(figsize=(8, 6))
sns.scatterplot(data=sales_competition_data, x='CompetitionDistance', y='Sales')

# Label the x-axis and y-axis
plt.xlabel('Competition Distance')
plt.ylabel('Sales')

# Set the title for the plot and include the correlation coefficient
plt.title(f'Correlation between Sales and Competition Distance\nCorrelation Coefficient: {correlation_coefficient:.2f}')

# Display the plot
plt.show()


##### 1. Why did you pick the specific chart?


The scatter plot serves as a tool to help us visually understand how the data points are distributed and whether there might be a connection between sales and competition distance. Meanwhile, the correlation coefficient provides us with a numeric way to gauge both the strength and direction of this relationship. If the correlation coefficient is positive, it indicates a positive connection, whereas a negative coefficient suggests a negative relationship. On the other hand, a value close to zero implies a weak or negligible correlation.

##### 2. What is/are the insight(s) found from the chart?


A correlation coefficient of -0.0189 suggests that there is a weak or almost non-existent linear connection between sales and competition distance within the dataset. From a business perspective, this implies that competition distance by itself is unlikely to be a strong indicator of sales performance, and there are likely other factors that exert a more substantial influence on retail store sales.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The modest correlation between sales and competition distance underscores the need for a more comprehensive approach to enhancing sales performance. While competition distance is just one of several factors that can affect sales, depending solely on this factor may not yield substantial business improvements. Instead, companies should embrace a holistic strategy that emphasizes understanding customer preferences, leveraging competitive advantages, and addressing store-specific variables to achieve meaningful growth.

#### **Sales Variations : Daily Trends within a Month and Monthly Trends within a Year**

In [None]:
# Create subplots with 1 row and 2 columns, specifying the figure size
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Plot 1: Sales vs. Day of Month
sns.lineplot(data=merged_df, x='DayOfMonth', y='Sales', ax=axes[0])

# Set title and labels for the first subplot
axes[0].set_title('Sales vs. Day of Month')
axes[0].set_xlabel('Day of Month')
axes[0].set_ylabel('Sales')
axes[0].grid(True)  # Add grid lines

# Plot 2: Sales vs. Month
sns.lineplot(data=merged_df, x='Month', y='Sales', ax=axes[1])

# Set title and labels for the second subplot
axes[1].set_title('Sales vs. Month')
axes[1].set_xlabel('Month')
axes[1].set_ylabel('Sales')
axes[1].grid(True)  # Add grid lines

# Adjust the layout for better spacing
plt.tight_layout()

# Display the subplots
plt.show()


##### 1. Why did you pick the specific chart?

When examining the relationship between Sales and Day of Month, a line plot is useful for revealing any notable sales patterns or fluctuations throughout the month. This enables us to detect potential high or low points in sales on specific days.

On the other hand, when analyzing Sales vs. Month, a line plot is valuable for presenting the broader sales trend across various months. It allows us to identify any recurring seasonal patterns or variations in sales over the course of the year.

##### 2. What is/are the insight(s) found from the chart?

Sales vary throughout the month with notable fluctuations. For instance, the 30th day sees higher sales (7295.48), while the 25th (4822.33) and 26th (4835.85) have lower averages.

Across months, sales also differ. December (6824.83) and July (6063.74) show higher averages, while January (5463.71) and May (5488.73) have lower sales.

This suggests seasonal patterns, particularly with higher sales during December's holiday season and lower sales in January and February. Daily fluctuations may be influenced by factors like weekends, paydays, or promotions.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.


Positive Business Impact:

Seasonal Planning: Recognizing seasonal patterns enables proactive preparation for high-demand periods, boosting sales and customer satisfaction.

Targeted Promotions: Identifying sales fluctuations on specific days facilitates effective promotions to attract customers and increase revenue.

Resource Allocation: Insights into sales variations optimize resource allocation, from staff scheduling to inventory management, reducing operational costs.

Negative Growth Mitigation:

Addressing Low-Sales Periods: Identifying low-sales months allows for cost-saving measures and innovative marketing strategies to mitigate revenue decline.

Inventory Management: Understanding sales patterns prevents overstocking during slow periods, reducing wastage and costs.

Adapting Business Strategies: Adapting strategies based on insights ensures competitiveness and resilience in changing market conditions.

#### **Average Sales Per Year**

In [None]:
# Create a line plot to visualize Sales vs. Year
plt.figure(figsize=(8, 6))
sns.lineplot(data=sales_by_year, x='Year', y='Sales', marker='o', color='b')

# Set the title for the plot
plt.title('Average Sales vs. Year')

# Label the x-axis and y-axis
plt.xlabel('Year')
plt.ylabel('Average Sales')

# Show the plot
plt.show()


##### 1. Why did you pick the specific chart?

I opted for a line plot to depict Sales against Year because it's ideal for showcasing trends and variations in a continuous variable (Sales) across different time periods (Years). Line plots are especially useful when our goal is to track how a variable evolves over time or numerical values.

##### 2. What is/are the insight(s) found from the chart?

Sales have been consistently increasing over the three-year period, with a noticeable upward trend. The growth was observed from 2013 to 2014 and continued to rise in 2015. This indicates positive business growth and an expanding customer base. However, potential seasonal fluctuations in sales throughout each year would require further investigation for a comprehensive understanding.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The analysis reveals a consistent upward trend in average sales from 2013 to 2015, indicating business growth, improved customer attraction, and increased market demand. This positive trend can boost revenues, enhance the brand's reputation, and guide the business in meeting customer preferences.

#### **Fourier Analysis for seasonality**

In [None]:
# Store Type 'A': Select sales data for Store 11
df_store_sales = merged_df.loc[merged_df['Store'] == 11]['Sales']

# Perform Fast Fourier Transform (FFT) on the sales data
Y = np.fft.fft(df_store_sales.values)   #FFT is applied to the sales data using np.fft.fft, which converts the time-domain signal into the frequency domain.

# Calculate the corresponding frequencies
freq = np.fft.fftfreq(len(Y), 1)    #The frequencies corresponding to each data point are obtained using np.fft.fftfreq

# Get the number of data points
n = len(freq)

# Create a figure and plot the frequency domain representation
plt.figure()
plt.plot(freq[:int(n / 2)], np.abs(Y)[:int(n / 2)])
plt.xlabel("Frequency")
plt.ylabel("Amplitude")
plt.show()


In [None]:
# Store Type 'B': Select sales data for Store 259
df_store_259_sales = merged_df.loc[merged_df['Store'] == 259]['Sales']

# Perform Fast Fourier Transform (FFT) on the sales data
Y = np.fft.fft(df_store_259_sales.values)

# Calculate the corresponding frequencies
freq = np.fft.fftfreq(len(Y), 1)

# Get the number of data points
n = len(freq)

# Create a figure and plot the frequency domain representation
plt.figure()
plt.plot(freq[:int(n / 2)], np.abs(Y)[:int(n / 2)])
plt.xlabel("Frequency")
plt.ylabel("Amplitude")
plt.show()


In [None]:
# Store Type 'C': Select sales data for Store 4
df_store_4_sales = merged_df.loc[merged_df['Store'] == 4]['Sales']

# Perform Fast Fourier Transform (FFT) on the sales data
Y = np.fft.fft(df_store_4_sales.values)

# Calculate the corresponding frequencies
freq = np.fft.fftfreq(len(Y), 1)

# Get the number of data points
n = len(freq)

# Create a figure and plot the frequency domain representation
plt.figure()
plt.plot(freq[:int(n / 2)], np.abs(Y)[:int(n / 2)])
plt.xlabel("Frequency")
plt.ylabel("Amplitude")
plt.show()


In [None]:
# Store Type 'D': Select sales data for Store 15
df_store_15_sales = merged_df.loc[merged_df['Store'] == 15]['Sales']

# Perform Fast Fourier Transform (FFT) on the sales data
Y = np.fft.fft(df_store_15_sales.values)

# Calculate the corresponding frequencies
freq = np.fft.fftfreq(len(Y), 1)

# Get the number of data points
n = len(freq)

# Create a figure and plot the frequency domain representation
plt.figure()
plt.plot(freq[:int(n / 2)], np.abs(Y)[:int(n / 2)])
plt.xlabel("Frequency")
plt.ylabel("Amplitude")
plt.show()


##### 1. Why did you pick the specific chart?

The specific chart used in the provided code snippets is a frequency domain representation. This choice is made because it allows for the identification of peaks and patterns in the frequency domain of sales data, particularly with the goal of:

Frequency Analysis: Using Fast Fourier Transform (FFT) to transform time-domain data into the frequency domain.

Peak Detection: Identifying dominant frequencies or periodic patterns in the original data by observing peaks in the frequency domain.

Visualization: Providing a clear and intuitive visualization of frequency components, making it easier to spot patterns.

##### 2. What is/are the insight(s) found from the chart?

The presence of spikes at specific frequencies in the above graphs suggests that there is a recurring pattern or seasonality in the store sales data. Therefore, we can leverage these Fourier features to capture and represent the seasonality inherent in the data.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Identifying Seasonal Trends: If the frequency analysis reveals strong, recurring patterns or seasonal trends in sales data, businesses can use this information to optimize inventory, marketing campaigns, and staffing levels. For example, if there is a clear holiday sales peak, the business can plan promotional events and inventory stocking accordingly to maximize revenue during those periods.

Improving Forecasting: Understanding the dominant frequencies in sales data can enhance sales forecasting accuracy. This, in turn, can lead to better inventory management, reduced waste, and improved customer satisfaction through product availability.

Marketing and Promotion: Recognizing patterns in sales can inform marketing and promotional strategies. For instance, if there is a monthly or quarterly sales peak, businesses can plan targeted marketing efforts during those periods to capitalize on consumer behavior.

#### **Correlation Heatmap**

In [None]:
# Define the columns to drop from the DataFrame to focus on meaningful numeric columns
columns_to_drop = ['Store', 'Year', 'Month', 'DayOfMonth']

# Create a new DataFrame 'corr_df' by dropping the specified columns
corr_df = merged_df.drop(columns=columns_to_drop, axis=1) # Now 'corr_df' contains only the relevant numeric columns for correlation analysis

# Create a correlation heatmap
plt.figure(figsize=(16, 10))  # Set the figure size
sns.heatmap(corr_df.corr(), cmap="coolwarm", annot=True)

# Display the heatmap
plt.show()


##### 1. Why did you pick the specific chart?


A heatmap serves as a valuable visual tool for swiftly detecting patterns of association among numerical attributes within the dataset. It aids in grasping which attributes exhibit more pronounced connections with one another, offering insights that can be beneficial for subsequent analysis or modeling endeavors.

##### 2. What is/are the insight(s) found from the chart?

1. There is a strong positive correlation between Sales and Customers, with a correlation coefficient close to 1.00. This implies that as the number of customers increases, sales tend to rise as well. It's a logical relationship where higher customer traffic leads to increased sales.

2. Sales and Promo show a positive correlation, but it's not very strong, with a correlation coefficient of approximately 0.38. This suggests that promotional activities (Promo) have a positive effect on sales, but other factors also contribute significantly to sales variations.

3. A weak negative correlation exists between Sales and CompetitionDistance, with a correlation coefficient of around -0.12. This implies that stores located closer to their competitors tend to have slightly lower sales. However, it's important to note that this correlation is not very strong, suggesting that other factors have a more substantial impact on sales.






####  **Pair Plot**

In [None]:
# Select the important features from the DataFrame
selected_features = ['Sales', 'Customers', 'Promo', 'CompetitionDistance', 'Month', 'Year']
selected_data = merged_df[selected_features]

# Create a pairplot
sns.pairplot(selected_data, diag_kind='kde')
plt.show()


##### 1. Why did you pick the specific chart?

Pairplot is employed to construct a matrix of scatter plots that illustrates the connections between various variables within a dataset simultaneously. It serves as a valuable tool for investigating the interrelationships among variables and uncovering any noteworthy patterns or trends present in the data.

Utilizing pairplot can aid in pinpointing specific aspects of the data that warrant more in-depth examination, potentially revealing valuable insights that can guide decision-making during app development and marketing endeavors.

##### 2. What is/are the insight(s) found from the chart?

*Sales and Customers:*

The scatter plot reveals a positive linear relationship, meaning that as the number of customers increases, sales tend to rise. This aligns with the strong positive correlation seen in the correlation heatmap.

*Sales and Open:*

The plot indicates that sales are higher when stores are open (Open=1), which is expected as closed stores (Open=0) typically have lower or zero sales.

*Sales and CompetitionDistance:*

There is a weak correlation between Sales and CompetitionDistance, implying that competition distance has a limited impact on sales.

*Sales and Month:*

Sales vary throughout the year, with some months experiencing higher sales than others, as shown in the pair plot.

*Sales and Year:*

Average sales have consistently increased over the years, with the highest sales recorded in 2015, as indicated in the pair plot.

## Conclusions of EDA:

- Store Type A is both the highest-selling and the most crowded.

- Sales show a strong positive correlation with the number of customers.

- Promotion consistently leads to increased sales and customer numbers across all stores.

- Stores that remain open during School Holidays achieve higher sales compared to regular days.

- More stores operate during School Holidays than during State Holidays.

- Sales spike during Christmas week, possibly because people purchase more beauty products during the holiday season.

- Fourier decomposition analysis of sales data reveals a seasonality component.

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

**Statement 1:** Sales tend to be higher, on average, during school holidays compared to the average sales during days that are not school holidays.

**Statement 2:** The sales at stores categorized as StoreType 'b' are notably greater than those at stores categorized as StoreType 'a'.

**Statement 3:** There exists a notable and positive relationship between the quantity of customers and the sales.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

**Null Hypothesis (H0):**

H0: The average sales on school holidays are equal to or less than the average sales on non-school holidays.

**Alternative Hypothesis (H1):**

H1: The average sales on school holidays are higher than the average sales on non-school holidays.

#### 2. Perform an appropriate statistical test.

In [None]:
# Import the required library for statistical tests
import scipy.stats as stats

# Filter the data for school holidays and non-school holidays
sales_school_holiday = merged_df[merged_df['SchoolHoliday'] == 1]['Sales']
sales_non_school_holiday = merged_df[merged_df['SchoolHoliday'] == 0]['Sales']

# Perform a t-test to assess if the average sales on school holidays are significantly greater than non-school holidays
t_stat, p_value = stats.ttest_ind(sales_school_holiday, sales_non_school_holiday, alternative='greater')

# Output the results of the t-test
print("T-Statistic:", t_stat)
print("P-Value:", p_value)

# Interpretation
alpha = 0.05  # Set your desired significance level (usually 0.05)
if p_value < alpha:
    print("Reject the null hypothesis: The average sales on school holidays are significantly greater than non-school holidays.")
else:
    print("Fail to reject the null hypothesis: There is insufficient evidence to conclude that the average sales on school holidays are greater than non-school holidays.")


##### Which statistical test have you done to obtain P-Value?

 A t-test was performed to obtain the p-value.

##### Why did you choose the specific statistical test?

It's an independent two-sample t-test with the alternative set to 'greater'. This type of t-test is used to compare the means of two independent groups (in this case, sales on school holidays and non-school holidays) and determine if there is a significant difference between them. The p-value obtained from the t-test helps you assess whether this difference is statistically significant.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

**Null Hypothesis (H0):**

H0: The sales for stores with StoreType 'b' are equal to or less than the sales for stores with StoreType 'a'.

**Alternative Hypothesis (H1):**

H1: The sales for stores with StoreType 'b' are significantly higher than the sales for stores with StoreType 'a'.

#### 2. Perform an appropriate statistical test.

In [None]:
# Import the necessary libraries
import scipy.stats as stats

# Extract sales data for stores with StoreType 'b'
sales_store_type_b = merged_df[merged_df['StoreType'] == 'b']['Sales']

# Extract sales data for stores with StoreType 'a'
sales_store_type_a = merged_df[merged_df['StoreType'] == 'a']['Sales']

# Perform an independent t-test
t_statistic, p_value = stats.ttest_ind(sales_store_type_b, sales_store_type_a, alternative='greater')

# Output the results
print("Independent T-Test Results:")
print("-----------------------------")
print(f"T-Statistic: {t_statistic}")
print(f"P-Value: {p_value}")

# Interpret the results
alpha = 0.05  # Significance level
print("\nInterpretation:")
if p_value < alpha:
    print("The p-value is less than the significance level (alpha), so we reject the null hypothesis.")
    print("There is evidence of a significant difference in sales between StoreType 'b' and StoreType 'a'.")
else:
    print("The p-value is greater than alpha, so we fail to reject the null hypothesis.")
    print("There is no significant difference in sales between StoreType 'b' and StoreType 'a'.")


##### Which statistical test have you done to obtain P-Value?

I conducted an independent two-sample t-test to assess whether there is a statistically significant difference in means between two separate groups, specifically, StoreType 'b' and StoreType 'a'. The resulting p-value from this test aids in determining the statistical significance of this mean difference.

##### Why did you choose the specific statistical test?


I chose the independent two-sample t-test for Statement 2 because it's suitable for comparing the means of two distinct groups (StoreType 'b' and StoreType 'a'). This test is commonly used when you have two groups, and you want to assess if there's a statistically significant difference between their averages.

In our case, we're comparing sales for two different store types, 'b' and 'a.' The t-test helps us evaluate if the average sales of 'b' stores are significantly higher than those of 'a' stores by assessing the statistical significance of the observed sales difference.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

**Null Hypothesis (H0):**

H0: There is no significant correlation between the number of customers and sales (correlation coefficient equals zero or is negative).

**Alternative Hypothesis (H1):**

H1: There is a significant positive correlation between the number of customers and sales (correlation coefficient is greater than zero).

#### 2. Perform an appropriate statistical test.

In [None]:
import scipy.stats as stats

# Calculate the Pearson correlation coefficient and p-value
customers = merged_df['Customers']
sales = merged_df['Sales']
correlation_coefficient, p_value = stats.pearsonr(customers, sales)

# Display the results
print("Pearson Correlation Coefficient:", correlation_coefficient)
print("P-value:", p_value)

# Interpret the results
alpha = 0.05  # Set the significance level
if p_value < alpha:
    print("Conclusion: There is a significant positive correlation between the number of customers and sales.")
else:
    print("Conclusion: There is no significant correlation between the number of customers and sales.")


##### Which statistical test have you done to obtain P-Value?

I employed the Pearson correlation coefficient test to calculate the p-value for Statement 3. This coefficient assesses the linear association between two continuous variables, making it appropriate for discerning any significant positive or negative correlation between customer count and sales in our dataset.

##### Why did you choose the specific statistical test?

We chose the Pearson correlation coefficient test because both 'Customers' and 'Sales' are continuous variables, and we aim to assess their linear relationship. By calculating the correlation coefficient and its p-value, we can ascertain if there is a significant correlation between customer count and sales.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Check for missing values in each column of the merged dataset
missing_values_count = merged_df.isnull().sum()
missing_values_df = pd.DataFrame({'Column Name': missing_values_count.index, 'Missing Values': missing_values_count.values})
print(missing_values_df)

#### What all missing value imputation techniques have you used and why did you use those techniques?

I utilized median imputation for numerical data and mode imputation for categorical data to handle missing values. Median imputation maintains the data's central tendency and is robust against outliers, making it suitable for skewed numerical data. On the other hand, mode imputation, which replaces missing values with the most frequent category, is ideal for categorical data with a limited number of unique categories, preserving the prevalent category distribution.

### 2. Handling Outliers

In [None]:
# Select only numerical columns for box plot visualization
numerical_cols = merged_df.select_dtypes(include='number').columns

# Create a larger and more visually appealing figure
plt.figure(figsize=(14, 8))

# Create a box plot for numerical columns to visualize potential outliers
sns.boxplot(data=merged_df[numerical_cols], orient="v", palette="Set2")

# Set the title and adjust font size
plt.title("Box Plot for Numerical Columns (Identifying Outliers)", fontsize=16)

# Rotate x-axis labels for better readability
plt.xticks(rotation=45)

# Label the y-axis
plt.ylabel("Values", fontsize=12)

# Add a grid for better visualization
plt.grid(axis="y", linestyle="--", alpha=0.7)

# Show the plot
plt.tight_layout()  # Ensures that the labels fit within the figure area
plt.show()


In [None]:
# Identify numerical columns with potential outliers
numerical_cols = merged_df.select_dtypes(include='number').columns

# Set the z-score threshold for identifying outliers
z_score_threshold = 3

# Dictionary to store the percentage of outliers for each numerical column
percentage_of_outliers = {}

# Loop through each numerical column and calculate the percentage of outliers
for col in numerical_cols:
    col_mean = merged_df[col].mean()
    col_std = merged_df[col].std()
    z_scores = np.abs((merged_df[col] - col_mean) / col_std)
    num_outliers = len(merged_df[z_scores > z_score_threshold])
    percentage = (num_outliers / len(merged_df)) * 100
    percentage_of_outliers[col] = percentage

# Print the percentage of outliers for each numerical column
for col, percentage in percentage_of_outliers.items():
    print(f"Percentage of outliers in {col}: {percentage:.2f}%")


**Removing Outliers from Key Retail Metrics: Sales, Customers, StateHoliday, CompetitionDistance, and CompetitionOpenSinceYear**

In [None]:
# Create a copy of the DataFrame without outliers
outlier_free_df = merged_df.copy()

# Loop through each numerical column and remove the outliers
for col in numerical_cols:

    # Get the z-scores for all the values in the column
    z_scores = np.abs((outlier_free_df[col] - col_mean) / col_std)

    # Identify the outlier indices
    outlier_indices = z_scores[z_scores > z_score_threshold].index

    # Remove the outliers from the DataFrame
    outlier_free_df = outlier_free_df.drop(outlier_indices)

# Calculate the number of rows in the outlier-free DataFrame
num_rows_outlier_free = len(outlier_free_df)

# Print the result
print(f"The outlier-free DataFrame contains {num_rows_outlier_free} rows")



##### What all outlier treatment techniques have you used and why did you use those techniques?

**Outlier Treatment Techniques Used:**

1. **Z-score Method:** This technique calculates the z-score for each data point, measuring its deviation from the mean in terms of standard deviations. Outliers are typically defined with z-scores greater than 3 or less than -3. We chose this method for its simplicity and robustness, as it's not overly sensitive to a few outliers.

2. **Shapiro-Wilk Test:** This test assesses the normality of data distribution. If the p-value is < 0.05, it indicates non-normality, implying potential outliers. We employed this test to ensure outlier removal, as it's more powerful and likely to detect outliers, even if they aren't extremely deviant.

### 3. Categorical Encoding

In [None]:
# Get the names of categorical columns
categorical_columns = merged_df.select_dtypes(include='object').columns

# Print the list of categorical columns
print(f"Categorical Columns: {list(categorical_columns)}")


In [None]:
# Define a list of columns to one-hot encode
columns_to_encode = ['StoreType', 'Assortment', 'PromoInterval']

# Use the get_dummies function with the prefix parameter to add a prefix to the new columns
merged_df = pd.get_dummies(merged_df, columns=columns_to_encode, drop_first=True, prefix=columns_to_encode)


#### What all categorical encoding techniques have you used & why did you use those techniques?

One-hot encoding is chosen for non-ordinal categorical variables, creating binary columns for each category to represent their presence or absence in the original data. This ensures machine learning models interpret the categories correctly, especially for nominal variables with no inherent order.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Assuming you have a DataFrame called 'merged_df' and want to drop the 'Store' column
merged_df.drop('Store', axis=1, inplace=True)


In [None]:
# Drop the 'Date' column from the DataFrame
merged_df.drop('Date', axis=1, inplace=True)


In [None]:
# Drop the 'CompetitionOpenSinceMonth' and 'CompetitionOpenSinceYear' columns from the DataFrame
merged_df.drop(['CompetitionOpenSinceMonth', 'CompetitionOpenSinceYear'], axis=1, inplace=True)


In [None]:
# Remove the 'Promo2', 'Promo2SinceWeek', and 'Promo2SinceYear' columns from the DataFrame
merged_df.drop(['Promo2', 'Promo2SinceWeek', 'Promo2SinceYear'], axis=1, inplace=True)

# Calculate the number of observations for closed stores with zero sales
closed_stores_with_zero_sales = merged_df[(merged_df['Open'] == 0) & (merged_df['Sales'] == 0)]
num_closed_stores_with_zero_sales = closed_stores_with_zero_sales.shape[0]

# Display the number of observations
print(f"Number of observations for closed stores with zero sales: {num_closed_stores_with_zero_sales}")

In [None]:
#since the stores closed had 0 sale value; removing the irrelevant part
merged_df = merged_df[merged_df.Open != 0]
merged_df.drop('Open', axis=1, inplace=True)

In [None]:
# Display the first few rows of the DataFrame 'merged_df'
merged_df.head()

#### 2. Feature Selection

In [None]:
# Separate the feature matrix 'X' and the target variable 'y'
X = merged_df.drop(columns=['Sales'])  # Features
y = merged_df['Sales']  # Target variable

# Number of top features to select
k = 10

# Perform feature selection using ANOVA
selector = SelectKBest(score_func=f_regression, k=k)
X_selected = selector.fit_transform(X, y)

# Get the indices of the selected features
selected_feature_indices = selector.get_support(indices=True)

# Get the names of the selected features
selected_feature_names = X.columns[selected_feature_indices]

# Get the ANOVA F-values of the selected features
selected_feature_scores = selector.scores_[selected_feature_indices]

# Now, 'X_selected' contains only the selected features, and 'selected_feature_names' contains their names.
# Print the selected features and their corresponding ANOVA F-values
print("Selected Features:")
for feature, score in zip(selected_feature_names, selected_feature_scores):
    print(f"{feature}: ANOVA F-value = {score}")


##### What all feature selection methods have you used  and why?

"In our merged dataset, we used SelectKBest with ANOVA for feature selection. This method is suitable for regression tasks, like predicting 'Sales,' a continuous target variable. It helps us choose the most relevant features, reducing model complexity and preventing overfitting."

##### Which all features you found important and why?

The ANOVA F-values highlight the top 10 influential factors for 'Sales' variability:

1. DayOfWeek: Consumer behavior varies by day.
2. Customers: Foot traffic significantly impacts sales.
3. Promo: Promotions affect sales.
4. SchoolHoliday: Sales differ during school holidays.
5. StoreType: Different store types have varying sales patterns.
6. Assortment: Product offerings impact sales.
7. CompetitionDistance: Proximity to competitors influences sales.
8. PromoInterval_Feb,May,Aug,Nov: Promotions in specific months boost sales.
9. PromoInterval_Jan,Apr,Jul,Oct: Certain months' promotions also matter.
10. PromoInterval_Mar,Jun,Sept,Dec: Different months' promotions have diverse effects.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
# Take the natural logarithm (base e) of the 'Sales' column in the 'merged_df' DataFrame
merged_df['Sales'] = np.log(merged_df['Sales'])

# Remove rows where 'Sales' becomes negative infinity after taking the logarithm
merged_df.drop(merged_df[merged_df['Sales'] == float("-inf")].index, inplace=True)


We used a log transformation on 'Sales' because it had positive skewness (skewed towards higher values) and a long right tail (presence of extreme values). This transformation helps normalize the data, making it more suitable for modeling and less sensitive to outliers.

### 6. Data Splitting

In [None]:
# Split the data into training and testing sets
# X: Features (input data)
# y: Target variable (output data)
# test_size: The proportion of the data to include in the test set (here, 20%)
# random_state: A random seed for reproducibility
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=0)

# Print the shapes of the training and testing sets to verify the split
print("Shape of X_train:", X_train.shape)
print("Shape of X_test:", X_test.shape)


##### What data splitting ratio have you used and why?

The `test_size` parameter in `train_test_split` determines the proportion of data allocated to the testing set when splitting the dataset into training and testing subsets. In this code, `test_size=0.2` is used, resulting in a 20% allocation for testing, with 80% for training. Commonly used splitting ratios are 80:20 (test_size=0.2) and 70:30 (test_size=0.3), offering a balanced trade-off between training data quantity and reliable testing set evaluation.

## ***7. ML Model Implementation***

### Linear Regression

In [None]:
# Create a Linear Regression model
regressor = LinearRegression()

# Fit the model to the training data
regressor.fit(X_train, y_train)

# Predict values on the test set
y_pred = regressor.predict(X_test)

# Create a DataFrame to compare actual and predicted values
comparison_data = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})

# Print the DataFrame to view the comparison
print(comparison_data)


#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Import necessary libraries
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
import math

# Calculate the R-squared (Coefficient of Determination)
r2s_1 = r2_score(y_test, y_pred)

# Calculate the Mean Absolute Error (MAE)
mae1 = mean_absolute_error(y_test, y_pred)

# Calculate the Root Mean Squared Error (RMSE)
rmse1 = math.sqrt(mean_squared_error(y_test, y_pred))

# Display the performance metrics
print('Performance of Linear Regression Model:')
print('-' * 40)
print('R-squared (r2_score):', r2s_1)
print('Mean Absolute Error (MAE): %.2f' % mae1)
print('Root Mean Squared Error (RMSE):', rmse1)



#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# Import necessary libraries
from sklearn.linear_model import Ridge, Lasso
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_absolute_error as mae, r2_score as r2

# Define the hyperparameter values to search for Ridge Regression
ridge_params = {'alpha': [0.1, 1.0, 10.0]}
ridge_model = Ridge()

# Perform Grid Search Cross-Validation for Ridge Regression
ridge_grid = GridSearchCV(ridge_model, ridge_params, cv=5, scoring='neg_mean_squared_error')
ridge_grid.fit(X_train, y_train)

# Define the hyperparameter values to search for Lasso Regression
lasso_params = {'alpha': [0.1, 1.0, 10.0]}
lasso_model = Lasso()

# Perform Grid Search Cross-Validation for Lasso Regression
lasso_grid = GridSearchCV(lasso_model, lasso_params, cv=5, scoring='neg_mean_squared_error')
lasso_grid.fit(X_train, y_train)

# Get the best hyperparameters for Ridge and Lasso Regression
best_ridge_alpha = ridge_grid.best_params_['alpha']
best_lasso_alpha = lasso_grid.best_params_['alpha']

# Create Ridge and Lasso Regression models with the best hyperparameters
best_ridge_model = Ridge(alpha=best_ridge_alpha)
best_lasso_model = Lasso(alpha=best_lasso_alpha)

# Fit the models on the training data
best_ridge_model.fit(X_train, y_train)
best_lasso_model.fit(X_train, y_train)

# Make predictions on the test set
ridge_y_pred = best_ridge_model.predict(X_test)
lasso_y_pred = best_lasso_model.predict(X_test)

# Evaluate the models
ridge_mse = mae(y_test, ridge_y_pred)
ridge_r2 = r2(y_test, ridge_y_pred)

lasso_mse = mae(y_test, lasso_y_pred)
lasso_r2 = r2(y_test, lasso_y_pred)

# Print the results
print("Ridge Regression:")
print(f"Best alpha: {best_ridge_alpha}")
print(f"Mean Absolute Error (MAE): {ridge_mse:.2f}")
print(f"R-squared (R2): {ridge_r2:.2f}\n")

print("Lasso Regression:")
print(f"Best alpha: {best_lasso_alpha}")
print(f"Mean Absolute Error (MAE): {lasso_mse:.2f}")
print(f"R-squared (R2): {lasso_r2:.2f}")


##### Which hyperparameter optimization technique have you used and why?

GridSearchCV is employed because it thoroughly explores hyperparameter options within predefined values. It assesses model performance for all hyperparameter combinations via cross-validation, selecting the best set based on a chosen scoring metric (here, negative mean squared error).

Using GridSearchCV ensures comprehensive hyperparameter exploration, leading to the best model configuration without manual trial and error. It automates the tuning process, enhancing efficiency and effectiveness in model optimization.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

The evaluation metrics reveal that the R-squared value remains consistent, hovering around 0.83, for both Ridge and Lasso Regression models, whether or not hyperparameter tuning is applied. Moreover, there is only a marginal difference in Mean Squared Error (940.47 without tuning vs. 940.45 with tuning, specifically for Lasso).

In this specific scenario, it appears that hyperparameter tuning did not lead to a substantial enhancement in model performance. However, it's crucial to acknowledge that these models are already performing quite well, as indicated by an R-squared value of approximately 0.83, suggesting a strong fit to the data.

### XGBoost - ML Model

In [None]:
# Import Necessary Libraries
import xgboost as xgb
from sklearn.metrics import mean_squared_error

# Building XGBoost Regressor Model
xgboost = xgb.XGBRegressor(objective='reg:squarederror', verbosity=0)
xgboost.fit(X_train, y_train)

# Making predictions on the test data using the trained model
y_pred = xgboost.predict(X_test)

# Create a DataFrame to compare actual and predicted values
comparison_df = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})

# Calculate Mean Squared Error (MSE) to evaluate model performance
mse = mean_squared_error(y_test, y_pred)

# Print the first few rows of the comparison DataFrame and MSE
print("Comparison of Actual vs. Predicted Values:")
print(comparison_df.head())
print("\nMean Squared Error (MSE):", mse)



#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
import matplotlib.pyplot as plt
import numpy as np
from sklearn.metrics import r2_score as r2, mean_squared_error as mse, mean_absolute_error as mae

# Calculate evaluation metrics
r2_value = r2(y_test, y_pred)
mae_value = mae(y_test, y_pred)
rmse_value = np.sqrt(mse(y_test, y_pred))  # Calculate RMSE separately

# Create a DataFrame to display the metrics
metrics_df = pd.DataFrame({'Metric': ['R-squared', 'Mean Absolute Error', 'Root Mean Squared Error'],
                            'Value': [r2_value, mae_value, rmse_value]})

# Create a bar chart to visualize the metrics
plt.figure(figsize=(8, 6))
plt.barh(metrics_df['Metric'], metrics_df['Value'], color='skyblue')
plt.xlabel('Metric Value')
plt.title('Evaluation Metrics for XGBoost Regressor Model')
plt.xlim(0, max(metrics_df['Value']) * 1.2)

# Display the metric values on the bars
for i, v in enumerate(metrics_df['Value']):
    plt.text(v + 0.01, i, f'{v:.4f}', va='center', fontsize=12, fontweight='bold')

# Show the chart
plt.show()

# Print the metrics table
print("Performance Metrics for XGBoost Regressor Model:")
print(metrics_df)




#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
from sklearn.metrics import r2_score as r2, mean_squared_error as mse, mean_absolute_error as mae
import xgboost as xgb
from sklearn.model_selection import RandomizedSearchCV
import numpy as np

# Create the XGBoost regressor
xgboost = xgb.XGBRegressor(objective='reg:linear', verbosity=0)

# Define hyperparameters for tuning
parameters = {'max_depth': [2, 5, 10],
              'learning_rate': [0.05, 0.1, 0.2],
              'min_child_weight': [1, 2, 5],
              'gamma': [0, 0.1, 0.3],
              'colsample_bytree': [0.3, 0.5, 0.7]}

# RandomizedSearchCV for hyperparameter tuning with cross-validation
xg_reg = RandomizedSearchCV(estimator=xgboost, param_distributions=parameters, n_iter=10, cv=3)
xg_reg.fit(X_train, y_train)

# Print the best parameter values and negative mean squared error
print("Best Hyperparameters for XGBoost Regression: ")
for key, value in xg_reg.best_params_.items():
    print(f"{key}={value}")
print(f"\nNegative Mean Squared Error (CV): {xg_reg.best_score_:.4f}")

# Predict the test data
y_test_pred = xg_reg.predict(X_test)

# Calculate evaluation metrics
r2_score_test_xg = r2(y_test, y_test_pred)
mae_test_xg = mae(y_test, y_test_pred)
rmse_test_xg = np.sqrt(mse(y_test, y_test_pred))

# Display the evaluation metrics
print("\nPerformance Metrics on Test Data:")
print("-------------------------------")
print(f"R-squared (Test): {r2_score_test_xg:.4f}")
print(f"Mean Absolute Error (Test): {mae_test_xg:.2f}")
print(f"Root Mean Squared Error (Test): {rmse_test_xg:.2f}")



##### Which hyperparameter optimization technique have you used and why?

I chose RandomizedSearchCV for several reasons:

1. **Faster Computation:** It's faster as it samples a fixed number of hyperparameter combinations randomly, unlike GridSearchCV, which explores all possibilities.

2. **Flexibility:** RandomizedSearchCV allows specifying the number of iterations (n_iter) rather than an exhaustive grid, making it flexible for large hyperparameter spaces.

3. **Better Exploration:** It explores a wider range of hyperparameter values, beneficial when the best values aren't on the grid points.

4. **Resource-Efficient:** It's more resource-efficient, especially with complex models and large datasets, as it runs fewer iterations than GridSearchCV.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

The model's performance deteriorated after hyperparameter tuning, evident from the decreased R-squared value and increased MAE and RMSE values. This implies that the untuned XGBoost model outperforms the tuned one in capturing the relationships between features and the target variable, ultimately leading to more precise predictions.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

R-squared (R2 score):

- **Indication:** R-squared reflects the fraction of the variance in the target variable (e.g., sales) that is explained by the independent variables (features) in the model. A higher R-squared value signifies that the model effectively captures the relationship between input features and the target, explaining more of the variance.

- **Business Impact:** A high R-squared value indicates accurate sales predictions, enhancing the reliability of business decisions. It provides valuable insights into factors affecting sales, aiding resource allocation, inventory management, and promotion planning. This optimization leads to improved retail operations and profitability.

Mean Absolute Error (MAE):

- **Indication:** MAE calculates the average absolute difference between actual and predicted sales. A lower MAE signifies more accurate predictions with reduced errors.

- **Business Impact:** Lower MAE fosters efficient inventory management by minimizing overstocking or understocking, lowering holding costs, and enhancing customer satisfaction. It supports pricing and promotion decisions for increased sales.

Root Mean Squared Error (RMSE):

- **Indication:** RMSE calculates the square root of the average squared differences between actual and predicted sales. It's sensitive to larger prediction errors.

- **Business Impact:** A lower RMSE aids in minimizing forecasting errors, optimizing inventory levels, ensuring product availability, and resource planning. It helps identify sales opportunities during peak seasons while reducing potential losses due to excess inventory.

### Decision Tree - ML Model

In [None]:
# Create a Decision Tree Regressor
decision_tree_model = DecisionTreeRegressor()
# Fit the Algorithm
decision_tree_model.fit(X_train, y_train)
# Predict on the model
y_test_pred = decision_tree_model.predict(X_test)
# After building the model we are comparing the actual and the predicted values in this code:

data2 = pd.DataFrame({'Actual':y_test, 'Predicted':y_test_pred})
data2

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error

# Calculate evaluation metrics
r2_score_test1 = r2_score(y_test, y_test_pred)
mae_test1 = mean_absolute_error(y_test, y_test_pred)
rmse_test1 = mean_squared_error(y_test, y_test_pred, squared=False)

# Print the evaluation metrics
print("Performance of Decision Tree Regressor Model:")
print("-------------------------------------------")
print(f"R-squared (Test): {r2_score_test1:.4f}")
print(f"Mean Absolute Error (Test): {mae_test1:.2f}")
print(f"Root Mean Squared Error (Test): {rmse_test1:.2f}")




### 1. Which Evaluation metrics did you consider for a positive business impact and why?

To achieve positive business impact, we assess the Decision Tree model using the following evaluation metrics:

1. **R-squared (Coefficient of Determination):** R-squared quantifies the proportion of variance in the dependent variable (e.g., sales) that can be anticipated from the independent variables (features). A higher R-squared value signifies a better ability to capture underlying patterns and trends in sales data. This is vital for businesses as it gauges the model's effectiveness in fitting the data and making accurate sales predictions.

2. **Mean Absolute Error (MAE):** MAE calculates the average absolute disparity between actual and predicted sales values. Lower MAE values indicate higher prediction accuracy, as they represent smaller absolute errors. This accuracy is pivotal for businesses as it facilitates improved resource allocation, inventory management, and informed decision-making.

3. **Root Mean Squared Error (RMSE):** RMSE computes the square root of the average squared differences between actual and predicted sales values. It measures the precision of the model's predictions. A lower RMSE signifies that, on average, the model's predictions are closer to actual values, enhancing their reliability for data-driven business decisions.

These metrics collectively provide a comprehensive assessment of the model's performance, enabling businesses to gauge its effectiveness in predicting sales and guiding strategic actions.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

Based on the evaluation of various metrics, it is evident that the XGBoost model outperforms other models in terms of predictive accuracy for sales forecasting. The key performance indicators for the XGBoost model on the test dataset are highly promising:

- **R-squared (Test):** An impressive R-squared value of 0.9263 indicates that the model effectively explains approximately 92.63% of the variance in the sales data. This signifies a robust ability to capture and model the underlying patterns and trends in sales, demonstrating its strong predictive power.

- **Mean Absolute Error (Test):** With a low MAE of 612.25, the XGBoost model demonstrates exceptional accuracy in predicting sales. The average absolute difference between its predictions and actual sales values is minimal, highlighting its precision in forecasting.

- **Root Mean Squared Error (Test):** An RMSE of 842.83 reinforces the model's reliability. This metric indicates that, on average, the model's predictions are close to the actual sales values, further emphasizing its effectiveness in generating accurate forecasts.

In summary, the XGBoost model, with its high R-squared value and low MAE and RMSE scores, stands out as the preferred choice for sales prediction. Its ability to capture complex relationships within the data makes it a valuable tool for guiding strategic decisions in areas such as resource allocation, inventory management, and overall business optimization.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

Yellowbrick serves as an indispensable tool for presenting feature importance in machine learning models. Its simplicity, adaptability, and integration with scikit-learn make it an efficient choice for enhancing model explainability, benefiting both data scientists and stakeholders seeking to comprehend and trust the insights derived from machine learning models.

# **Conclusion**

Businesses rely on sales forecasts to make informed decisions and create effective business strategies. These forecasts influence crucial choices, including budgeting, staffing, incentives, goal setting, acquisitions, and growth plans. The accuracy of these predictions directly impacts the success of these strategies.

In this analysis, we have forecasted the sales of various Rossmann stores in Europe for the recent six weeks and compared these predictions with actual sales figures. Some key insights from this analysis include:

1. **Sales Patterns:** Sales tend to be higher on Mondays, possibly because many shops are closed on Sundays, resulting in lower Sunday sales. This observation confirms the hypothesis about the impact of days of the week on sales.

2. **Promotions:** Promotions have a positive effect on both customer traffic and sales, highlighting the importance of promotional strategies in driving revenue.

3. **Competition:** Most stores face competition from nearby stores within a range of 0 to 10 kilometers. Stores with closer competition tend to have higher sales, indicating that competition is more intense in busy locations compared to remote ones.

4. **Store Types:** Store type B, while less in number, achieves the highest average sales. Factors contributing to this include the availability of assortment level B, which is exclusive to type B stores, and being open on Sundays.

5. **Outliers:** Outliers in the dataset exhibit justifiable behavior, often associated with store type B or ongoing promotions, which increase sales.

6. **Model Performance:** Among the four methods tested, Random Forest demonstrates the highest accuracy with an R-squared score of 0.9810, MAE of 297.44, and RMSE of 469.92. While it yields the lowest error, it requires more computational effort than the other methods.

7. **Feature Importance:** The most influential feature for store sales is 'Customers,' which, in turn, depends on other factors like competition distance, store type, and promotions.

In conclusion, the Random Forest model proves to be a powerful tool for predicting sales, with feature importance data providing valuable insights. Rossmann stores can leverage this model to forecast sales for the next six months. However, it's important to note that data preprocessing may have influenced prediction results, as the training set contained incomplete entries that required imputation.

These insights will empower Rossmann and similar businesses to make data-driven decisions, optimize operations, and enhance their revenue-generating capabilities.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***