# <div style="background: #6b8272;font-family: monospace; font-weight: bold; font-size: 110%; text-align: center; border-bottom: 0.3px solid #004466;"> Walmart Sales Analysis </div>
    

Walmart Inc. is an American multinational retail corporation that operates a chain of hypermarkets, discount department stores, and grocery stores in the United States, headquartered in Bentonville, Arkansas. 

![](https://corporate.walmart.com/content/corporate/en_us/about/jcr:content/par/grid_4_copy_copy/parsys_tab_1/image.img.jpg/1693432306522.jpg)

# <div style="background: #6b8272; font-family: monospace; font-weight: bold; font-size: 110%; text-align: center; border-bottom: 0.3px solid #004466;"> PROJECT OVERVIEW </div>
    
* [INTRODUCTION](#1)
* [CONFIGURATION](#2)
    * [LIBRARY DEPENDENCIES](#2.1)
    * [DATA SOURCE & PREPARATION](#2.2)    
* [PREPROCESSING](#3)
    * [DATA CLEANING](#3.1)
    * [DATA EXPLORATION](#3.2)
* [FEATURE ENGINEERING](#4)
* [EXPLORATORY DATA ANALYSIS](#5)

<a id="1"></a>
# <div style="background: #6b8272;font-family: monospace; font-weight: bold; font-size: 110%; text-align: center; border-bottom: 0.3px solid #004466;"> INTRODUCTION </div>


This project conducts an in-depth analysis of retail sales data, specifically focusing on Walmart store performance across different locations. Using Python for data exploration and visualization, I aim to uncover insights into sales trends, seasonal variations, and the influence of external factors like holidays, temperature, fuel prices, CPI, and unemployment rates. Through this analysis, I aim to provide actionable insights for retail management decision-making and enhance understanding of consumer behavior and market dynamics.
<br><br>   

<b>Summary of the Dataset:</b>
1. Store: Identifier for the retail store.
2. Date: Date of sales record.
3. Holiday_Flag: Indicator for holiday week (1) or non-holiday week (0).
4. Temperature: Temperature in the region of the store.
5. Fuel_Price: Fuel price in the region.
6. CPI: Consumer Price Index.
7. Unemployment: Unemployment rate.

<a id="2"></a>
## <div style="background: #6b8272;font-family: monospace; font-weight: bold; font-size: 110%; text-align: center; border-bottom: 0.3px solid #004466;"> CONFIGURATION </div>
    
* <b>Libraries: </b>The analysis utilizes Python libraries such as Pandas, Matplotlib, scipy, calendar, and Seaborn for data manipulation and visualization.
* <b>Dataset:</b> The dataset used in this analysis includes variables such as Store, Date, Weekly_Sales, Holiday_Flag, Temperature, Fuel_Price, CPI, and Unemployment, providing a comprehensive view of Walmart sales data.




<a id="2.1"></a>
# <div style="background: #6b8272;font-family: monospace; font-weight: bold; font-size: 70%; text-align: center; border-bottom: 0.3px solid #004466;"> Essential Library Imports for Data Analysis </div>

In [None]:
%%time

# Installing select libraries:-
from gc import collect; # garbage collection to free up memory
from warnings import filterwarnings; # handle warning messages

import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import numpy as np # linear algebra

import matplotlib.pyplot as plt # data visualization
import seaborn as sns # statistical data visualization

# Set the plot style to 'fivethirtyeight'
plt.style.use("fivethirtyeight")

from datetime import datetime  # Importing the datetime class from the datetime module

from scipy import stats # statistical functions

filterwarnings('ignore'); # Ignore warning messages
from IPython.display import display_html, clear_output; # displaying HTML content


clear_output();
print();
collect();

<a id="2.2"></a>
# <div style="background: #6b8272;font-family: monospace; font-weight: bold; font-size: 70%; text-align: center; border-bottom: 0.3px solid #004466;"> Loading and Preparing the Dataset </div>

In [None]:
%%time

# Error Handling When Loading Dataset with Pandas read_csv

try:
    # Attempt to read the dataset
    df = pd.read_csv('/kaggle/input/walmart-sales/Walmart_sales.csv')
    print("Dataset loaded successfully.")
    
except FileNotFoundError:
    # Handle FileNotFoundError if the file does not exist
    print("Error: File not found. Please check the file path.")

except Exception as e:
    # Handle other exceptions
    print("An error occurred while loading the dataset:", e)

print();
collect();

<a id="3"></a>
# <div style="background: #6b8272;font-family: monospace; font-weight: bold; font-size: 110%; text-align: center; border-bottom: 0.3px solid #004466;"> DATA PREPROCESSING </div>
    
1. <b>Data Cleaning: </b>
* Handling Missing Values
* Duplicate Values
* Data Type Conversations

This stage involves preparing the dataset for analysis by addressing missing values and ensuring uniform data types.

2. <b>Data Exploration:</b>
* Descriptive Statistics
* Dataset Shape Analysis

This phase focuses on exploring the dataset's characteristics through statistical summaries and understanding its structure and dimensions.


<a id="3.1"></a>
## <div style="background: #6b8272;font-family: monospace; font-weight: bold; font-size: 70%; text-align: center; border-bottom: 0.3px solid #004466;"> Data Cleaning ( Column formating, Missing values, duplicate values, data type ) </div>

In [None]:
# check the columns name
df.columns

In [None]:
%%time

# Rename columns to lowercase
df.columns = df.columns.str.lower()

# Verify the new column names
print("\nNew column names:")
print(df.columns)

print();
collect();

In [None]:
# Checking the null values in the dataset
df.isna().sum()

In [None]:
#information about the dataset
df.info()

In [None]:
# Attempt to convert 'date' column to datetime format
try:
    df['date'] = pd.to_datetime(df['date'], format='mixed')
    print("All values in 'date' column are valid dates.")
except ValueError as e:
    print("Error:", e)
    print("There are non-date values present in the 'date' column.")

In [None]:
# Checking the duplicate values in the data
duplicate_values=df.duplicated().sum()
print(f'The data contains {duplicate_values} duplicate values')

<a id="3.2"></a>
# <div style="background: #6b8272;font-family: monospace; font-weight: bold; font-size: 70%; text-align: center; border-bottom: 0.3px solid #004466;"> Data Exploration </div>

<a id="3.2"></a>


In [None]:
# Checking the data shape
print(f'The dataset contains {df.shape[0]} rows and {df.shape[1]} columns')

In [None]:
# Statistics about the data set
df.describe().style.background_gradient(cmap='bone_r')

<a id="4"></a>
# <div style="background: #6b8272;font-family: monospace; font-weight: bold; font-size: 110%; text-align: center; border-bottom: 0.3px solid #004466;"> FEATURE ENGINEERING </div>
    
* <b>Date Features: </b>Extracting additional information from the 'Date' column such as day, month, year, day of the week, and whether it's a holiday or not. This can provide seasonality information and holiday effects.

List of Numerical Features
* weekly sales
* temperature
* fuel price
* cpi
* unemployment

In [None]:
# Function to map dates to seasons
def date_to_season(date):
    # Extract month from date string and convert it to integer
    month = datetime.strptime(date, "%Y-%m-%d").month
    
    # Map months to seasons
    if month in [3, 4, 5]:
        return "Spring"
    elif month in [6, 7, 8]:
        return "Summer"
    elif month in [9, 10, 11]:
        return "Autumn"
    else:
        return "Winter"
    
# Apply date_to_season function to create a new column
df["season"] = df["date"].apply(lambda x: date_to_season(x.strftime("%Y-%m-%d")))

In [None]:
# Extract year from the 'date' column
df['year'] = df['date'].dt.year

# Extract month from the 'date' column
df['month'] = df['date'].dt.month

# Extract month name from the 'date' column
df['month_name'] = df['date'].dt.month_name()

# Extract day from the 'date' column
df['day'] = df['date'].dt.day

# Extract day of the week (0 = Monday, 1 = Tuesday, ..., 6 = Sunday) from the 'date' column
df['day_of_week'] = df['date'].dt.dayofweek

df.sample(5)

<a id="5"></a>
# <div style="background: #6b8272;font-family: monospace; font-weight: bold; font-size: 110%; text-align: center; border-bottom: 0.3px solid #004466;"> EXPLORATORY DATA ANALYSIS(EDA) </div>
    
Univariate Analysis:

* Visualize the distribution of numerical variables using histograms or density plots.
* Explore categorical variables using bar plots or count plots to understand their frequency distribution.

Bivariate Analysis:

* Analyze relationships between pairs of variables using scatter plots (for numerical variables) or grouped bar plots (for categorical variables).
* Calculate correlation coefficients between numerical variables to measure the strength and direction of linear relationships.


Visualizing distributions of variables (histograms, kernel density plots)
Summary statistics (box plots, violin plots)
Pairwise relationships (scatter plots, pair plots)

In [None]:
# List of features
features = ['weekly_sales', 'temperature', 'fuel_price', 'cpi', 'unemployment']

# Set the figure size
plt.figure(figsize=(18, 20))

# Loop through each column in your dataset
for i, col in enumerate(features):
    # Create subplots
    plt.subplot(3, 3, i+1)
    
    # Plot histogram for the current column
    sns.histplot(data=df, x=col, kde=True)

plt.tight_layout()
plt.show()

<div style="background: #4d4d00; padding: 20px; font-family: monospace; font-weight: bold; font-size: 90%; text-align: left; width: 100%; box-sizing: border-box; color: white;">
    <u><strong>Conclusion</strong></u><br>
    <br>
    <b>Weekly Sales:</b> The histogram of weekly sales indicates a right-skewed distribution, suggesting that there are relatively fewer instances of very high sales compared to lower sales amounts. This could indicate occasional spikes in sales or a few high-performing periods.<br>
    <br>
    <b>Temperature and Unemployment:</b> The histograms of temperature and unemployment show approximately normal distributions, indicating that the majority of the data points are centered around the mean with relatively few outliers. This suggests that these factors may follow typical patterns without significant deviations.<br>
    <br>
    <b>Fuel Price, CPI:</b> The histograms of fuel price and CPI exhibit bimodal distributions, suggesting the presence of two distinct peaks or modes in the data. This could imply the existence of different market conditions or states within the dataset, potentially indicating varying economic situations or consumer behaviors.
</div>


In [None]:
correlation_matrix = df[['weekly_sales', 'temperature', 'fuel_price', 'cpi', 'unemployment']].corr()
plt.figure(figsize=(8, 6))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Heatmap')
plt.show()

<div style="background: #4d4d00; padding: 20px; font-family: monospace; font-weight: bold; font-size: 90%; text-align: left; width: 100%; box-sizing: border-box; color: white;">
    <u><strong>Conclusion</strong></u><br>
    <ul>
        <li>Weekly sales have a weak negative correlation with temperature, fuel price, and unemployment. This means that as these variables increase, weekly sales tend to decrease slightly.</li>
        <li>Weekly sales have a weak positive correlation with CPI. This means that as CPI increases, weekly sales tend to increase slightly.</li>
        <li>Temperature has a weak positive correlation with fuel price, CPI, and unemployment.</li>
        <li>Fuel price has a weak negative correlation with CPI.</li>
        <li>CPI has a moderate negative correlation with unemployment.</li>
    </ul>
</div>


In [None]:
pd.pivot_table(data = df,
              index = 'year',
              columns = 'season',
              values = 'weekly_sales',
              aggfunc = 'sum')

In [None]:
# Create a dictionary to store season-wise weekly sales for each year
seasonwise_weekly_sales = {}

# Iterate over unique seasons
for season in df['season'].unique():
    # Group by year and sum the weekly sales for the current season
    season_sales = df[df['season'] == season].groupby('year')['weekly_sales'].sum()
    # Store the season-wise weekly sales in the dictionary
    seasonwise_weekly_sales[season] = season_sales

# Create an empty list to store data
plot_data = []

# Populate the list with data
for season, sales in seasonwise_weekly_sales.items():
    for year, weekly_sales in sales.items():
        plot_data.append({'Year': year, 'Season': season, 'Weekly Sales': weekly_sales})

# Convert the list to a DataFrame
plot_data = pd.DataFrame(plot_data)

# Create the figure and axes
fig, ax = plt.subplots(figsize=(10, 6))

# Plot the stacked bar plot using seaborn
sns.barplot(data=plot_data, x='Year', y='Weekly Sales', hue='Season', ax=ax, ci=None)

# Set labels and title
ax.set_xlabel('Year')
ax.set_ylabel('Total Weekly Sales')
ax.set_title('Yearly Season-wise Total Weekly Sales')
# Adjust legend position to prevent it from going outside the plot
ax.legend(title='Season', loc='upper left', bbox_to_anchor=(1, 1))

plt.tight_layout()
plt.show()


<div style="background: #4d4d00; padding: 20px; font-family: monospace; font-weight: bold; font-size: 90%; text-align: left; width: 100%; box-sizing: border-box; color: white;">
    <u><strong>Conclusion: Best Season by Year</strong></u><br>
    <ul>
        <li><strong>2010:</strong> Spring</li>
        <li><strong>2011:</strong> Autumn</li>
        <li><strong>2012:</strong> Summer</li>
    </ul>
</div>

In [None]:
# Group by year and holiday_flag to get counts
holiday_counts = df.groupby(['year', 'holiday_flag']).size().unstack(fill_value=0).reset_index()

# Melt DataFrame to long format
holiday_counts_melted = pd.melt(holiday_counts, id_vars='year', var_name='Holiday Flag', value_name='Count')

# Plot using Seaborn
fig, ax = plt.subplots(1, 2, figsize=(12, 6))

# Bar plot
sns.barplot(data=holiday_counts_melted, x='year', y='Count', hue='Holiday Flag', ax=ax[0])
ax[0].set_title('Holiday Distribution Over the Years')
ax[0].set_xlabel('Year')
ax[0].set_ylabel('Count')

# Get legend handles
handles, _ = ax[0].get_legend_handles_labels()

ax[0].legend(handles=handles, labels=['Not Holiday', 'Holiday'], title='Holiday Flag', loc='upper left', bbox_to_anchor=(1, 1))

ax[1].pie(df['holiday_flag'].value_counts().values, labels=['Not Holiday', 'Holiday'], autopct='%1.2f%%')
ax[1].set_title('Overall Holiday Distribution')

plt.tight_layout()
plt.show()

<div style="background: #4d4d00; padding: 20px; font-family: monospace; font-weight: bold; font-size: 90%; text-align: left; width: 100%; box-sizing: border-box; color: white;">
    <u><strong>Conclusion</strong></u><br>
    Based on the analysis of holiday distribution over the years, it appears to be a typical scenario, with holiday percentages varying across different years. The overall holiday percentage, calculated as 6.99%, indicates that holidays occur in a relatively small proportion of the observed data points. Further investigation into the specific trends and patterns within each year may provide additional insights into holiday occurrences and their impact.
</div>

In [None]:
# Calculate the count of each year
year_counts = df['year'].value_counts()

# Plotting
fig, ax = plt.subplots(1, 2, figsize=(14, 6))

# Countplot for the distribution of years
sns.countplot(data=df, x='year', ax=ax[0])
ax[0].set_xlabel('Year')
ax[0].set_ylabel('Count')

# Pie chart for the distribution of years
ax[1].pie(year_counts.values, labels=year_counts.index, autopct='%1.2f%%')

# Set a single title for the entire figure
plt.suptitle('Distribution of Years', fontsize=16, y=1.05)

plt.show()

# Due assignment
* More Visualization
* Machine Learning Model
