Dataset Description :

Walmart runs several promotional markdown events throughout the year, particularly around four major holidays: the Super Bowl, Labor Day, Thanksgiving, and Christmas. Historical sales data for 45 Walmart stores located in different regions from 2010-02-05 to 2012-11-01 are provided.


The dataset includes the following fields:

-Store: The store number. 
-Date: The week of sales. 
-Weekly_Sales: Sales for the given store. 
-Holiday_Flag: Whether the week is a special holiday week (1) or a non-holiday week (0).
-Temperature: Temperature on the day of sale.
-Fuel_Price: Cost of fuel in the region. 
-CPI: Prevailing consumer price index. 
-Unemployment: Prevailing unemployment rate.

Problem Statement and Analysis Questions :

-Perform exploratory data analysis: 
-Import data.
-Display data.
-Visualize quantitative variables distributions.
-Perform data cleaning. 
-Answer the following questions:

-Which store has maximum sales?
-Which store has maximum standard deviation in sales (i.e., the sales vary a lot)? 
-Find holidays that have higher sales than the mean sales in the non-holiday season for all stores together.
-Provide a monthly and semester view of sales in units and give insights. 
-Plot the relationships between weekly sales and other numeric features, and provide insights.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

In [None]:
# Ignore warnings to keep the output clean
warnings.filterwarnings('ignore')

In [None]:
# Load the dataset from the specified path
df = pd.read_csv("walmart-dataset.csv")

In [None]:
# Check the data types and presence of missing values
df.info()

In [None]:
# Convert the 'Date' column to datetime format
df['Date'] = pd.to_datetime(df['Date'], format='%d-%m-%Y')

In [None]:
#for overall statistics about  walmart dataset
df.describe()
# try this -> df.describe().round()

In [None]:
# Count the number of missing values in each column
print(df.isnull().sum())

In [None]:
# Count the number of duplicate rows in the data
print(df.duplicated().sum())

In [None]:
#save the clean one to use it in Power BI
# df.to_csv('clean_walmart_set.csv', index=False)

In [None]:
# Plot histograms for distributions of quantitative variables
df[['Weekly_Sales', 'Temperature', 'Fuel_Price', 'CPI', 'Unemployment']].hist(bins=20, figsize=(15, 10))
plt.show()

In [None]:
# Plot boxplots for each quantitative variable
for col in ['Weekly_Sales', 'Temperature', 'Fuel_Price', 'CPI', 'Unemployment']:
    sns.boxplot(x=col, data=df)
    plt.show()

In [None]:
# Calculate total sales for each store and sort them in descending order
total_sales = df.groupby('Store')['Weekly_Sales'].sum().sort_values(ascending=False)
print("Top 5 stores with the highest total sales:")
print(total_sales.head())

#from the output , we will get that : stor number 20 has maximum sales

In [None]:
# Calculate the standard deviation of weekly sales for each store
total_std = df.groupby('Store')['Weekly_Sales'].std().sort_values(ascending=False)
print("Top 5 stores with the highest volatility in weekly sales:")
print(total_std.head())

#from the output , we will get that : the weekly sales at store 14 are more volatile or variable compared to other stores ( heighest std )

In [None]:
# Identify holidays with higher sales than the mean non-holiday sales
mean_non_holiday_sales = df[df['Holiday_Flag'] == 0]['Weekly_Sales'].mean()
high_sales_holidays = df[(df['Holiday_Flag'] == 1) & (df['Weekly_Sales'] > mean_non_holiday_sales)]
print("Holidays with higher sales than mean non-holiday sales:")
print(high_sales_holidays[['Date', 'Weekly_Sales']])

In [None]:
# Calculate and plot monthly sales
monthly_sales = df.resample('M', on='Date')['Weekly_Sales'].sum()
print("Monthly Sales:")
print(monthly_sales.head())

# Plot monthly sales as a bar plot
plt.figure(figsize=(10, 6))
monthly_sales.plot(kind='bar', color='skyblue')
plt.title("Monthly Sales")
plt.xlabel("Month")
plt.ylabel("Total Weekly Sales")
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

# Find the month with the highest sales
max_sales_month = monthly_sales.sort_values(ascending=False).head(1)
print("Month with the highest sales:")
print(max_sales_month)


In [None]:
# Calculate and plot semesterly sales
semesterly_sales = df.resample('6M', on='Date')['Weekly_Sales'].sum()
print("\nSemesterly Sales:")
print(semesterly_sales.head())

# Plot semesterly sales as a bar plot
plt.figure(figsize=(10, 6))
semesterly_sales.plot(kind='bar', color='skyblue')
plt.title("Semesterly Sales")
plt.xlabel("Semester")
plt.ylabel("Total Weekly Sales")
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

# Find the semester with the highest sales
max_sales_semester = semesterly_sales.sort_values(ascending=False).head(1)
print("Semester with the highest sales:")
print(max_sales_semester)

In [None]:

# Define the numeric features
numeric_features = ['Temperature', 'Fuel_Price', 'CPI', 'Unemployment']

# Create a figure with a specified size
plt.figure(figsize=(12, 8))

# Loop through each numeric feature and plot the relationship with Weekly_Sales
#The enumerate function starts the loop at index 1 because you want to place the subplots in a 2x2 grid, 
for i, feature in enumerate(numeric_features, 1):
    plt.subplot(2, 2, i)
    
    # Use regplot instead of scatterplot to plot the scatter plot and regression line
    #You use sns.regplot from the seaborn library to create a scatter plot with a regression line between Weekly_Sales and the current feature
    sns.regplot(x=df[feature], y=df['Weekly_Sales'], line_kws={'color': 'red'})
    
    # Add a title to the subplot
    plt.title(f"Weekly Sales vs. {feature}")
    
    # Label the x-axis and y-axis
    plt.xlabel(feature)
    plt.ylabel("Weekly Sales")

# Adjust the layout of the subplots
plt.tight_layout()

# Display the plots
plt.show()


Insights from the Analysis

    1-Store with Maximum Sales:
    Store Number 20 has the maximum total sales.
    
    2-Store with Maximum Standard Deviation in Sales:
    Store Number 14 has the highest standard deviation in sales, indicating more volatile or variable       weekly sales compared to other stores.
    
    3-Holidays with Higher Sales:
    Holidays with higher sales than mean non-holiday sales were identified, showing that some holidays     have a positive impact on sales.
    
    4-Monthly and Semester View of Sales:
    December has the highest weekly sales, showing significant seasonal influence.
    The summer of 2012 has the highest weekly sales in the semester view, indicating a peak sales           period.
    
    5-Relationships Between Weekly Sales and Other Numeric Features:
    Weekly Sales vs. Temperature or Fuel Price: There appears to be a very weak relationship between       temperature and weekly sales, as indicated by the spread of points and a nearly horizontal             regression line.


  Weekly Sales vs. CPI: A subtle decline in sales is observed as CPI increases, suggesting that higher   prices might slightly deter consumer spending.

  Weekly Sales vs. Unemployment: A negative relationship is observed between unemployment and weekly     sales, as seen in the scatter plot and regression line.

  Weekly Sales vs. CPI: A slight upward trend is noted as CPI increases, indicating that higher           inflation might slightly boost sales.

