Skip to content

Analysing walmart sales data to find out the if there is a relationship between weather and sales

Notifications You must be signed in to change notification settings

mlk500/ML-Walmart-data

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Sales Data Analysis and Weather Prediction

This project focuses on analyzing sales data for 111 products sold at 45 different Walmart locations, and predicting daily sales figures and rainy days based on weather data. The project utilizes various data preprocessing techniques, machine learning models, and data visualization to gain insights and make predictions.

Data Sources

The project uses three datasets:

  1. sales.csv: Contains sales data for 111 products sold at 45 different Walmart locations from January 2012 to October 2014.
  2. key.csv: Indicates the weather station associated with each store.
  3. weather.csv: Contains daily weather data for each weather station.

Data Exploration and Visualization

The data exploration and visualization phase involved creating informative plots and tables to gain insights into the data. Some key findings include:

  • The number of stores associated with each weather station varies.
image
  • Average monthly sales per year show seasonal patterns and variations across different years.
image
  • Scatter plots of units sold against weather variables like snowfall, maximum temperature, and minimum temperature reveal weak relationships between sales and weather variables.
image image image

The weak correlations observed in the scatter plots suggest that there might not be a strong direct relationship between sales and individual weather variables. This could indicate that the relationship between sales and weather is more complex and may involve interactions between multiple variables.

Data Preprocessing

The data preprocessing phase involved several steps:

  1. Handling Missing Values: Columns with more than 50% missing values were dropped, and the remaining missing values were imputed using the K-Nearest Neighbors (KNN) algorithm.

  2. Outlier Detection and Removal: Outliers in the 'units' feature were identified using the Interquartile Range (IQR) method and removed from the dataset.

image image
  1. Feature Engineering: Additional features like day of the week, day, month, and year were extracted from the date column.

  2. Merging Datasets: The sales, key, and weather datasets were merged based on store and date information.

Unit Sales Prediction

To predict the daily sales figures of the sum of units sold for products 5, 6, 9, 16, and 45 (key_sum) per store, two machine learning models were used: Gradient Boost Regressor and Decision Tree Regressor. The models were trained on data from 2012-2013 and tested on data from 2014.

  • Gradient Boost Regressor: The model achieved an MSE of 959.98 before parameter tuning and 991.66 after tuning using GridSearchCV.

  • Decision Tree Regressor: The model achieved an MSE of 1835.20 before parameter tuning and 1757.07 after tuning using GridSearchCV.

image image
image image

The prediction plots indicate that both models struggle to capture the full range of variability in the sales data, with predictions clustering around the mean. This suggests that the models might be underfitting the data and that there could be other important factors influencing sales that are not captured in the current feature set.

Feature importance analysis was conducted for both models to identify the most influential variables in predicting unit sales. The results show that weather-related features have relatively low importance compared to other features like store and item information. image image

image image

Rainy Day Prediction

To predict whether it rained or not for store number 11 on a given day, two machine learning models were used: AdaBoost Classifier and Random Forest Classifier. The models were trained on data from 2012-2013 and tested on data from 2014.

  • AdaBoost Classifier: The model achieved an accuracy of 0.602, sensitivity of 0.705, and specificity of 0.504 after parameter tuning using GridSearchCV.
image image
  • Random Forest Classifier: The model achieved an accuracy of 0.606, sensitivity of 0.557, and specificity of 0.654 after parameter tuning using RandomizedSearchCV.
image image

Confusion matrices and feature importance analysis were used to evaluate and interpret the model results. Both models show moderate performance in predicting rainy days, with the AdaBoost classifier having higher sensitivity (better at predicting rainy days) and the Random Forest classifier having higher specificity (better at predicting non-rainy days).

The feature importance analysis reveals that temporal features like day, month, and day of the week are more influential in predicting rainy days compared to sales-related features. This suggests that there might be seasonal patterns in rainfall that the models are capturing.

Repository Structure

  • data/: Contains the raw and preprocessed datasets.
  • notebooks/: Jupyter notebooks used for data exploration, preprocessing, modeling, and analysis.
  • plots/: Generated plots and visualizations.
  • README.md: Overview of the project, methodology, and findings.

Dependencies

  • Python 3.x
  • pandas
  • numpy
  • matplotlib
  • seaborn
  • scikit-learn

Acknowledgements

This project was completed as part of the Machine Learning course at the University of Haifa. The dataset used in this project was obtained from Walmart Recruiting - Sales in Stormy Weather.

About

Analysing walmart sales data to find out the if there is a relationship between weather and sales

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published