# Walmart Recruiting II: Sales in Stormy Weather

Author: Jingwen ZHENG

Update: 2020-07-01

https://www.kaggle.com/c/walmart-recruiting-sales-in-stormy-weather/overview

## Content

- Project understanding
- Objectif
- Python packages to be applied
- Import data
- Data description
- Data cleaning
- Data analysis
- Build preprocessing pipeline
- Train data
- Reference

## Project understanding

Walmart operates 11,450 stores in 27 countries, managing inventory across varying climates and cultures. Extreme weather events, like hurricanes, blizzards, and floods, can have a huge impact on sales at the store and product level.

In their second Kaggle recruiting competition, Walmart challenges participants to accurately predict the sales of 111 potentially weather-sensitive products (like umbrellas, bread, and milk) around the time of major weather events at 45 of their retail locations.

Intuitively, we may expect an uptick in the sales of umbrellas before a big thunderstorm, but it's difficult for replenishment managers to correctly predict the level of inventory needed to avoid being out-of-stock or overstock during and after that storm. Walmart relies on a variety of vendor tools to predict sales around extreme weather events, but it's an ad-hoc and time-consuming process that lacks a systematic measure of effectiveness.


## Objectif

Helping Walmart better predict sales of weather-sensitive products will keep valued customers out of the rain.
Predict how sales of weather-sensitive products are affected by snow and rain.

## Python packages to be applied

In [13]:
import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)


## Import data

In [5]:
key_df = pd.read_csv('data/key.csv')
weather_df = pd.read_csv('data/weather.csv')
train_df = pd.read_csv('data/train.csv')
test_df = pd.read_csv('data/test.csv')

## Data description

In [15]:
key_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45 entries, 0 to 44
Data columns (total 2 columns):
store_nbr      45 non-null int64
station_nbr    45 non-null int64
dtypes: int64(2)
memory usage: 800.0 bytes


In [16]:
weather_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20517 entries, 0 to 20516
Data columns (total 20 columns):
station_nbr    20517 non-null int64
date           20517 non-null object
tmax           20517 non-null object
tmin           20517 non-null object
tavg           20517 non-null object
depart         20517 non-null object
dewpoint       20517 non-null object
wetbulb        20517 non-null object
heat           20517 non-null object
cool           20517 non-null object
sunrise        20517 non-null object
sunset         20517 non-null object
codesum        20517 non-null object
snowfall       20517 non-null object
preciptotal    20517 non-null object
stnpressure    20517 non-null object
sealevel       20517 non-null object
resultspeed    20517 non-null object
resultdir      20517 non-null object
avgspeed       20517 non-null object
dtypes: int64(1), object(19)
memory usage: 3.1+ MB


In [17]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4617600 entries, 0 to 4617599
Data columns (total 4 columns):
date         object
store_nbr    int64
item_nbr     int64
units        int64
dtypes: int64(3), object(1)
memory usage: 140.9+ MB


## Data cleaning

In [11]:
train_cplt_df = pd.merge(key_df, weather_df, on='station_nbr', how='right')
train_cplt_df = pd.merge(train_cplt_df, train_df, on=['date', 'store_nbr'], how='left')

In [14]:
train_cplt_df.head()

Unnamed: 0,store_nbr,station_nbr,date,tmax,tmin,tavg,depart,dewpoint,wetbulb,heat,cool,sunrise,sunset,codesum,snowfall,preciptotal,stnpressure,sealevel,resultspeed,resultdir,avgspeed,item_nbr,units
0,1,1,2012-01-01,52,31,42,M,36,40,23,0,-,-,RA FZFG BR,M,0.05,29.78,29.92,3.6,20,4.6,1.0,0.0
1,1,1,2012-01-01,52,31,42,M,36,40,23,0,-,-,RA FZFG BR,M,0.05,29.78,29.92,3.6,20,4.6,2.0,0.0
2,1,1,2012-01-01,52,31,42,M,36,40,23,0,-,-,RA FZFG BR,M,0.05,29.78,29.92,3.6,20,4.6,3.0,0.0
3,1,1,2012-01-01,52,31,42,M,36,40,23,0,-,-,RA FZFG BR,M,0.05,29.78,29.92,3.6,20,4.6,4.0,0.0
4,1,1,2012-01-01,52,31,42,M,36,40,23,0,-,-,RA FZFG BR,M,0.05,29.78,29.92,3.6,20,4.6,5.0,0.0
