# Walmart Sales in Stormy Weather

### Overview 

Walmart operates 11,450 stores in 27 countries, managing inventory across varying climates and cultures. Extreme weather events, like hurricanes, blizzards, and floods, can have a huge impact on sales at the store and product level. 

In their second Kaggle recruiting competition, Walmart challenges participants to accurately predict the sales of 111 potentially weather-sensitive products (like umbrellas, bread, and milk) around the time of major weather events at 45 of their retail locations. 

Intuitively, we may expect an uptick in the sales of umbrellas before a big thunderstorm, but it's difficult for replenishment managers to correctly predict the level of inventory needed to avoid being out-of-stock or overstock during and after that storm. Walmart relies on a variety of vendor tools to predict sales around extreme weather events, but it's an ad-hoc and time-consuming process that lacks a systematic measure of effectiveness. 

Helping Walmart better predict sales of weather-sensitive products will keep valued customers out of the rain. It could also earn you a position at one of the most data-driven retailers in the world! 

Please note: You must compete as an individual in recruiting competitions. You may only use the data provided to make your predictions.

### Data Descriptoin

You have been provided with sales data for 111 products whose sales may be affected by the weather (such as milk, bread, umbrellas, etc.). These 111 products are sold in stores at 45 different Walmart locations. Some of the products may be a similar item (such as milk) but have a different id in different stores/regions/suppliers. The 45 locations are covered by 20 weather stations (i.e. some of the stores are nearby and share a weather station).

The competition task is to predict the amount of each product sold around the time of major weather events. For the purposes of this competition, we have defined a weather event as any day in which more than an inch of rain or two inches of snow was observed. You are asked to predict the units sold for a window of ±3 days surrounding each storm.

The following graphic shows the layout of the test windows. The green dots are the training set days, the red dots are the test set days, and the event=True are the days with storms. Note that this plot is for the 20 weather stations. All days prior to 2013-04-01 are given out as training data.

You are provided with the full observed weather covering the entire data set. You do not need to forecast weather in addition to sales (it's as though you have a perfect weather forecast at your disposal).

You will not be provided with more information about the products, store locations, or other details.
Because the storms occur at variable times and in variable locations, use the test set file (or sample submission) as your guide to know which days and stores you must forecast.
The sales data does not capture the difference between the stock and the demand. In other words, sales number 0 doesn't necessarily mean there was no demand for this product; it may mean it was in stock but none were sold, or it could mean that the product was out of stock, or discontinued and not available. 


#### Field descriptions

date - the day of sales or weather  
store_nbr - an id representing one of the 45 stores  
station_nbr - an id representing one of 20 weather stations  
item_nbr - an id representing one of the 111 products  
units - the quantity sold of an item on a given day  
id - a triplet representing a store_nbr, item_nbr, and date. Form the id by concatenating these (in that order) with an underscore. E.g. "2_1_2013-04-01" represents store 2, item 1, sold on 2013-04-01.


#### File descriptions

key.csv - the relational mapping between stores and the weather stations that cover them  
sampleSubmission.csv - file that gives the prediction format  
train.csv - sales data for all stores & dates in the training set  
test.csv - stores & dates for forecasting (missing 'units', which you must predict)  NOTE: This file has been   encrypted. To get the password, please fill out Walmart's Recruiting Survey  
weather.csv - a file containing the NOAA weather information for each station and day  
noaa_weather_qclcd_documentation.pdf - a guide to understand the data provided in the weather.csv file  

## Weather.csv

In [7]:
weather = pd.read_csv('data/weather.csv')
print(weather.shape)
weather

(20517, 20)


Unnamed: 0,station_nbr,date,tmax,tmin,tavg,depart,dewpoint,wetbulb,heat,cool,sunrise,sunset,codesum,snowfall,preciptotal,stnpressure,sealevel,resultspeed,resultdir,avgspeed
0,1,2012-01-01,52,31,42,M,36,40,23,0,-,-,RA FZFG BR,M,0.05,29.78,29.92,3.6,20,4.6
1,2,2012-01-01,48,33,41,16,37,39,24,0,0716,1626,RA,0.0,0.07,28.82,29.91,9.1,23,11.3
2,3,2012-01-01,55,34,45,9,24,36,20,0,0735,1720,,0.0,0.00,29.77,30.47,9.9,31,10.0
3,4,2012-01-01,63,47,55,4,28,43,10,0,0728,1742,,0.0,0.00,29.79,30.48,8.0,35,8.2
4,6,2012-01-01,63,34,49,0,31,43,16,0,0727,1742,,0.0,0.00,29.95,30.47,14.0,36,13.8
5,7,2012-01-01,50,33,42,M,26,35,23,0,-,-,,0.0,0.00,29.15,30.54,10.3,32,10.2
6,8,2012-01-01,66,45,M,M,34,46,M,M,-,-,RA BR,M,0.00,30.05,M,11.0,36,10.9
7,9,2012-01-01,34,19,27,M,17,23,38,0,-,-,UP,M,T,29.34,30.09,22.8,30,22.5
8,10,2012-01-01,73,53,63,M,55,58,2,0,0723,1738,FG+ FG BR,M,0.00,30.16,30.19,5.1,24,5.5
9,11,2012-01-01,72,48,60,7,54,56,5,0,0724,1737,FG+ FG BR,0.0,0.00,30.15,30.18,4.6,23,4.8


- `온도는 화씨로 표기`
- `풍속은 mph로 표기`

  **Terminology**
    - date : 날짜
    - tmax : 최대 온도
    - tmin : 최저 온도
    - depart : `normal` 온도와 `tmax`와의 차이
    - dewpoint : 평균 이슬점
    - wetbulb : 평균 습도
    - heat : 열기
    - cool : 냉기
    - sunrise : 일출 시간
    - sunset : 일몰 시간 
    - codesum : 날씨 코드
        - +FC TORNADO/WATERSPOUT
        - FC FUNNEL CLOUD
        - TS THUNDERSTORM
        - GR HAIL
        - RA RAIN
        - DZ : DRIZZLE
        - SN : SNOW
        - SG : SNOW GRAINS
        - GS : SMALL HAIL &/OR SNOW PELLETS
        - PL : ICE PELLETS
        - IC : ICE CRYSTALS
        - FG+: HEAVY FOG (FG & LE.25 MILES VISIBILITY)
        - FG : FOG
        - BR : MIST
        - UP : UNKNOWN PRECIPITATION
        - HZ : HAZE
        - FU : SMOKE
        - VA : VOLCANIC ASH
        - DU : WIDESPREAD DUST
        - DS : DUSTSTORM
        - PO : SAND/DUST WHIRLS
        - SA : SAND
        - SS : SANDSTORM
        - PY : SPRAY
        - SQ : SQUALL
        - DR : LOW DRIFTING
        - SH : SHOWER
        - FZ : FREEZING
        - MI : SHALLOW
        - PR : PARTIAL
        - BC : PATCHES
        - BL : BLOWING
        - VC : VICINITY
        - (-) : LIGHT
        - (+) : HEAVY
        - "NO SIGN" : MODERATE
    - snowfall : 눈이 쌓인 정도 (inches)
    - precipitotal : 24시간 기준 강수량 (inches)
    - stnpressure : 평균 압력 관측소 기준
    - sealevel : 해수면 압력, 현재 평균 온도가 아닌 12시간 전의 평균 온도를 사용
    - resultspeed : 합성 풍속 (mph)
    - resultdir : 합성 풍향 (10도씩) 
    - avgspeed : 평균 풍속 (mph)
    - M : Missing Data
    - T : Trace Data 추적 데이터

In [8]:
sales = pd.read_csv("train.csv")
print(sales.shape)
sales.head()

(4617600, 4)


Unnamed: 0,date,store_nbr,item_nbr,units
0,2012-01-01,1,1,0
1,2012-01-01,1,2,0
2,2012-01-01,1,3,0
3,2012-01-01,1,4,0
4,2012-01-01,1,5,0


## Key.csv
각 매장(store_nbr)이 위치한 지역의 날씨데이터를 기록하는 관측소(station_nbr) 데이터로 이루어져 있다.

In [3]:
keys = pd.read_csv("key.csv")
keys = pd.DataFrame(keys, columns=["station_nbr", "store_nbr"])
keys = keys.sort_values(by=["station_nbr"], ascending=True).reset_index(drop=True)
print(keys.shape)
keys

(45, 2)


Unnamed: 0,station_nbr,store_nbr
0,1,1
1,2,16
2,3,29
3,3,21
4,3,33
5,4,8
6,5,35
7,6,13
8,6,7
9,7,3


## Frist Place Entry for reference and inspiration


by : threecourse
https://www.kaggle.com/c/walmart-recruiting-sales-in-stormy-weather/discussion/14452

Thank you all people around this competition, I'm a newbie in data science and it was the first challenge for a non-playground competition, so I'm really surprised and glad to win.

I'm not great at English, so wrote this method description in itemized style.

Train model

1. Exclude item/stores whose units are all zeros.

2. For each item/stores,
apply curve fitting by R ppr function (projection pursuit regression).
y = log1p_units, x = days from 2012-01-01

here, data on 2013-12-25 are excluded. (because units are almost all zeros)

3. Train linear model with lasso using vowpal wabbit.
y = log1p_units - ppr_fitted

features :
- A : weekday, is_weekend, is_holiday, is_holiday_and_weekday, is_holiday_and_weekend
- B : item_nbr
- C : store_nbr
- D : date
- E : year, month, day
- F : is_BlackFriday-3days, -2days, -1day, is_BlackFriday, +1day, +2days, +3days
- G : weather features (is preciptotal > 0.2, depart > 8, depart < -8)
- interactions A*B A*C B*E C*E B*F C*F

here, below are excluded:
- on 2013-12-25
- moving average(21 elements, centered) is zero.

4. Mark dates as "too much zeros" where both sides are many successive zeros.
4-1. for dates whose units are not zero, calculate minimum of both side successive zeros (= min_side_zeros).
4-2. for each item/stores, 
calculate maximum of min_side_zeros (= max_min_side_zeros), floor and ceiling by 1 and 9.
4-3. for each item/stores,
mark dates as "too much zeros" where both sides are successive zeros more than max_min_side_zeros.

Prediction on test set

predicted_log1p = ppr_fitted(train-2) + linear model predicted(train-3)
predicted = exp(predicted_log1p) - 1

here, below are predicted as zero.
- item/stores whose units are all zeros. 
- on 2013-12-25
- moving average(21 elements, centered) is zero.
- "too much zeros" (train-4)

Comments

The core idea is very simple like that:
1. Create a baseline for each item/stores.
2. Apply linear regression using vowpal wabbit with many features.

As for baseline:
- R ppr functions fit really nice on almost all item/stores (can be improved on some item/stores). 
- At first I used moving average. It worked, but fluctulates too much or catch too distant value.

As for features:
- weekday is the most important
- month periodicity is on some store/items
- around Black Friday sales fluctuates a lot
- weather features are not effective almost at all
   In the data, people go shopping as usual however much it rains. 
   It's not natural, so I guess weather data came from different stations.

Considering successive zeros was my final push, it slightly improved the score.

Codes

uploaded on github, https://github.com/threecourse/kaggle-walmart-recruiting-sales-in-stormy-weather