## World Wide Products!

This week, we will be looking at forecasting models using a sample dataset that deals with demand forecasting. 

In [1]:
# import
import pandas as pd
import numpy as np

In [2]:
# read and clean
df = pd.read_csv("../data/external/Historical Product Demand.csv")
df.drop(df[df.Date.isnull()].index, inplace=True)
df.head(5)

Unnamed: 0,Product_Code,Warehouse,Product_Category,Date,Order_Demand
0,Product_0993,Whse_J,Category_028,2012/7/27,100
1,Product_0979,Whse_J,Category_028,2012/1/19,500
2,Product_0979,Whse_J,Category_028,2012/2/3,500
3,Product_0979,Whse_J,Category_028,2012/2/9,500
4,Product_0979,Whse_J,Category_028,2012/3/2,500


In [3]:
# convert date to datetime
import datetime
df['Date'] = pd.to_datetime(df['Date'])
df['Order_Demand'] = pd.to_numeric(df['Order_Demand'], errors='coerce')
df = df.dropna()
df['Order_Demand']= df['Order_Demand'].astype('int')
df.head(5)

Unnamed: 0,Product_Code,Warehouse,Product_Category,Date,Order_Demand
0,Product_0993,Whse_J,Category_028,2012-07-27,100
1,Product_0979,Whse_J,Category_028,2012-01-19,500
2,Product_0979,Whse_J,Category_028,2012-02-03,500
3,Product_0979,Whse_J,Category_028,2012-02-09,500
4,Product_0979,Whse_J,Category_028,2012-03-02,500


In [4]:
# create new year and month features 
df['Year'], df['Month'] = df['Date'].dt.year, df['Date'].dt.month
df.head(5)

Unnamed: 0,Product_Code,Warehouse,Product_Category,Date,Order_Demand,Year,Month
0,Product_0993,Whse_J,Category_028,2012-07-27,100,2012,7
1,Product_0979,Whse_J,Category_028,2012-01-19,500,2012,1
2,Product_0979,Whse_J,Category_028,2012-02-03,500,2012,2
3,Product_0979,Whse_J,Category_028,2012-02-09,500,2012,2
4,Product_0979,Whse_J,Category_028,2012-03-02,500,2012,3


In [5]:
# want to understand the scope of the data we are looking at
df.shape

(1031437, 7)

In [6]:
# encode and fit with random forest
from sklearn.preprocessing import LabelEncoder
e = LabelEncoder()
df['Product_Code'] = e.fit_transform(df['Product_Code'])
df['Warehouse'] = e.fit_transform(df['Warehouse'])
df['Product_Category'] = e.fit_transform(df['Product_Category'])

In [7]:
df.head(5)

Unnamed: 0,Product_Code,Warehouse,Product_Category,Date,Order_Demand,Year,Month
0,982,2,27,2012-07-27,100,2012,7
1,968,2,27,2012-01-19,500,2012,1
2,968,2,27,2012-02-03,500,2012,2
3,968,2,27,2012-02-09,500,2012,2
4,968,2,27,2012-03-02,500,2012,3


In [8]:
# split data
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
decision = df[['Order_Demand']]
features = df[['Product_Code','Warehouse', 'Product_Category', 'Year', 'Month']]
train, test, train_d, test_d = train_test_split(features,
                                                decision,
                                                test_size = 0.2,
                                                random_state = 14)

In [9]:
# Try random forest regression
from sklearn.ensemble import RandomForestRegressor
RF = RandomForestRegressor()
rfr = RF.fit(train, train_d.values.ravel())

In [10]:
# Test
test_rfr = RF.predict(test)
test_rfr

array([2.30586018e+03, 1.78960317e+01, 1.87364231e+03, ...,
       3.30425056e+04, 1.19870775e+04, 8.00000000e+03])

In [11]:
# look at accuracy
rfr.score(test, test_d)

0.15245315408617266

Given a fairly small R-squared value, it seems that the random forest regresor performs fairly poorly for this particular dataset. I want to try seeing if a different model has a better result.

In [12]:
# SVM breaks with a huge dataset, so take a small sample and resplit
df = df.sample(frac=0.0001)
decision = df[['Order_Demand']]
features = df[['Product_Code','Warehouse', 'Product_Category', 'Year', 'Month']]
train, test, train_d, test_d = train_test_split(features,
                                                decision,
                                                test_size = 0.2,
                                                random_state = 14)

In [13]:
# sourced from sklearn.org SVM examples
from sklearn.svm import SVR
svr_rbf = SVR(kernel='rbf', C=1e3, gamma=0.1)
svr_lin = SVR(kernel='linear', C=1e3)
y_rbf = svr_rbf.fit(train, train_d.values.ravel()).score(test, test_d)
y_lin = svr_lin.fit(train, train_d.values.ravel()).score(test, test_d)

In [14]:
print(y_rbf)
print(y_lin)

-0.019834762709035392
-0.1947895555116128
