# World Wide Products Inc.

#### Introduction
Many products are sold; however, depending on the season, some products are in more demand than others. Given a data set with order dates and demand quanitity, can the future demand be forecasted?

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn import preprocessing

from sklearn.ensemble import GradientBoostingRegressor as GBR #GBM algorithm
from sklearn.ensemble import RandomForestRegressor as RFR
from sklearn import metrics


#Import dataset
products=pd.read_csv('../data/external/Historical Product Demand.csv')

## Transforming Data

The current dataset contains five attributes: product code, warehouse, product category, date needed and order quanitity. Using Google Facets, no data is missing; however, a few dates are labeled as NA. Due to this, these are dropped.

Furthermore, the order demand needs normalization due to the massive range and majority of orders under 400 versus 4 million. A few of these entries contain values other than only digits. This is fixed.

For machine learning, it's better to input numerical data. For product category, the "category_" string is dropped. This is similar with product code "Product_". These numbers need to be readjusted so it does not affect scaling. Lastly, the warehouse is hashed.


In [2]:
#Drop NA dates
products=products[products['Date'] != 'NA']
products=products.dropna(subset=['Date'])

#Normalize order demand
products['Order_Demand']=products['Order_Demand'].str.replace('[^0-9]', '', regex=True)
products['Order_Demand']=products['Order_Demand'].astype("int")
products['Order_Demand']=(products['Order_Demand']-products['Order_Demand'].min())/(products['Order_Demand'].max()-products['Order_Demand'].min())

products['Product_Category']=products['Product_Category'].str.replace('[^0-9]','',regex=True)
products['Product_Code']=products['Product_Code'].str.replace('[^0-9]','',regex=True)

products['Warehouse']=products['Warehouse'].apply(hash)

## Feature Extraction

Since date contains a day, month and year, this is extracted into separate columns. Furthermore, seasons can be extracted; however, the assumption is the products are demanded by USA.

These seasons are defined as followed:

* Spring(1): Mar(3) 20 - Jun(6) 20

* Summer(2): Jun(6) 21 - Sept(9) 21

* Fall(3): Sept(9) 22 - Dec(12) 20

* Winter(4): Dec(12) 21 - Mar(3) 19

However this is difficult to program as days in the month reset after each month. By labeling the day in the year, a numerical range exists. This is as below:

* Spring(1): [80-172)

* Summer(2): [172-264)

* Fall(3): [264-355)

* Winter(4): All else

Aside from season, day of the week, week and weekday is extracted from the date.

Lastly, all data needs to be numerical. The date is output into YearMonthDay format.

In [3]:
products['Date']=pd.to_datetime(products['Date'])

products['DayofWeek']=products.Date.apply(lambda x: pd.Timestamp.isoweekday(x))
products['DayofYear']=products.Date.dt.dayofyear
products['Week']=products.Date.dt.week
products['Isweekday']=products.Date.dt.weekday
products['Isweekday']=np.where(products['Isweekday'] >0,1,0)

products['Season']=np.where((products.DayofYear>79)&(products.DayofYear<172), 1, 4)
products['Season']=np.where((products.DayofYear>171)&(products.DayofYear<265), 2, products['Season'])
products['Season']=np.where((products.DayofYear>264)&(products.DayofYear<356), 3, products['Season'])


products['Date'] = products.Date.dt.strftime('%Y%m%d')

Unnamed: 0,Product_Code,Warehouse,Product_Category,Date,Order_Demand,DayofWeek,DayofYear,Week,Isweekday,Season
0,0993,-8354174887402860729,028,20120727,2.500000e-05,5,209,30,1,2
1,0979,-8354174887402860729,028,20120119,1.250000e-04,4,19,3,1,4
2,0979,-8354174887402860729,028,20120203,1.250000e-04,5,34,5,1,4
3,0979,-8354174887402860729,028,20120209,1.250000e-04,4,40,6,1,4
4,0979,-8354174887402860729,028,20120302,1.250000e-04,5,62,9,1,4
5,0979,-8354174887402860729,028,20120419,1.250000e-04,4,110,16,1,1
6,0979,-8354174887402860729,028,20120605,1.250000e-04,2,157,23,1,1
7,0979,-8354174887402860729,028,20120627,1.250000e-04,3,179,26,1,2
8,0979,-8354174887402860729,028,20120723,1.250000e-04,1,205,30,0,2
9,0979,-8354174887402860729,028,20120829,1.250000e-04,3,242,35,1,2


## Setting up Data
Before applying the model, the data requires subsetting into training, test and validation. This is parsed based on dates such that 10% of the closest dates are validation, the next 10% of dates is testing and the rest of the dates are training. 

The total amount of data is 1,037,336 thus 10 percent is 103,733 such that testing and validation are a total of 207,467. Since the dataset is already sorted based on date, this can be parsed by location.

With a bit of curiosity, the data is also randomly sampled into these three categories.

In [8]:
#
training=products.iloc[0:829870,:]
testing=products.iloc[829870:933603,:]
validation=products.iloc[933603:1037336,:]

train_y=training['Order_Demand']
train_x=training.drop(columns=['Order_Demand'])

test_y=testing['Order_Demand']
test_x=testing.drop(columns=['Order_Demand'])

val_y=validation['Order_Demand']
val_x=validation.drop(columns=['Order_Demand'])

featu=products.drop(columns=['Order_Demand'])

#Convert to numpy array
features=np.array(featu)
label=products['Order_Demand']

#---------Training/Testing/Validation as previously done in homeworks--------
x, x_test, y, y_test = train_test_split(features,label,test_size=0.1,train_size=0.9)
x_train, x_val, y_train, y_val = train_test_split(x,y,test_size = 0.15,train_size =0.85)

## Data Modeling

For each product, determining the demand requires forecasting. With multiple products to forecast, a two models are implemented: Gradient Boosting and Random Forests. 

In [7]:
#Gradient Boosting
algorithm=GBR()
algorithm.fit(train_x,train_y)

predictions=algorithm.predict(test_x)
print('Parsed on date relevance:')
print ('R-squared Test: ', algorithm.score(test_x,test_y))
predVal=algorithm.predict(val_x)
print('R-squared Test: ', algorithm.score(val_x,val_y))


predictions=algorithm.predict(x_test)
print('Parsed randomly:')
print ('R-squared Test: ', algorithm.score(x_test,y_test))
predVal=algorithm.predict(x_val)
print('R-squared Test: ', algorithm.score(x_val,y_val))


Parsed on date relevance:
R-squared Test:  0.19493815243899082
R-squared Test:  0.07422608927122298
Parsed randomly:
R-squared Test:  0.1815793016813353
R-squared Test:  0.1781623552256183


In [6]:
#Random Forests
rf=RFR()
rf.fit(x_train,y_train)

predictions=rf.predict(test_x)
print('Parsed on date relevance:')
print ('R-squared Test: ', rf.score(test_x,test_y))
predVal=rf.predict(val_x)
print('R-squared Test: ', rf.score(val_x,val_y))


predictions=rf.predict(x_test)
print('Parsed randomly:')
print('R-squared Test: ', rf.score(x_test,y_test))
predictions=rf.predict(x_val)
print('R-squared Test: ', rf.score(x_val,y_val))

Parsed on date relevance:
R-squared Test:  0.5607536166298456
R-squared Test:  0.25305422457369176
Parsed randomly:
R-squared Test:  0.08849291713482443
R-squared Test:  0.08947666765184914


## Analysis
Comparing and contrasting gradient boosting with random forests, random forests performed the best with test data based on recent dates. Unfortunately, it did not perform well with the validation set which contained the most recent date orders. The success is measured based on the R squared technique. A value closer to 1 indicates a better graph of data variance. Random forests ranged between 0.110 and 0.556. Gradient boosting ranged between 0.074 and 0.204. 

The lack of determined variance fit could be due to the lack of features. Certain weather conditions can also affect orders as well as warehouse reputations. By only using date information and order information as well as source, these two models do not forecast as well.


## Conclusion
As stated before, many orders are placed based on supply and demand. By using gradient boosting and random forests, one can forecast the demand. However, with a R squared score under 0.56 for both models, this does not prove highly efficient. 


## References
#### How to determine seasons
https://www.almanac.com/content/first-day-seasons 
https://stackoverflow.com/questions/16139306/determine-season-given-timestamp-in-python-using-datetime?rq=1 

#### To understand how all products can be used for forecasting and what works the best for this dataset
https://datascience.stackexchange.com/questions/31267/demand-forecasting-for-multiple-products-across-thousands-of-stores

#### How to use gradient boosting
https://www.analyticsvidhya.com/blog/2016/02/complete-guide-parameter-tuning-gradient-boosting-gbm-python/ 

https://shankarmsy.github.io/stories/gbrt-sklearn.html

