*Going for a [walk](https://xcitech.github.io/tutorials/heroku_tutorial/) in the (random) forest.*

# Random forests

### Imports

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn import *
from sklearn.metrics import r2_score
import pickle
import requests
from pprint import pprint
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import GridSearchCV

#### Load data
Recall that low_memory=False commands the module to read in the file before assigning data types. Not using it now, may later.

In [2]:
cleaned_df = pd.read_csv('cleaned_df.csv')

In [3]:
# Dropping Summary column- not useful at this point- and auto-generated indexing column ('Unnamed: 0')
cleaned_df = cleaned_df.drop(['Unnamed: 0'], axis=1)
cleaned_df = cleaned_df.drop(['summary'], axis=1)

In [4]:
cleaned_df.head()

Unnamed: 0,brand,category,condition,price,site
0,lululemon,bottoms,new,37.95,ebay
1,lululemon,bras,new,15.0,ebay
2,lululemon,bottoms,new,69.0,ebay
3,lululemon,bottoms,new,22.5,ebay
4,lululemon,tops,new,14.99,ebay


### One hot encoding
Here, I take categorical data and convert it to arbitrary numerical representation.

In [5]:
# One-hot encode the data using pandas get_dummies
cleaned_df = pd.concat([cleaned_df,pd.get_dummies(cleaned_df['brand'],drop_first=True,prefix="brand")],axis=1)
cleaned_df = pd.concat([cleaned_df,pd.get_dummies(cleaned_df['condition'],drop_first=True,prefix="condition")],axis=1)
cleaned_df = pd.concat([cleaned_df,pd.get_dummies(cleaned_df['category'],drop_first=True,prefix="category")],axis=1)
cleaned_df = pd.concat([cleaned_df,pd.get_dummies(cleaned_df['site'],drop_first=True,prefix="site")],axis=1)

I then remove the original columns, which all contain strings. The model can't use those, thus the bother with OHE.

In [6]:
cleaned_df.drop(['brand', 'condition', 'category', 'site'],axis=1,inplace=True)

### Training/testing split

I am setting the random state (equivalent to set.seed in the R universe) to 29- which means the results will be the same each time I run the split.

In [7]:
cleaned_df = cleaned_df.dropna()

In [8]:
X_train, X_test, y_train, y_test = train_test_split(cleaned_df.drop('price', axis=1), 
                                                                            cleaned_df['price'], test_size = 0.2, 
                                                                            random_state = 29)

In [9]:
print('Training Features Shape:', X_train.shape)
print('Training Labels Shape:', y_train.shape)
print('Testing Features Shape:', X_test.shape)
print('Testing Labels Shape:', y_test.shape)

Training Features Shape: (170432, 15)
Training Labels Shape: (170432,)
Testing Features Shape: (42608, 15)
Testing Labels Shape: (42608,)


*For another time, perhaps- dealing with time series, or [cyclical features](http://blog.davidkaleko.com/feature-engineering-cyclical-features.html).*

Moving on...

### Establish baseline

The baseline is the error I would get if I simply predicted the historical average sale price for all items. If I can reduce the error by using my model, that's a good sign the approach is valid.

Very confusingly, [this](https://towardsdatascience.com/random-forest-in-python-24d0893d51c0) tutorial calls the independent variables *and* the df 'features'.

In [10]:
# The baseline predictions are the historical averages

baseline_preds = y_train # Baseline errors, and display average baseline error
baseline_errors = abs(baseline_preds)
print('Average baseline error: $', round(np.mean(baseline_errors), 2))

Average baseline error: $ 107.79


### Model training

In [11]:
# Instantiate model with 1000 decision trees
rf = RandomForestRegressor(n_estimators = 1000, random_state = 29)

# Train the model on training data
rf.fit(X_train,y_train)

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
                      max_features='auto', max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, n_estimators=1000,
                      n_jobs=None, oob_score=False, random_state=29, verbose=0,
                      warm_start=False)

### Predicting to the withheld data

In [12]:
#Predicting on the Test Set
predictions = rf.predict(X_test)

# Calculate the absolute errors
errors = abs(predictions - y_test)

# Print out the mean absolute error (mae)
print('Mean Absolute Error: $', round(np.mean(errors), 2))

#r2 = r2_score(y_test, predictions)
#print(r2)

Mean Absolute Error: $ 42.67


Hot damn, an improvement.

In [42]:
df1 = pd.DataFrame(y_test)
df1= pd.DataFrame.reset_index(df1) # reset for ease of merge
del df1['index'] # the prior index isn't useful here
df1.columns = ['ytrue']

df2 = pd.DataFrame(predictions)
df2.columns = ['ypred']

rf_train_test_df = pd.merge(df1, df2, df3, right_index=True, left_index=True) # merge based on index
rf_train_test_df.head()

In [66]:
df3 = pd.DataFrame(X_test.site_poshmark)
df3= pd.DataFrame.reset_index(df3)
del df3['index']

In [70]:
rf_train_test_df = pd.merge(rf_train_test_df, df3, right_index=True, left_index=True) # merge based on index
rf_train_test_df.head()
pd.DataFrame.to_csv(rf_train_test_df, 'rf_train_test_df.csv')

## Pickle that
I messed up *a lot* here. If you're adapting this tutorial, pay attention to all the little details that come next.

In [17]:
# My model is called 'rf' (as you see above)- whatever you called your model, substitute that in for 'rf', below.
# Don't change anything else unless you really want to.
with open('model.pkl', 'wb') as fid:
    pickle.dump(rf, fid,2)  

# Load the model from disk
loaded_model = pickle.load(open('model.pkl', 'rb'))
result = loaded_model.score(X_test, y_test)
print(result)

# The script below gives the same value- r^2:
#r2 = r2_score(y_test, predictions)
#print(r2)

0.6252812783640533


### Tuning: following [this](https://towardsdatascience.com/hyperparameter-tuning-the-random-forest-in-python-using-scikit-learn-28d2aa77dd74) tutorial

I could probably get that r^2 value of 63% up just a bit.

In [14]:
# Look at parameters used by our current forest
print('Parameters currently in use:\n')
pprint(rf.get_params())

Parameters currently in use:

{'bootstrap': True,
 'criterion': 'mse',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 1000,
 'n_jobs': None,
 'oob_score': False,
 'random_state': 29,
 'verbose': 0,
 'warm_start': False}


The scikit-learn documentation tells us the most important settings are the number of trees in the forest (n_estimators) and the number of features considered for splitting at each leaf node (max_features).

I'm going to try to change the first two of these:

- n_estimators = number of trees in the foreset
- max_features = max number of features considered for splitting a node
- max_depth = max number of levels in each decision tree
- min_samples_split = min number of data points placed in a node before the node is split
- min_samples_leaf = min number of data points allowed in a leaf node
- bootstrap = method for sampling data points (with or without replacement)

#### Random Hyperparameter Grid

In [28]:
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
bootstrap = [True, False] # Create the random grid

random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'bootstrap': bootstrap}

print(random_grid)

{'bootstrap': [True, False],
 'max_features': ['auto', 'sqrt'],
 'n_estimators': [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000]}


In [None]:
'''
# Use the random grid to search for best hyperparameters
# First create the base model to tune
rf = RandomForestRegressor()

# Random search of parameters, using 3 fold cross validation, 
# Search across 100 different combinations, and use all available cores
rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, n_iter = 10, cv = 3, verbose=2, random_state=42, n_jobs = -1)

# Fit the random search model
rf_random.fit(X_train, y_train)
rf_random.fit.best_params_
'''

In [43]:
#Predicting on the Test Set
predictions = rf_200_s_t.predict(X_test)

# Calculate the absolute errors
errors = abs(predictions - y_test)

# Print out the mean absolute error (mae)
print('Mean Absolute Error: $', round(np.mean(errors), 2))

r2_cv = r2_score(y_test, predictions)
print(r2_cv)

Mean Absolute Error: $ 42.67
0.6252659565602762


Tuning appears to have done basically nothing here. Perhaps I'll give XGBoost a go?

Moving on..

We need to create our...

### Feature vector 

...of the exact same dimension as our training set. To convert our user input into dummy variables, we should save a dict of the the dummy variables. Later we can populate our feature vector for prediction using this dict.

In [15]:
# Create a Dataframe with only the dummy variables. Mine was called 'cleaned_df' and my dependent variable is 'price'.

cat = cleaned_df.drop('price',axis=1)
index_dict = dict(zip(cat.columns,range(cat.shape[1])))

with open('cat', 'wb') as fid:
    pickle.dump(index_dict, fid, 2)  

In [16]:
cat.shape

(213040, 15)

You may, at this point, want to open the troubleshooting_scratchpad.ipynb for instructions on how to use your pickled model to talk to the Flask app.

Last, but not least...
## Evaluation of model performance

Super, model has run, but is it decent? To check, I'm calculating accuracy by substracting the mean average percentage error (MAPE) from 100.

In [18]:
mape = 100 * (errors / y_test)
accuracy = 100 - np.mean(mape)
print('Accuracy:', round(accuracy, 2), '%.')

Accuracy: 10.42 %.


#### eBay and poshmark random forest models
I'm running these for ease of coding my application- the performance is fine when I load in two pickled models. Trying to code a binary site selection doesn't yield proper results.

In [None]:
cleaned_df_edit = cleaned_df[(cleaned_df.site == 'poshmark')].index
cleaned_df.drop(cleaned_df_edit, inplace=True)
pd.DataFrame.to_csv(cleaned_df, 'cleaned_df_ebay.csv')
# Same procedure to subset the poshmark dataframe...

In [75]:
cleaned_df_ebay = pd.read_csv('cleaned_df_ebay.csv')
cleaned_df_pm = pd.read_csv('cleaned_df_pm.csv')

In [None]:
# One-hot encode the data using pandas get_dummies
cleaned_df_ebay = pd.concat([cleaned_df_ebay,pd.get_dummies(cleaned_df_ebay['brand'],drop_first=True,prefix="brand")],axis=1)
cleaned_df_ebay = pd.concat([cleaned_df_ebay,pd.get_dummies(cleaned_df_ebay['condition'],drop_first=True,prefix="condition")],axis=1)
cleaned_df_ebay = pd.concat([cleaned_df_ebay,pd.get_dummies(cleaned_df_ebay['category'],drop_first=True,prefix="category")],axis=1)
cleaned_df_ebay = pd.concat([cleaned_df_ebay,pd.get_dummies(cleaned_df_ebay['site'],drop_first=True,prefix="site")],axis=1)

cleaned_df_pm = pd.concat([cleaned_df_pm,pd.get_dummies(cleaned_df_pm['brand'],drop_first=True,prefix="brand")],axis=1)
cleaned_df_pm = pd.concat([cleaned_df_pm,pd.get_dummies(cleaned_df_pm['condition'],drop_first=True,prefix="condition")],axis=1)
cleaned_df_pm = pd.concat([cleaned_df_pm,pd.get_dummies(cleaned_df_pm['category'],drop_first=True,prefix="category")],axis=1)
cleaned_df_pm = pd.concat([cleaned_df_pm,pd.get_dummies(cleaned_df_pm['site'],drop_first=True,prefix="site")],axis=1)

cleaned_df_ebay.drop(['brand', 'condition', 'category', 'site'],axis=1,inplace=True)
cleaned_df_pm.drop(['brand', 'condition', 'category', 'site'],axis=1,inplace=True)

In [None]:
cleaned_df_ebay = cleaned_df_ebay.dropna()
cleaned_df_pm = cleaned_df_pm.dropna()

#### First, the eBay model...

In [None]:
X_train, X_test, y_train, y_test = train_test_split(cleaned_df_ebay.drop('price', axis=1), 
                                                                            cleaned_df_ebay['price'], test_size = 0.2, 
                                                                            random_state = 29)

# Instantiate model with 1000 decision trees
rf = RandomForestRegressor(n_estimators = 1000, random_state = 29)

# Train the model on training data
rf.fit(X_train,y_train)

# Pickle that!
with open('ebmodel.pkl', 'wb') as fid:
    pickle.dump(rf, fid,2)  
    
# A lovely feature vector
cateb = cleaned_df_ebay.drop('price',axis=1)
index_dict = dict(zip(cateb.columns,range(cateb.shape[1])))

with open('cateb', 'wb') as fid:
    pickle.dump(index_dict, fid, 2) 

#### And now, the poshmark model...

In [None]:
X_train, X_test, y_train, y_test = train_test_split(cleaned_df_pm.drop('price', axis=1), 
                                                                            cleaned_df_pm['price'], test_size = 0.2, 
                                                                            random_state = 29)

# Instantiate model with 1000 decision trees
rf = RandomForestRegressor(n_estimators = 1000, random_state = 29)

# Train the model on training data
rf.fit(X_train,y_train)

# Pickle that!
with open('pmmodel.pkl', 'wb') as fid:
    pickle.dump(rf, fid,2)  
    
# A lovely feature vector
catpm = cleaned_df_pm.drop('price',axis=1)
index_dict = dict(zip(catpm.columns,range(catpm.shape[1])))

with open('catpm', 'wb') as fid:
    pickle.dump(index_dict, fid, 2) 

# Can XGBoost beat the random forest?

I'm looking to see if XGBoost gives me superior model accuracy and a better r^2.

### Imports

In [76]:
import pandas as pd
import numpy as np
import xgboost as xgb
import csv as csv
from xgboost import plot_importance
from matplotlib import pyplot
from sklearn import *
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt
from scipy.stats import skew
from collections import OrderedDict
from sklearn.metrics import r2_score
import pickle
import requests

In [77]:
xg_reg = xgb.XGBRegressor(objective ='reg:squarederror', alpha=10, learning_rate = 0.2, max_depth = 15)

xg_reg.fit(X_train, y_train)
preds = xg_reg.predict(X_test)

  if getattr(data, 'base', None) is not None and \


In [78]:
r2 = r2_score(y_test, preds)
print(r2) # close, but...

0.6252776760499661
