*Going for a [walk](https://xcitech.github.io/tutorials/heroku_tutorial/) in the (random) forest.*

# Random forests

### Imports

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn import *
import pickle
import requests

#### Load data
Recall that low_memory=False commands the module to read in the file before assigning data types. Not using it now, may later.

In [2]:
cleaned_df = pd.read_csv('cleaned_df.csv')

In [3]:
# Dropping Summary column- not useful at this point- and auto-generated indexing column ('Unnamed: 0')
cleaned_df = cleaned_df.drop(['Unnamed: 0'], axis=1)
cleaned_df = cleaned_df.drop(['Summary'], axis=1)

In [4]:
cleaned_df.head()

Unnamed: 0,Price,Brand,Condition,Category,Site
0,48.0,lululemon,New,bottoms,eBay
1,41.0,lululemon,New,bottoms,eBay
2,16.0,lululemon,New,tops,eBay
3,37.95,lululemon,New,bottoms,eBay
4,15.0,lululemon,New,bras,eBay


### One hot encoding
Here, I take categorical data and convert it to arbitrary numerical representation.

In [5]:
# One-hot encode the data using pandas get_dummies
cleaned_df = pd.concat([cleaned_df,pd.get_dummies(cleaned_df['Brand'],drop_first=True,prefix="Brand")],axis=1)
cleaned_df = pd.concat([cleaned_df,pd.get_dummies(cleaned_df['Condition'],drop_first=True,prefix="Condition")],axis=1)
cleaned_df = pd.concat([cleaned_df,pd.get_dummies(cleaned_df['Category'],drop_first=True,prefix="Category")],axis=1)
cleaned_df = pd.concat([cleaned_df,pd.get_dummies(cleaned_df['Site'],drop_first=True,prefix="Site")],axis=1)

I then remove the original columns, which all contain strings. The model can't use those, thus the bother with OHE.

In [6]:
cleaned_df.drop(['Brand', 'Condition', 'Category', 'Site'],axis=1,inplace=True)

### Training/testing split

I am setting the random state (equivalent to set.seed in the R universe) to 29- which means the results will be the same each time I run the split.

In [7]:
X_train, X_test, y_train, y_test = train_test_split(cleaned_df.drop('Price', axis=1), 
                                                                            cleaned_df['Price'], test_size = 0.2, 
                                                                            random_state = 29)

In [8]:
print('Training Features Shape:', X_train.shape)
print('Training Labels Shape:', y_train.shape)
print('Testing Features Shape:', X_test.shape)
print('Testing Labels Shape:', y_test.shape)

Training Features Shape: (48356, 7)
Training Labels Shape: (48356,)
Testing Features Shape: (12090, 7)
Testing Labels Shape: (12090,)


*For another time, perhaps- dealing with time series, or [cyclical features](http://blog.davidkaleko.com/feature-engineering-cyclical-features.html).*

Moving on...

### Establish baseline

The baseline is the error I would get if I simply predicted the historical average sale price for all items. If I can reduce the error by using my model, that's a good sign the approach is valid.

I'll now separate the data into the feature/independent/predictor variables and target/label/dependent/predicted variable. 

I'll also convert the df to numpy arrays for easily digestion by the algorithm.

In [13]:
# Create an array representing the y (dependent variable) values, also known as labels
labels = np.array(cleaned_df['Price']) 

# Remove the labels from the x (independent variable) values, also known as features
# axis = 1 refers to the columns
features = cleaned_df.drop('Price', axis = 1)

# Saving feature names to a list for use in a moment
feature_list = list(cleaned_df.columns)

# Convert to numpy array
features = np.array(features)

baseline_preds = test_features[:, feature_list.index('Price')]

# Baseline errors, and display average baseline error
baseline_errors = abs(baseline_preds - 
                      test_labels)
print('Average baseline error: $', round(np.mean(baseline_errors), 2))

Average baseline error: $ 33.86


### Model training

In [14]:
# Instantiate model with 1000 decision trees
rf = RandomForestRegressor(n_estimators = 1000, random_state = 29)

# Train the model on training data
rf.fit(X_train,y_train)

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
                      max_features='auto', max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, n_estimators=1000,
                      n_jobs=None, oob_score=False, random_state=29, verbose=0,
                      warm_start=False)

### Predicting to the withheld data

In [15]:
#Predicting on the Test Set
predictions = rf.predict(X_test)

# Calculate the absolute errors
errors = abs(predictions - y_test)

# Print out the mean absolute error (mae)
print('Mean Absolute Error: $', round(np.mean(errors), 2))

Mean Absolute Error: $ 12.58


Hot damn, an improvement.

## Pickle that
I messed up *a lot* here. If you're adapting this tutorial, pay attention to all the little details that come next.

In [17]:
# My model is called 'rf' (as you see above)- whatever you called your model, substitute that in for 'rf', below.
# Don't change anything else unless you really want to.
with open('model.pkl', 'wb') as fid:
    pickle.dump(rf, fid,2)  

# Load the model from disk
loaded_model = pickle.load(open('model.pkl', 'rb'))
result = loaded_model.score(X_test, y_test)
print(result)

0.51278380621452


### Tuning

That is pretty terrible performance! 

Moving on..

We need to create our feature vector of exact same dimension as our training set. To convert our user input into dummy variables, we should save a dict of the the dummy variables. Later we can populate our feature vector for prediction using this dict.

In [18]:
# Create a Dataframe with only the dummy variables. Mine was called 'cleaned_df' and my dependent variable is 'Price'.

cat = cleaned_df.drop('Price',axis=1)
index_dict = dict(zip(cat.columns,range(cat.shape[1])))

with open('cat', 'wb') as fid:
    pickle.dump(index_dict, fid, 2)  

In [19]:
cat.shape

(60446, 7)

## Evaluation of model performance

Super, model has run, but is it decent? To check, I'm calculating accuracy by substracting the mean average percentage error (MAPE) from 100.

In [20]:
mape = 100 * (errors / y_test)
accuracy = 100 - np.mean(mape)
print('Accuracy:', round(accuracy, 2), '%.')

Accuracy: 39.01 %.


This is kind of bad. I'll need to tune, I think..

### Model tuning
I'm saving [this](https://scikit-learn.org/stable/modules/grid_search.html) for later.

### Interpreting results: variable importance

In [21]:
# Get numerical feature importances
importances = list(rf.feature_importances_)

# List of tuples with variable and importance
feature_importances = [(feature, round(importance, 2)) for feature, importance in zip(feature_list, importances)]

# Sort the feature importances by most important first
feature_importances = sorted(feature_importances, key = lambda x: x[1], reverse = True)

# Print out the feature and importances 
[print('Variable: {:20} Importance: {}'.format(*pair)) for pair in feature_importances]

Variable: Category_bras        Importance: 0.56
Variable: Category_outerwear   Importance: 0.2
Variable: Category_tops        Importance: 0.07
Variable: Price                Importance: 0.06
Variable: Brand_lululemon      Importance: 0.06
Variable: Condition_PreOwned   Importance: 0.05
Variable: Category_dresses     Importance: 0.0


[None, None, None, None, None, None, None]

Er, not much looking great here, but it just tells me I need more data, probably.

### Updated random forest

In [23]:
# New random forest with only the two most important variables
rf_most_important = RandomForestRegressor(n_estimators= 1000, random_state=22)

### EDIT THIS NEXT SECTION
# Extract the two most important features
important_indices = [feature_list.index('Category_bras'), feature_list.index('Category_outerwear')]
train_important = train_features[:, important_indices]
test_important = test_features[:, important_indices]

# Train the random forest
rf_most_important.fit(train_important, train_labels)

# Make predictions and determine the error
predictions = rf_most_important.predict(test_important)

errors = abs(predictions - test_labels)

# Display the performance metrics
print('Mean Absolute Error: $', round(np.mean(errors), 2))

mape = np.mean(100 * (errors / test_labels))

accuracy = 100 - mape

print('Accuracy:', round(accuracy, 2), '%.')

Mean Absolute Error: $ 13.53
Accuracy: 33.66 %.


It's worse! Oh god, it's even worse. Oh dear. Well, I just need more data, this is not surprising, I expected it. Onwards!

You may, at this point, want to open the troubleshooting_scratchpad.ipynb for instructions on how to use your pickled model to talk to the Flask app.