*Going for a [walk](https://towardsdatascience.com/random-forest-in-python-24d0893d51c0) in the (random) forest.*

# Random forests

### Imports

In [26]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn import *

#### Load data
Recall that low_memory=False commands the module to read in the file before assigning data types. Not using it now, may later.

In [35]:
cleaned_df = pd.read_csv('cleaned_df.csv')

# Dropping Summary column
cleaned_df = cleaned_df_new.drop(['Summary'], axis=1)
cleaned_df.head()

Unnamed: 0,Price,Brand,Condition,Category,Site
0,48.0,lululemon,New,bottoms,eBay
1,41.0,lululemon,New,bottoms,eBay
2,16.0,lululemon,New,tops,eBay
3,37.95,lululemon,New,bottoms,eBay
4,15.0,lululemon,New,bras,eBay


### One hot encoding
Here, we take categorical data and convert it to arbitrary numerical representation.

In [38]:
# One-hot encode the data using pandas get_dummies
cleaned_df = pd.get_dummies(cleaned_df) 
cleaned_df.head()

Unnamed: 0,Price,Brand_Reformation,Brand_lululemon,Condition_New,Condition_PreOwned,Category_bottoms,Category_bras,Category_dresses,Category_outerwear,Category_tops,Site_Poshmark,Site_eBay
0,48.0,0,1,1,0,1,0,0,0,0,0,1
1,41.0,0,1,1,0,1,0,0,0,0,0,1
2,16.0,0,1,1,0,0,0,0,0,1,0,1
3,37.95,0,1,1,0,1,0,0,0,0,0,1
4,15.0,0,1,1,0,0,1,0,0,0,0,1


The shape of the data is now different, and all of the column are numbers.

### Features, targets, array conversion

I'll now separate the data into the feature/independent/predictor variables and target/label/dependent/predicted variable. 

I'll also convert the df to numpy arrays for easily digestion by the algorithm.

I'm also saving the column headers, which are the names of the features, to a list to use for later visualization.

In [39]:
labels = np.array(cleaned_df['Price'])

# Remove the labels from the features
# axis = 1 refers to the columns
features = cleaned_df.drop('Price', axis = 1)

# Saving feature names for later use
feature_list = list(cleaned_df.columns)

# Convert to numpy array
features = np.array(features)

### Training/testing split

I am setting the random state (equivalent to set.seed) to 22- which means the results will be the same each time I run the split.

In [40]:
train_features, test_features, train_labels, test_labels = train_test_split(features, 
                                                                            labels, test_size = 0.2, 
                                                                            random_state = 22)

In [41]:
print('Training Features Shape:', train_features.shape)
print('Training Labels Shape:', train_labels.shape)
print('Testing Features Shape:', test_features.shape)
print('Testing Labels Shape:', test_labels.shape)

Training Features Shape: (48356, 11)
Training Labels Shape: (48356,)
Testing Features Shape: (12090, 11)
Testing Labels Shape: (12090,)


*For another time, perhaps- dealing with time series, or [cyclical features](http://blog.davidkaleko.com/feature-engineering-cyclical-features.html).*

Moving on...

### Establish baseline

The baseline is the error I would get if I simply predicted the historical average sale price for all items.

In [43]:
baseline_preds = test_features[:, feature_list.index('Price')]

# Baseline errors, and display average baseline error
baseline_errors = abs(baseline_preds - 
                      test_labels)
print('Average baseline error: ', round(np.mean(baseline_errors), 2))

Average baseline error:  34.76


### Model training

In [44]:
# Instantiate model with 1000 decision trees
rf = RandomForestRegressor(n_estimators = 1000, random_state = 22)

# Train the model on training data
rf.fit(train_features, train_labels)

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
                      max_features='auto', max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, n_estimators=1000,
                      n_jobs=None, oob_score=False, random_state=22, verbose=0,
                      warm_start=False)

### Predicting to the withheld data

In [45]:
predictions = rf.predict(test_features)

# Calculate the absolute errors
errors = abs(predictions - test_labels)

# Print out the mean absolute error (mae)
print('Mean Absolute Error:', round(np.mean(errors), 2), 'dollars.')

Mean Absolute Error: 12.62 dollars.


## Evaluation of model performance

Super, model has run, but is it decent? To check, I'm calculating accuracy by substracting the mean average percentage error (MAPE) from 100.

In [46]:
mape = 100 * (errors / test_labels)
accuracy = 100 - np.mean(mape)
print('Accuracy:', round(accuracy, 2), '%.')

Accuracy: 40.18 %.


This is kind of bad. I'll need to tune, I think..

### Model tuning
I'm saving [this](https://scikit-learn.org/stable/modules/grid_search.html) for later.

### Interpreting results: variable importance

In [47]:
# Get numerical feature importances
importances = list(rf.feature_importances_)

# List of tuples with variable and importance
feature_importances = [(feature, round(importance, 2)) for feature, importance in zip(feature_list, importances)]

# Sort the feature importances by most important first
feature_importances = sorted(feature_importances, key = lambda x: x[1], reverse = True)

# Print out the feature and importances 
[print('Variable: {:20} Importance: {}'.format(*pair)) for pair in feature_importances]

Variable: Category_bras        Importance: 0.56
Variable: Condition_PreOwned   Importance: 0.22
Variable: Brand_Reformation    Importance: 0.05
Variable: Condition_New        Importance: 0.04
Variable: Category_dresses     Importance: 0.04
Variable: Category_tops        Importance: 0.04
Variable: Site_Poshmark        Importance: 0.03
Variable: Price                Importance: 0.01
Variable: Brand_lululemon      Importance: 0.01
Variable: Category_bottoms     Importance: 0.0
Variable: Category_outerwear   Importance: 0.0


[None, None, None, None, None, None, None, None, None, None, None]

Er, not much looking great here, but it just tells me I need more data, probably.

### Updated random forest

In [49]:
# New random forest with only the three most important variables
rf_most_important = RandomForestRegressor(n_estimators= 1000, random_state=22)

### EDIT THIS NEXT SECTION
# Extract the two most important features
important_indices = [feature_list.index('Category_bras'), feature_list.index('Condition_PreOwned'), feature_list.index('Brand_Reformation')]
train_important = train_features[:, important_indices]
test_important = test_features[:, important_indices]

# Train the random forest
rf_most_important.fit(train_important, train_labels)

# Make predictions and determine the error
predictions = rf_most_important.predict(test_important)

errors = abs(predictions - test_labels)

# Display the performance metrics
print('Mean Absolute Error:', round(np.mean(errors), 2), 'dollars.')

mape = np.mean(100 * (errors / test_labels))

accuracy = 100 - mape

print('Accuracy:', round(accuracy, 2), '%.')

Mean Absolute Error: 13.86 dollars.
Accuracy: 34.68 %.


It's worse! Oh god, it's even worse. Oh dear. Well, I just need more data, this is not surprising, I expected it. Onwards!