# Background

Perhaps you've played [Twenty Questions](https://en.wikipedia.org/wiki/Twenty_Questions) before:  it's a game where one player (the _answerer_) thinks of a person, place, or thing, and other players ask yes-or-no questions to guess the object of the answerer's thoughts.  Since the answerer probably knows about a lot of different people and objects, a good strategy for the other players involves devising questions that reduce the space of possible answers as much as possible no matter how they are answered.  

Given a labeled collection of examples, you might imagine a technique to [learn a _decision tree_](https://en.wikipedia.org/wiki/Decision_tree_learning) of questions to classify these examples by asking as few questions as possible.  However, you might imagine that such a technique would necessarily be quite dependent on the exact examples on offer.  (In other words,  these techniques are prone to _overfitting_.)  As a simple illustration,  consider the case where your set of example objects was `{ 'ant', 'elephant'}`.  In this case, the question "is it smaller than a typical adult human" would enable you to differentiate between examples optimally.   However, that question would be useless if our set of example objects was the set of all domesticated dog breeds.

[Random decision forest models](https://en.wikipedia.org/wiki/Random_forest) work by training an _ensemble_ of imprecise decision trees that only consider subsets of features or examples and then aggregating the results from the ensemble.  By learning and aggregating an ensemble of trees, random decision forests can be more accurate than individual decision trees _and_ are less likely to overfit.  In this notebook, we'll use a random decision forest to classify documents as either "spam" (based on food reviews) or "legitimate" (based on Jane Austen).

We'll begin by loading in the feature vectors which we generated in either [the simple summaries feature extraction notebook](03-feature-engineering-summaries.ipynb) or [the TF-IDF feature extraction notebook](03-feature-engineering-tfidf.ipynb). 

In [None]:
import pandas as pd

features = pd.read_parquet("data/features.parquet")

In [None]:
features.sample(5)

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn import model_selection

train, test = model_selection.train_test_split(features)

rfc = RandomForestClassifier(n_estimators=25, random_state=404)

In [None]:
import sys
sys.getsizeof(train)

In [None]:
MAX_FEATURES=512
rfc.fit(X=train.iloc[:,2:train.shape[1]], y=train["label"])

In [None]:
from mlworkflows import plot

predictions = rfc.predict(test.iloc[:,2:train.shape[1]])
df, chart = plot.binary_confusion_matrix(test["label"], predictions)

In [None]:
chart

In [None]:
df

One interesting aspect of random decision forests is that they provide a metric for how important each feature was to the ultimate conclusion.  This is a useful property both for having _explainable models_ (i.e., so you can explain to a human why the model made a particular prediction) and for guiding further experiments (i.e., so you can learn more about the real world based on what the model has identified as likely to be correlated with what you're trying to predict).

In [None]:
l = list(enumerate(rfc.feature_importances_))

In [None]:
l.sort(key=lambda x: -x[1])
l[:20]

In [None]:
import pickle
import os

filename = 'model.sav'
pickle.dump(rfc, open(filename, 'wb'))