<a href="https://colab.research.google.com/github/mrdbourke/m1-machine-learning-test/blob/main/03_random_forest_benchmark.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# California Housing Random Forest Benchmark

The following notebook tests the speed at which a given device can perform training iterations on the California Housing dataset (use features of an area to predict median house value) using a Random Forest Model, `RandomizedSearchCV` and 5 folds of cross-validation.

It's designed to be a simple test to compare Apple's M1 (normal, Pro, Max) to each other and other sources of compute.

| Model | Dataset | Dataset Size |
| ----- | ----- | ----- |
| Random Forest (Scikit-Learn) + Random Search + Cross-validation | [California Housing](https://scikit-learn.org/stable/datasets/real_world.html#california-housing-dataset) | ~20,000 samples, 8 features, 1 target variable |

## Resources
* Code on GitHub: https://github.com/mrdbourke/m1-machine-learning-test
* Code in this notebook adapted from: https://github.com/mrdbourke/zero-to-mastery-ml/blob/master/section-2-data-science-and-ml-tools/introduction-to-scikit-learn.ipynb

## Setup Hyperparameters

The main hyperparameter we're concerned with is what device this test is running on.

Since it'll be many different machines, we'll note the current one here.

We'll also list the dataset name and other attributes about the data.

In [1]:
BATCH_SIZE = None 
EPOCHS = None 
DATASET_NAME = "california_housing" 
DEVICE = "Intel_Mac"

## Get helper functions and import dependencies

The function below downloads the helper functions if necessary (if running this notebook in Google Colab, it's easier to download a single file than clone the whole repo).

In [2]:
# Get helper functions
import os
import requests

if not os.path.exists("helper_functions.py"):
  print("Downloading helper functions...")
  r = requests.get("https://raw.githubusercontent.com/mrdbourke/m1-machine-learning-test/main/helper_functions.py")
  print("Writing helper functions to file...")
  open("helper_functions.py", "wb").write(r.content)
else:
  print("Helper functions already downloaded, skipping redownload.")

Helper functions already downloaded, skipping redownload.


In [3]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

from timeit import default_timer as timer 
from helper_functions import print_train_time

# Get California Housing dataset
from sklearn.datasets import fetch_california_housing
housing = fetch_california_housing()
housing; # gets downloaded as dictionary

## View data

In [4]:
# Setup dataframe
housing_df = pd.DataFrame(housing["data"], columns=housing["feature_names"])
housing_df["target"] = pd.Series(housing["target"])
housing_df.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,target
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422


## Setup Random Search Grid

To make the time a little longer, we'll fit 5 random combinations of hyperparameters.

In [5]:
# Hyperparameter grid RandomizedSearchCV will search over
grid = {"n_estimators": [100, 200, 500],
        "max_depth": [None, 5, 10, 20],
        "max_features": ["auto", "sqrt"],
        "min_samples_split": [2, 4],
        "min_samples_leaf": [1, 2]}

## Model data

We'll use Scikit-Learn's Random Forest model to model the data with `n_jobs=-1` to use as many processors as possible.

The model will be:
* [`RandomForestRegressor`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html) from Scikit-Learn
* We'll use [`RandomizedSearchCV`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html) to search for different hyperparameters (this will ensure the modelling takes a little longer)
  * For each different set of hyperparameters, we'll do 5-fold cross-validation (fitting the same model 5x on different splits of data to again take more time).

In [6]:
# Import the RandomForestRegressor model class from the ensemble module
from sklearn.ensemble import RandomForestRegressor
# Import data splitting and random search CV function
from sklearn.model_selection import RandomizedSearchCV, train_test_split

In [7]:
# Start time
start_time = timer()

# Setup random seed
np.random.seed(42)

# Create the data
X = housing_df.drop("target", axis=1)
y = housing_df["target"]

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Institate and fit the model (on the training set)
model = RandomForestRegressor(n_jobs=-1) # set to use all processors

# Setup RandomizedSearchCV
rs_model = RandomizedSearchCV(estimator=model,
                            param_distributions=grid,
                            n_iter=5, # try 5 models total
                            cv=5, # 5-fold cross-validation
                            verbose=2) # print out results

# Fit the random search model
rs_model.fit(X_train, y_train)

# End timer
end_time = timer()
train_time = print_train_time(start_time, 
                              end_time, 
                              device=DEVICE)

Fitting 5 folds for each of 5 candidates, totalling 25 fits
[CV] END max_depth=20, max_features=auto, min_samples_leaf=2, min_samples_split=4, n_estimators=200; total time=   2.2s
[CV] END max_depth=20, max_features=auto, min_samples_leaf=2, min_samples_split=4, n_estimators=200; total time=   1.2s
[CV] END max_depth=20, max_features=auto, min_samples_leaf=2, min_samples_split=4, n_estimators=200; total time=   1.2s
[CV] END max_depth=20, max_features=auto, min_samples_leaf=2, min_samples_split=4, n_estimators=200; total time=   1.3s
[CV] END max_depth=20, max_features=auto, min_samples_leaf=2, min_samples_split=4, n_estimators=200; total time=   1.2s
[CV] END max_depth=20, max_features=sqrt, min_samples_leaf=2, min_samples_split=4, n_estimators=200; total time=   0.5s
[CV] END max_depth=20, max_features=sqrt, min_samples_leaf=2, min_samples_split=4, n_estimators=200; total time=   0.5s
[CV] END max_depth=20, max_features=sqrt, min_samples_leaf=2, min_samples_split=4, n_estimators=200;

In [8]:
# Check the score of the model (on the test set)
# The default score metirc of regression aglorithms is R^2
rs_model.score(X_test, y_test)

0.8128037523299034

In [9]:
# Find the best hyperparameters found by RandomizedSearchCV
rs_model.best_params_

{'n_estimators': 200,
 'min_samples_split': 4,
 'min_samples_leaf': 2,
 'max_features': 'sqrt',
 'max_depth': None}

## Track results and save to file

In [10]:
results = {
    "device": DEVICE,
    "dataset_name": DATASET_NAME,
    "epochs": EPOCHS,
    "batch_size": BATCH_SIZE,
    "num_train_samples": len(X_train),
    "num_test_samples": len(X_test),
    "total_train_time": round(train_time, 3),
    "time_per_epoch": None,
    "model": "RandomForestCV"
    }
results_df = pd.DataFrame(results, index=[0])
results_df

Unnamed: 0,device,dataset_name,epochs,batch_size,num_train_samples,num_test_samples,total_train_time,time_per_epoch,model
0,Intel_Mac,california_housing,,,16512,4128,20.424,,RandomForestCV


In [11]:
# Write CSV to file
if not os.path.exists("results/"):
  os.makedirs("results/")

results_df.to_csv(f"results/{DATASET_NAME}_{DEVICE}.csv")