 # Project # 4 - Random Forest
 Data file:
 * https://raw.githubusercontent.com/vjavaly/Baruch-CIS-STA-3920/main/data/temps_extended.csv

## Project #4 Requirements
* Part 1: 
  * Load and examine data
  * Clean and prepare data for model training
  * Display RandomForestRegressor model default hyperparameters
  * Train RandomForestRegressor model with default hyperparameters
  * Print model accuracy
* Part 2: The objective is to improve performance above that of the default hyperparameters
  * Prepare hyperparameter variables for random search
  * Setup and print random search hyperparameter variables grid
  * Setup and execute random search with k-fold cross validation using the hyperparameter variables grid
  * Print best hyperparameters combination
  * Print model accuracy from the best hyperparameters combination

In [1]:
from datetime import datetime
print(f'Run time: {datetime.now().strftime("%D %T")}')

Run time: 12/01/23 10:13:44


### Import libraries

In [2]:
import numpy as np
import pandas as pd
from pprint import pprint
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import GridSearchCV

### Function to evaluate model accuracy
#### You will use this function twice in this notebook.
Please review the comments below carefully to find out when to invoke this function.

In [3]:
def evaluate(model, model_string, test_features, test_labels):
    predictions = model.predict(test_features)
    errors = abs(predictions - test_labels)
    mape = 100 * np.mean(errors / test_labels)
    accuracy = 100 - mape
    print('Model Performance: {}'.format(model_string))
    print('Accuracy = {:0.2f}%.'.format(accuracy))
    
    return accuracy

## Part 1

### Load data

In [4]:
df = pd.read_csv("https://raw.githubusercontent.com/vjavaly/Baruch-CIS-STA-3920/main/data/temps_extended.csv")

### Examine data

In [5]:
print(df.head())

   year  month  day weekday  ws_1  prcp_1  snwd_1  temp_2  temp_1  average  \
0  2011      1    1     Sat  4.92    0.00       0      36      37     45.6   
1  2011      1    2     Sun  5.37    0.00       0      37      40     45.7   
2  2011      1    3     Mon  6.26    0.00       0      40      39     45.8   
3  2011      1    4    Tues  5.59    0.00       0      39      42     45.9   
4  2011      1    5     Wed  3.80    0.03       0      42      38     46.0   

   actual  friend  
0      40      40  
1      39      50  
2      42      42  
3      38      59  
4      45      39  


In [6]:
print(df.columns)

Index(['year', 'month', 'day', 'weekday', 'ws_1', 'prcp_1', 'snwd_1', 'temp_2',
       'temp_1', 'average', 'actual', 'friend'],
      dtype='object')


In [7]:
print(df.sample)

<bound method NDFrame.sample of       year  month  day weekday   ws_1  prcp_1  snwd_1  temp_2  temp_1  \
0     2011      1    1     Sat   4.92    0.00       0      36      37   
1     2011      1    2     Sun   5.37    0.00       0      37      40   
2     2011      1    3     Mon   6.26    0.00       0      40      39   
3     2011      1    4    Tues   5.59    0.00       0      39      42   
4     2011      1    5     Wed   3.80    0.03       0      42      38   
...    ...    ...  ...     ...    ...     ...     ...     ...     ...   
2186  2016     12   28     Wed  15.21    0.05       0      42      44   
2187  2016     12   29   Thurs   8.72    0.00       0      44      47   
2188  2016     12   30     Fri   8.50    0.05       0      47      48   
2189  2016     12   31     Sat   6.93    0.02       0      48      45   
2190  2017      1    1     Sun   8.05    0.03       0      45      38   

      average  actual  friend  
0        45.6      40      40  
1        45.7      39      

### Clean up data

In [8]:
# Drop unnecessary columns: year, month, day, weekday
df_cleaned = df.drop(columns=['year', 'month', 'day', 'weekday'])

In [9]:
# Display first few rows of updated dataframe
print(df_cleaned.head())

   ws_1  prcp_1  snwd_1  temp_2  temp_1  average  actual  friend
0  4.92    0.00       0      36      37     45.6      40      40
1  5.37    0.00       0      37      40     45.7      39      50
2  6.26    0.00       0      40      39     45.8      42      42
3  5.59    0.00       0      39      42     45.9      38      59
4  3.80    0.03       0      42      38     46.0      45      39


### Separate independent variables and dependent variable
* Independent variables: all remaining variables except 'actual'
* Dependent variable: 'actual'

In [10]:
X = df_cleaned.drop('actual', axis=1)
y = df_cleaned['actual']

### Split into training and test sets

In [11]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

### Instantiate the RandomForestRegressor model

In [12]:
from sklearn.ensemble import RandomForestRegressor

rf_model = RandomForestRegressor(random_state=42)

### Print RandomForestRegressor default hyperparameters

In [13]:
pprint(rf_model.get_params())

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'criterion': 'squared_error',
 'max_depth': None,
 'max_features': 1.0,
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': 42,
 'verbose': 0,
 'warm_start': False}


### Fit RandomForestRegressor model using the default hyperparameters

In [14]:
rf_model.fit(X_train, y_train)

### Print accuracy for RandomForestRegressor model using the default hyperparameters
#### NOTE: Use "evaluate" function defined at top of this notebook.
For example, assuming the following variable values:
* model = rf
* model_string = 'using default hyperparameters'
* test_features = X_test
* test_labels = y_test

rfr_base_accuracy = evaluate(rf, 'With default hyperparameters', X_test, y_test)

In [15]:
accuracy_default = evaluate(rf_model, "With Default Hyperparameters", X_test, y_test)

Model Performance: With Default Hyperparameters
Accuracy = 93.49%.


## Part 2

### NOTE: The objective of the hyperparameter search is to improve model performance above the default hyperparameters

### Prepare variables for hyperparameter search
* Using sklearn.ensemble.RandomForestRegressor documentation [https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html] choose at least 3 hyperparameters for random search
* For each hyperparameter selected, set up an array of values
  * For example: max_features = ['log2', 'sqrt']

In [16]:
# Hyperparameters
n_estimators = [100, 200, 300, 400, 500]
max_depth = [10, 20, 30, 40, 50, None]
max_features = ['sqrt', 'log2']

random_grid = {
    'n_estimators': n_estimators,
    'max_depth': max_depth,
    'max_features': max_features,
}

### Create the hyperparameter grid for the random search
Use the variables prepared above

In [17]:
random_grid = {
    'n_estimators': n_estimators,
    'max_features': max_features,
    'max_depth': max_depth,
}

### Print the hyperparameter grid for the random search

In [18]:
print(random_grid)

{'n_estimators': [100, 200, 300, 400, 500], 'max_features': ['sqrt', 'log2'], 'max_depth': [10, 20, 30, 40, 50, None]}


### Set up random search with k-fold cross validation using the hyperparameter grid

In [19]:
# Random search of hyperparameters using 5 fold cross validation.
rf = RandomForestRegressor(random_state=42)
rf_random = RandomizedSearchCV(estimator=rf, param_distributions=random_grid,
                               n_iter=100, cv=5, verbose=2, random_state=42, n_jobs=-1)

### Fit the random search model
Be patient, this might take a minute or longer

In [20]:
rf_random.fit(X_train, y_train)

Fitting 5 folds for each of 60 candidates, totalling 300 fits




### Print the best hyperparameters found by the random search

In [21]:
print(rf_random.best_params_)

{'n_estimators': 500, 'max_features': 'sqrt', 'max_depth': 10}


### Print best random search model accuracy
#### NOTE: Use "evaluate" function defined at top of this notebook.

In [22]:
best_random_model = rf_random.best_estimator_
best_random_model_accuracy = evaluate(best_random_model, 'Best Random Search Model', X_test, y_test)
print(f"Accuracy of the Best Random Search Model: {best_random_model_accuracy}%")

Model Performance: Best Random Search Model
Accuracy = 93.75%.
Accuracy of the Best Random Search Model: 93.75121651808202%
[CV] END ..max_depth=10, max_features=sqrt, n_estimators=100; total time=   0.5s
[CV] END ..max_depth=10, max_features=sqrt, n_estimators=300; total time=   1.4s
[CV] END ..max_depth=10, max_features=sqrt, n_estimators=400; total time=   2.2s
[CV] END ..max_depth=10, max_features=log2, n_estimators=100; total time=   0.4s
[CV] END ..max_depth=10, max_features=log2, n_estimators=200; total time=   0.9s
[CV] END ..max_depth=10, max_features=log2, n_estimators=300; total time=   1.1s
[CV] END ..max_depth=10, max_features=log2, n_estimators=500; total time=   1.9s
[CV] END ..max_depth=20, max_features=sqrt, n_estimators=200; total time=   0.9s
[CV] END ..max_depth=20, max_features=sqrt, n_estimators=400; total time=   1.8s
[CV] END ..max_depth=20, max_features=sqrt, n_estimators=500; total time=   2.1s
[CV] END ..max_depth=20, max_features=log2, n_estimators=300; tota