# Predicting with a Random Forest Regressor

This notebook predicts the meter reading with a Random Forest Regressor. 
  
Please change the paths in the second cell to where you store the csv files  

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.model_selection import cross_val_score

import pandas as pd
import numpy as np

### Import data and train the Model

First, we import all the necessary csv files.

In [None]:
train_preprocessed = pd.read_csv("../CSV/train_preprocessed_with_Zeros.csv")
train_target = pd.read_csv("../CSV/train_target_with_Zeros.csv")

test_preprocessed = pd.read_csv("../CSV/test_preprocessed.csv")
test_row = pd.read_csv("../CSV/test_row.csv")

Next, we split the train_preprocessed and train_target into test and training sets.  
Then, we train a RandomForestRegressor and display their score.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(train_preprocessed, train_target, random_state=0)

RFregr = RandomForestRegressor(random_state=0, n_estimators=5)
RFregr.fit(X_train, y_train)

print("Score on training set: {:.4f}".format(RFregr.score(X_train, y_train)))
print("Score on test set: {:.4f}".format(RFregr.score(X_test, y_test)))

Unfortunately, even with the smallest possible values of only 1 iteration for a 2-fold cross validation, our laptop's CPUs / memory cannot handle it.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(train_preprocessed, train_target, random_state=0)

#Wide range of parameters to evaluate
param_grid = {'n_estimators':  [7,10,13],
              'max_depth': [ 4, 12, None],
              'max_features': ['auto', 'sqrt'],
              'min_samples_split': [ 2, 5, 7],
              'min_samples_leaf': [1, 2, 3],
              'bootstrap': [True, False]   
               }

rf = RandomForestRegressor()
random_grid_search = RandomizedSearchCV(estimator = rf, param_distributions = param_grid,
                               n_iter = 1, cv = 2, verbose=2, random_state=42, n_jobs = -1)

#fit data to the model
random_grid_search.fit(X_train, y_train)

print("Best parameters: {}".format(random_grid_search.best_params_))
print("Best cross-validation score: {:.3f}".format(random_grid_search.best_score_))
print("Best estimator:\n{}".format(random_grid_search.best_estimator_))
print("Test set score: {:.3f}".format(random_grid_search.score(X_test, y_test)))

The same problem occurs for a regular grid search CV

In [None]:
param_grid = {'n_estimators':  [9,10], 
              'max_depth': [12, None],
              'max_features': ['auto', 'sqrt'], 
              'min_samples_split': [ 4, 5, 6 ],
              'min_samples_leaf': [1, 2, 3],
              'bootstrap': [True, False]                
               }

grid_search = GridSearchCV(RandomForestRegressor(), param_grid, cv=5)

X_trainval, X_test, y_trainval, y_test = train_test_split(train_preprocessed, train_target, random_state=0)
grid_search.fit(X_trainval, y_trainval)

print("Best parameters: {}".format(grid_search.best_params_))
print("Best cross-validation score: {:.3f}".format(grid_search.best_score_))
print("Best estimator:\n{}".format(grid_search.best_estimator_))
print("Test set score: {:.3f}".format(grid_search.score(X_test, y_test)))

### Predict and save prediction to CSV file

Finally, we make a prediction based on the test_final dataset and the previously trained RandomForestRegressor.  
The final predictions gets saved as a CSV file

In [None]:
PredictionRF = RFregr.predict(test_preprocessed)
PredictionRF = pd.DataFrame(PredictionRF, columns = ["meter_reading"])

PredictionRFCombined = pd.concat([test_row,PredictionRF],axis=1)

PredictionRFCombined.to_csv('../PredictionRandomForestRegressor.csv', index = False)
PredictionRFCombined.head()