## Introduction

This will be one last attempt at trying to create a model for this dataset. As I've already explained, there is still quite a bit of information missing for a comprehensive model, but I'm going to attempt to use a non-parametric algorithm. Up until this point, I had just been using different iterations of a linear regression model, which the data has proved does not fit that trend.

#### Libraries

In [51]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.tree import export_graphviz
import pydot
from sklearn import tree
import graphviz

import warnings
warnings.filterwarnings('ignore')

#### Data

In [2]:
#MTBS Data
mtbs = pd.read_csv('MTBS_clean.csv')
mtbs.drop(columns='Unnamed: 0', inplace=True)

#Ecoregion Level 3 - Weather Data
weather = pd.read_csv('Ecoregion Level 3 Weather.csv', skiprows=[1])
weather.drop(columns='Unnamed: 0', inplace=True)

#### Processing

In [3]:
#Converting to date
weather['Date'] = pd.to_datetime(weather['Date'], format='%Y-%m-%d')
mtbs['Date'] = pd.to_datetime(mtbs['Date'], format='%Y-%m-%d')

In [4]:
#Joining Data
mtbs_weather_3 = pd.merge(mtbs, weather, how='left', on=['NA_L3NAME', 'Date'])
mtbs_weather_3.drop(columns=['wind_speed_ms_30', 'wind_speed_ms_90', 'wind_speed_ms_180', 'wind_speed_ms_365'],
                   inplace=True)
mtbs_weather_3.dropna(inplace=True)

## Modeling

In [65]:
col_keep = ['Fire_Type', 'Low_T', 'Mod_T', 'High_T', 'NA_L3NAME', 'state', 'month_ig', 'max_temp_C_90',
           'precipitation_cm_90', 'Acres']

df_model_1 = mtbs_weather_3[col_keep]
df_model_1 = df_model_1[df_model_1['Fire_Type'] == 'Wildfire']
target_1 = np.array(df_model_1['Acres'])
predictors_1 = df_model_1.drop(columns=['Acres'])
predictors_1 = pd.get_dummies(predictors_1, drop_first = True)

In [75]:
X_train, X_test, y_train, y_test = train_test_split(predictors_1, target_1, test_size=0.2)

# Instantiate model with 1000 decision trees
rf = RandomForestRegressor(n_estimators = 500, random_state = 42)
# Train the model on training data
rf.fit(X_train, y_train)

RandomForestRegressor(n_estimators=500, random_state=42)

In [76]:
rf.score(X_train, y_train)

0.8549118000962644

In [80]:
rf.get_params()

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'criterion': 'mse',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 500,
 'n_jobs': None,
 'oob_score': False,
 'random_state': 42,
 'verbose': 0,
 'warm_start': False}

In [77]:
predictions = rf.predict(X_test)
errors = abs(predictions - y_test)
print('Mean Absolute Error:', round(np.mean(errors), 2), 'acres.')

Mean Absolute Error: 10866.49 acres.


In [70]:
importances = list(rf.feature_importances_)
feature_importances = [(feature, round(importance, 2)) for feature, importance in zip(X_train.columns, importances)]
feature_importances = sorted(feature_importances, key = lambda x: x[1], reverse = True)
[print('Variable: {:20} Importance: {}'.format(*pair)) for pair in feature_importances]

Variable: precipitation_cm_90  Importance: 0.19
Variable: max_temp_C_90        Importance: 0.17
Variable: Low_T                Importance: 0.15
Variable: Mod_T                Importance: 0.15
Variable: High_T               Importance: 0.12
Variable: month_ig             Importance: 0.04
Variable: state_OK             Importance: 0.02
Variable: state_OR             Importance: 0.02
Variable: NA_L3NAME_Arizona/New Mexico Mountains Importance: 0.01
Variable: NA_L3NAME_Idaho Batholith Importance: 0.01
Variable: NA_L3NAME_Northern Basin and Range Importance: 0.01
Variable: NA_L3NAME_Southwestern Tablelands Importance: 0.01
Variable: state_AZ             Importance: 0.01
Variable: state_ID             Importance: 0.01
Variable: state_NV             Importance: 0.01
Variable: state_TX             Importance: 0.01
Variable: NA_L3NAME_Arizona/New Mexico Plateau Importance: 0.0
Variable: NA_L3NAME_Arkansas Valley Importance: 0.0
Variable: NA_L3NAME_Aspen Parkland/Northern Glaciated Plains Import

[None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None]

#### Conclusion

Well, the Random Forest Regressor was a mixed bag. The r-squared value from the RFR model was much better than any of the r-squared values returned in the linear regression models, but my accuracy and mean absolute error didn't look that good. Below are the performance metrics for my model:
1. R-squared: `0.855`
2. Mean absolute error: `10,866`

This is a classic example of an overfit model. My regressor trained really well on my training data, but did a bad job predicting the burn size of my test data.