# Predicting Forest Fires


In this notebook I will explore different machine learning regression algorithms to see how well they can predict forest fire outcomes. In this notebook I will examine the different merits and drawbacks of using multiple linear regression, polynomial regression, SVM regression, and random forest regression.


All X features are scaled and y has been transformed.

# Model

I will perform a series of operations on this dataset to understand the different pros and cons of each type of regression algorithm

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [24]:
from sklearn.preprocessing import StandardScaler

In [25]:
# Import data from flat file

# Set path
path = 'DRV_ForestFires.csv'
data = pd.read_csv(path)
data.head()

X_train = pd.read_csv('DRV_ForestFires_XTrain.txt').values
X_test = pd.read_csv('DRV_ForestFires_XTest.txt').values
y_train = pd.read_csv('DRV_ForestFires_yTrain.txt').values
y_test = pd.read_csv('DRV_ForestFires_yTest.txt').values

In [32]:
sc_y = StandardScaler()
y_train = np.ravel(sc_y.fit_transform(y_train.reshape(-1,1)))
y_test = np.ravel(sc_y.fit_transform(y_test.reshape(-1,1)))

## Support Vector Regression

In [3]:
from sklearn.svm import SVR
from sklearn.feature_selection import RFE
import statsmodels.formula.api as sm

In [60]:
regressor = SVR(kernel='rbf', gamma='auto', epsilon=.1, C=.1)
regressor.fit(X_train, y_train)

regressor.score(X_train, y_train)

regressor.score(X_test, y_test)


-0.2680358226094195

In [45]:
data.columns

Index(['FFMC', 'DMC', 'DC', 'ISI', 'temp', 'RH', 'wind', 'rain', 'day_sat',
       'area'],
      dtype='object')

SVR model does not work well. Part of the problem may be that there are some obvious outliers in the outputs.  

Try again  with outliers removed.

To remove outliers:
    - Concat X_train with y_train  
    - Remove rows where y_train greater than specified z_score
    - Split back into X_train, y_train
    - Same process for test

In [78]:
train_data = np.concatenate((X_train, y_train.reshape(-1,1)), axis=1)
train_data[(train_data[:, -1] < 2.5)]

X_train = train_data[:, :-1]
y_train = train_data[:, -1]

In [81]:
test_data = np.concatenate((X_test, y_test.reshape(-1,1)), axis=1)
test_data[(test_data[:, -1] < 2.5)]

X_test = test_data[:, :-1]
y_test = test_data[:, -1]

## SVR Attempt II

In [84]:
regressor = SVR(kernel='rbf', gamma='auto')
regressor.fit(X_train, y_train)

regressor.score(X_test, y_test)

-0.29349417063794037

Not the result I was hoping for. Model still seems to perform worse than if it was just randomly guessing the outcomes.

## Attempt III

In [92]:
from sklearn.metrics import explained_variance_score, mean_absolute_error

In [91]:
explained_variance_score(y_train, regressor.predict(X_train))

0.17785183731369725

In [93]:
mean_absolute_error(y_train, regressor.predict(X_train))

0.6306978359712209