# Predicting Forest Fires


In this notebook I will explore different machine learning regression algorithms to see how well they can predict forest fire outcomes. In this notebook I will examine the different merits and drawbacks of using multiple linear regression, polynomial regression, SVM regression, and random forest regression.


# Preprocessing

The dataset is supplied by:

    P. Cortez and A. Morais. A Data Mining Approach to Predict Forest Fires using Meteorological Data. 
    In J. Neves, M. F. Santos and J. Machado Eds., New Trends in Artificial Intelligence, 
    Proceedings of the 13th EPIA 2007 - Portuguese Conference on Artificial Intelligence, December,
    Guimaraes, Portugal, pp. 512-523, 2007. APPIA, ISBN-13 978-989-95618-0-9. 
    Available at: http://www.dsi.uminho.pt/~pcortez/fires.pdf

In [2]:
import pandas as pd
import numpy as np

In [36]:
# Set path
path = 'forestfires.csv'

data = pd.read_csv(path)
data.head()

Unnamed: 0,X,Y,month,day,FFMC,DMC,DC,ISI,temp,RH,wind,rain,area
0,7,5,mar,fri,86.2,26.2,94.3,5.1,8.2,51,6.7,0.0,0.0
1,7,4,oct,tue,90.6,35.4,669.1,6.7,18.0,33,0.9,0.0,0.0
2,7,4,oct,sat,90.6,43.7,686.9,6.7,14.6,33,1.3,0.0,0.0
3,8,6,mar,fri,91.7,33.3,77.5,9.0,8.3,97,4.0,0.2,0.0
4,8,6,mar,sun,89.3,51.3,102.2,9.6,11.4,99,1.8,0.0,0.0


In [37]:
month_to_qtr = {'jan': 1, 'feb': 1, 'mar': 1, 
                'apr': 2, 'may': 2, 'jun': 2, 
                'jul': 3, 'aug': 3, 'sep': 3, 
                'oct': 4, 'nov': 4, 'dec': 4}

data['month'] = data['month'].apply(lambda x: month_to_qtr[x])
data = data.rename(columns={'month': 'qtr'}).astype(dtype={'qtr': 'str'})

The dataset has the following attributes:
 1. X - x-axis spatial coordinate within the Montesinho park map: 1 to 9
 2. Y - y-axis spatial coordinate within the Montesinho park map: 2 to 9
 3. month - month of the year: "jan" to "dec" 
 4. day - day of the week: "mon" to "sun"
 5. FFMC - FFMC index from the FWI system: 18.7 to 96.20
 6. DMC - DMC index from the FWI system: 1.1 to 291.3 
 7. DC - DC index from the FWI system: 7.9 to 860.6 
 8. ISI - ISI index from the FWI system: 0.0 to 56.10
 9. temp - temperature in Celsius degrees: 2.2 to 33.30
 10. RH - relative humidity in %: 15.0 to 100
 11. wind - wind speed in km/h: 0.40 to 9.40 
 12. rain - outside rain in mm/m2 : 0.0 to 6.4 
 13. area - the burned area of the forest (in ha): 0.00 to 1090.84 
 
The data has two categorical attributes--month and day--for now I will one-hot encode them. It will be determined later whether or not they are statistically relevant to the regressor. 

In [39]:
# First separate independent from dependent variables
# Then one-hot encode X
X = pd.get_dummies(data.iloc[:, :-1], drop_first=True).values
y = data.iloc[:, -1].values


# X.shape = (517, 19)
# y.shape = (517,)

# VERIFIED

## Using ColumnTransformer

In [27]:
# import preprocessing modules
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, FunctionTransformer

Normalize features in X.  
I will do this by implementing an sklearn pipeline. If there are any missing data points in the numeric features, they will be imputed with the mean of the column then standardized.

In [42]:
# Split X into numeric and categorical features
X_num = X[:, :10]
X_cat = X[:, 10:]

# Setup numeric transformer
# This will usually entail handling missing variables then scaling
num_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())])

# Define the preprocessor
preprocessor = ColumnTransformer(transformers=[('num', num_transformer, list(range(0,X_num.shape[1])))], 
                                            remainder='passthrough')

In [43]:
# Let X equal preprocessed version of X 
X = preprocessor.fit_transform(X)

# Take natural_log(y+1) for each element
y = np.log1p(y)

## Splitting Dataset in Train and Test

In [44]:
from sklearn.model_selection import train_test_split

In [45]:
# Tweak test_size if applicable
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

print('Training Data: %3d' % len(X_train))
print('Test Data: %7d' % len(X_test))

Training Data: 413
Test Data:     104


All X features are scaled and y has been transformed.

Lastly, export the dataset as csv

In [46]:
pd.DataFrame(X_train).to_csv('ForestFires_XTrain.txt', index=False)
pd.DataFrame(X_test).to_csv('ForestFires_XTest.txt', index=False)
pd.DataFrame(y_train).to_csv('ForestFires_yTrain.txt', index=False)
pd.DataFrame(y_test).to_csv('ForestFires_yTest.txt', index=False)