# Predicting Forest Fires


In this notebook I will conduct some statistical analysis motivated by the data visualizations I created in the previous step.

In [9]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
from statsmodels.formula.api import ols

In [6]:
# Import data from flat file

path = 'forestfires.csv'
data = pd.read_csv(path)
data.head()

Unnamed: 0,X,Y,month,day,FFMC,DMC,DC,ISI,temp,RH,wind,rain,area
0,7,5,mar,fri,86.2,26.2,94.3,5.1,8.2,51,6.7,0.0,0.0
1,7,4,oct,tue,90.6,35.4,669.1,6.7,18.0,33,0.9,0.0,0.0
2,7,4,oct,sat,90.6,43.7,686.9,6.7,14.6,33,1.3,0.0,0.0
3,8,6,mar,fri,91.7,33.3,77.5,9.0,8.3,97,4.0,0.2,0.0
4,8,6,mar,sun,89.3,51.3,102.2,9.6,11.4,99,1.8,0.0,0.0


# Analysis of Variance

## X and Y coordinates:

First check if I can exclude X and Y coordinates as factors in my model

In [12]:
anova = ols('area ~ C(X)', data=data).fit()

anova.summary()

0,1,2,3
Dep. Variable:,area,R-squared:,0.011
Model:,OLS,Adj. R-squared:,-0.004
Method:,Least Squares,F-statistic:,0.7235
Date:,"Thu, 07 Mar 2019",Prob (F-statistic):,0.671
Time:,11:15:24,Log-Likelihood:,-2877.5
No. Observations:,517,AIC:,5773.0
Df Residuals:,508,BIC:,5811.0
Df Model:,8,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,13.3923,9.208,1.454,0.146,-4.698,31.482
C(X)[T.2],-3.8217,11.854,-0.322,0.747,-27.112,19.468
C(X)[T.3],-10.9357,12.600,-0.868,0.386,-35.691,13.820
C(X)[T.4],-3.0071,11.380,-0.264,0.792,-25.365,19.350
C(X)[T.5],-10.3466,14.847,-0.697,0.486,-39.516,18.822
C(X)[T.6],6.7227,11.494,0.585,0.559,-15.858,29.303
C(X)[T.7],-2.2996,12.353,-0.186,0.852,-26.570,21.970
C(X)[T.8],11.0746,12.308,0.900,0.369,-13.107,35.256
C(X)[T.9],5.1546,19.945,0.258,0.796,-34.031,44.340

0,1,2,3
Omnibus:,979.012,Durbin-Watson:,1.651
Prob(Omnibus):,0.0,Jarque-Bera (JB):,792971.002
Skew:,12.68,Prob(JB):,0.0
Kurtosis:,193.179,Cond. No.,11.1


In [13]:
anova = ols('area ~ C(Y)', data=data).fit()

anova.summary()

0,1,2,3
Dep. Variable:,area,R-squared:,0.02
Model:,OLS,Adj. R-squared:,0.008
Method:,Least Squares,F-statistic:,1.71
Date:,"Thu, 07 Mar 2019",Prob (F-statistic):,0.116
Time:,11:15:58,Log-Likelihood:,-2875.3
No. Observations:,517,AIC:,5765.0
Df Residuals:,510,BIC:,5794.0
Df Model:,6,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,15.5134,9.557,1.623,0.105,-3.263,34.290
C(Y)[T.3],-6.4034,12.415,-0.516,0.606,-30.794,17.987
C(Y)[T.4],-7.1006,10.542,-0.674,0.501,-27.812,13.611
C(Y)[T.5],0.2452,11.113,0.022,0.982,-21.587,22.077
C(Y)[T.6],4.8725,12.068,0.404,0.687,-18.837,28.582
C(Y)[T.8],170.2466,64.111,2.656,0.008,44.293,296.201
C(Y)[T.9],-14.7684,27.589,-0.535,0.593,-68.970,39.434

0,1,2,3
Omnibus:,990.319,Durbin-Watson:,1.641
Prob(Omnibus):,0.0,Jarque-Bera (JB):,849262.057
Skew:,12.981,Prob(JB):,0.0
Kurtosis:,199.851,Cond. No.,25.9


So there does appear to be a statistically significant difference between location 'Y'=2 and 'Y'=8. However as noted before there is only data point with 'Y' coordinate 8 so while it may be statistically significant vs. the null hypothesis of 'Y'=2 I will ignore it because of the risk of overfitting that would come with its inclusion as a factor.

Conclusion:

I can justifiably exclude both the X and Y columns as inputs in my model

# Quarter and Day of Week

Upon previous inspection, I determined it would be better to convert month into quarter column due to non-uniform counts for each month. 

In [15]:
month_to_qtr = {'jan': 1, 'feb': 1, 'mar': 1, 
                'apr': 2, 'may': 2, 'jun': 2, 
                'jul': 3, 'aug': 3, 'sep': 3, 
                'oct': 4, 'nov': 4, 'dec': 4}

data['month'] = data['month'].apply(lambda x: month_to_qtr[x])
data = data.rename(columns={'month': 'qtr'}).astype(dtype={'qtr': 'str'})

In [17]:
anova = ols('area ~ C(qtr)', data=data).fit()

anova.summary()

0,1,2,3
Dep. Variable:,area,R-squared:,0.004
Model:,OLS,Adj. R-squared:,-0.002
Method:,Least Squares,F-statistic:,0.6586
Date:,"Thu, 07 Mar 2019",Prob (F-statistic):,0.578
Time:,11:30:44,Log-Likelihood:,-2879.5
No. Observations:,517,AIC:,5767.0
Df Residuals:,513,BIC:,5784.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,4.7468,7.309,0.649,0.516,-9.613,19.106
C(qtr)[T.2],3.0317,14.086,0.215,0.830,-24.642,30.706
C(qtr)[T.3],10.3149,7.993,1.291,0.197,-5.388,26.018
C(qtr)[T.4],4.0348,14.691,0.275,0.784,-24.827,32.897

0,1,2,3
Omnibus:,982.81,Durbin-Watson:,1.649
Prob(Omnibus):,0.0,Jarque-Bera (JB):,807981.439
Skew:,12.783,Prob(JB):,0.0
Kurtosis:,194.975,Cond. No.,8.13


In [18]:
anova = ols('area ~ C(day)', data=data).fit()

anova.summary()

0,1,2,3
Dep. Variable:,area,R-squared:,0.01
Model:,OLS,Adj. R-squared:,-0.002
Method:,Least Squares,F-statistic:,0.8593
Date:,"Thu, 07 Mar 2019",Prob (F-statistic):,0.525
Time:,11:31:01,Log-Likelihood:,-2877.8
No. Observations:,517,AIC:,5770.0
Df Residuals:,510,BIC:,5799.0
Df Model:,6,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,5.2616,6.910,0.761,0.447,-8.314,18.837
C(day)[T.mon],4.2861,10.129,0.423,0.672,-15.614,24.186
C(day)[T.sat],20.2724,9.801,2.068,0.039,1.016,39.528
C(day)[T.sun],4.8429,9.512,0.509,0.611,-13.844,23.530
C(day)[T.thu],11.0843,10.690,1.037,0.300,-9.918,32.087
C(day)[T.tue],7.3601,10.544,0.698,0.485,-13.354,28.074
C(day)[T.wed],5.4532,11.087,0.492,0.623,-16.328,27.234

0,1,2,3
Omnibus:,976.258,Durbin-Watson:,1.645
Prob(Omnibus):,0.0,Jarque-Bera (JB):,775077.886
Skew:,12.611,Prob(JB):,0.0
Kurtosis:,191.001,Cond. No.,7.43


Of all qtrs and days of the week, only saturday appears to have any statistical significance. I find this to be a bit surprising considering the difference in climate between the Winter and the warmer Spring and Summer months

In [19]:
data.groupby('qtr')['area'].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
qtr,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,76.0,4.746842,9.947936,0.0,0.0,0.0,5.43,51.78
2,28.0,7.778571,18.070853,0.0,0.0,0.0,3.7925,70.32
3,388.0,15.061727,73.017183,0.0,0.0,0.755,6.4325,1090.84
4,25.0,8.7816,11.742596,0.0,0.0,5.44,11.19,49.37


While the actual damage caused by fires may not vary significantly by qtr, the frequency certainly does. Quarter 3 alone has 388 data points out of the 517 entries!

In [22]:
data.groupby('day')['area'].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
fri,85.0,5.261647,10.012083,0.0,0.0,0.33,5.97,43.32
mon,74.0,9.547703,33.703562,0.0,0.0,0.745,6.0325,278.53
sat,84.0,25.534048,122.69884,0.0,0.0,0.34,7.55,1090.84
sun,95.0,10.104526,26.076032,0.0,0.0,0.0,6.815,196.48
thu,61.0,16.345902,95.351052,0.0,0.0,0.9,4.95,746.28
tue,64.0,12.621719,33.568193,0.0,0.0,0.655,8.85,212.88
wed,54.0,10.714815,30.285914,0.0,0.0,0.76,5.7825,185.76


Supports the notion that fires on Saturday do tend to be significantly more desctructive than any other day of the week. This data is from a public park, perhaps Saturdays tend to be more destructive due to increased traffic or even the potential for man-made fires.

Either way, it has proven to be statistically significant so it will remain as a factor.