# US Wildfire Analysis

## Introduction

**Wildfires** are a common problem for United States and the wildfires which runs out of control results in big consequences.

## Problems:
<ul>
    <li>Q1: Have wildfires become more or less frequent over time?</li>
    <li>Q2: What counties are the most and least fire-prone?</li>
    <li>Q3: Given the size, location and date, can you predict the cause of a wildfire?
</li>
</ul>

### Import Libraries and Connect Database


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sqlite3
import seaborn as sns
sns.set()

In [None]:
conn = sqlite3.connect('FPA_FOD_20170508.sqlite')

### Q1: Have wildfires become more or less frequent over time?

In [None]:
wildfires = pd.read_sql_query('SELECT FIRE_YEAR FROM Fires;', con=conn)

Creating a histogram of fire incidents by their years

In [None]:
fig, ax = plt.subplots(figsize=(12,6))
ax.hist(wildfires['FIRE_YEAR'], rwidth=0.9, bins=24);
ax.set_xlabel('Year')
ax.set_ylabel('Amount of Fire Incidents')
plt.title('Incidents by Years')

In [None]:
wildfires['FIRE_YEAR'].value_counts()

In [None]:
years = wildfires['FIRE_YEAR'].unique()
years.sort()

freqs = []
for year in years:
    freq = float(wildfires['FIRE_YEAR'][wildfires['FIRE_YEAR'] == year].count() / 365)
    freqs.append(freq)
    print(f'Frequency: {year} -> ', freq)

In [None]:
freqs_dict = {'year': years, 'frequency': freqs}
df_freq = pd.DataFrame(data=freqs_dict)

In [None]:
fig, ax = plt.subplots(figsize=(12,8))

sns.kdeplot(x=df_freq['year'], y=df_freq['frequency'], fill=True, cbar=True)
plt.title('Density of Frequencies by Year')

In [None]:
fig, ax = plt.subplots(figsize=(12,8))
sns.regplot(x=df_freq['year'], y=df_freq['frequency'])
plt.title('Wildfire Frequency Trend')

From those 3 graphs, we can see there is a big increase in the number of fire incidents in the year of 2006. It is visible that the minimum number of incidents increased over time. So we can say that the number of wildfires are becoming more frequent over time.

### Q2: What counties are the most and least fire-prone?

#### Q2.1: Fire Size

In [None]:
wildfires = pd.read_sql_query("SELECT FIPS_CODE, FIPS_NAME, STATE, FIRE_SIZE FROM Fires;", conn).dropna().sort_values(by='FIRE_SIZE', ascending=False)
wildfires.set_index('FIPS_CODE', inplace=True)
wildfires['AREA'] = wildfires['FIPS_NAME'] + ', ' + wildfires['STATE']
wildfires.drop(['FIPS_NAME', 'STATE'], axis=1, inplace=True)
wildfires

In [None]:
area = wildfires.groupby('AREA').sum().sort_values(by='FIRE_SIZE', ascending=False)
area[:15].plot(kind='bar', figsize=(12, 6))

According to the data, the total of wildfire size in Elko, Nevada is the largest among other counties. Also the total of wildfire size in Dewey, South Dakota is the smallest. This data puts Elko, Nevada as one of the most fire-prone county and puts Dewey, South Dakota as one of the least fire-prone county

#### Q2.2: Fire Frequency

In [None]:
incidents = pd.read_sql_query("SELECT FIPS_CODE, FIPS_NAME, STATE FROM Fires;", conn).dropna()
incidents.set_index('FIPS_CODE', inplace=True)
incidents['AREA'] = incidents['FIPS_NAME'] + ', ' + incidents['STATE']
incidents.drop(['FIPS_NAME', 'STATE'], axis=1, inplace=True)
incidents

In [None]:
incidents.value_counts().rename_axis('AREA').to_frame('counts')[:15].plot(kind='bar', figsize=(12, 6))

In this graph, we can see the the most frequent wildfires with respect to counties are in Coconino, Arizona. Which makes Coconino, Arizona one of the most fire-prone county.

In [None]:
incident_counts = incidents.value_counts().rename_axis('AREA').to_frame('counts')
incident_min = incident_counts[incident_counts['counts'] < 2]
incident_min.index.to_list()

Given counties are the least fire-prone with least frequent wildfires.

### Q3: Given the size, location and date, can you predict the cause of a wildfire?

In [None]:
wildfires = pd.read_sql_query("SELECT FIRE_SIZE, STATE, LATITUDE, LONGITUDE, CONT_DOY, CONT_TIME, STAT_CAUSE_DESCR FROM Fires;", conn).dropna()
wildfires

#### Data Preperation

##### Check correlations

In [None]:
corr = wildfires.corr()

In [None]:
sns.heatmap(corr, annot=True)

##### Encoding & Train Test Split

In [None]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

wildfires['STATE'] = le.fit_transform(wildfires['STATE'])
wildfires['STAT_CAUSE_DESCR'] = le.fit_transform(wildfires['STAT_CAUSE_DESCR'])

Labeled classes:

    0 -> Arson
    1 -> Campfire
    2 -> Children
    3 -> Debris Burning
    4 -> Equipment Use
    5 -> Fireworks
    6 -> Lightning
    7 -> Miscellaneus
    8 -> Missing/Undefined
    9 -> Powerline
    10 -> Railroad
    11 -> Smoking
    12 -> Structure

In [None]:
X = wildfires.iloc[:, :-1].to_numpy()
y = wildfires['STAT_CAUSE_DESCR'].to_numpy()

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [None]:
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

#### Normalize the data

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

#### Search on Models

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import ExtraTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import GradientBoostingClassifier
import time

In [None]:
# dtc = DecisionTreeClassifier()
# extc = ExtraTreeClassifier()
# bc = BaggingClassifier()
# rfc = RandomForestClassifier()
# knn = KNeighborsClassifier()
# gb = GradientBoostingClassifier()

# clf_list = [dtc, extc, bc, rfc, knn, gb]
# models = []

# for clf in clf_list:
#     start = time.time()
#     clf.fit(X_train, y_train)
#     end = time.time()
#     print(end - start)
#     print(f"Model: {clf}")
#     print(f"train score: {clf.score(X_train, y_train)}, test score: {clf.score(X_test, y_test)}")
#     models.append(clf)
#     print("---------------------")

From above, Gradient Boosting was the only model that did not overfit with an accuracy of 52% training and 52% testing.

#### Search on Gradient Boosting for the best parameters

In [None]:
gb = GradientBoostingClassifier(n_estimators=600)

start = time.time()
gb.fit(X_train, y_train)
end = time.time()

print(end - start)
print(f"Model: {gb}")
print(f"train score: {gb.score(X_train, y_train)}, test score: {gb.score(X_test, y_test)}")

In [None]:
from sklearn.metrics import classification_report
y_pred = gb.predict(X_test)
print(classification_report(y_test, y_pred))

In [None]:
from sklearn.metrics import confusion_matrix
fig, ax = plt.subplots(figsize=(16,12))

cfmatrix = confusion_matrix(y_test, y_pred)
sns.heatmap(cfmatrix, annot=True, fmt='.0f')


labels = ['Arson', 'Campfire', 'Children', 'Debris Burning', 
          'Equipment Use', 'Fireworks', 'Lightning', 'Miscellaneus', 
          'Missing/Undefined', 'Powerline', 'Railroad', 'Smoking', 
          'Structure']

ax.set_xlabel('Predicted')
ax.set_ylabel('Actual')

ax.set_yticklabels(labels, rotation=0)
ax.set_xticklabels(labels, rotation=90);
plt.title('Confusion Matrix of Fire Cause Classification')

In [None]:
FP = cfmatrix.sum(axis=0) - np.diag(cfmatrix)  
FN = cfmatrix.sum(axis=1) - np.diag(cfmatrix)
TP = np.diag(cfmatrix)
TN = cfmatrix.sum() - (FP + FN + TP)

In [None]:
for i in range(len(TP)):
    print(f'Class: {i} \t TP: {TP[i]} \t TN: {TN[i]} \t FP: {FP[i]} \t FN: {FN[i]}')

In [None]:
for i in range(len(TP)):
    print(f'Class: {i} \t Sensitivity: {TP[i]/(TP[i]+FN[i])} \t Specificity: {TN[i]/(TN[i]+FP[i])}')

In [None]:
!pip install yellowbrick

In [None]:
from yellowbrick.classifier import ROCAUC
fig, ax = plt.subplots(figsize=(12, 12))

visualizer = ROCAUC(gb, classes=labels)
visualizer.fit(X_train, y_train)
visualizer.score(X_test, y_test)
visualizer.show()

<b>Given the size, location and date, it is hard to predict the cause of a wildfire.</b>

In [None]:
!jupyter nbconvert --to html report.ipynb