<a href="https://colab.research.google.com/github/mnocerino23/Wildfire-Forecaster/blob/main/Classifiers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [112]:
import sklearn
import numpy as np
import pandas as pd

In [113]:
from google.colab import drive
drive.mount('/content/drive')

#Read in the two final datasets. The first contains over 110,000 fires from 2001-2015 while the second has 1,000 more recent, larger fires.
wildfire_set1 = pd.read_csv('/content/drive/MyDrive/Data_Science_Projects/Wildfires/wildfires1_w_snow.csv')
wildfire_set2 = pd.read_csv('/content/drive/MyDrive/Data_Science_Projects/Wildfires/wildfires2_w_snow.csv')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


  exec(code_obj, self.user_global_ns, self.user_ns)


Now that we have gathered data from kaggle, NOAA, and the California Department of Water resources into two consistent datasets, we will drop columns that are not necessary for training our models and take care of some final preprocessing.

In this file, I will start to build multi-class classifiers using Support Vector Machine, Gaussian Naive Bayes, Decision Tree, Random Forest, KNN, Gradient Boosting, and Neural Networks. 

The target feature will be ***Fire Size Class*** as we want to predict the size/risk of a large fire given certain weather and snow conditions.


I will deploy the following techniques:

1.   One-Hot Encoding of Categorical Variables
2.   Feature Selection
3.   Splitting the training and testing data
4.   Cross-Validation



In [114]:
print(wildfire_set1.columns)

Index(['Unnamed: 0', 'Year', 'Name', 'AcresBurned', 'Fire Size Rank', 'Cause',
       'SOURCE_REPORTING_UNIT_NAME', 'DaysBurn', 'Discovery Month',
       'Discovered DOY', 'Contained Month', 'Contained DOY', 'Latitude',
       'Longitude', 'County', 'CountyIds', 'State', 'OWNER_DESCR',
       'NOAA Station', 'Link', 'AWND', 'CLDD', 'DP10', 'DX90', 'PRCP', 'TAVG',
       'TMAX', 'TMIN', 'PRCP_6M', 'PRCP_RS', 'DX90_2M', 'DP10_2M',
       'Receives Snow', 'Snow Station', 'River Basin', 'Mar_SP', 'Mar_WC',
       'Mar_Dens'],
      dtype='object')


In [115]:
print(wildfire_set2.columns)

Index(['Unnamed: 0', 'Year', 'Name', 'AcresBurned', 'Fire Size Rank', 'Cause',
       'SOURCE_REPORTING_UNIT_NAME', 'DaysBurn', 'Discovery Month',
       'Discovered DOY', 'Contained Month', 'Contained DOY', 'Latitude',
       'Longitude', 'County', 'CountyIds', 'State', 'OWNER_DESCR',
       'NOAA Station', 'Link', 'AWND', 'CLDD', 'DP10', 'DX90', 'PRCP', 'TAVG',
       'TMAX', 'TMIN', 'PRCP_6M', 'PRCP_RS', 'DX90_2M', 'DP10_2M',
       'Receives Snow', 'Snow Station', 'River Basin', 'Mar_SP', 'Mar_WC',
       'Mar_Dens'],
      dtype='object')


Drop all columns that will not be relevant for our classification task:
Name, AcresBurned, Cause, SOURCE_REPORTING UNIT NAME, DaysBurn, Discovered DOY, Contained Month, Contained DOY, Lat, Long, County Ids, State, OWNER_DESCR, NOAA Station, Link, Snow Station, River Basin

In [116]:
wildfire_set1.drop(columns = ['Unnamed: 0', 'Year', 'Name', 'Cause',
                      'SOURCE_REPORTING_UNIT_NAME', 'DaysBurn', 'Discovered DOY', 'Contained Month',
                      'Contained DOY','Latitude','Longitude','CountyIds','State','OWNER_DESCR',
                      'NOAA Station', 'Link', 'Snow Station', 'River Basin'], inplace = True)

In [117]:
wildfire_set2.drop(columns = ['Unnamed: 0', 'Year', 'Name', 'AcresBurned', 'Cause',
                      'SOURCE_REPORTING_UNIT_NAME', 'DaysBurn', 'Discovered DOY', 'Contained Month',
                      'Contained DOY','Latitude','Longitude','CountyIds','State','OWNER_DESCR',
                      'NOAA Station', 'Link', 'Snow Station', 'River Basin'], inplace = True)

Taking a look at both datasets now that we have dropped the nonrelevant columns

In [118]:
wildfire_set1.head(5)

Unnamed: 0,AcresBurned,Fire Size Rank,Discovery Month,County,AWND,CLDD,DP10,DX90,PRCP,TAVG,TMAX,TMIN,PRCP_6M,PRCP_RS,DX90_2M,DP10_2M,Receives Snow,Mar_SP,Mar_WC,Mar_Dens
0,0.1,A,Feb,Plumas,5.6,0.0,12.0,0.0,5.33,38.9,43.9,33.9,27.89,49.06,0.0,19.0,1.0,79.6,34.0,0.43
1,0.25,A,May,Placer,6.9,0.0,2.0,0.0,0.81,47.3,63.0,31.6,14.37,14.76,0.0,3.0,1.0,108.6,38.1,0.35
2,0.1,A,Jun,El Dorado,5.6,36.0,0.0,0.0,0.0,63.1,70.2,56.0,36.71,40.37,0.0,11.0,1.0,108.6,38.1,0.35
3,0.1,A,Jun,Alpine,5.6,0.0,1.0,0.0,0.29,54.7,72.9,36.5,13.63,14.76,0.0,3.0,1.0,87.2,28.4,0.33
4,0.1,A,Jun,Alpine,5.6,0.0,1.0,0.0,0.29,54.7,72.9,36.5,13.63,14.76,0.0,3.0,1.0,87.2,28.4,0.33


In [119]:
wildfire_set1.shape

(114558, 20)

In [120]:
wildfire_set2.head(5)

Unnamed: 0,Fire Size Rank,Discovery Month,County,AWND,CLDD,DP10,DX90,PRCP,TAVG,TMAX,TMIN,PRCP_6M,PRCP_RS,DX90_2M,DP10_2M,Receives Snow,Mar_SP,Mar_WC,Mar_Dens
0,G,Jul,Monterey,6.5,0.0,0.0,0.0,0.0,58.8,65.2,52.4,14.11,21.42,0.0,1.0,0,0.0,0.0,0.0
1,G,Jun,Kern,6.7,529.0,0.0,22.0,0.0,82.6,96.6,68.6,4.68,4.88,15.0,4.0,1,36.0,16.0,0.44
2,G,Aug,San Luis Obispo,6.9,237.0,0.0,23.0,0.0,72.6,92.6,52.6,2.52,8.09,43.0,0.0,0,0.0,0.0,0.0
3,G,Aug,San Bernardino,6.5,455.0,0.0,28.0,0.0,79.7,94.6,64.7,3.41,6.45,43.0,0.0,0,0.0,0.0,0.0
4,G,Aug,Siskiyou,4.5,0.0,0.0,0.0,0.02,56.4,62.9,49.9,18.03,54.17,0.0,2.0,1,77.0,34.0,0.44


In [121]:
wildfire_set2.shape

(1197, 19)

In [122]:
print(wildfire_set1.isnull().sum())

AcresBurned           0
Fire Size Rank        0
Discovery Month       0
County               45
AWND               6254
CLDD               3335
DP10               3042
DX90               3333
PRCP               3042
TAVG               3335
TMAX               3333
TMIN               3285
PRCP_6M            2470
PRCP_RS            2811
DX90_2M            2215
DP10_2M            1599
Receives Snow       102
Mar_SP              102
Mar_WC              102
Mar_Dens            102
dtype: int64


In [123]:
wildfire_set1 = wildfire_set1.dropna()

In [124]:
wildfire_set1.shape

(106024, 20)

In [125]:
print(wildfire_set1['Fire Size Rank'].value_counts())

A    55462
B    43118
C     5097
D     1096
E      597
F      406
G      248
Name: Fire Size Rank, dtype: int64


In [126]:
print(wildfire_set2.isnull().sum())

Fire Size Rank      0
Discovery Month     0
County              0
AWND               37
CLDD               37
DP10               32
DX90               34
PRCP               32
TAVG               37
TMAX               34
TMIN               37
PRCP_6M             7
PRCP_RS            10
DX90_2M             2
DP10_2M             1
Receives Snow       0
Mar_SP              0
Mar_WC              0
Mar_Dens            0
dtype: int64


In [127]:
print(wildfire_set1['Fire Size Rank'].value_counts())

A    55462
B    43118
C     5097
D     1096
E      597
F      406
G      248
Name: Fire Size Rank, dtype: int64


In [128]:
wildfire_set2 = wildfire_set2.dropna()

In [129]:
wildfire_set2.shape

(1148, 19)

In [130]:
print(wildfire_set2['Fire Size Rank'].value_counts())

C    577
D    230
E    112
F    106
G     97
A     22
B      4
Name: Fire Size Rank, dtype: int64


Double check our datatypes before we proceed with preprocessing and model building. As we can see, all features besides county and fire size rank are numerical (of type float) so all we have to do is one hot encode these two categorial 

In [97]:
wildfire_set1.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 106024 entries, 0 to 114557
Data columns (total 20 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   AcresBurned      106024 non-null  float64
 1   Fire Size Rank   106024 non-null  object 
 2   Discovery Month  106024 non-null  object 
 3   County           106024 non-null  object 
 4   AWND             106024 non-null  float64
 5   CLDD             106024 non-null  float64
 6   DP10             106024 non-null  float64
 7   DX90             106024 non-null  float64
 8   PRCP             106024 non-null  float64
 9   TAVG             106024 non-null  float64
 10  TMAX             106024 non-null  float64
 11  TMIN             106024 non-null  float64
 12  PRCP_6M          106024 non-null  float64
 13  PRCP_RS          106024 non-null  float64
 14  DX90_2M          106024 non-null  float64
 15  DP10_2M          106024 non-null  float64
 16  Receives Snow    106024 non-null  floa

In [98]:
wildfire_set2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1148 entries, 0 to 1196
Data columns (total 19 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Fire Size Rank   1148 non-null   object 
 1   Discovery Month  1148 non-null   object 
 2   County           1148 non-null   object 
 3   AWND             1148 non-null   float64
 4   CLDD             1148 non-null   float64
 5   DP10             1148 non-null   float64
 6   DX90             1148 non-null   float64
 7   PRCP             1148 non-null   float64
 8   TAVG             1148 non-null   float64
 9   TMAX             1148 non-null   float64
 10  TMIN             1148 non-null   float64
 11  PRCP_6M          1148 non-null   float64
 12  PRCP_RS          1148 non-null   float64
 13  DX90_2M          1148 non-null   float64
 14  DP10_2M          1148 non-null   float64
 15  Receives Snow    1148 non-null   int64  
 16  Mar_SP           1148 non-null   float64
 17  Mar_WC        

In [99]:
wildfire_set1.describe()

Unnamed: 0,AcresBurned,AWND,CLDD,DP10,DX90,PRCP,TAVG,TMAX,TMIN,PRCP_6M,PRCP_RS,DX90_2M,DP10_2M,Receives Snow,Mar_SP,Mar_WC,Mar_Dens
count,106024.0,106024.0,106024.0,106024.0,106024.0,106024.0,106024.0,106024.0,106024.0,106024.0,106024.0,106024.0,106024.0,106024.0,106024.0,106024.0,106024.0
mean,79.390867,6.294384,224.058279,0.940853,12.191391,0.457943,69.135184,83.236084,55.033364,9.410894,16.386058,18.299178,3.139883,0.48871,27.645534,11.249224,0.186137
std,2456.30882,1.875589,215.023594,1.810402,11.505412,1.135417,10.958172,13.094602,9.797339,9.3014,13.181156,19.939243,4.419945,0.499875,40.044686,16.943957,0.223842
min,0.001,0.4,0.0,0.0,0.0,0.0,18.3,33.0,3.7,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.1,4.9,19.0,0.0,0.0,0.0,61.4,73.8,49.0,2.87,7.04,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.25,6.0,180.0,0.0,9.0,0.04,70.4,85.1,55.8,6.59,12.42,10.0,2.0,0.0,0.0,0.0,0.0
75%,1.0,7.4,377.0,1.0,23.0,0.41,77.4,93.8,61.8,12.91,22.33,35.0,4.0,1.0,56.5,20.6,0.38
max,315578.8,17.2,1113.0,24.0,31.0,33.7,100.9,114.3,89.0,85.91,88.29,62.0,38.0,1.0,215.0,113.0,1.56


In [100]:
wildfire_set2.describe()

Unnamed: 0,AWND,CLDD,DP10,DX90,PRCP,TAVG,TMAX,TMIN,PRCP_6M,PRCP_RS,DX90_2M,DP10_2M,Receives Snow,Mar_SP,Mar_WC,Mar_Dens
count,1148.0,1148.0,1148.0,1148.0,1148.0,1148.0,1148.0,1148.0,1148.0,1148.0,1148.0,1148.0,1148.0,1148.0,1148.0,1148.0
mean,6.647909,271.436411,0.567944,14.698606,0.246977,72.098606,87.083362,57.11507,11.065462,20.15088,21.529617,2.364983,0.430314,29.864373,13.209495,0.174861
std,1.787315,216.499359,1.341998,11.316857,0.804377,9.65092,11.292646,9.178351,9.882451,15.915638,20.114363,3.531657,0.495336,46.300128,21.06606,0.217942
min,1.3,0.0,0.0,0.0,0.0,29.0,43.5,14.6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,5.4,67.75,0.0,3.0,0.0,65.65,79.975,52.1,4.2,9.84,2.0,0.0,0.0,0.0,0.0,0.0
50%,6.5,264.5,0.0,15.0,0.0,73.3,89.3,57.6,8.325,15.37,15.0,1.0,0.0,0.0,0.0,0.0
75%,7.8,428.25,1.0,26.0,0.12,78.9,95.4,62.9,14.16,24.98,39.0,3.0,1.0,51.5,23.0,0.42
max,15.0,1005.0,12.0,31.0,10.46,97.4,111.6,85.0,67.97,87.18,62.0,25.0,1.0,178.5,85.0,0.58


In [101]:
from sklearn import preprocessing
from sklearn.preprocessing import MinMaxScaler

Address Null Values and Data Smoothing

One-hot encode categorical features (e.g. Month discovered)

Bar Charts and Class Definition

Split the data into train-test sets

In [102]:
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score, f1_score

Normalize data using min-max-scalar after splitting into train and test

Feature Selection

In [103]:
from sklearn.feature_selection import RFE, SelectKBest

Cross-Validation

Model Building:

SVM Classifier

In [104]:
from sklearn import svm
from sklearn.svm import SVC

KNN Classifier

In [105]:
from sklearn.neighbors import KNeighborsClassifier

Naive Bayes Classifier

In [106]:
from sklearn.naive_bayes import GaussianNB

Decision Tree Classifier

In [107]:
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier

Random Forest Classifier

In [108]:
from sklearn.ensemble import RandomForestClassifier

Gradient Boost Classifier

In [109]:
from sklearn.ensemble import GradientBoostingClassifier

Neural Network Multiclass classifier (TensorFlow)
- Experiment by changing number of hidden layers and activation functions (sigmoid, relu, softmax)
- Change number of epochs and add more hidden layers
- Size of input = number of features in the dataset
- Size of output = number of classes in the multiclass classification problem

In [110]:
import tensorflow as tf
from tensorflow import keras
from sklearn.model_selection import train_test_split

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Activation, Dense, BatchNormalization, Dropout
from tensorflow.keras import optimizers
import matplotlib.pyplot as plt
import numpy as np 
import pandas as pd
import seaborn as sns

Voting Classifier

In [111]:
from sklearn.ensemble import VotingClassifier