<a href="https://colab.research.google.com/github/mnocerino23/Wildfire-Forecaster/blob/main/Classifiers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Now that we have gathered data from kaggle, NOAA, and the California Department of Water resources into two consistent datasets, we will drop columns that are not necessary for training our models and take care of some final preprocessing.

In this file, I will start to build multi-class classifiers using Support Vector Machine, Gaussian Naive Bayes, Decision Tree, Random Forest, KNN, Gradient Boosting, and Neural Networks. 

The target feature will be ***Fire Size Class*** as we want to predict the size/risk of a large fire given certain weather and snow conditions.


I will deploy the following techniques:

1.   One-Hot Encoding of Categorical Variables
2.   Feature Selection
3.   Splitting the training and testing data
4.   Cross-Validation



In [1]:
import sklearn
import numpy as np
import pandas as pd

from google.colab import drive
drive.mount('/content/drive')

#Read in the two final datasets. The first contains over 110,000 fires from 2001-2015 while the second has 1,000 more recent, larger fires.
wildfire_set1 = pd.read_csv('/content/drive/MyDrive/Data_Science_Projects/Wildfires/wildfires1_w_snow.csv')
wildfire_set2 = pd.read_csv('/content/drive/MyDrive/Data_Science_Projects/Wildfires/wildfires2_w_snow.csv')

Mounted at /content/drive


  exec(code_obj, self.user_global_ns, self.user_ns)


Before starting to build our classifiers, I take care of a few small issues and add an additional feature. From inspecting the dataset, I found that some invalid coordinates with (latitude = 0,longitude = 0) appear in the datasets so we quickly take care of that issue with the code below: 

In [2]:
for index, row in wildfire_set1.iterrows():
  if wildfire_set1.at[index,'Latitude'] == 0 and wildfire_set1.at[index,'Longitude'] == 0:
    wildfire_set1.drop([index], inplace = True)
wildfire_set1.reset_index()

for index, row in wildfire_set2.iterrows():
  if wildfire_set2.at[index,'Latitude'] == 0 and wildfire_set2.at[index,'Longitude'] == 0:
    wildfire_set2.drop([index], inplace = True)
wildfire_set2.reset_index()

Unnamed: 0.1,index,Unnamed: 0,Year,Name,AcresBurned,Fire Size Rank,Cause,SOURCE_REPORTING_UNIT_NAME,DaysBurn,Discovery Month,...,PRCP_6M,PRCP_RS,DX90_2M,DP10_2M,Receives Snow,Snow Station,River Basin,Mar_SP,Mar_WC,Mar_Dens
0,0,0,2016,Soberanes Fire,132127.0,G,,,83.0,Jul,...,14.11,21.42,0.0,1.0,0,,,0.0,0.0,0.00
1,1,1,2016,Erskine Fire,48019.0,G,,,18.0,Jun,...,4.68,4.88,15.0,4.0,1,mineral_king,Kaweah,36.0,16.0,0.44
2,2,2,2016,Chimney Fire,46344.0,G,,,24.0,Aug,...,2.52,8.09,43.0,0.0,0,,,0.0,0.0,0.00
3,3,3,2016,Blue Cut Fire,36274.0,G,,,7.0,Aug,...,3.41,6.45,43.0,0.0,0,,,0.0,0.0,0.00
4,4,4,2016,Gap Fire,33867.0,G,,,1.0,Aug,...,18.03,54.17,0.0,2.0,1,parks_creek,Shasta,77.0,34.0,0.44
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1151,1192,1192,2019,Eagle Fire,9.0,B,,,,Oct,...,0.49,12.66,48.0,0.0,0,,,0.0,0.0,0.00
1152,1193,1193,2019,Long Fire,2.0,B,,,,Jun,...,67.97,69.29,0.0,17.0,1,eureka_lake,Feather,110.0,48.0,0.44
1153,1194,1194,2019,Cashe Fire,,B,,,,Nov,...,3.29,21.47,13.0,0.0,0,,,0.0,0.0,0.00
1154,1195,1195,2019,Oak Fire,,B,,,,Oct,...,0.00,0.00,0.0,0.0,0,,,0.0,0.0,0.00


Add in one final feature Elevation. Make a post request to the open elevation API (https://developer.mapquest.com/documentation/open/elevation-api/#:~:text=The%20Open%20Elevation%20API%20provides,by%20the%20lat%2Flng%20collection) which allows us to get elevation given latitude and longitude. Below, we create a dictionary which has the key location mapped to a list of dictionaries each holding the individual fire locations which is the format required by the API as described in its github documentation. (https://github.com/Jorl17/open-elevation/blob/master/docs/api.md)

In [141]:
coordinates = []
for index, row in wildfire_set1.iterrows():
  if len(coordinates) < 1500:
    d = {}
    d["latitude"] = wildfire_set1.at[index,"Latitude"]
    d["longitude"] = wildfire_set1.at[index,"Longitude"]
    coordinates.append(d)
  else:
    break
print(len(coordinates))
final = {}
final["locations"] = coordinates

1500


In [142]:
final

{'locations': [{'latitude': 40.03694444, 'longitude': -121.00583333},
  {'latitude': 38.93305556, 'longitude': -120.40444444},
  {'latitude': 38.98416667, 'longitude': -120.73555556},
  {'latitude': 38.55916667, 'longitude': -119.91333333},
  {'latitude': 38.55916667, 'longitude': -119.93305556},
  {'latitude': 38.63527778, 'longitude': -120.10361111},
  {'latitude': 38.68833333, 'longitude': -120.15333333},
  {'latitude': 40.96805556, 'longitude': -122.43388889},
  {'latitude': 41.23361111, 'longitude': -122.28333333},
  {'latitude': 38.54833333, 'longitude': -120.14916667},
  {'latitude': 38.69166667, 'longitude': -120.15972222},
  {'latitude': 38.5275, 'longitude': -120.10611111},
  {'latitude': 38.78666667, 'longitude': -120.19333333},
  {'latitude': 38.43333333, 'longitude': -120.51},
  {'latitude': 38.67583333, 'longitude': -120.27972222},
  {'latitude': 38.56416667, 'longitude': -120.54222222},
  {'latitude': 38.52333333, 'longitude': -120.21166667},
  {'latitude': 38.78, 'longi

In [143]:
import requests
import json

coord = final
j = json.dumps(coord)
print(type(j))
json_object = json.loads(j)
print(type(json_object))
r = requests.post(url= 'https://api.open-elevation.com/api/v1/lookup', json= json_object, timeout = 30)
r.text

<class 'str'>
<class 'dict'>


'{"results": [{"latitude": 40.03694444, "longitude": -121.00583333, "elevation": 904}, {"latitude": 38.93305556, "longitude": -120.40444444, "elevation": 1892}, {"latitude": 38.98416667, "longitude": -120.73555556, "elevation": 1053}, {"latitude": 38.55916667, "longitude": -119.91333333, "elevation": 2365}, {"latitude": 38.55916667, "longitude": -119.93305556, "elevation": 2316}, {"latitude": 38.63527778, "longitude": -120.10361111, "elevation": 2507}, {"latitude": 38.68833333, "longitude": -120.15333333, "elevation": 2020}, {"latitude": 40.96805556, "longitude": -122.43388889, "elevation": 399}, {"latitude": 41.23361111, "longitude": -122.28333333, "elevation": 869}, {"latitude": 38.54833333, "longitude": -120.14916667, "elevation": 2052}, {"latitude": 38.69166667, "longitude": -120.15972222, "elevation": 2000}, {"latitude": 38.5275, "longitude": -120.10611111, "elevation": 2554}, {"latitude": 38.78666667, "longitude": -120.19333333, "elevation": 1663}, {"latitude": 38.43333333, "long

In [146]:
r.text
y = json.loads(r.text)
for item in y['results']:
  print(item['elevation'])

904
1892
1053
2365
2316
2507
2020
399
869
2052
2000
2554
1663
724
1990
1092
1872
1526
1580
924
490
328
842
1818
1850
379
2321
509
1213
2655
824
2467
2288
1003
98
388
803
2244
2610
2570
2078
1880
1219
1879
127
773
1963
768
611
718
406
910
784
784
797
2166
786
783
2005
792
888
2294
2733
811
2937
374
1950
468
2279
2290
2321
1736
2394
698
456
1169
874
410
1012
1571
848
492
562
1511
1176
401
661
1765
405
1752
2599
2546
2049
1573
1740
764
2402
815
1897
2460
2125
1007
2239
443
638
815
798
805
2614
2278
2117
2480
2361
2073
2174
1667
2420
2343
2196
2390
2565
2788
1990
1536
1993
2271
1949
2176
2213
1475
1343
76
2188
1096
334
1063
2162
704
2686
707
529
2123
409
431
343
1740
959
1152
992
1724
1164
659
1038
1180
1112
1602
291
670
493
473
758
579
410
979
505
1194
1056
1443
1136
707
623
697
924
1475
1010
466
1355
1029
485
2040
1202
2523
1042
1463
1068
1092
1546
1575
757
2472
1036
2781
1018
2184
823
2791
385
2350
2464
1900
1905
685
1402
1085
2189
2221
1602
230
231
178
1150
1331
572
273
877
561
537
157

In [87]:
import requests

# script for returning elevation from lat, long, based on open elevation data
# which in turn is based on SRTM
def get_elevation(coordinates):
    query = ('https://api.open-elevation.com/api/v1/lookup'
             f'?locations={lat},{long}')
    r = requests.post(query).json()  # json object, various ways you can extract value
    # one approach is to use pandas json functionality:
    elevation = pd.io.json.json_normalize(r, 'results')['elevation'].values[0]
    return elevation

In [16]:
print(wildfire_set1['Elevation'].value_counts())

                      114551
2965.87936                 1
6207.34928                 1
3454.7245199999998         1
7759.1866                  1
7598.42544                 1
8225.06588                 1
6627.2968                  1
Name: Elevation, dtype: int64


In [None]:
wildfire_set2.head(5)

In [None]:
import sklearn
import numpy as np
import pandas as pd

In [57]:
print(wildfire_set2.columns)

Index(['Unnamed: 0', 'Year', 'Name', 'AcresBurned', 'Fire Size Rank', 'Cause',
       'SOURCE_REPORTING_UNIT_NAME', 'DaysBurn', 'Discovery Month',
       'Discovered DOY', 'Contained Month', 'Contained DOY', 'Latitude',
       'Longitude', 'County', 'CountyIds', 'State', 'OWNER_DESCR',
       'NOAA Station', 'Link', 'AWND', 'CLDD', 'DP10', 'DX90', 'PRCP', 'TAVG',
       'TMAX', 'TMIN', 'PRCP_6M', 'PRCP_RS', 'DX90_2M', 'DP10_2M',
       'Receives Snow', 'Snow Station', 'River Basin', 'Mar_SP', 'Mar_WC',
       'Mar_Dens'],
      dtype='object')


In [None]:
print(wildfire_set1.columns)

In [None]:
print(wildfire_set2.columns)

Drop all columns that will not be relevant for our classification task:
Name, AcresBurned, Cause, SOURCE_REPORTING UNIT NAME, DaysBurn, Discovered DOY, Contained Month, Contained DOY, Lat, Long, County Ids, State, OWNER_DESCR, NOAA Station, Link, Snow Station, River Basin

In [None]:
wildfire_set1.drop(columns = ['Unnamed: 0', 'Year', 'Name', 'Cause',
                      'SOURCE_REPORTING_UNIT_NAME', 'DaysBurn', 'Discovered DOY', 'Contained Month',
                      'Contained DOY','Latitude','Longitude','CountyIds','State','OWNER_DESCR',
                      'NOAA Station', 'Link', 'Snow Station', 'River Basin'], inplace = True)

In [None]:
wildfire_set2.drop(columns = ['Unnamed: 0', 'Year', 'Name', 'AcresBurned', 'Cause',
                      'SOURCE_REPORTING_UNIT_NAME', 'DaysBurn', 'Discovered DOY', 'Contained Month',
                      'Contained DOY','Latitude','Longitude','CountyIds','State','OWNER_DESCR',
                      'NOAA Station', 'Link', 'Snow Station', 'River Basin'], inplace = True)

Taking a look at both datasets now that we have dropped the nonrelevant columns

In [None]:
wildfire_set1.head(5)

In [None]:
wildfire_set1.shape

In [None]:
wildfire_set2.head(5)

In [None]:
wildfire_set2.shape

In [None]:
print(wildfire_set1.isnull().sum())

In [None]:
wildfire_set1 = wildfire_set1.dropna()

In [None]:
wildfire_set1.shape

In [None]:
print(wildfire_set1['Fire Size Rank'].value_counts())

In [None]:
print(wildfire_set2.isnull().sum())

In [None]:
print(wildfire_set1['Fire Size Rank'].value_counts())

In [None]:
wildfire_set2 = wildfire_set2.dropna()

In [None]:
wildfire_set2.shape

In [None]:
print(wildfire_set2['Fire Size Rank'].value_counts())

Double check our datatypes before we proceed with preprocessing and model building. As we can see, all features besides county and fire size rank are numerical (of type float) so all we have to do is one hot encode these two categorial 

In [None]:
wildfire_set1.info()

In [None]:
wildfire_set2.info()

In [None]:
wildfire_set1.describe()

In [None]:
wildfire_set2.describe()

In [None]:
from sklearn import preprocessing
from sklearn.preprocessing import MinMaxScaler

Address Null Values and Data Smoothing

One-hot encode categorical features (e.g. Month discovered)

Bar Charts and Class Definition

Split the data into train-test sets

In [None]:
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score, f1_score

Normalize data using min-max-scalar after splitting into train and test

Feature Selection

In [None]:
from sklearn.feature_selection import RFE, SelectKBest

Cross-Validation

Model Building:

SVM Classifier

In [None]:
from sklearn import svm
from sklearn.svm import SVC

KNN Classifier

In [None]:
from sklearn.neighbors import KNeighborsClassifier

Naive Bayes Classifier

In [None]:
from sklearn.naive_bayes import GaussianNB

Decision Tree Classifier

In [None]:
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier

Random Forest Classifier

In [None]:
from sklearn.ensemble import RandomForestClassifier

Gradient Boost Classifier

In [None]:
from sklearn.ensemble import GradientBoostingClassifier

Neural Network Multiclass classifier (TensorFlow)
- Experiment by changing number of hidden layers and activation functions (sigmoid, relu, softmax)
- Change number of epochs and add more hidden layers
- Size of input = number of features in the dataset
- Size of output = number of classes in the multiclass classification problem

In [None]:
import tensorflow as tf
from tensorflow import keras
from sklearn.model_selection import train_test_split

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Activation, Dense, BatchNormalization, Dropout
from tensorflow.keras import optimizers
import matplotlib.pyplot as plt
import numpy as np 
import pandas as pd
import seaborn as sns

Voting Classifier

In [None]:
from sklearn.ensemble import VotingClassifier