<a href="https://colab.research.google.com/github/mnocerino23/Wildfire-Forecaster/blob/main/Regression_LargeDataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

As the classification models have had some difficulties classifying the fires accurately, we will attempt to build regression models (on Acres Burned feature) instead and see if these have more success.

I will deploy the following techniques:

1.   One-Hot Encoding of Categorical Variables
2.   Splitting the Training and Testing data
3.   Normalizing Data
4.   Feature Selection
5.   Regressions




In [60]:
import numpy as np
import pandas as pd

from google.colab import drive
drive.mount('/content/drive')

#Read in the smaller of the final datasets. The dataset contains around 1200 fires from 2016-2019
wildfire_set2 = pd.read_csv('/content/drive/MyDrive/Data_Science_Projects/Wildfires/wildfire_set2_w_allfeatures.csv')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [61]:
wildfire_set2.head()

Unnamed: 0.1,Unnamed: 0,Year,Name,AcresBurned,Fire Size Rank,Cause,SOURCE_REPORTING_UNIT_NAME,DaysBurn,Discovery Month,Discovered DOY,...,DX90_2M,DP10_2M,Receives Snow,Snow Station,River Basin,Mar_SP,Mar_WC,Mar_Dens,Has_Elevation,Elevation
0,0,2016,Soberanes Fire,132127.0,G,,,83.0,Jul,,...,0.0,1.0,0,,,0.0,0.0,0.0,1,961.2744
1,1,2016,Erskine Fire,48019.0,G,,,18.0,Jun,,...,15.0,4.0,1,mineral_king,Kaweah,36.0,16.0,0.44,1,3389.0664
2,2,2016,Chimney Fire,46344.0,G,,,24.0,Aug,,...,43.0,0.0,0,,,0.0,0.0,0.0,1,1049.856
3,3,2016,Blue Cut Fire,36274.0,G,,,7.0,Aug,,...,43.0,0.0,0,,,0.0,0.0,0.0,1,4192.8624
4,4,2016,Gap Fire,33867.0,G,,,1.0,Aug,,...,0.0,2.0,1,parks_creek,Shasta,77.0,34.0,0.44,1,3244.7112


In [62]:
print(wildfire_set2.columns)

Index(['Unnamed: 0', 'Year', 'Name', 'AcresBurned', 'Fire Size Rank', 'Cause',
       'SOURCE_REPORTING_UNIT_NAME', 'DaysBurn', 'Discovery Month',
       'Discovered DOY', 'Contained Month', 'Contained DOY', 'Latitude',
       'Longitude', 'County', 'CountyIds', 'State', 'OWNER_DESCR',
       'NOAA Station', 'Link', 'AWND', 'CLDD', 'DP10', 'DX90', 'PRCP', 'TAVG',
       'TMAX', 'TMIN', 'PRCP_6M', 'PRCP_RS', 'DX90_2M', 'DP10_2M',
       'Receives Snow', 'Snow Station', 'River Basin', 'Mar_SP', 'Mar_WC',
       'Mar_Dens', 'Has_Elevation', 'Elevation'],
      dtype='object')


Drop all columns that will not be relevant for our regression task:Unnamed: 0, Year, Name, Cause, SOURCE_REPORTING UNIT NAME, DaysBurn, Discovered DOY, Contained Month, Contained DOY, Lat, Long, County Ids, State, OWNER_DESCR, NOAA Station, Link, Snow Station, River Basin

In [63]:
wildfire_set2.drop(columns = ['Unnamed: 0', 'Year', 'Name', 'Cause', 'Fire Size Rank',
                      'SOURCE_REPORTING_UNIT_NAME', 'DaysBurn','Contained Month', 'Discovered DOY',
                      'Contained DOY','Latitude','Longitude','County','CountyIds','State','OWNER_DESCR',
                      'NOAA Station', 'Link', 'Snow Station', 'River Basin','Has_Elevation'], inplace = True)

Taking a look at the dataset now that we have dropped the nonrelevant columns

In [64]:
wildfire_set2.head(5)

Unnamed: 0,AcresBurned,Discovery Month,AWND,CLDD,DP10,DX90,PRCP,TAVG,TMAX,TMIN,PRCP_6M,PRCP_RS,DX90_2M,DP10_2M,Receives Snow,Mar_SP,Mar_WC,Mar_Dens,Elevation
0,132127.0,Jul,6.5,0.0,0.0,0.0,0.0,58.8,65.2,52.4,14.11,21.42,0.0,1.0,0,0.0,0.0,0.0,961.2744
1,48019.0,Jun,6.7,529.0,0.0,22.0,0.0,82.6,96.6,68.6,4.68,4.88,15.0,4.0,1,36.0,16.0,0.44,3389.0664
2,46344.0,Aug,6.9,237.0,0.0,23.0,0.0,72.6,92.6,52.6,2.52,8.09,43.0,0.0,0,0.0,0.0,0.0,1049.856
3,36274.0,Aug,6.5,455.0,0.0,28.0,0.0,79.7,94.6,64.7,3.41,6.45,43.0,0.0,0,0.0,0.0,0.0,4192.8624
4,33867.0,Aug,4.5,0.0,0.0,0.0,0.02,56.4,62.9,49.9,18.03,54.17,0.0,2.0,1,77.0,34.0,0.44,3244.7112


In [65]:
wildfire_set2.shape

(1156, 19)

# Investigate the presence of null values in the dataset:
Below, we have the count of totall null values in each column for  wildfire_set2.

In [66]:
print(wildfire_set2.isnull().sum())

AcresBurned         3
Discovery Month     0
AWND               37
CLDD               37
DP10               32
DX90               34
PRCP               32
TAVG               37
TMAX               34
TMIN               37
PRCP_6M             7
PRCP_RS            10
DX90_2M             2
DP10_2M             1
Receives Snow       0
Mar_SP              0
Mar_WC              0
Mar_Dens            0
Elevation           0
dtype: int64


In [67]:
wildfire_set2 = wildfire_set2.dropna()

In the cell above, we drop all null values in the dataset. We are able to drop all nulls in this case because our dataset is sizeable with relatively few null values in each so not much data is lost in this process. Looking at the shape of the dataframe after the drops, we see that the dataset remains a similar size post drop.

In [68]:
wildfire_set2.shape

(1105, 19)

# For our regressions, we will try to predict the number of Acres Burned in the fire

In [69]:
print(wildfire_set2.columns)

Index(['AcresBurned', 'Discovery Month', 'AWND', 'CLDD', 'DP10', 'DX90',
       'PRCP', 'TAVG', 'TMAX', 'TMIN', 'PRCP_6M', 'PRCP_RS', 'DX90_2M',
       'DP10_2M', 'Receives Snow', 'Mar_SP', 'Mar_WC', 'Mar_Dens',
       'Elevation'],
      dtype='object')


Double check our datatypes before we proceed with preprocessing and model building. As we can see, all features besides discovery and fire size rank are numerical (of type float or int) so all we have to do is one hot encode the discovery month of the fire.

In [70]:
wildfire_set2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1105 entries, 0 to 1152
Data columns (total 19 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   AcresBurned      1105 non-null   float64
 1   Discovery Month  1105 non-null   object 
 2   AWND             1105 non-null   float64
 3   CLDD             1105 non-null   float64
 4   DP10             1105 non-null   float64
 5   DX90             1105 non-null   float64
 6   PRCP             1105 non-null   float64
 7   TAVG             1105 non-null   float64
 8   TMAX             1105 non-null   float64
 9   TMIN             1105 non-null   float64
 10  PRCP_6M          1105 non-null   float64
 11  PRCP_RS          1105 non-null   float64
 12  DX90_2M          1105 non-null   float64
 13  DP10_2M          1105 non-null   float64
 14  Receives Snow    1105 non-null   int64  
 15  Mar_SP           1105 non-null   float64
 16  Mar_WC           1105 non-null   float64
 17  Mar_Dens      

# Below we use the describe function to get a general outlook on mean, max, min, and percentiles for each of the numerical features

In [71]:
wildfire_set2.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
AcresBurned,1105.0,5205.01448,31284.394585,0.0,32.0,87.0,370.0,410203.0
AWND,1105.0,6.550679,1.735019,1.3,5.4,6.5,7.6,15.0
CLDD,1105.0,279.669683,216.201422,0.0,77.0,272.0,432.0,1005.0
DP10,1105.0,0.509502,1.296648,0.0,0.0,0.0,0.0,12.0
DX90,1105.0,14.961991,11.36955,0.0,3.0,16.0,26.0,31.0
PRCP,1105.0,0.231213,0.80656,0.0,0.0,0.0,0.1,10.46
TAVG,1105.0,72.610679,9.188062,34.0,66.0,73.8,79.3,97.4
TMAX,1105.0,87.442172,11.012137,49.3,80.1,89.6,95.7,111.6
TMIN,1105.0,57.779186,8.497258,18.7,52.7,57.9,63.4,85.0
PRCP_6M,1105.0,11.244308,10.023728,0.0,4.2,8.84,14.58,67.97


# One-Hot Encoding Categorical Variables


*   Encode the discovery month (the only categorical variable) in both datasets



In [72]:
#Add the dummy columns to the first dataset
dummy_month = pd.get_dummies(wildfire_set2['Discovery Month'])
wildfire_set2 = pd.merge(left = wildfire_set2, right = dummy_month, left_index = True, right_index = True)
wildfire_set2 = wildfire_set2.drop(columns = ['Discovery Month'])
wildfire_set2.head(10)

Unnamed: 0,AcresBurned,AWND,CLDD,DP10,DX90,PRCP,TAVG,TMAX,TMIN,PRCP_6M,...,Dec,Feb,Jan,Jul,Jun,Mar,May,Nov,Oct,Sep
0,132127.0,6.5,0.0,0.0,0.0,0.0,58.8,65.2,52.4,14.11,...,0,0,0,1,0,0,0,0,0,0
1,48019.0,6.7,529.0,0.0,22.0,0.0,82.6,96.6,68.6,4.68,...,0,0,0,0,1,0,0,0,0,0
2,46344.0,6.9,237.0,0.0,23.0,0.0,72.6,92.6,52.6,2.52,...,0,0,0,0,0,0,0,0,0,0
3,36274.0,6.5,455.0,0.0,28.0,0.0,79.7,94.6,64.7,3.41,...,0,0,0,0,0,0,0,0,0,0
4,33867.0,4.5,0.0,0.0,0.0,0.02,56.4,62.9,49.9,18.03,...,0,0,0,0,0,0,0,0,0,0
5,29322.0,6.0,629.0,0.0,31.0,0.0,85.3,99.2,71.4,2.15,...,0,0,0,0,0,0,0,0,0,0
6,12518.0,6.3,53.0,0.0,3.0,0.0,64.1,76.2,52.0,3.84,...,0,0,0,0,0,0,0,0,0,1
7,8110.0,6.5,455.0,0.0,28.0,0.0,79.7,94.6,64.7,3.41,...,0,0,0,0,0,0,0,0,0,0
8,7609.0,8.3,216.0,0.0,18.0,0.0,71.2,91.5,50.9,7.63,...,0,0,0,0,1,0,0,0,0,0
9,7474.0,5.6,19.0,0.0,0.0,0.01,63.1,71.0,55.3,9.62,...,0,0,0,0,1,0,0,0,0,0


In [73]:
#Radomly Shuffle the dataframe to get rid of any pre-existing organization based on size
wildfire_set2 = wildfire_set2.sample(frac=1).reset_index(drop=True)

In [74]:
#Write the clean, encoded, and randomly shuffled data to csv before continuing 
wildfire_set2.to_csv('wildfire2_clean.csv')

Split the data into train-test sets

In [75]:
#We will train our models using the more recent dataset

In [76]:
import sklearn
from sklearn import preprocessing
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import MinMaxScaler

In [77]:
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score, f1_score, classification_report, ConfusionMatrixDisplay

Utilize an 80-20 train test split on our large dataset

In [78]:
train_data, test_data = train_test_split(wildfire_set2, test_size = 0.2)

Print the shape of training and testing datasets after the split to make sure we have done this correctly

In [79]:
print(train_data.shape)
print(test_data.shape)

(884, 30)
(221, 30)


Now for each of the three datasets, we normalize

In [80]:
#utilize minmax scaler normalization
scaler = MinMaxScaler()

#Normalize the train, validate, and test independently after splitting
train_data = scaler.fit_transform(train_data)
#train_target = scaler(train_target)

test_data = scaler.fit_transform(test_data)
#train_target = scaler(test_target)

In [84]:
print('Train:')
print('train_data dimensions: ', train_data.shape)

print('Test:')
print('test_data dimensions: ', test_data.shape)

Train:
train_data dimensions:  (884, 30)
Test:
test_data dimensions:  (221, 30)


Normalize data using min-max-scalar after splitting into train, validate, and test

In [85]:
from sklearn import linear_model

In [89]:
from sklearn.feature_selection import SelectKBest, f_regression

In [90]:
best_features = SelectKBest(f_regression, k=10)

In [91]:
top_features_ordered = []

In [92]:
train_target = wildfire_set2['AcresBurned']
train_features = wildfire_set2.drop(columns = ['AcresBurned'])

In [93]:
for i in range(29):
  best_features = SelectKBest(f_regression, k=i)
  k_best_data = best_features.fit_transform(train_features, train_target)
  mask = best_features.get_support()
  new_features = train_features.columns[mask]
  for item in new_features:
    if item not in top_features_ordered:
      top_features_ordered.append(item)



In [94]:
count = 0
for feature in top_features_ordered:
  count +=1
  print(count, ': ', feature)

1 :  Dec
2 :  Jul
3 :  AWND
4 :  DP10_2M
5 :  Jun
6 :  May
7 :  Nov
8 :  Receives Snow
9 :  Sep
10 :  TMIN
11 :  Mar_Dens
12 :  Mar_WC
13 :  Mar_SP
14 :  DX90
15 :  PRCP_RS
16 :  TMAX
17 :  Aug
18 :  Apr
19 :  DX90_2M
20 :  Oct
21 :  DP10
22 :  Feb
23 :  PRCP_6M
24 :  Mar
25 :  CLDD
26 :  Jan
27 :  TAVG
28 :  PRCP


In [95]:
multivariate_regression = linear_model.LinearRegression()
multivariate_regression.fit()

TypeError: ignored