<a href="https://colab.research.google.com/github/mnocerino23/Wildfire-Forecaster/blob/main/NOAA_weatherdata.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

I requested monthly weather data from the state of California's various weather stations from 2001-2019. In this notebook, I engineer additional weather features, map each station to its coordinates, then find the closest station to each fire that occured based on the coordinates

In [77]:
#Read in the csv file from my google drive
import pandas as pd
from google.colab import drive
drive.mount('/content/drive')
weather = pd.read_csv('/content/drive/MyDrive/Data_Science_Projects/Wildfires/NOAA_California.csv')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Weather Features:


1.   Station - Weather Station in California
2.   Date - Year and month of the report
3.   AWND - Average Monthly Wind Speed
4.   CDSD - Cooling Degree Days Season to Date
5.   CLDD - Cooling Degree Days
6.   DP10 - Days with over 0.10 inches of percipitation
7.   DX90 - Days with temperature over 90
8.   DYTS - Number of days with thunderstorms
9.   EMXP - Extreme max percipitation
10.  EMXT - Extreme max temperature
11.  PRCP - Monthly Percipitation
12.  SNOW - Total Monthly Snowfall
13.  TAVG - Average Temperature
14.  TMAX - Max Temperature
15.  TMIN - Min Temperature






In [78]:
weather.shape

(19239, 15)

In [79]:
weather.head(8)

Unnamed: 0,STATION,DATE,AWND,CDSD,CLDD,DP10,DX90,DYTS,EMXP,EMXT,PRCP,SNOW,TAVG,TMAX,TMIN
0,USW00023129,2000-08,5.6,953.0,323.0,0.0,6.0,,0.0,97.0,0.0,,75.4,84.3,66.5
1,USW00023129,2000-09,5.1,1175.0,221.0,0.0,5.0,,0.0,96.0,0.0,,72.4,82.5,62.3
2,USW00023129,2000-10,4.7,1208.0,33.0,4.0,0.0,1.0,1.81,79.0,2.3,,64.7,71.9,57.5
3,USW00023129,2000-11,3.6,1208.0,0.0,0.0,0.0,1.0,0.0,80.0,0.0,,56.9,68.1,45.8
4,USW00023129,2000-12,3.1,1209.0,0.0,0.0,0.0,,0.0,80.0,0.0,,57.1,68.3,46.0
5,USW00023129,2001-01,3.6,0.0,0.0,5.0,0.0,1.0,0.71,82.0,2.11,,52.4,63.1,41.7
6,USW00023129,2001-02,5.1,1.0,1.0,8.0,0.0,2.0,1.93,87.0,5.79,,52.5,61.5,43.4
7,USW00023129,2001-03,5.1,7.0,6.0,1.0,0.0,1.0,0.25,85.0,0.26,,59.0,66.3,51.8


Read in a CSV file that has the California NOAA weather stations and their coordinates and split it into latitude and longitude

In [80]:
stations = pd.read_csv('/content/drive/MyDrive/Data_Science_Projects/Wildfires/NOAA_Stations.csv')

In [81]:
stations.rename({'Unnamed: 3': 'Longitude', 'Coordinates':'Latitude'}, axis = 1, inplace = True)
for index, row in stations.iterrows():
  s = stations.at[index,'Latitude'].split(' ')
  stations.at[index,'Longitude'] = s[1]
  stations.at[index,'Latitude'] = s[0]
stations.head(10)

Unnamed: 0.1,Unnamed: 0,Station,Latitude,Longitude
0,0,USW00023129,33.8117,-118.1464
1,1,USW00093111,34.1167,-119.1167
2,2,USW00093112,32.7,-117.2
3,3,USW00093115,32.5667,-117.1167
4,4,USW00093116,33.25,-119.45
5,5,USW00003167,33.9228,-118.3342
6,6,USW00023293,37.3592,-121.9239
7,7,USW00023174,33.9381,-118.3889
8,8,USW00023130,34.2097,-118.4892
9,9,USW00093241,38.3775,-121.9575


In [82]:
#Dictionaries mapping the station to its latitude and longitude
lat = dict(zip(stations['Station'],stations['Latitude']))
long = dict(zip(stations['Station'],stations['Longitude']))

In [83]:
weather['Latitude'] = ''
weather['Longitude'] = ''
for index, row in weather.iterrows():
  weather.at[index, 'Latitude'] = lat[weather.at[index,'STATION']] 
  weather.at[index, 'Longitude'] = long[weather.at[index,'STATION']] 

In [84]:
weather.head()

Unnamed: 0,STATION,DATE,AWND,CDSD,CLDD,DP10,DX90,DYTS,EMXP,EMXT,PRCP,SNOW,TAVG,TMAX,TMIN,Latitude,Longitude
0,USW00023129,2000-08,5.6,953.0,323.0,0.0,6.0,,0.0,97.0,0.0,,75.4,84.3,66.5,33.8117,-118.1464
1,USW00023129,2000-09,5.1,1175.0,221.0,0.0,5.0,,0.0,96.0,0.0,,72.4,82.5,62.3,33.8117,-118.1464
2,USW00023129,2000-10,4.7,1208.0,33.0,4.0,0.0,1.0,1.81,79.0,2.3,,64.7,71.9,57.5,33.8117,-118.1464
3,USW00023129,2000-11,3.6,1208.0,0.0,0.0,0.0,1.0,0.0,80.0,0.0,,56.9,68.1,45.8,33.8117,-118.1464
4,USW00023129,2000-12,3.1,1209.0,0.0,0.0,0.0,,0.0,80.0,0.0,,57.1,68.3,46.0,33.8117,-118.1464


Use haversine to calculate the distance between two sets of coordinates to find the closest weather station to each fire

In [85]:
!pip install haversine
import haversine as hs
#each parameter is a tuple with lat, long
h = hs.haversine((33.8117, -118.1464), (33.3000,-117.35))
print(h)

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
93.18494458973653


In [86]:
wildfires1 = pd.read_csv('/content/drive/MyDrive/Data_Science_Projects/Wildfires/CA_Wildfires_2001_2015.csv')

In the code below, I find the closest NOAA weather station to every fire and add it as a feature in both the 2001-2015 dataset and the smaller 2016-2019 dataset.

In [87]:
wildfires1['NOAA Station'] = ''
for index1, row in wildfires1.iterrows():
  #initialize to a super large number so that condition of being less than this will be met right away
  min_dist = 100000000
  w = []
  w.append(float(wildfires1.at[index1, 'Latitude']))
  w.append(float(wildfires1.at[index1, 'Longitude']))
  wildfire_coordinates = tuple(w)

  for index2, row in stations.iterrows():
    station = stations.at[index2, 'Station']
    s = []
    s.append(float(stations.at[index2, 'Latitude']))
    s.append(float(stations.at[index2, 'Longitude']))
    station_coordinates = tuple(s)

    dist = hs.haversine(station_coordinates, wildfire_coordinates)
    if dist < min_dist:
      min_dist = dist
      min_station = station
  wildfires1.at[index1, 'NOAA Station'] = min_station

The second dataset had a few occurences of invalid coordinates so I write a function to check if the fire's coordinates are valid and drop invalid coordinates.

In [88]:
wildfires2 = pd.read_csv('/content/drive/MyDrive/Data_Science_Projects/Wildfires/CA_Wildfires_recent.csv')

In [89]:
def are_valid_coordinates(latitude, longitude):
  if latitude < -90 or latitude > 90:
    return False
  elif longitude < -180 or longitude > 180:
    return False
  else:
    return True

In [90]:
#Drop entries in the second dataset that were causing issues due to incorrect coordinates.

for index, row in wildfires2.iterrows():
  if are_valid_coordinates(wildfires2.at[index,'Latitude'], wildfires2.at[index, 'Longitude']) == False:
    bad_index = index
    wildfires2.drop([bad_index], axis = 0, inplace = True)

In [91]:
wildfires2['NOAA Station'] = ''
for index1, row in wildfires2.iterrows():
  #initialize to a super large number so that condition of being less than this will be met right away
  min_dist = 100000000
  w = []
  w.append(float(wildfires2.at[index1, 'Latitude']))
  w.append(float(wildfires2.at[index1, 'Longitude']))
  wildfire_coordinates = tuple(w)

  for index2, row in stations.iterrows():
    station = stations.at[index2, 'Station']
    s = []
    s.append(float(stations.at[index2, 'Latitude']))
    s.append(float(stations.at[index2, 'Longitude']))
    station_coordinates = tuple(s)

    dist = hs.haversine(station_coordinates, wildfire_coordinates)
    if dist < min_dist:
      min_dist = dist
      min_station = station
  wildfires2.at[index1, 'NOAA Station'] = min_station

In [92]:
wildfires1.head(5)

Unnamed: 0,Year,Name,AcresBurned,Fire Size Rank,Cause,SOURCE_REPORTING_UNIT_NAME,DaysBurn,Discovery Month,Discovered DOY,Contained Month,Contained DOY,Latitude,Longitude,County,CountyIds,State,OWNER_DESCR,NOAA Station
0,2005,FOUNTAIN,0.1,A,Miscellaneous,Plumas National Forest,1.0,Feb,33.0,Feb,33.0,40.036944,-121.005833,Plumas,32.0,CA,USFS,USW00023225
1,2004,PIGEON,0.25,A,Lightning,Eldorado National Forest,1.0,May,133.0,May,133.0,38.933056,-120.404444,Placer,31.0,CA,USFS,USW00093230
2,2004,SLACK,0.1,A,Debris Burning,Eldorado National Forest,1.0,Jun,152.0,Jun,152.0,38.984167,-120.735556,El Dorado,9.0,CA,STATE OR PRIVATE,USW00023225
3,2004,DEER,0.1,A,Lightning,Eldorado National Forest,5.0,Jun,180.0,Jul,185.0,38.559167,-119.913333,Alpine,2.0,CA,USFS,USW00093230
4,2004,STEVENOT,0.1,A,Lightning,Eldorado National Forest,5.0,Jun,180.0,Jul,185.0,38.559167,-119.933056,Alpine,2.0,CA,USFS,USW00093230


In [93]:
wildfires2.head(5)

Unnamed: 0,Year,Name,AcresBurned,Fire Size Rank,Cause,SOURCE_REPORTING_UNIT_NAME,DaysBurn,Discovery Month,Discovered DOY,Contained Month,Contained DOY,Latitude,Longitude,County,CountyIds,State,OWNER_DESCR,NOAA Station
0,2016,Soberanes Fire,132127.0,G,,,83.0,Jul,,Oct,287.0,36.45994,-121.89938,Monterey,27,CA,,USW00023259
1,2016,Erskine Fire,48019.0,G,,,18.0,Jun,,Jul,193.0,35.6115,-118.45628,Kern,15,CA,,USW00023155
2,2016,Chimney Fire,46344.0,G,,,24.0,Aug,,Sep,250.0,35.70595,-120.98316,San Luis Obispo,40,CA,,USW00093209
3,2016,Blue Cut Fire,36274.0,G,,,7.0,Aug,,Aug,236.0,34.30372,-117.49342,San Bernardino,36,CA,,USW00003102
4,2016,Gap Fire,33867.0,G,,,1.0,Aug,,Aug,241.0,41.851,-123.118,Siskiyou,47,CA,,USW00024283


Datasets are good to go with each fire having its closest weather station as a feature. Quickly Resolve one more issue: drop fires where our engineered feature of DaysBurn is less than 0 (there was some incorrect data that went into this calculation)

In [94]:
w1 = wildfires1.loc[wildfires1['DaysBurn'] >= 0]
w2 = wildfires2.loc[wildfires2['DaysBurn'] >= 0]

In [95]:
#Save each dataset to a new CSV so that we don't have to rerun the expensive calculation to find the closest station to every fire
w1.to_csv('wildfires1_with_station.csv', index = False)
w2.to_csv('wildfires2_with_station.csv', index = False)

Now, having mapped each fire to its closest weather station (which will ultimately allow us to add features to each fire regarding the temperature/weather conditions during the year it burned), we will clean the NOAA weather and engineer some additional weather related features before linking the two wildfire datasets to our weather data from NOAA.

In [96]:
#Import datetime library convert the DATE in the weather dataframe into a datetime object
import datetime
from datetime import datetime

print(type(weather.at[0,'DATE']))
weather['DATE'] = weather['DATE'].apply(lambda x: pd.to_datetime(x, format = '%Y-%m'))
print(type(weather.at[0,'DATE']))

<class 'str'>
<class 'pandas._libs.tslibs.timestamps.Timestamp'>


In [97]:
weather.head()

Unnamed: 0,STATION,DATE,AWND,CDSD,CLDD,DP10,DX90,DYTS,EMXP,EMXT,PRCP,SNOW,TAVG,TMAX,TMIN,Latitude,Longitude
0,USW00023129,2000-08-01,5.6,953.0,323.0,0.0,6.0,,0.0,97.0,0.0,,75.4,84.3,66.5,33.8117,-118.1464
1,USW00023129,2000-09-01,5.1,1175.0,221.0,0.0,5.0,,0.0,96.0,0.0,,72.4,82.5,62.3,33.8117,-118.1464
2,USW00023129,2000-10-01,4.7,1208.0,33.0,4.0,0.0,1.0,1.81,79.0,2.3,,64.7,71.9,57.5,33.8117,-118.1464
3,USW00023129,2000-11-01,3.6,1208.0,0.0,0.0,0.0,1.0,0.0,80.0,0.0,,56.9,68.1,45.8,33.8117,-118.1464
4,USW00023129,2000-12-01,3.1,1209.0,0.0,0.0,0.0,,0.0,80.0,0.0,,57.1,68.3,46.0,33.8117,-118.1464


In [98]:
#Create a feature called link
#holding weather station + date as a string so that we can connect the dataframes
# later on. This feature will link the weather dataset to the two fire datasets
weather['Link'] = ''
for index, row in weather.iterrows():
  st = str(weather.at[index,'STATION'])
  date = str(weather.at[index,'DATE'])
  weather.at[index,'Link'] = st + ' ' + date

In [99]:
weather.head()

Unnamed: 0,STATION,DATE,AWND,CDSD,CLDD,DP10,DX90,DYTS,EMXP,EMXT,PRCP,SNOW,TAVG,TMAX,TMIN,Latitude,Longitude,Link
0,USW00023129,2000-08-01,5.6,953.0,323.0,0.0,6.0,,0.0,97.0,0.0,,75.4,84.3,66.5,33.8117,-118.1464,USW00023129 2000-08-01 00:00:00
1,USW00023129,2000-09-01,5.1,1175.0,221.0,0.0,5.0,,0.0,96.0,0.0,,72.4,82.5,62.3,33.8117,-118.1464,USW00023129 2000-09-01 00:00:00
2,USW00023129,2000-10-01,4.7,1208.0,33.0,4.0,0.0,1.0,1.81,79.0,2.3,,64.7,71.9,57.5,33.8117,-118.1464,USW00023129 2000-10-01 00:00:00
3,USW00023129,2000-11-01,3.6,1208.0,0.0,0.0,0.0,1.0,0.0,80.0,0.0,,56.9,68.1,45.8,33.8117,-118.1464,USW00023129 2000-11-01 00:00:00
4,USW00023129,2000-12-01,3.1,1209.0,0.0,0.0,0.0,,0.0,80.0,0.0,,57.1,68.3,46.0,33.8117,-118.1464,USW00023129 2000-12-01 00:00:00


In [100]:
weather.shape

(19239, 18)

In [101]:
#Investigate the number of null values in each column of the dataframe
print(weather.isnull().sum())

STATION          0
DATE             0
AWND           781
CDSD          1251
CLDD           365
DP10           271
DX90           357
DYTS         15558
EMXP           271
EMXT           357
PRCP           271
SNOW         13440
TAVG           364
TMAX           357
TMIN           348
Latitude         0
Longitude        0
Link             0
dtype: int64


In [102]:
#After a closer look into the dataset, we find that the snow data available is mostly null and not recorded at many mountain stations
#so we will drop this feature for now and potentially looking into other data sources for monthly snowfall.
#we will drop CDSD as it is very similar to CLDD (want to avoid redundancy) and has more null values 
#we will also drop Days of thunder storms due to very high null count (many stations likely don't record this)

In [103]:
weather.drop(['CDSD','DYTS','SNOW'], axis = 1, inplace = True)

In [104]:
wild1 = pd.read_csv('/content/drive/MyDrive/Data_Science_Projects/Wildfires/wildfires1_with_station.csv')
wild2 = pd.read_csv('/content/drive/MyDrive/Data_Science_Projects/Wildfires/wildfires2_with_station.csv')

Add on weather data to the wildfires datasets

In [105]:
from pandas.core.tools.datetimes import to_datetime
wild1['Link'] = ''
for index, row in wild1.iterrows():
  d = str(wild1.at[index,'Year']) + '-' + str(wild1.at[index,'Discovery Month'])
  n = datetime.strptime(d, '%Y-%b')
  wild1.at[index,'Link'] = wild1.at[index,'NOAA Station'] + ' ' + str(n)

wild1.head()

Unnamed: 0,Year,Name,AcresBurned,Fire Size Rank,Cause,SOURCE_REPORTING_UNIT_NAME,DaysBurn,Discovery Month,Discovered DOY,Contained Month,Contained DOY,Latitude,Longitude,County,CountyIds,State,OWNER_DESCR,NOAA Station,Link
0,2005,FOUNTAIN,0.1,A,Miscellaneous,Plumas National Forest,1.0,Feb,33.0,Feb,33.0,40.036944,-121.005833,Plumas,32.0,CA,USFS,USW00023225,USW00023225 2005-02-01 00:00:00
1,2004,PIGEON,0.25,A,Lightning,Eldorado National Forest,1.0,May,133.0,May,133.0,38.933056,-120.404444,Placer,31.0,CA,USFS,USW00093230,USW00093230 2004-05-01 00:00:00
2,2004,SLACK,0.1,A,Debris Burning,Eldorado National Forest,1.0,Jun,152.0,Jun,152.0,38.984167,-120.735556,El Dorado,9.0,CA,STATE OR PRIVATE,USW00023225,USW00023225 2004-06-01 00:00:00
3,2004,DEER,0.1,A,Lightning,Eldorado National Forest,5.0,Jun,180.0,Jul,185.0,38.559167,-119.913333,Alpine,2.0,CA,USFS,USW00093230,USW00093230 2004-06-01 00:00:00
4,2004,STEVENOT,0.1,A,Lightning,Eldorado National Forest,5.0,Jun,180.0,Jul,185.0,38.559167,-119.933056,Alpine,2.0,CA,USFS,USW00093230,USW00093230 2004-06-01 00:00:00


In [106]:
wild2['Link'] = ''
for index, row in wild2.iterrows():
  d = str(wild2.at[index,'Year']) + '-' + str(wild2.at[index,'Discovery Month'])
  n = datetime.strptime(d, '%Y-%b')
  wild2.at[index,'Link'] = wild2.at[index,'NOAA Station'] + ' ' + str(n)

wild2.head()

Unnamed: 0,Year,Name,AcresBurned,Fire Size Rank,Cause,SOURCE_REPORTING_UNIT_NAME,DaysBurn,Discovery Month,Discovered DOY,Contained Month,Contained DOY,Latitude,Longitude,County,CountyIds,State,OWNER_DESCR,NOAA Station,Link
0,2016,Soberanes Fire,132127.0,G,,,83.0,Jul,,Oct,287.0,36.45994,-121.89938,Monterey,27,CA,,USW00023259,USW00023259 2016-07-01 00:00:00
1,2016,Erskine Fire,48019.0,G,,,18.0,Jun,,Jul,193.0,35.6115,-118.45628,Kern,15,CA,,USW00023155,USW00023155 2016-06-01 00:00:00
2,2016,Chimney Fire,46344.0,G,,,24.0,Aug,,Sep,250.0,35.70595,-120.98316,San Luis Obispo,40,CA,,USW00093209,USW00093209 2016-08-01 00:00:00
3,2016,Blue Cut Fire,36274.0,G,,,7.0,Aug,,Aug,236.0,34.30372,-117.49342,San Bernardino,36,CA,,USW00003102,USW00003102 2016-08-01 00:00:00
4,2016,Gap Fire,33867.0,G,,,1.0,Aug,,Aug,241.0,41.851,-123.118,Siskiyou,47,CA,,USW00024283,USW00024283 2016-08-01 00:00:00


In [107]:
wild1['AWND'] = ''
wild1['CLDD'] = ''
wild1['DP10'] = ''
wild1['DX90'] = ''
wild1['PRCP'] = ''
wild1['TAVG'] = ''
wild1['TMAX'] = ''
wild1['TMIN'] = ''

wild2['AWND'] = ''
wild2['CLDD'] = ''
wild2['DP10'] = ''
wild2['DX90'] = ''
wild2['PRCP'] = ''
wild2['TAVG'] = ''
wild2['TMAX'] = ''
wild2['TMIN'] = ''

In [108]:
#Link weather data to each fire in wild1 through the link feature (holds station name + date)
for index, row in wild1.iterrows():
  li = wild1.at[index,'Link']
  w = weather.loc[weather['Link'] == li]
  w.reset_index(inplace = True)

  if w.shape == (1, 16):
    wild1.at[index,'AWND'] = w.at[0, 'AWND']
    wild1.at[index,'CLDD'] = w.at[0, 'CLDD']
    wild1.at[index,'DP10'] = w.at[0, 'DP10']
    wild1.at[index,'DX90'] = w.at[0, 'DX90']
    wild1.at[index,'PRCP'] = w.at[0, 'PRCP']
    wild1.at[index,'TAVG'] = w.at[0, 'TAVG']
    wild1.at[index,'TMAX'] = w.at[0, 'TMAX']
    wild1.at[index,'TMIN'] = w.at[0, 'TMIN']

In [109]:
#Link weather data to each fire in wild2 through the link feature (holds station name + date)
for index, row in wild2.iterrows():
  li = wild2.at[index,'Link']
  w = weather.loc[weather['Link'] == li]
  w.reset_index(inplace = True)

  if w.shape == (1, 16):
    wild2.at[index,'AWND'] = w.at[0, 'AWND']
    wild2.at[index,'CLDD'] = w.at[0, 'CLDD']
    wild2.at[index,'DP10'] = w.at[0, 'DP10']
    wild2.at[index,'DX90'] = w.at[0, 'DX90']
    wild2.at[index,'PRCP'] = w.at[0, 'PRCP']
    wild2.at[index,'TAVG'] = w.at[0, 'TAVG']
    wild2.at[index,'TMAX'] = w.at[0, 'TMAX']
    wild2.at[index,'TMIN'] = w.at[0, 'TMIN']

In [110]:
wild1.head(5)

Unnamed: 0,Year,Name,AcresBurned,Fire Size Rank,Cause,SOURCE_REPORTING_UNIT_NAME,DaysBurn,Discovery Month,Discovered DOY,Contained Month,...,NOAA Station,Link,AWND,CLDD,DP10,DX90,PRCP,TAVG,TMAX,TMIN
0,2005,FOUNTAIN,0.1,A,Miscellaneous,Plumas National Forest,1.0,Feb,33.0,Feb,...,USW00023225,USW00023225 2005-02-01 00:00:00,5.6,0.0,12.0,0.0,5.33,38.9,43.9,33.9
1,2004,PIGEON,0.25,A,Lightning,Eldorado National Forest,1.0,May,133.0,May,...,USW00093230,USW00093230 2004-05-01 00:00:00,6.9,0.0,2.0,0.0,0.81,47.3,63.0,31.6
2,2004,SLACK,0.1,A,Debris Burning,Eldorado National Forest,1.0,Jun,152.0,Jun,...,USW00023225,USW00023225 2004-06-01 00:00:00,5.6,36.0,0.0,0.0,0.0,63.1,70.2,56.0
3,2004,DEER,0.1,A,Lightning,Eldorado National Forest,5.0,Jun,180.0,Jul,...,USW00093230,USW00093230 2004-06-01 00:00:00,5.6,0.0,1.0,0.0,0.29,54.7,72.9,36.5
4,2004,STEVENOT,0.1,A,Lightning,Eldorado National Forest,5.0,Jun,180.0,Jul,...,USW00093230,USW00093230 2004-06-01 00:00:00,5.6,0.0,1.0,0.0,0.29,54.7,72.9,36.5


In [111]:
wild2.head(5)

Unnamed: 0,Year,Name,AcresBurned,Fire Size Rank,Cause,SOURCE_REPORTING_UNIT_NAME,DaysBurn,Discovery Month,Discovered DOY,Contained Month,...,NOAA Station,Link,AWND,CLDD,DP10,DX90,PRCP,TAVG,TMAX,TMIN
0,2016,Soberanes Fire,132127.0,G,,,83.0,Jul,,Oct,...,USW00023259,USW00023259 2016-07-01 00:00:00,6.5,0.0,0.0,0.0,0.0,58.8,65.2,52.4
1,2016,Erskine Fire,48019.0,G,,,18.0,Jun,,Jul,...,USW00023155,USW00023155 2016-06-01 00:00:00,6.7,529.0,0.0,22.0,0.0,82.6,96.6,68.6
2,2016,Chimney Fire,46344.0,G,,,24.0,Aug,,Sep,...,USW00093209,USW00093209 2016-08-01 00:00:00,6.9,237.0,0.0,23.0,0.0,72.6,92.6,52.6
3,2016,Blue Cut Fire,36274.0,G,,,7.0,Aug,,Aug,...,USW00003102,USW00003102 2016-08-01 00:00:00,6.5,455.0,0.0,28.0,0.0,79.7,94.6,64.7
4,2016,Gap Fire,33867.0,G,,,1.0,Aug,,Aug,...,USW00024283,USW00024283 2016-08-01 00:00:00,4.5,0.0,0.0,0.0,0.02,56.4,62.9,49.9


Feature Engineering:


1.   Total percipitation in the six months preceeding the fire
2.   Total percipitation during rainy season of preceeding year (November-April)
2.   Total DX90 in the two months preceeding the fire (Number of heatwaves)
3.   Total DP10 in the two months preceeding the fire (Number of days with some non-neglible percipitation)


In [112]:
weather.head()

Unnamed: 0,STATION,DATE,AWND,CLDD,DP10,DX90,EMXP,EMXT,PRCP,TAVG,TMAX,TMIN,Latitude,Longitude,Link
0,USW00023129,2000-08-01,5.6,323.0,0.0,6.0,0.0,97.0,0.0,75.4,84.3,66.5,33.8117,-118.1464,USW00023129 2000-08-01 00:00:00
1,USW00023129,2000-09-01,5.1,221.0,0.0,5.0,0.0,96.0,0.0,72.4,82.5,62.3,33.8117,-118.1464,USW00023129 2000-09-01 00:00:00
2,USW00023129,2000-10-01,4.7,33.0,4.0,0.0,1.81,79.0,2.3,64.7,71.9,57.5,33.8117,-118.1464,USW00023129 2000-10-01 00:00:00
3,USW00023129,2000-11-01,3.6,0.0,0.0,0.0,0.0,80.0,0.0,56.9,68.1,45.8,33.8117,-118.1464,USW00023129 2000-11-01 00:00:00
4,USW00023129,2000-12-01,3.1,0.0,0.0,0.0,0.0,80.0,0.0,57.1,68.3,46.0,33.8117,-118.1464,USW00023129 2000-12-01 00:00:00


In [113]:
from datetime import date
from dateutil.relativedelta import relativedelta

In [114]:
weather['Year Month'] = ''
for index, row in weather.iterrows():
  d = str(weather.at[index,'DATE'])[0:-9]
  dat = datetime.strptime(d, '%Y-%m-%d')
  da = dat.strftime('%Y %b')
  weather.at[index,'Year Month'] = da

In [115]:
weather.head()

Unnamed: 0,STATION,DATE,AWND,CLDD,DP10,DX90,EMXP,EMXT,PRCP,TAVG,TMAX,TMIN,Latitude,Longitude,Link,Year Month
0,USW00023129,2000-08-01,5.6,323.0,0.0,6.0,0.0,97.0,0.0,75.4,84.3,66.5,33.8117,-118.1464,USW00023129 2000-08-01 00:00:00,2000 Aug
1,USW00023129,2000-09-01,5.1,221.0,0.0,5.0,0.0,96.0,0.0,72.4,82.5,62.3,33.8117,-118.1464,USW00023129 2000-09-01 00:00:00,2000 Sep
2,USW00023129,2000-10-01,4.7,33.0,4.0,0.0,1.81,79.0,2.3,64.7,71.9,57.5,33.8117,-118.1464,USW00023129 2000-10-01 00:00:00,2000 Oct
3,USW00023129,2000-11-01,3.6,0.0,0.0,0.0,0.0,80.0,0.0,56.9,68.1,45.8,33.8117,-118.1464,USW00023129 2000-11-01 00:00:00,2000 Nov
4,USW00023129,2000-12-01,3.1,0.0,0.0,0.0,0.0,80.0,0.0,57.1,68.3,46.0,33.8117,-118.1464,USW00023129 2000-12-01 00:00:00,2000 Dec


Engineered feature #1: Find total percipitation in the 6 months preceeding the start of the fire. We will denote this using the abbreviation PRCP_6M. In the code below, I add PRCP_6M to each dataset.

In [116]:
wild1['PRCP_6M'] =''
for index, row in wild1.iterrows():
  f = str(wild1.at[index,'Year']) + ' ' + str(wild1.at[index,'Discovery Month'])
  fire_date = datetime.strptime(f, '%Y %b')
  max_month = fire_date + relativedelta(months =-1)
  six_months_before = fire_date + relativedelta(months=-6)
  year_month = pd.period_range(six_months_before, max_month, freq='M')
  year_month = list(year_month.strftime('%Y %b'))

  #w = weather.loc[(weather['STATION'] == wild1.at[index,'NOAA Station']) & (weather['Year Month'] in year_month)
  w = weather.loc[(weather['STATION'] == wild1.at[index,'NOAA Station']) & ((weather['Year Month'] == year_month[0])
  |(weather['Year Month'] == year_month[1])|(weather['Year Month'] == year_month[2])|(weather['Year Month'] == year_month[3])
  |(weather['Year Month'] == year_month[4])|(weather['Year Month'] == year_month[5]))]
  prcp_sum = 0
  for index2, rows in w.iterrows():
    prcp_sum += w.at[index2,'PRCP']
  wild1.at[index, 'PRCP_6M'] = prcp_sum

In [117]:
wild2['PRCP_6M'] =''
for index, row in wild2.iterrows():
  f = str(wild2.at[index,'Year']) + ' ' + str(wild2.at[index,'Discovery Month'])
  fire_date = datetime.strptime(f, '%Y %b')
  max_month = fire_date + relativedelta(months =-1)
  six_months_before = fire_date + relativedelta(months=-6)
  year_month = pd.period_range(six_months_before, max_month, freq='M')
  year_month = list(year_month.strftime('%Y %b'))

  #w = weather.loc[(weather['STATION'] == wild2.at[index,'NOAA Station']) & (weather['Year Month'] in year_month)
  w = weather.loc[(weather['STATION'] == wild2.at[index,'NOAA Station']) & ((weather['Year Month'] == year_month[0])
  |(weather['Year Month'] == year_month[1])|(weather['Year Month'] == year_month[2])|(weather['Year Month'] == year_month[3])
  |(weather['Year Month'] == year_month[4])|(weather['Year Month'] == year_month[5]))]
  prcp_sum = 0
  for index2, rows in w.iterrows():
    prcp_sum += w.at[index2,'PRCP']
  wild2.at[index, 'PRCP_6M'] = prcp_sum

Engineered Feature #2: Total percipitation during rainy season (October-April) before the fire. We will denote this feature PRCP_RS, and I add this engineered feature to both datasets in the code below.

In [118]:
wild1['PRCP_RS'] =''
for index, row in wild1.iterrows():
  f = str(wild1.at[index,'Year'])
  fire_date = datetime.strptime(f, '%Y')
  start_rainy = fire_date + relativedelta(months =-3)
  end_rainy = fire_date + relativedelta(months=+3)
  year_month = pd.period_range(start_rainy, end_rainy, freq='M')
  year_month = list(year_month.strftime('%Y %b'))

  #w = weather.loc[(weather['STATION'] == wild1.at[index,'NOAA Station']) & (weather['Year Month'] in year_month)
  w = weather.loc[(weather['STATION'] == wild1.at[index,'NOAA Station']) & ((weather['Year Month'] == year_month[0])
  |(weather['Year Month'] == year_month[1])|(weather['Year Month'] == year_month[2])|(weather['Year Month'] == year_month[3])
  |(weather['Year Month'] == year_month[4])|(weather['Year Month'] == year_month[5])|(weather['Year Month'] == year_month[6]))]
  prcp_rs = 0
  for index2, rows in w.iterrows():
    prcp_rs += w.at[index2,'PRCP']
  wild1.at[index, 'PRCP_RS'] = prcp_rs

In [119]:
wild2['PRCP_RS'] =''
for index, row in wild2.iterrows():
  f = str(wild2.at[index,'Year'])
  fire_date = datetime.strptime(f, '%Y')
  start_rainy = fire_date + relativedelta(months =-3)
  end_rainy = fire_date + relativedelta(months=+3)
  year_month = pd.period_range(start_rainy, end_rainy, freq='M')
  year_month = list(year_month.strftime('%Y %b'))

  #w = weather.loc[(weather['STATION'] == wild2.at[index,'NOAA Station']) & (weather['Year Month'] in year_month)
  w = weather.loc[(weather['STATION'] == wild2.at[index,'NOAA Station']) & ((weather['Year Month'] == year_month[0])
  |(weather['Year Month'] == year_month[1])|(weather['Year Month'] == year_month[2])|(weather['Year Month'] == year_month[3])
  |(weather['Year Month'] == year_month[4])|(weather['Year Month'] == year_month[5])|(weather['Year Month'] == year_month[6]))]
  prcp_rs = 0
  for index2, rows in w.iterrows():
    prcp_rs += w.at[index2,'PRCP']
  wild2.at[index, 'PRCP_RS'] = prcp_rs

Engineered Feature #3: Total DX90 in the two months preceeding the fire (Number of heatwaves) called DX90_2M

In [120]:
wild1['DX90_2M'] =''
for index, row in wild1.iterrows():
  f = str(wild1.at[index,'Year']) + ' ' + str(wild1.at[index,'Discovery Month'])
  fire_date = datetime.strptime(f, '%Y %b')
  max_month = fire_date + relativedelta(months =-1)
  two_months_before = fire_date + relativedelta(months=-2)
  year_month = pd.period_range(two_months_before, max_month, freq='M')
  year_month = list(year_month.strftime('%Y %b'))

  #w = weather.loc[(weather['STATION'] == wild1.at[index,'NOAA Station']) & (weather['Year Month'] in year_month)
  w = weather.loc[(weather['STATION'] == wild1.at[index,'NOAA Station']) & ((weather['Year Month'] == year_month[0])
  |(weather['Year Month'] == year_month[1]))]
  dx90_sum = 0
  for index2, rows in w.iterrows():
    dx90_sum += w.at[index2,'DX90']
  wild1.at[index, 'DX90_2M'] = dx90_sum

In [121]:
wild2['DX90_2M'] =''
for index, row in wild2.iterrows():
  f = str(wild2.at[index,'Year']) + ' ' + str(wild2.at[index,'Discovery Month'])
  fire_date = datetime.strptime(f, '%Y %b')
  max_month = fire_date + relativedelta(months =-1)
  two_months_before = fire_date + relativedelta(months=-2)
  year_month = pd.period_range(two_months_before, max_month, freq='M')
  year_month = list(year_month.strftime('%Y %b'))

  #w = weather.loc[(weather['STATION'] == wild2.at[index,'NOAA Station']) & (weather['Year Month'] in year_month)
  w = weather.loc[(weather['STATION'] == wild2.at[index,'NOAA Station']) & ((weather['Year Month'] == year_month[0])
  |(weather['Year Month'] == year_month[1]))]
  dx90_sum = 0
  for index2, rows in w.iterrows():
    dx90_sum += w.at[index2,'DX90']
  wild2.at[index, 'DX90_2M'] = dx90_sum

Engineered Feature #4: Total DP10 in the two months preceeding the fire (Number of days with some non-neglible percipitation)


In [122]:
wild1['DP10_2M'] =''
for index, row in wild1.iterrows():
  f = str(wild1.at[index,'Year']) + ' ' + str(wild1.at[index,'Discovery Month'])
  fire_date = datetime.strptime(f, '%Y %b')
  max_month = fire_date + relativedelta(months =-1)
  two_months_before = fire_date + relativedelta(months=-2)
  year_month = pd.period_range(two_months_before, max_month, freq='M')
  year_month = list(year_month.strftime('%Y %b'))

  #w = weather.loc[(weather['STATION'] == wild1.at[index,'NOAA Station']) & (weather['Year Month'] in year_month)
  w = weather.loc[(weather['STATION'] == wild1.at[index,'NOAA Station']) & ((weather['Year Month'] == year_month[0])
  |(weather['Year Month'] == year_month[1]))]
  dp10_sum = 0
  for index2, rows in w.iterrows():
    dp10_sum += w.at[index2,'DP10']
  wild1.at[index, 'DP10_2M'] = dp10_sum

In [123]:
wild2['DP10_2M'] =''
for index, row in wild2.iterrows():
  f = str(wild2.at[index,'Year']) + ' ' + str(wild2.at[index,'Discovery Month'])
  fire_date = datetime.strptime(f, '%Y %b')
  max_month = fire_date + relativedelta(months =-1)
  two_months_before = fire_date + relativedelta(months=-2)
  year_month = pd.period_range(two_months_before, max_month, freq='M')
  year_month = list(year_month.strftime('%Y %b'))

  #w = weather.loc[(weather['STATION'] == wild2.at[index,'NOAA Station']) & (weather['Year Month'] in year_month)
  w = weather.loc[(weather['STATION'] == wild2.at[index,'NOAA Station']) & ((weather['Year Month'] == year_month[0])
  |(weather['Year Month'] == year_month[1]))]
  dp10_sum = 0
  for index2, rows in w.iterrows():
    dp10_sum += w.at[index2,'DP10']
  wild2.at[index, 'DP10_2M'] = dp10_sum

In [124]:
wild1.head(10)

Unnamed: 0,Year,Name,AcresBurned,Fire Size Rank,Cause,SOURCE_REPORTING_UNIT_NAME,DaysBurn,Discovery Month,Discovered DOY,Contained Month,...,DP10,DX90,PRCP,TAVG,TMAX,TMIN,PRCP_6M,PRCP_RS,DX90_2M,DP10_2M
0,2005,FOUNTAIN,0.1,A,Miscellaneous,Plumas National Forest,1.0,Feb,33.0,Feb,...,12.0,0.0,5.33,38.9,43.9,33.9,27.89,49.06,0.0,19.0
1,2004,PIGEON,0.25,A,Lightning,Eldorado National Forest,1.0,May,133.0,May,...,2.0,0.0,0.81,47.3,63.0,31.6,14.37,14.76,0.0,3.0
2,2004,SLACK,0.1,A,Debris Burning,Eldorado National Forest,1.0,Jun,152.0,Jun,...,0.0,0.0,0.0,63.1,70.2,56.0,36.71,40.37,0.0,11.0
3,2004,DEER,0.1,A,Lightning,Eldorado National Forest,5.0,Jun,180.0,Jul,...,1.0,0.0,0.29,54.7,72.9,36.5,13.63,14.76,0.0,3.0
4,2004,STEVENOT,0.1,A,Lightning,Eldorado National Forest,5.0,Jun,180.0,Jul,...,1.0,0.0,0.29,54.7,72.9,36.5,13.63,14.76,0.0,3.0
5,2004,HIDDEN,0.1,A,Lightning,Eldorado National Forest,1.0,Jul,182.0,Jul,...,0.0,0.0,0.0,61.1,79.8,42.3,8.31,14.76,0.0,3.0
6,2004,FORK,0.1,A,Lightning,Eldorado National Forest,1.0,Jul,183.0,Jul,...,0.0,0.0,0.0,61.1,79.8,42.3,8.31,14.76,0.0,3.0
7,2005,SLATE,0.8,B,Debris Burning,Shasta-Trinity National Forest,1.0,Mar,67.0,Mar,...,9.0,0.0,4.99,55.5,67.0,44.0,25.91,32.72,0.0,12.0
8,2005,SHASTA,1.0,B,Debris Burning,Shasta-Trinity National Forest,1.0,Mar,74.0,Mar,...,9.0,0.0,4.99,55.5,67.0,44.0,25.91,32.72,0.0,12.0
9,2004,TANGLEFOOT,0.1,A,Lightning,Eldorado National Forest,1.0,Jul,183.0,Jul,...,0.0,0.0,0.0,61.1,79.8,42.3,8.31,14.76,0.0,3.0


In [125]:
wild2.head(10)

Unnamed: 0,Year,Name,AcresBurned,Fire Size Rank,Cause,SOURCE_REPORTING_UNIT_NAME,DaysBurn,Discovery Month,Discovered DOY,Contained Month,...,DP10,DX90,PRCP,TAVG,TMAX,TMIN,PRCP_6M,PRCP_RS,DX90_2M,DP10_2M
0,2016,Soberanes Fire,132127.0,G,,,83.0,Jul,,Oct,...,0.0,0.0,0.0,58.8,65.2,52.4,14.11,21.42,0.0,1.0
1,2016,Erskine Fire,48019.0,G,,,18.0,Jun,,Jul,...,0.0,22.0,0.0,82.6,96.6,68.6,4.68,4.88,15.0,4.0
2,2016,Chimney Fire,46344.0,G,,,24.0,Aug,,Sep,...,0.0,23.0,0.0,72.6,92.6,52.6,2.52,8.09,43.0,0.0
3,2016,Blue Cut Fire,36274.0,G,,,7.0,Aug,,Aug,...,0.0,28.0,0.0,79.7,94.6,64.7,3.41,6.45,43.0,0.0
4,2016,Gap Fire,33867.0,G,,,1.0,Aug,,Aug,...,0.0,0.0,0.02,56.4,62.9,49.9,18.03,54.17,0.0,2.0
5,2016,Rey Fire,32606.0,G,,,28.0,Aug,,Sep,...,2.0,9.0,0.33,66.4,87.6,45.2,6.08,10.28,26.0,3.0
6,2016,Cedar Fire,29322.0,G,,,46.0,Aug,,Oct,...,0.0,31.0,0.0,85.3,99.2,71.4,2.15,4.88,52.0,0.0
7,2016,Canyon Fire,12518.0,G,,,10.0,Sep,,Sep,...,0.0,3.0,0.0,64.1,76.2,52.0,3.84,8.76,0.0,0.0
8,2016,Pilot Fire,8110.0,G,,,9.0,Aug,,Aug,...,0.0,28.0,0.0,79.7,94.6,64.7,3.41,6.45,43.0,0.0
9,2016,Border Fire,7609.0,G,,,11.0,Jun,,Jun,...,0.0,18.0,0.0,71.2,91.5,50.9,7.63,9.31,1.0,5.0


In [126]:
wild1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 61572 entries, 0 to 61571
Data columns (total 31 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   Year                        61572 non-null  int64  
 1   Name                        57233 non-null  object 
 2   AcresBurned                 61572 non-null  float64
 3   Fire Size Rank              61572 non-null  object 
 4   Cause                       61572 non-null  object 
 5   SOURCE_REPORTING_UNIT_NAME  61572 non-null  object 
 6   DaysBurn                    61572 non-null  float64
 7   Discovery Month             61572 non-null  object 
 8   Discovered DOY              61572 non-null  float64
 9   Contained Month             61572 non-null  object 
 10  Contained DOY               61572 non-null  float64
 11  Latitude                    61572 non-null  float64
 12  Longitude                   61572 non-null  float64
 13  County                      385

In [130]:
wild2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1059 entries, 0 to 1058
Data columns (total 31 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   Year                        1059 non-null   int64  
 1   Name                        1059 non-null   object 
 2   AcresBurned                 1059 non-null   float64
 3   Fire Size Rank              1059 non-null   object 
 4   Cause                       0 non-null      float64
 5   SOURCE_REPORTING_UNIT_NAME  0 non-null      float64
 6   DaysBurn                    1059 non-null   float64
 7   Discovery Month             1059 non-null   object 
 8   Discovered DOY              0 non-null      float64
 9   Contained Month             1059 non-null   object 
 10  Contained DOY               1059 non-null   float64
 11  Latitude                    1059 non-null   float64
 12  Longitude                   1059 non-null   float64
 13  County                      1059 

Save both datasets to .csv's which now have NOAA weather data for each fire in addition to the four engineered features I created. The next step is to add snow data on for those fires that occured in areas that receive snow during the winter months in addition to the elevation that each fire occured at. After this, we will be able to start dropping null values and begin training classification models.

In [131]:
wild1.to_csv('CA_wildfires_with_weather.csv')
wild2.to_csv('CA_wildfires_recent_with_weather.csv')