<a href="https://colab.research.google.com/github/mnocerino23/Wildfire-Forecaster/blob/main/NOAA_weatherdata.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

I requested monthly weather data from the state of California's various weather stations from 2001-2019. In this notebook, I engineer additional weather features, map each station to its coordinates, then find the closest station to each fire that occured based on the coordinates

In [40]:
#Read in the csv file from my google drive
import pandas as pd
from google.colab import drive
drive.mount('/content/drive')
weather = pd.read_csv('/content/drive/MyDrive/Data_Science_Projects/Wildfires/NOAA_California.csv')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Weather Features:


1.   Station - Weather Station in California
2.   Date - Year and month of the report
3.   AWND - Average Monthly Wind Speed
4.   CDSD - Cooling Degree Days Season to Date
5.   CLDD - Cooling Degree Days
6.   DP10 - Days with over 0.10 inches of percipitation
7.   DX90 - Days with temperature over 90
8.   DYTS - Number of days with thunderstorms
9.   EMXP - Extreme max percipitation
10.  EMXT - Extreme max temperature
11.  PRCP - Monthly Percipitation
12.  SNOW - Total Monthly Snowfall
13.  TAVG - Average Temperature
14.  TMAX - Max Temperature
15.  TMIN - Min Temperature






In [41]:
weather.shape

(19239, 15)

In [42]:
weather.head(8)

Unnamed: 0,STATION,DATE,AWND,CDSD,CLDD,DP10,DX90,DYTS,EMXP,EMXT,PRCP,SNOW,TAVG,TMAX,TMIN
0,USW00023129,2000-08,5.6,953.0,323.0,0.0,6.0,,0.0,97.0,0.0,,75.4,84.3,66.5
1,USW00023129,2000-09,5.1,1175.0,221.0,0.0,5.0,,0.0,96.0,0.0,,72.4,82.5,62.3
2,USW00023129,2000-10,4.7,1208.0,33.0,4.0,0.0,1.0,1.81,79.0,2.3,,64.7,71.9,57.5
3,USW00023129,2000-11,3.6,1208.0,0.0,0.0,0.0,1.0,0.0,80.0,0.0,,56.9,68.1,45.8
4,USW00023129,2000-12,3.1,1209.0,0.0,0.0,0.0,,0.0,80.0,0.0,,57.1,68.3,46.0
5,USW00023129,2001-01,3.6,0.0,0.0,5.0,0.0,1.0,0.71,82.0,2.11,,52.4,63.1,41.7
6,USW00023129,2001-02,5.1,1.0,1.0,8.0,0.0,2.0,1.93,87.0,5.79,,52.5,61.5,43.4
7,USW00023129,2001-03,5.1,7.0,6.0,1.0,0.0,1.0,0.25,85.0,0.26,,59.0,66.3,51.8


Read in a CSV file that has the California NOAA weather stations and their coordinates and split it into latitude and longitude

In [43]:
stations = pd.read_csv('/content/drive/MyDrive/Data_Science_Projects/Wildfires/NOAA_Stations.csv')

In [44]:
stations.rename({'Unnamed: 3': 'Longitude', 'Coordinates':'Latitude'}, axis = 1, inplace = True)
for index, row in stations.iterrows():
  s = stations.at[index,'Latitude'].split(' ')
  stations.at[index,'Longitude'] = s[1]
  stations.at[index,'Latitude'] = s[0]
stations.head(10)

Unnamed: 0.1,Unnamed: 0,Station,Latitude,Longitude
0,0,USW00023129,33.8117,-118.1464
1,1,USW00093111,34.1167,-119.1167
2,2,USW00093112,32.7,-117.2
3,3,USW00093115,32.5667,-117.1167
4,4,USW00093116,33.25,-119.45
5,5,USW00003167,33.9228,-118.3342
6,6,USW00023293,37.3592,-121.9239
7,7,USW00023174,33.9381,-118.3889
8,8,USW00023130,34.2097,-118.4892
9,9,USW00093241,38.3775,-121.9575


In [45]:
#Dictionaries mapping the station to its latitude and longitude
lat = dict(zip(stations['Station'],stations['Latitude']))
long = dict(zip(stations['Station'],stations['Longitude']))

In [46]:
weather['Latitude'] = ''
weather['Longitude'] = ''
for index, row in weather.iterrows():
  weather.at[index, 'Latitude'] = lat[weather.at[index,'STATION']] 
  weather.at[index, 'Longitude'] = long[weather.at[index,'STATION']] 

In [47]:
weather.head()

Unnamed: 0,STATION,DATE,AWND,CDSD,CLDD,DP10,DX90,DYTS,EMXP,EMXT,PRCP,SNOW,TAVG,TMAX,TMIN,Latitude,Longitude
0,USW00023129,2000-08,5.6,953.0,323.0,0.0,6.0,,0.0,97.0,0.0,,75.4,84.3,66.5,33.8117,-118.1464
1,USW00023129,2000-09,5.1,1175.0,221.0,0.0,5.0,,0.0,96.0,0.0,,72.4,82.5,62.3,33.8117,-118.1464
2,USW00023129,2000-10,4.7,1208.0,33.0,4.0,0.0,1.0,1.81,79.0,2.3,,64.7,71.9,57.5,33.8117,-118.1464
3,USW00023129,2000-11,3.6,1208.0,0.0,0.0,0.0,1.0,0.0,80.0,0.0,,56.9,68.1,45.8,33.8117,-118.1464
4,USW00023129,2000-12,3.1,1209.0,0.0,0.0,0.0,,0.0,80.0,0.0,,57.1,68.3,46.0,33.8117,-118.1464


Use haversine to calculate the distance between two sets of coordinates to find the closest weather station to each fire

In [48]:
!pip install haversine
import haversine as hs
#each parameter is a tuple with lat, long
h = hs.haversine((33.8117, -118.1464), (33.3000,-117.35))
print(h)

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
93.18494458973653


In [49]:
wildfires1 = pd.read_csv('/content/drive/MyDrive/Data_Science_Projects/Wildfires/CA_Wildfires_2001_2015.csv')

In the code below, I find the closest NOAA weather station to every fire and add it as a feature in both the 2001-2015 dataset and the smaller 2016-2019 dataset.

In [50]:
wildfires1['NOAA Station'] = ''
for index1, row in wildfires1.iterrows():
  #initialize to a super large number so that condition of being less than this will be met right away
  min_dist = 100000000
  w = []
  w.append(float(wildfires1.at[index1, 'Latitude']))
  w.append(float(wildfires1.at[index1, 'Longitude']))
  wildfire_coordinates = tuple(w)

  for index2, row in stations.iterrows():
    station = stations.at[index2, 'Station']
    s = []
    s.append(float(stations.at[index2, 'Latitude']))
    s.append(float(stations.at[index2, 'Longitude']))
    station_coordinates = tuple(s)

    dist = hs.haversine(station_coordinates, wildfire_coordinates)
    if dist < min_dist:
      min_dist = dist
      min_station = station
  wildfires1.at[index1, 'NOAA Station'] = min_station

The second dataset had a few occurences of invalid coordinates so I write a function to check if the fire's coordinates are valid and drop invalid coordinates.

In [67]:
wildfires2 = pd.read_csv('/content/drive/MyDrive/Data_Science_Projects/Wildfires/CA_Wildfires_recent.csv')

In [68]:
def are_valid_coordinates(latitude, longitude):
  if latitude < -90 or latitude > 90:
    return False
  elif longitude < -180 or longitude > 180:
    return False
  else:
    return True

In [69]:
for index, row in wildfires2.iterrows():
  if are_valid_coordinates(wildfires2.at[index,'Latitude'], wildfires2.at[index, 'Longitude']) == False:
    bad_index = index
    wildfires2.drop([bad_index], axis = 0, inplace = True)

In [70]:
#Drop a specific entry in the second dataset that was causing issues do to incorrect coordinates. Entry occured at index 99
#wildfires2.drop([97, 391, 603, 781], axis = 0, inplace=True)
#print(wildfires2.iloc[780:790])
wildfires2['NOAA Station'] = ''
for index1, row in wildfires2.iterrows():
  #initialize to a super large number so that condition of being less than this will be met right away
  min_dist = 100000000
  w = []
  w.append(float(wildfires2.at[index1, 'Latitude']))
  w.append(float(wildfires2.at[index1, 'Longitude']))
  wildfire_coordinates = tuple(w)

  for index2, row in stations.iterrows():
    station = stations.at[index2, 'Station']
    s = []
    s.append(float(stations.at[index2, 'Latitude']))
    s.append(float(stations.at[index2, 'Longitude']))
    station_coordinates = tuple(s)

    dist = hs.haversine(station_coordinates, wildfire_coordinates)
    if dist < min_dist:
      min_dist = dist
      min_station = station
  wildfires2.at[index1, 'NOAA Station'] = min_station

In [71]:
wildfires1.head(5)

Unnamed: 0,Year,Name,AcresBurned,Fire Size Rank,Cause,SOURCE_REPORTING_UNIT_NAME,DaysBurn,Discovery Month,Discovered DOY,Contained Month,Contained DOY,Latitude,Longitude,County,CountyIds,State,OWNER_DESCR,NOAA Station
0,2005,FOUNTAIN,0.1,A,Miscellaneous,Plumas National Forest,1.0,Feb,33.0,Feb,33.0,40.036944,-121.005833,Plumas,32.0,CA,USFS,USW00023225
1,2004,PIGEON,0.25,A,Lightning,Eldorado National Forest,1.0,May,133.0,May,133.0,38.933056,-120.404444,Placer,31.0,CA,USFS,USW00093230
2,2004,SLACK,0.1,A,Debris Burning,Eldorado National Forest,1.0,Jun,152.0,Jun,152.0,38.984167,-120.735556,El Dorado,9.0,CA,STATE OR PRIVATE,USW00023225
3,2004,DEER,0.1,A,Lightning,Eldorado National Forest,5.0,Jun,180.0,Jul,185.0,38.559167,-119.913333,Alpine,2.0,CA,USFS,USW00093230
4,2004,STEVENOT,0.1,A,Lightning,Eldorado National Forest,5.0,Jun,180.0,Jul,185.0,38.559167,-119.933056,Alpine,2.0,CA,USFS,USW00093230


In [72]:
wildfires2.head(5)

Unnamed: 0,Year,Name,AcresBurned,Fire Size Rank,Cause,SOURCE_REPORTING_UNIT_NAME,DaysBurn,Discovery Month,Discovered DOY,Contained Month,Contained DOY,Latitude,Longitude,County,CountyIds,State,OWNER_DESCR,NOAA Station
0,2016,Soberanes Fire,132127.0,G,,,83.0,Jul,,Oct,287.0,36.45994,-121.89938,Monterey,27,CA,,USW00023259
1,2016,Erskine Fire,48019.0,G,,,18.0,Jun,,Jul,193.0,35.6115,-118.45628,Kern,15,CA,,USW00023155
2,2016,Chimney Fire,46344.0,G,,,24.0,Aug,,Sep,250.0,35.70595,-120.98316,San Luis Obispo,40,CA,,USW00093209
3,2016,Blue Cut Fire,36274.0,G,,,7.0,Aug,,Aug,236.0,34.30372,-117.49342,San Bernardino,36,CA,,USW00003102
4,2016,Gap Fire,33867.0,G,,,1.0,Aug,,Aug,241.0,41.851,-123.118,Siskiyou,47,CA,,USW00024283


In [76]:
#Save each dataset to a new CSV so that we don't have to rerun the expensive calculation to find the closest station to every fire
wildfires1.to_csv('wildfires1_with_station.csv', index = False)
wildfires2.to_csv('wildfires2_with_station.csv', index = False)