## Processing and Cleaning Fire Data

The fire data includes a feature called ```UNIT_ID```, which is very similar to county, but not exactly the same because two counties are occasionally grouped together in the same unit.  Additionally, the abbreviations used in ```UNIT_ID``` are not the same as the standard three letter abbreviations used elsewhere to identify California counties.  So, a dictionary mapping these abbreviations to county names had to be manually constructed.

In [55]:
import pandas as pd
import numpy as np
import ast
from IPython.display import clear_output

In [66]:
fire = pd.read_csv('./data/raw_fires.csv')

In [67]:
county_dict = {'NEU' : 'Yuba', 'BTU': 'Butte', 'CZU': 'San Mateo - Santa Cruz', 'MEU': 'Mendocino', 'HUU': 'Humboldt', 'TGU': 'Tehama - Glenn', 'ORC': 'Orange County', 'SKU': 'Siskiyou', 'TCU': 'Tuolumne - Calaveras',
       'KRN': 'Kern', 'CND': 'Kern', 'MMU': 'Madera - Mariposa', 'SHF': 'Trinity', 'SHU': 'Shasta - Trinity', 'SBC': 'Santa Barbara', 'BDU': 'San Bernardino', 'SLU': 'San Luis Obispo', 'TUU': 'Tulare', 
       'AEU': 'Amador - El Dorado', 'FKU': 'Fresno - Kings', 'BEU': 'Monterey - San Benito', 'VNC': 'Ventura', 'SCU': 'Santa Clara', 'MVU': 'San Diego', 'RRU': 'Riverside', 'LAC': 'Los Angeles', 'ANF': 'Los Angeles',
       'LNU': 'Sonoma Lake - Napa', 'MRN': 'Marin', 'MCP': 'San Diego', 'INF': 'Inyo', 'MDF': 'Modoc', 'TNF': 'Yuba', 'PNF': 'Plumas', 'LPF': 'Santa Barbara', 'KNF': 'Siksiyou',
       'SQF': 'Fresno', 'ENF': 'El Dorado', 'SNF': 'Mariposa', 'HIA': 'Humboldt', 'MNF': 'Lake', 'STF': 'Tuolumne', 'BDF': 'San Bernadino', 'CNF': 'San Diego', 'MNP': 'San Bernardino',
       'LNF': 'Lassen', 'CDD': 'Riverside', 'LMU': 'Lassen - Modoc', 'SRF': 'Del Norte', 'HTF': 'Humboldt', 'YNP': 'Mariposa', 'BNP': 'Siskiyou', 'KNP': 'Tulare', 'CNP': 'Ventura',
       'RNP': 'Marin', 'CCD': 'Los Angeles', 'NOD': 'Shasta', 'BBD': 'Kern', 'TMU': 'Placer', 'TOI': 'Mono', 'SNU': 'Sonoma', 'DVP': 'Inyo',
       'AFV': 'Santa Barbara', 'CRB': 'San Luis Obispo', 'SMP': 'Los Angeles', 'LDF': 'Los Angeles', 'FNF': 'Klamath', 'WED': 'Siskiyou', 'KRR': 'Kern',
       'SWR': 'Sacramento', 'LUR': 'Merced', 'BRR': 'Kern', 'HPR': 'Ventura', 'PLR': 'Tulare', 'SOR': 'Imperial', 'TNR': 'San Diego', 'SJR': 'San Joaquin', 'CLR': 'Modoc',
       'LKR': 'Klamath', 'RWP': 'Humboldt', 'LNP': 'Lassen', 'JTP': 'Riverside', 'GNP': 'Marin', 'PIP': 'San Benito', 'VLJ': 'Solano', 'RRS': 'Siskiyou'}

In [68]:
fire['UNIT_ID'] = fire['UNIT_ID'].map(county_dict)

Since the weather data we obtained only contains information back to July 1, 2008, we drop the fire data before this date.

In [69]:
fire['ALARM_DATE'] = pd.to_datetime(fire.ALARM_DATE, errors = 'coerce')
fire_recent = fire.loc[fire.ALARM_DATE > '2008-07-01', :]
fire_recent.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4304 entries, 0 to 21317
Data columns (total 18 columns):
 #   Column        Non-Null Count  Dtype              
---  ------        --------------  -----              
 0   OBJECTID      4304 non-null   int64              
 1   YEAR_         4304 non-null   float64            
 2   STATE         4301 non-null   object             
 3   AGENCY        4299 non-null   object             
 4   UNIT_ID       4279 non-null   object             
 5   FIRE_NAME     4287 non-null   object             
 6   INC_NUM       4078 non-null   object             
 7   ALARM_DATE    4304 non-null   datetime64[ns, UTC]
 8   CONT_DATE     4249 non-null   object             
 9   CAUSE         4273 non-null   float64            
 10  COMMENTS      2052 non-null   object             
 11  REPORT_AC     3659 non-null   float64            
 12  GIS_ACRES     4297 non-null   float64            
 13  C_METHOD      4293 non-null   float64            
 14  OBJECTI

From this data we only need the size of the fire, when it occured, it's cause, and where it is located, so everything else is dropped.

Where the ```UNIT_ID``` contained two counties we arbitrarily select the second so that we will not have merge issues when we join this data to the weather data.

In [73]:
fire_recent.head()

Unnamed: 0,UNIT_ID,FIRE_NAME,ALARM_DATE,CAUSE,GIS_ACRES
0,Yuba,NELSON,2020-06-18 00:00:00+00:00,11.0,109.6025
1,Yuba,AMORUSO,2020-06-01 00:00:00+00:00,2.0,685.58502
2,Yuba,ATHENS,2020-08-10 00:00:00+00:00,14.0,27.30048
3,Yuba,FLEMING,2020-03-31 00:00:00+00:00,9.0,12.93155
4,Yuba,MELANESE,2020-04-14 00:00:00+00:00,18.0,10.31596


In [None]:
fire_recent.to_csv('./data/fire_data.csv', index = False)

## Processing and Cleaning Weather Data

Import the California county data in order add county information to the weather data.

In [75]:
weather = pd.read_csv('./data/clean_daily_weather.csv')
counties_long = pd.read_csv('./data/CA_Counties_Location.csv')
counties = counties_long[['NAMELSAD', 'INTPTLAT', 'INTPTLON']]
counties.rename(columns = {'NAMELSAD': 'name', 'INTPTLAT': 'lat', 'INTPTLON': 'lon'}, inplace = True)
counties.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().rename(


Unnamed: 0,name,lat,lon
0,Sierra County,39.576925,-120.521993
1,Sacramento County,38.450011,-121.340441
2,Santa Barbara County,34.537057,-120.039973
3,Calaveras County,38.1839,-120.561442
4,Ventura County,34.358742,-119.133143


Merge the county data to the weather data using latitude and longitude.

In [None]:
counties = counties.round(2)
w_c = pd.merge(weather, counties, left_on = ['lat', 'long'], right_on = ['lat', 'lon'])
w_c.drop(columns = ['lon'], inplace = True)

At this point our weather data has information for every day of every month of every year since July 1, 2008 in every county in California.  This data is too granular for our model, so we apply a county and date mask to the dataframe for every combination of day and county in order to find monthly average for all of our weather statistics.  Then this data is appended to a new data frame named ```out```.  

In [None]:
## THIS CELL TAKES 22 HOURS TO RUN ##

feature_list = ['maxtempF', 'mintempF', 'avgtempF', 'totalSnow_cm', 'sunHour', 'precip', 'humidity', 'windspeed', 'lat', 'long']
county_list = list(w_c['name'].unique())
years = list(range(2008, 2021))
months = ['01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '11', '12']
days = [31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31]
out = w_c.loc[0:1].copy()
out['name'] = 'drop'

new_row = []


for year in years:
    for i, month in enumerate(months):
        if month > '06' or year > 2008:
            for county in county_list:
                new_row = [str(year) + '-' + month]
                for feature in feature_list:
                    counter = 0
                    feature_collector = 0
                    for day in range(1, days[i]+1):
                        if len(str(day)) == 1:
                            day = '0' + str(day)
                        else:
                            day = str(day)
                        date_mask = str(year) + '-' + month + '-' + str(day)
                        counter += 1
                        feature_collector += float(w_c[(w_c['name'] == county) & (w_c['date'] == date_mask)][feature])
                    new_row.append(feature_collector / counter)
                
                new_row.append(county)    

                columns = ['date', 'maxtempF', 'mintempF', 'avgtempF', 'totalSnow_cm', 'sunHour', 'precip', 'humidity', 'windspeed', 'lat', 'long', 'name']

                out = out.append(pd.DataFrame([new_row], columns= columns), ignore_index = True)

                clear_output()
                
                print(f'{date_mask} of {county} complete')

The precipitation, humidity, and wind speed were stored differently in the WWO data than the temperature data.  The next cell collects this daily data into a new data frame and then uses the ```.resample``` function to convert this to monthly data.

In [4]:
precip = []
humid = []
wind = []
for item in df_out['hourly']:
    precip.append(ast.literal_eval(item)[0]['precipInches'])
    humid.append(ast.literal_eval(item)[0]['humidity'])
    wind.append(ast.literal_eval(item)[0]['windspeedMiles'])

lat = out['Lat and Long'].map(lambda x: float(x[4:9]))
long = out['Lat and Long'].map(lambda x: float(x[18:26]))
date = out['date']

fixer = pd.DataFrame(list(zip(date, precip, humid, wind, lat, long)), columns = ['date', 'precip', 'humid', 'wind', 'lat', 'long'])

fixer['date'] = pd.to_datetime(fixer.date)

fixer['humid'] = fixer['humid'].astype(float)
fixer['wind'] = fixer['wind'].astype(float)
fixer['precip'] = fixer['precip'].astype(float)
fixer = pd.merge(fixer, counties, left_on = ['lat', 'long'], right_on = ['lat', 'lon'])

iters = 0
for df in county_dfs:
    name = df['name'][0]
    temp_df = df[['humid', 'wind', 'precip']].resample('M').mean()
    temp_df['name'] = name
    temp_df.reset_index(inplace = True)
    if iters == 0:
        out = temp_df.copy()
    else:
        out = pd.concat([out, temp_df])
    iters += 1
    clear_output()
    print(iters)

Finally, the wind, humidity, and precipitation data is merged to the rest of the weather data and saved as a .csv file.

In [None]:
out['date'] = out['date'].astype(str)
out['date'] = out['date'].map(lambda x: x[0:7])

weather = pd.merge(weather, out, left_on = ['date', 'name'], right_on = ['date', 'name'])

weather.rename(columns = {'name': 'county'}, inplace = True)

weather = weather[['date', 'county', 'maxtempF', 'mintempF', 'avgtempF', 'totalSnow_cm', 'humid', 'wind', 'precip', 'sunHour', 'lat', 'long']]

weather.to_csv('./data/weather_data.csv', index = False)