# Time between Last Fire and Debris flow

- calculate the time between events to include as a feature
    - could be a good parameter for the model and for users in the final data product
- also desire to calculate the NUMBER of fires in the site

In [1]:
import os
import pandas as pd
import geopandas as gpd
import datetime as dt

# import requests, json, folium

In [2]:
file_path = "../../data/data_v06_landslide.parquet"
df = gpd.read_parquet(file_path)
print('raw data -',df.shape)

df.head()

raw data - (1550, 51)


Unnamed: 0,fire_name,year,fire_id,fire_segid,database,state,response,stormdate,gaugedist_m,stormstart,...,Metamorphic,Sedimentary,Unconsolidated,domrt,age_min,age_max,index_right,LNDS_RISKV,LNDS_RISKS,LNDS_RISKR
0,Buckweed,2007,bck,bck_1035,Training,CA,0,2008-01-22,1998.67,2008-01-21 16:27:00,...,1.0,0.0,0.0,Metamorphic,56.0,72.1,205,380675.353544,96.305814,Relatively High
1,Buckweed,2007,bck,bck_1090,Training,CA,0,2008-01-22,2368.93,2008-01-21 16:27:00,...,1.0,0.0,0.0,Metamorphic,56.0,72.1,205,380675.353544,96.305814,Relatively High
2,Buckweed,2007,bck,bck_1570,Training,CA,0,2008-01-22,3956.74,2008-01-21 16:27:00,...,0.973247,0.026753,0.0,Metamorphic,56.0,72.1,205,380675.353544,96.305814,Relatively High
3,Buckweed,2007,bck,bck_235,Training,CA,0,2008-01-22,1734.72,2008-01-21 15:47:00,...,1.0,0.0,0.0,Metamorphic,56.0,72.1,205,380675.353544,96.305814,Relatively High
4,Buckweed,2007,bck,bck_363,Training,CA,0,2008-01-22,1801.04,2008-01-21 15:47:00,...,1.0,0.0,0.0,Metamorphic,56.0,72.1,205,380675.353544,96.305814,Relatively High


In [4]:
cols = [
    'year', # year of wildfire
    'stormdate', # date of storm that produced debris-flow response
]

print("nan_count:")
print(df[cols].isna().sum())

df[cols]

nan_count:
year         0
stormdate    0
dtype: int64


Unnamed: 0,year,stormdate
0,2007,2008-01-22
1,2007,2008-01-22
2,2007,2008-01-22
3,2007,2008-01-22
4,2007,2008-01-22
...,...,...
1545,2011,2011-09-07
1546,2011,2011-07-11
1547,2011,2011-07-26
1548,2011,2011-08-15


## encoding logic for fire interval

- there are a significant number of records where the fire of record occurred *after* the debris flow. we should encode these as 1's
- the fire data is not granular, so we only have the year of the fire even though we have the date of the storm. many fires ocurred in the same year as the debrisflow
    - because we are subtracting simply the years, this will cause the time difference to be zero
- the remaining records will show a time greater than to handle this, offset the years of the fire by 1

In [5]:
df['fire_interval'] = df['year'] - pd.to_datetime(df['stormdate']).dt.year

In [7]:
df['fire_interval'].value_counts(dropna=False).sort_index()

-10     30
-5     221
-4     379
-2       3
-1     314
 0     603
Name: fire_interval, dtype: int64

In [9]:
# encode nans if present
df['fire_interval'] = df['fire_interval'].fillna(9999)

better would be DAYS between fire and DF. currently do not have more granular daterange for wildfire
- proceding with broader feature to test model results

In [12]:
df.shape

(1550, 52)

In [13]:
# write out the file with fire interval in parquet format
df.to_parquet("../../data/data_v07_fire_interval.parquet")