## Feature Consolidation and Split

The purpose of this notebook is to consolidate a set of features and perform a seeded train-test-split for consistency on training and comparing model performance. It also creates a JSON for hyperparameters

- [Conslidation](#consol)
- [Train-Test Split](#split)
    *- standard scaler applied here or in modeling notebook?*
- [Model Parameters JSON](#json)

In [1]:
# assign seed now for consistency any time it needs to be used
seed=27

In [2]:
import pandas as pd
import geopandas as gpd
#import numpy as np
from sklearn.model_selection import train_test_split
#from sklearn.preprocessing import StandardScaler
import pickle



In [3]:
file_path = "../../data/data_v08_fire_interval.parquet"
df = gpd.read_parquet(file_path)
print('raw data: ',df.shape)
df = df.dropna()
print('drop na: ',df.shape)
df

raw data:  (1379, 49)
drop na:  (1078, 49)


Unnamed: 0,fire_name,year,fire_id,fire_segid,database,state,response,stormdate,gaugedist_m,stormstart,...,Igneous,Metamorphic,Sedimentary,Unconsolidated,domrt,index_right,LNDS_RISKV,LNDS_RISKS,LNDS_RISKR,fire_interval
0,Buckweed,2007,bck,bck_1035,Training,CA,0,22,1998.670000,2008-01-21 16:27:00,...,0.0,1.000000,0.000000,0.0,Metamorphic,205,380675.353544,96.305814,Relatively High,1.0
1,Buckweed,2007,bck,bck_1090,Training,CA,0,22,2368.930000,2008-01-21 16:27:00,...,0.0,1.000000,0.000000,0.0,Metamorphic,205,380675.353544,96.305814,Relatively High,1.0
2,Buckweed,2007,bck,bck_1570,Training,CA,0,22,3956.740000,2008-01-21 16:27:00,...,0.0,0.973247,0.026753,0.0,Metamorphic,205,380675.353544,96.305814,Relatively High,1.0
3,Buckweed,2007,bck,bck_235,Training,CA,0,22,1734.720000,2008-01-21 15:47:00,...,0.0,1.000000,0.000000,0.0,Metamorphic,205,380675.353544,96.305814,Relatively High,1.0
4,Buckweed,2007,bck,bck_363,Training,CA,0,22,1801.040000,2008-01-21 15:47:00,...,0.0,1.000000,0.000000,0.0,Metamorphic,205,380675.353544,96.305814,Relatively High,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1530,Wallow,2011,wlw,wlw_45357,Test,AZ,0,11,67.548691,2011-07-11 14:45:00,...,1.0,0.000000,0.000000,0.0,Igneous,102,245876.108811,94.282043,Relatively Moderate,0.0
1534,Wallow,2011,wlw,wlw_45553,Test,AZ,0,11,143.436916,2011-07-11 14:45:00,...,1.0,0.000000,0.000000,0.0,Igneous,102,245876.108811,94.282043,Relatively Moderate,0.0
1538,Wallow,2011,wlw,wlw_46217,Test,AZ,1,11,955.156247,2011-07-11 14:45:00,...,1.0,0.000000,0.000000,0.0,Igneous,102,245876.108811,94.282043,Relatively Moderate,0.0
1542,Wallow,2011,wlw,wlw_47409,Test,AZ,1,11,2706.250000,2011-07-11 14:45:00,...,1.0,0.000000,0.000000,0.0,Igneous,102,245876.108811,94.282043,Relatively Moderate,0.0


## Feature Consolidation <a id="consol">

Sedimentary and Unconsolidated rocks have similar debris flow occurrences.

Define additional features:
-  `SedUn`: fraction of watershed covered by sedimentary and unconsolidated rocks.
- `SuscFrac`: fraction of watershed covered by susceptible vegetation types (everything except grassland, `GR`

In [4]:
df["SedUn"] = df["Sedimentary"] + df["Unconsolidated"]
df["SuscFrac"] = df["GS"] + df["SH"] + df["TL"] + df["TU"]

In [5]:
# df.loc[0,:]

In [6]:
use_cols = [
    'SiteID', # will ultimately be needed for joining    
    'fire_id',
    'fire_name',
    'fire_segid',
    'year',
    'state',
    #'database', # original staley train/test
    'response',
    'stormdate',
    'gaugedist_m',
    'lat',
    'lon',
    'geom',
    #'geometry', # this get's lost along the way, need to go back, best to retain this feature so we're not joining later
]

In [7]:
feature_cols = [
    'peak_i15_mmh',
    # 'peak_i30_mmh', 
    # 'peak_i60_mmh', 
    'contributingarea_km2', 
    'prophm23',
    'dnbr1000', 
    'kf', 
    #'acc015_mm', 
    # 'acc030_mm', 
    # 'acc060_mm', 
    'Fine fuel load', 
    'SAV', 
    'Packing ratio', 
    'Extinction moisture content',
    'LNDS_RISKS',
    'fire_interval',
    'SedUn',
    'SuscFrac',
]

In [8]:
use_cols + feature_cols

['SiteID',
 'fire_id',
 'fire_name',
 'fire_segid',
 'year',
 'state',
 'response',
 'stormdate',
 'gaugedist_m',
 'lat',
 'lon',
 'geom',
 'peak_i15_mmh',
 'contributingarea_km2',
 'prophm23',
 'dnbr1000',
 'kf',
 'Fine fuel load',
 'SAV',
 'Packing ratio',
 'Extinction moisture content',
 'LNDS_RISKS',
 'fire_interval',
 'SedUn',
 'SuscFrac']

In [9]:
df = df[use_cols + feature_cols]
df.to_parquet('../../data/data_v09_feature_consolidation.parquet')


## Train-Test Split <a id="split">

Data rows are split by SiteID, such that observations made at same site (during different storms) are not assigned to the both the test and training set.
The `unique` attribute is, therefore, essential - otherwise the split of train and test data by site is not guaranteed.

In [10]:
train_sites, test_sites = train_test_split(df['SiteID'].unique(), test_size=0.20, shuffle=True, random_state=seed)

In [11]:
# no overlap
set(train_sites).intersection(set(test_sites))

set()

In [12]:
# how to retain the non-feature/response cols
# could join back in later? think through this

train_mask = df['SiteID'].isin(train_sites)
test_mask = df['SiteID'].isin(test_sites)

In [13]:
X_train = df.loc[train_mask, feature_cols]
X_test = df.loc[test_mask, feature_cols]

y_train = df.loc[train_mask, 'response']
y_test = df.loc[test_mask, 'response']

In [14]:
print("X_train: ", X_train.shape)
print("X_test: ", X_test.shape)
print("\n")
print("y_train: ", y_train.shape)
print("y_test: ", y_test.shape)

X_train:  (863, 13)
X_test:  (215, 13)


y_train:  (863,)
y_test:  (215,)


In [15]:
pickle.dump([X_train, X_test, y_train, y_test], open("../../data/train_test_data.pkl", "wb"))

## Model Parameters <a id="json">
    
create an json file that we will add to in model tuning steps
    - check if it already exists first so we don't overwrite

In [26]:
import os
import json

In [27]:
file_path = "../../model/model_parameters.json"
if os.path.isfile(file_path) == True:
    print("parameter json already exists")
else:
    print("created empty json file")
    model_params = {} # create empty dict
    with open(file_path, "w") as json_file:
        json.dump(model_params, json_file)

created empty json file
