# 1. Feature Preparation

- Define additional features `SedUn` and `SuscFrac`
- Carry out train-test split
- Fit StandardScaler to data and save.

In [1]:
import pandas as pd
pd.set_option("max_colwidth", 50)
import geopandas as gpd



In [2]:
import sklearn
import numpy as np

In [3]:
pqfile="../data_preparation/staley16_observations_catchment_fuelpars_rocktype_randn_v3.parquet"
modelDataI = gpd.read_parquet(pqfile)

In [4]:
modelDataI.columns

Index(['fire_name', 'year', 'fire_id', 'fire_segid', 'database', 'state',
       'response', 'stormdate', 'gaugedist_m', 'stormstart', 'stormend',
       'stormdur_h', 'stormaccum_mm', 'stormavgi_mmh', 'peak_i15_mmh',
       'peak_i30_mmh', 'peak_i60_mmh', 'contributingarea_km2', 'prophm23',
       'dnbr1000', 'kf', 'acc015_mm', 'acc030_mm', 'acc060_mm', 'geom', 'lon',
       'lat', 'SiteID', 'NB', 'GR', 'GS', 'SH', 'TU', 'TL', 'dom',
       'Fine fuel load', 'SAV', 'Packing ratio', 'Extinction moisture content',
       'Igneous', 'Metamorphic', 'Sedimentary', 'Unconsolidated', 'domrt'],
      dtype='object')

Combining Sedimentary and Unconsolidated rocks, which have similar debris flow occurrences.

Define additional features:
-  `SedUn`: fraction of watershed covered by sedimentary and unconsolidated rocks.
- `SuscFrac`: fraction of watershed covered by susceptible vegetation types (everything except grassland, `GR`

In [5]:
modelDataI["SedUn"]=modelDataI["Sedimentary"] + modelDataI["Unconsolidated"]
modelDataI["SuscFrac"]=modelDataI["GS"] + modelDataI["SH"] + modelDataI["TL"] + modelDataI["TU"]

In [16]:
#Columns to use for training

usecols=["stormdur_h",
         "stormaccum_mm",
         "peak_i15_mmh",
         "contributingarea_km2",
         "prophm23",
         "dnbr1000",
         "kf",
         "SedUn",
         "SuscFrac",
         "Fine fuel load",
         "response",
         "SiteID"]

In [17]:
cdata=modelDataI[usecols].copy()

In [18]:
len(cdata)

1550

In [19]:
cdata.dropna(inplace=True)
print(len(cdata))

1241


In [20]:
cdata.describe()

Unnamed: 0,stormdur_h,stormaccum_mm,peak_i15_mmh,contributingarea_km2,prophm23,dnbr1000,kf,SedUn,SuscFrac,Fine fuel load,response,SiteID
count,1241.0,1241.0,1241.0,1241.0,1241.0,1241.0,1241.0,1241.0,1241.0,1241.0,1241.0,1241.0
mean,20.680375,40.64519,22.243663,1.07553,0.475066,0.336579,0.232977,0.219603,0.707096,2.139664,0.254633,388.271555
std,17.975764,40.26644,19.618746,1.533322,0.276599,0.189419,0.474465,0.394165,0.227012,0.902508,0.435831,198.313326
min,0.0,1.33161,1.474149,0.0201,0.0,0.007158,0.0,0.0,0.0,0.963445,0.0,0.0
25%,3.574931,10.521874,10.095109,0.117824,0.224443,0.189733,0.15,0.0,0.540533,1.595921,0.0,213.0
50%,16.886956,26.191091,16.498849,0.468936,0.51899,0.311063,0.237299,0.0,0.74562,1.869197,0.0,403.0
75%,30.321443,58.139272,26.376207,1.375373,0.692219,0.456611,0.244227,0.146225,0.907119,2.190905,1.0,561.0
max,67.570797,238.504743,122.774776,7.888105,0.989526,0.997439,11.360418,1.0,1.0,6.568486,1.0,715.0


In [21]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

Data rows are split by SiteID, such that observations made at same site (during different storms) are not assigned to the both the test and training set.
The `unique` attribute is, therefore, essential - otherwise the split of train and test data by site is not guaranteed.

In [33]:
import pickle
[trainsites, testsites] = pickle.load(open("/tmp/testsites.pkl", "rb"))

In [42]:
trainsites, testsites = train_test_split(cdata["SiteID"].unique(), test_size=0.20, shuffle=True, random_state=2)

trainmask=cdata["SiteID"].apply(lambda x: x in trainsites)
testmask=cdata["SiteID"].apply(lambda x: x in testsites)
trainX=cdata[trainmask].drop(columns=["response", "SiteID"])
trainY=cdata[trainmask]["response"]

testX=cdata[testmask].drop(columns=["response", "SiteID"])
testY=cdata[testmask]["response"]

This is to double-check that there is no overlap between train and test sites in terms of SiteID (intersection is empty):

In [43]:
set(trainsites) & set(testsites)

set()

Save to disk.  All ML models, saved in subsequent notebooks, will use the same train-test split.

In [44]:
import pickle
pickle.dump([trainX, trainY, testX, testY], open("staley16+addtl_feats_split.pkl", "wb"))

Scale features, and save scaler to disc.  This is to make sure that all models use the same feature scaling.

In [45]:
ssc=StandardScaler()
ssc.fit_transform(trainX)
pickle.dump(ssc, open("feature_scaler.pkl", "wb"))