# 4. Add random noise to storm data

Precipitation observations in the dataset were collected from rain gages up to 4 km away from the watersheds [Staley *et al.*, 2016](https://pubs.er.usgs.gov/publication/ofr20161106), 
and only 123 unique precipitation values are present in the dataset. 
The assumption that the data are independent and identically distributed may thus not hold for this dataset. 
Random forests in particular tend to overfit to small-scale dependencies of debris flow likelihoods on precipitation data during training.  
Nearby watersheds sharing the exact same i15 value may exhibit a similar debris flow response, and the ML algorithm effectively learns to assign a debris flow site to a group of similar sites based on the unique i15 record.
This notebook adds random noise to the precipitation features (i15, duration, total accumulation) before the test-train split. 
The amplitude of the noise is defined to range between -10% and 10% of the recorded value.

In [1]:
import geopandas as gpd
inpfile="../../data/data_v04_rocktype.parquet"
modelDataI = gpd.read_parquet(inpfile)
len(modelDataI)

1550

In [2]:
modelDataI.groupby(["database","response"])["database"].count()

database  response
Test      0           478
          1           133
Training  0           738
          1           201
Name: database, dtype: int64

In [3]:
modelDataI.columns

Index(['fire_name', 'year', 'fire_id', 'fire_segid', 'database', 'state',
       'response', 'stormdate', 'gaugedist_m', 'stormstart', 'stormend',
       'stormdur_h', 'stormaccum_mm', 'stormavgi_mmh', 'peak_i15_mmh',
       'peak_i30_mmh', 'peak_i60_mmh', 'contributingarea_km2', 'prophm23',
       'dnbr1000', 'kf', 'acc015_mm', 'acc030_mm', 'acc060_mm', 'geom', 'lon',
       'lat', 'SiteID', 'NB', 'GR', 'GS', 'SH', 'TU', 'TL', 'dom',
       'Fine fuel load', 'SAV', 'Packing ratio', 'Extinction moisture content',
       'Igneous', 'Metamorphic', 'Sedimentary', 'Unconsolidated', 'domrt'],
      dtype='object')

In [4]:
stormfeats=[]
for feat in modelDataI.columns:
    if feat.find("peak") > -1 or feat.find("acc") > -1 or feat.find("dur") > -1:
        stormfeats.append(feat)
stormfeats

['stormdur_h',
 'stormaccum_mm',
 'peak_i15_mmh',
 'peak_i30_mmh',
 'peak_i60_mmh',
 'acc015_mm',
 'acc030_mm',
 'acc060_mm']

Number of unique storm values:

In [5]:
len(modelDataI.dropna()[stormfeats].drop_duplicates())

205

Number of unique precipitation values:

In [6]:
len(modelDataI["peak_i15_mmh"].dropna().unique())

123

In [7]:
modelDataI["peak_i15_mmh"].unique()

array([  3.2     ,   1.6     ,   9.14    ,   7.11    ,  53.848   ,
         4.      ,        nan,  13.5     ,  35.56    ,   6.1     ,
        34.54    ,  27.43    ,   8.13    ,  10.16    ,  23.37    ,
         5.08    ,  13.21    ,  36.58    ,  15.24    ,  33.53    ,
        19.3     ,   3.05    ,  14.22    ,  46.736   ,  39.624   ,
        10.4     ,   9.6     ,  20.8     ,  22.4     ,  35.      ,
        48.      ,  37.      ,  14.      ,  36.      ,  55.      ,
        92.      ,  24.      ,  43.      ,  80.      ,  88.392   ,
        16.256   ,  21.34    ,  22.35    ,  25.4     ,  18.29    ,
        43.688   ,  20.32    ,  55.88    ,  11.18    ,  42.67    ,
        46.74    ,  54.86    ,  28.448   ,  49.78    ,  32.512   ,
        12.192   ,  79.      ,  70.      ,  52.      ,  19.      ,
        28.      ,   9.143999,   4.06    ,  24.384   ,  12.19    ,
        26.42    ,  19.2     ,  56.      ,  13.6     ,  51.816   ,
        81.28    ,  30.48    ,  27.432   , 100.      ,  12.   

In [8]:
len(modelDataI["stormdur_h"].unique())

222

In [9]:
len(modelDataI["stormaccum_mm"].unique())

217

Aggregating by number of observations where the same 15 minute precipitation intensity was recorded:

In [10]:
modelDataI.groupby("peak_i15_mmh")["peak_i15_mmh"].count().sort_values(ascending=False).head(n=10)

peak_i15_mmh
11.20    62
8.00     45
16.00    45
18.40    40
16.80    37
5.60     34
9.60     33
8.80     32
19.30    31
3.05     31
Name: peak_i15_mmh, dtype: int64

In [13]:
import numpy as np
np.random.seed(seed=27) # for reproducibility

In [14]:
def randomize(x, stdf=0.05):
    #return np.random.normal(x, x*stdf)
    if not np.isnan(x):
        return np.random.uniform(x*0.9, x*1.1)
    else:
        return x

In [15]:
randomize(1)

0.9851442821037792

In [16]:
for feat in stormfeats:
    modelDataI[feat] = modelDataI[feat].map(randomize)

In [17]:
modelDataI["peak_i15_mmh"].head()

0    3.419289
1    3.148600
2    3.045470
3    1.462495
4    1.709037
Name: peak_i15_mmh, dtype: float64

In [19]:
outfile=inpfile.replace('v04_rocktype.parquet', 'v05_rocktype_randn.parquet')
#modelDataI.to_parquet("../../data/" + outfile)
modelDataI.to_parquet(outfile)