# FEMA Disaster Cost Forecasting
#### Capstone 2 - Pre-processing and Training Data Development
Michael Garber


* NOTE: Please run notebooks (below) before this one to create data needed.
    * On Github 
        * > [FEMA-DataWrangling.ipynb on Github](https://github.com/mdgarber/FEMADisasterCostForecasting/blob/ef70129c4bf06a38b13c61e1254fdb6a3105486b/femadisastercostforecasting/notebooks/FEMA-DataWrangling.ipynb)
        * > [FEMA-EDA.ipynb on Github](https://github.com/mdgarber/FEMADisasterCostForecasting/blob/ef70129c4bf06a38b13c61e1254fdb6a3105486b/femadisastercostforecasting/notebooks/FEMA-EDA.ipynb)
    * OR local path
        * > /FEMADisasterCostForecasting/femadisastercostforecasting/notebooks/FEMA-DataWrangling.ipynb
        * > /FEMADisasterCostForecasting/femadisastercostforecasting/notebooks/FEMA-EDA.ipynb



#### Pre-processing and Training Data Development High-Level Steps
1. Creating dummy features
2. Scale standardization
3. Split data into training and testing subsets

Goal: Create a cleaned development dataset you can use to complete the
modeling step of your project.

## Step 0 - Import libraries & load & clean data

In [4]:
# import libraries
import pandas as pd
import numpy as np
from IPython.display import display
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from matplotlib import pyplot as plt

In [5]:
# load data
femaDataCleanV2 = pd.read_csv('../data/interim/femaDataCleanV2.csv')

In [6]:
# Clean data

# Create new V3 df & Set index to disasterNumber
femaDataCleanV3 = femaDataCleanV2.set_index('disasterNumber')

# drop useless columns: 'unnamed: 0', etc
femaDataCleanV3.drop(['Unnamed: 0'], axis=1, inplace=True)

## Step 1 - Creating Dummy Features

In [8]:
# view all columns
print(femaDataCleanV3.columns)

Index(['declarationDate', 'disasterName', 'incidentBeginDate',
       'incidentEndDate', 'declarationType', 'stateCode', 'stateName',
       'incidentType', 'entryDate', 'updateDate', 'closeoutDate', 'region',
       'ihProgramDeclared', 'iaProgramDeclared', 'paProgramDeclared',
       'hmProgramDeclared', 'designatedIncidentTypes',
       'declarationRequestDate', 'id_x', 'hash_x', 'lastRefresh_x',
       'totalNumberIaApproved', 'totalAmountIhpApproved',
       'totalAmountHaApproved', 'totalAmountOnaApproved',
       'totalObligatedAmountPa', 'totalObligatedAmountCatAb',
       'totalObligatedAmountCatC2g', 'paLoadDate', 'iaLoadDate',
       'totalObligatedAmountHmgp', 'hash_y', 'lastRefresh_y', 'id_y',
       'totalDisasterCost'],
      dtype='object')


__Choosen categorical variables__
- incidentType
- stateCode
- region
- ihProgramDeclared (already a boolean)
- iaProgramDeclared (already a boolean)
- paProgramDeclared (already a boolean)
- hmProgramDeclared (already a boolean)
- designatedIncidentTypes * multi-value field *

In [10]:
#check for NULLs
femaDataCleanV3[['incidentType', 'stateCode', 'region','ihProgramDeclared', 'iaProgramDeclared', 'paProgramDeclared', 'hmProgramDeclared']].isnull().sum()

incidentType           0
stateCode              0
region                 0
ihProgramDeclared    251
iaProgramDeclared    251
paProgramDeclared    251
hmProgramDeclared    251
dtype: int64

__Fields with NULLs found__
- ihProgramDeclared
- iaProgramDeclared
- paProgramDeclared
- hmProgramDeclared

In [12]:
#Handle NULLS for categorical features
nullCols = {'ihProgramDeclared': 0, 'iaProgramDeclared': 0, 'paProgramDeclared': 0, 'hmProgramDeclared': 0}
femaDataCleanV3 = femaDataCleanV3.fillna(value=nullCols)

In [13]:
#check for NULLs after fillna
femaDataCleanV3[['ihProgramDeclared', 'iaProgramDeclared', 'paProgramDeclared', 'hmProgramDeclared']].isnull().sum()

ihProgramDeclared    0
iaProgramDeclared    0
paProgramDeclared    0
hmProgramDeclared    0
dtype: int64

In [14]:
# Create Dummies for incidentType, stateCode, region
femaDataCleanV3 = pd.get_dummies(femaDataCleanV3, columns=['incidentType', 'stateCode', 'region'])

In [15]:
# Update Dummy fields for designatedIncidentTypes ?
# nah...

In [16]:
# designatedIncidentTypes Key
'''
0: Not applicable
1: Explosion
2: Straight-Line Winds
3: Tidal Wave
4: Tropical Storm
5: Winter Storm
8: Tropical Depression
A: Tsunami
B: Biological
C: Coastal Storm
D: Drought
E: Earthquake
F: Flood
G: Freezing
H: Hurricane
I: Terrorist
J: Typhoon
K: Dam/Levee Break
L: Chemical
M: Mud/Landslide
N: Nuclear
O: Severe Ice Storm
P: Fishing Losses
Q: Crop Losses
R: Fire
S: Snowstorm
T: Tornado
U: Civil Unrest
V: Volcanic Eruption
W: Severe Storm
X: Toxic Substances
Y: Human Cause
Z: Other
'''

'\n0: Not applicable\n1: Explosion\n2: Straight-Line Winds\n3: Tidal Wave\n4: Tropical Storm\n5: Winter Storm\n8: Tropical Depression\nA: Tsunami\nB: Biological\nC: Coastal Storm\nD: Drought\nE: Earthquake\nF: Flood\nG: Freezing\nH: Hurricane\nI: Terrorist\nJ: Typhoon\nK: Dam/Levee Break\nL: Chemical\nM: Mud/Landslide\nN: Nuclear\nO: Severe Ice Storm\nP: Fishing Losses\nQ: Crop Losses\nR: Fire\nS: Snowstorm\nT: Tornado\nU: Civil Unrest\nV: Volcanic Eruption\nW: Severe Storm\nX: Toxic Substances\nY: Human Cause\nZ: Other\n'

## Step 2 - Scale Standardization

__Fields to Standard Scale__
- totalAmountIhpApproved
- totalAmountHaApproved
- totalAmountOnaApproved
- totalObligatedAmountPa
- totalObligatedAmountCatAb
- totalObligatedAmountCatC2g
- totalObligatedAmountHmgp

In [19]:
#Standard Scale the cost ($) fields
scaler = StandardScaler()

costName = [
    'totalAmountIhpApproved', 'totalAmountHaApproved', 
    'totalAmountOnaApproved', 'totalObligatedAmountPa','totalObligatedAmountCatAb',
    'totalObligatedAmountCatC2g', 'totalObligatedAmountHmgp' #exclude label "totalDisasterCost"
]

for cost in costName:
    scaledName = cost + '_StdScaled'
    femaDataCleanV3[scaledName] = scaler.fit_transform(femaDataCleanV3[cost].values.reshape(-1, 1))

## Step 3 - Split data into training and testing subsets

In [21]:
# assign X & y
X = femaDataCleanV3.select_dtypes(exclude='object')
y = femaDataCleanV3['totalDisasterCost']

# perform train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)