# FEMA Disaster Cost Forecasting
#### Capstone 2 - Data Wrangling
Michael Garber




#### Data Wrangling High-Level Steps
1. Data Collection
2. Data Organization
3. Data Definition
4. Data Cleaning


#### Data Collection


Top level FEMA data sets
- [https://www.fema.gov/about/openfema/data-sets](https://www.fema.gov/about/openfema/data-sets)

OpenFEMA Dataset: FEMA Web Disaster Declarations - v1
- info [https://www.fema.gov/openfema-data-page/fema-web-disaster-declarations-v1](https://www.fema.gov/openfema-data-page/fema-web-disaster-declarations-v1) \
- data [https://www.fema.gov/api/open/v1/FemaWebDisasterDeclarations.csv](https://www.fema.gov/api/open/v1/FemaWebDisasterDeclarations.csv)

OpenFEMA Dataset: FEMA Web Disaster Summaries - v1
- info [https://www.fema.gov/openfema-data-page/fema-web-disaster-summaries-v1](https://www.fema.gov/openfema-data-page/fema-web-disaster-summaries-v1) \
- data [https://www.fema.gov/api/open/v1/FemaWebDisasterSummaries.csv](https://www.fema.gov/api/open/v1/FemaWebDisasterSummaries.csv)

ClimRR
- info [https://climrr.anl.gov/](https://climrr.anl.gov/) \
*just a reference - may be used for climate prediction data if needed*

In [4]:
#Import packages
import pandas as pd
import requests
import os

In [5]:
#Data Download locations
disasterInfoUrl = 'https://www.fema.gov/api/open/v1/FemaWebDisasterDeclarations.csv'
disasterCostUrl = 'https://www.fema.gov/api/open/v1/FemaWebDisasterSummaries.csv'
rawDataDir = '../data/raw/'
femaInfoPath = rawDataDir + 'FemaWebDisasterDeclarations.csv'
femaCostPath = rawDataDir + 'FemaWebDisasterSummaries.csv'

#Download Disaster Info data locally
r = requests.get(disasterInfoUrl)
with open(femaInfoPath, 'wb') as f:
    f.write(r.content)

#Download Disaster Cost data locally
r = requests.get(disasterCostUrl)
with open(femaCostPath, 'wb') as f:
    f.write(r.content)

FileNotFoundError: [Errno 2] No such file or directory: '../data/raw/FemaWebDisasterDeclarations.csv'

In [None]:
#Loading to pandas dataframe

#Load FEMA disaster info data
femaInfo = pd.read_csv(femaInfoPath)

#Load FEMA disaster cost (federal financial assistance) data
femaCosts = pd.read_csv(femaCostPath)

In [None]:
#check loaded data shape
print("femaInfo rows, cols: " + str(femaInfo.shape))
print("femaCosts rows, cols: " + str(femaCosts.shape))

In [None]:
#check the head - fema info
femaInfo.head()

In [None]:
#check the head - fema costs
femaCosts.head()

In [None]:
#Pre-join check

#check that join fields are unique before join - femainfo.disasterNumber
print("disaster number is all unique (femaInfo)? " + str(len(femaInfo) == femaInfo['disasterNumber'].nunique()))

#check that join fields are unique before join - femacosts.disasterNumber
print("disaster number is all unique (femaCosts)? " + str(len(femaCosts) == femaCosts['disasterNumber'].nunique()))


In [None]:
#Data Joining - femaInfo & femaCosts 
print("Rows in femaInfo: " + str(len(femaInfo)))
print("Rows in femaInfo: " + str(len(femaCosts)))
print("*Note: There are more rows in the disaster [info] dataset than the disaster [costs]," + '\n' + "this suggests some disasters do not have cost info. Perhaps there was no requests to FEMA made.")

In [None]:
#Join disaster info with disaster costs via the 'disaster info' column
femaMasterData = pd.merge(femaInfo, femaCosts, how='left', on='disasterNumber')

#check joined data set
print("(Rows, columns) in femaInfo" + '\n' + str(femaMasterData.shape))
femaMasterData.head()

#### Data Organization
Project file structure based on the cookiecutter data science template. \
[https://drivendata.github.io/cookiecutter-data-science/](https://drivendata.github.io/cookiecutter-data-science/)

Folder structure tree (GitHub) \
[https://github.com/mdgarber/FEMADisasterCostForecasting/blob/acc1f9a68773c3fa7b87325f7fb814c049f03306/femadisastercostforecasting/README.md](https://github.com/mdgarber/FEMADisasterCostForecasting/blob/acc1f9a68773c3fa7b87325f7fb814c049f03306/femadisastercostforecasting/README.md)

#### Data Definition

Column names
- Data types
- Description of the columns
- Counts and percents unique values
- Ranges of values
- Calc Summary statistics

In [None]:
#Check Data types,  unique values, range of index
femaMasterData.info(verbose=True)

##### Description of the columns
femaInfo
- [https://www.fema.gov/openfema-data-page/fema-web-disaster-declarations-v1](https://www.fema.gov/openfema-data-page/fema-web-disaster-declarations-v1)

femaCosts
- [https://www.fema.gov/openfema-data-page/fema-web-disaster-summaries-v1](https://www.fema.gov/openfema-data-page/fema-web-disaster-summaries-v1)

In [None]:
#Value counts for important categorical-like fields
print('\n======== declarationType ========')
print(femaMasterData['declarationType'].value_counts())

print('\n======== stateCode ========')
print(femaMasterData['stateCode'].value_counts())

print('\n======== stateName ========')
print(femaMasterData['stateName'].value_counts())

print('\n======== incidentType ========')
print(femaMasterData['incidentType'].value_counts())

print('\n======== region ========')
print(femaMasterData['region'].value_counts())

femaMasterData['declarationType'].value_counts()


In [None]:
#Calc Summary statistics - 1
femaMasterData.describe()

#### Data Cleaning

- Data set contains values for all rows (4865) that appear to be needed for analysis except for some of the costs
- "totalAmount.." and "totalObligated..." fields will be set to 0 as most of them represent money (in USD) spent or authorized for spending
- Duplicates for disasterNumber checked in pre-join steps

In [None]:
#missing or NA values (cost values should not be null...setting to zero)
femaMasterData['totalNumberIaApproved'].fillna(0, inplace=True)
femaMasterData['totalAmountHaApproved'].fillna(0, inplace=True)
femaMasterData['totalAmountIhpApproved'].fillna(0, inplace=True)
femaMasterData['totalAmountOnaApproved'].fillna(0, inplace=True)
femaMasterData['totalObligatedAmountPa'].fillna(0, inplace=True)
femaMasterData['totalObligatedAmountCatAb'].fillna(0, inplace=True)
femaMasterData['totalObligatedAmountCatC2g'].fillna(0, inplace=True)
femaMasterData['totalObligatedAmountHmgp'].fillna(0, inplace=True)

#drop columns
#...will drop after EDA so that I confirm which fields aren't useful

In [None]:
#check that the "total..." fields are no longer NULL/NaN
femaMasterData.info()

In [None]:
#Export cleaned df to file
