# FEMA Disaster Cost Forecasting
#### Capstone 2 - Data Wrangling
Michael Garber




#### Data Wrangling High-Level Steps
1. Data Collection
2. Data Organization
3. Data Definition
4. Data Cleaning


#### Data Collection


Top level FEMA data sets
- [https://www.fema.gov/about/openfema/data-sets](https://www.fema.gov/about/openfema/data-sets)

OpenFEMA Dataset: FEMA Web Disaster Declarations - v1
- info [https://www.fema.gov/openfema-data-page/fema-web-disaster-declarations-v1](https://www.fema.gov/openfema-data-page/fema-web-disaster-declarations-v1) \
- data [https://www.fema.gov/api/open/v1/FemaWebDisasterDeclarations.csv](https://www.fema.gov/api/open/v1/FemaWebDisasterDeclarations.csv)

OpenFEMA Dataset: FEMA Web Disaster Summaries - v1
- info [https://www.fema.gov/openfema-data-page/fema-web-disaster-summaries-v1](https://www.fema.gov/openfema-data-page/fema-web-disaster-summaries-v1) \
- data [https://www.fema.gov/api/open/v1/FemaWebDisasterSummaries.csv](https://www.fema.gov/api/open/v1/FemaWebDisasterSummaries.csv)

ClimRR
- info [https://climrr.anl.gov/](https://climrr.anl.gov/) \
*just a reference - may be used for climate prediction data if needed*

In [1]:
#Import packages
import pandas as pd
import requests
import os

In [2]:
#Data Download locations
disasterInfoUrl = 'https://www.fema.gov/api/open/v1/FemaWebDisasterDeclarations.csv'
disasterCostUrl = 'https://www.fema.gov/api/open/v1/FemaWebDisasterSummaries.csv'
rawDataDir = '../data/raw/'
femaInfoPath = rawDataDir + 'FemaWebDisasterDeclarations.csv'
femaCostPath = rawDataDir + 'FemaWebDisasterSummaries.csv'

#Download Disaster Info data locally
r = requests.get(disasterInfoUrl)
with open(femaInfoPath, 'wb') as f:
    f.write(r.content)

#Download Disaster Cost data locally
r = requests.get(disasterCostUrl)
with open(femaCostPath, 'wb') as f:
    f.write(r.content)

In [3]:
#Loading to pandas dataframe

#Load FEMA disaster info data
femaInfo = pd.read_csv(femaInfoPath)

#Load FEMA disaster cost (federal financial assistance) data
femaCosts = pd.read_csv(femaCostPath)

In [4]:
#check loaded data shape
print("femaInfo rows, cols: " + str(femaInfo.shape))
print("femaCosts rows, cols: " + str(femaCosts.shape))

femaInfo rows, cols: (4865, 20)
femaCosts rows, cols: (3621, 14)


In [5]:
#check the head - fema info
femaInfo.head()

Unnamed: 0,disasterNumber,declarationDate,disasterName,incidentBeginDate,incidentEndDate,declarationType,stateCode,stateName,incidentType,entryDate,updateDate,closeoutDate,region,ihProgramDeclared,iaProgramDeclared,paProgramDeclared,hmProgramDeclared,id,hash,lastRefresh
0,4734,2023-08-31T00:00:00.000Z,HURRICANE IDALIA,2023-08-27T00:00:00.000Z,2023-09-04T00:00:00.000Z,Major Disaster,FL,Florida,Hurricane,2023-08-31T00:00:00.000Z,2023-10-02T00:00:00.000Z,,4,1.0,0.0,1.0,1.0,7039b9e8-8e40-411a-b3bd-ecb27b37d535,b2327cb14c124443d7e00b898be990718576195f,2023-10-02T22:21:25.390Z
1,4738,2023-09-07T00:00:00.000Z,HURRICANE IDALIA,2023-08-30T00:00:00.000Z,2023-08-30T00:00:00.000Z,Major Disaster,GA,Georgia,Hurricane,2023-09-07T00:00:00.000Z,2023-10-02T00:00:00.000Z,,4,1.0,0.0,1.0,1.0,6f1316f1-788f-4763-9bc5-3c5d47e65f55,3b10fdf0c825cd0bc81e6f83955c85d9b32057e5,2023-10-02T22:41:25.957Z
2,4744,2023-10-06T00:00:00.000Z,SEVERE STORMS AND FLOODING,2023-08-03T00:00:00.000Z,2023-08-05T00:00:00.000Z,Major Disaster,VT,Vermont,Flood,2023-10-06T00:00:00.000Z,2023-10-06T00:00:00.000Z,,1,0.0,0.0,1.0,1.0,f8a460e1-772c-4efe-91c3-bd7eed5df61e,314ca84418ff5016ef5c0fc7a1e32a6106b7e18b,2023-10-06T16:41:28.824Z
3,4745,2023-10-11T00:00:00.000Z,FLOODING,2023-06-01T00:00:00.000Z,2023-06-08T00:00:00.000Z,Major Disaster,MT,Montana,Flood,2023-10-11T00:00:00.000Z,2023-10-11T00:00:00.000Z,,8,0.0,0.0,1.0,1.0,889564e8-549a-4123-88ba-a1d9b7b0261b,96fe8a62931b20e70e8a1779545c9467c66d29f8,2023-10-11T22:02:12.546Z
4,3404,2018-09-12T00:00:00.000Z,TROPICAL STORM OLIVIA,2018-09-09T00:00:00.000Z,2018-09-13T00:00:00.000Z,Emergency,HI,Hawaii,Hurricane,2018-09-12T00:00:00.000Z,2023-10-12T00:00:00.000Z,2023-10-11T00:00:00.000Z,9,0.0,0.0,1.0,0.0,5aa4296a-7d77-430f-a513-b7a433f6d305,87732b7d7d5a424fad84c04f5edf0bff4e860f50,2023-10-12T11:21:25.048Z


In [6]:
#check the head - fema costs
femaCosts.head()

Unnamed: 0,disasterNumber,totalNumberIaApproved,totalAmountIhpApproved,totalAmountHaApproved,totalAmountOnaApproved,totalObligatedAmountPa,totalObligatedAmountCatAb,totalObligatedAmountCatC2g,paLoadDate,iaLoadDate,totalObligatedAmountHmgp,hash,lastRefresh,id
0,3601,,,,,,,,,,,3de68baba960e69da445cf822d3dd859081fb34a,2023-10-09T23:02:26.341Z,faafecca-0f76-4fb8-8ffd-b6f46f3b712c
1,3602,,,,,,,,,,,58566c446fce5cabbd0c3412a6bb3daa4ada1993,2023-10-09T23:02:26.341Z,b74f0dc2-fab5-42b9-acf7-c94df14d85ad
2,1267,,,,,,,,,,2167204.0,1ba476aa5b95b344e79f9ed5c0b1442849ccb5e0,2023-03-18T13:22:12.883Z,0f5ed8dd-d8e5-4328-9155-372206b47182
3,1270,,,,,,,,,,319783.0,bb3c5cb8faf9c3e5606bd3f86ff09da517907068,2023-03-18T13:22:12.883Z,df214994-59d2-4075-b52c-894f7f4b358e
4,1290,,,,,,,,,,200782.0,fbe5b8caa213cd388e1d7773b121c3c74559a5fb,2023-03-18T13:22:12.883Z,ffca81a5-0830-4f1c-b088-0ff4d5bb8e8a


In [7]:
#Pre-join check

#check that join fields are unique before join - femainfo.disasterNumber
print("disaster number is all unique (femaInfo)? " + str(len(femaInfo) == femaInfo['disasterNumber'].nunique()))

#check that join fields are unique before join - femacosts.disasterNumber
print("disaster number is all unique (femaCosts)? " + str(len(femaCosts) == femaCosts['disasterNumber'].nunique()))


disaster number is all unique (femaInfo)? True
disaster number is all unique (femaCosts)? True


In [8]:
#Data Joining - femaInfo & femaCosts 
print("Rows in femaInfo: " + str(len(femaInfo)))
print("Rows in femaInfo: " + str(len(femaCosts)))
print("*Note: There are more rows in the disaster [info] dataset than the disaster [costs]," + '\n' + "this suggests some disasters do not have cost info. Perhaps there was no requests to FEMA made.")

Rows in femaInfo: 4865
Rows in femaInfo: 3621
*Note: There are more rows in the disaster [info] dataset than the disaster [costs],
this suggests some disasters do not have cost info. Perhaps there was no requests to FEMA made.


In [9]:
#Join disaster info with disaster costs via the 'disaster info' column
femaMasterData = pd.merge(femaInfo, femaCosts, how='left', on='disasterNumber')

#check joined data set
print("(Rows, columns) in femaInfo" + '\n' + str(femaMasterData.shape))
femaMasterData.head()

(Rows, columns) in femaInfo
(4865, 33)


Unnamed: 0,disasterNumber,declarationDate,disasterName,incidentBeginDate,incidentEndDate,declarationType,stateCode,stateName,incidentType,entryDate,...,totalAmountOnaApproved,totalObligatedAmountPa,totalObligatedAmountCatAb,totalObligatedAmountCatC2g,paLoadDate,iaLoadDate,totalObligatedAmountHmgp,hash_y,lastRefresh_y,id_y
0,4734,2023-08-31T00:00:00.000Z,HURRICANE IDALIA,2023-08-27T00:00:00.000Z,2023-09-04T00:00:00.000Z,Major Disaster,FL,Florida,Hurricane,2023-08-31T00:00:00.000Z,...,26522249.47,298020800.0,278159500.0,118465.0,2024-04-21T00:00:00.000Z,2024-04-21T00:00:00.000Z,3210378.19,a8295fd115c373be2a8a3fecfbdc7ea7d765506b,2024-04-21T05:23:24.089Z,c89d55b6-497c-4d62-9d44-f3642270ba16
1,4738,2023-09-07T00:00:00.000Z,HURRICANE IDALIA,2023-08-30T00:00:00.000Z,2023-08-30T00:00:00.000Z,Major Disaster,GA,Georgia,Hurricane,2023-09-07T00:00:00.000Z,...,586041.75,28085050.0,17143520.0,9603108.46,2024-04-21T00:00:00.000Z,2024-04-21T00:00:00.000Z,0.0,20722d7dfc52cf540367410736a56a0ee3d51147,2024-04-21T05:23:24.089Z,c0dc9a9e-2492-4ac5-8fd0-60e921d44410
2,4744,2023-10-06T00:00:00.000Z,SEVERE STORMS AND FLOODING,2023-08-03T00:00:00.000Z,2023-08-05T00:00:00.000Z,Major Disaster,VT,Vermont,Flood,2023-10-06T00:00:00.000Z,...,,119083.5,18867.75,16300.72,2024-04-21T00:00:00.000Z,,0.0,2ed7d6399909cb4bce8008e2e61baee221d2384c,2024-04-21T03:43:05.656Z,a10a4a1a-50bc-4b4f-8a60-7bb8ce5e83d7
3,4745,2023-10-11T00:00:00.000Z,FLOODING,2023-06-01T00:00:00.000Z,2023-06-08T00:00:00.000Z,Major Disaster,MT,Montana,Flood,2023-10-11T00:00:00.000Z,...,,2340303.0,139534.4,1954442.82,2024-04-21T00:00:00.000Z,,0.0,9e734c367377607787e241e41c07b3c9aa7b89bd,2024-04-21T03:43:05.656Z,2f90c006-f204-4084-920e-1bf2d063b3aa
4,3404,2018-09-12T00:00:00.000Z,TROPICAL STORM OLIVIA,2018-09-09T00:00:00.000Z,2018-09-13T00:00:00.000Z,Emergency,HI,Hawaii,Hurricane,2018-09-12T00:00:00.000Z,...,,,,,,,,,,


#### Data Organization
Project file structure based on the cookiecutter data science template. \
[https://drivendata.github.io/cookiecutter-data-science/](https://drivendata.github.io/cookiecutter-data-science/)

Folder structure tree (GitHub) \
[https://github.com/mdgarber/FEMADisasterCostForecasting/blob/acc1f9a68773c3fa7b87325f7fb814c049f03306/femadisastercostforecasting/README.md](https://github.com/mdgarber/FEMADisasterCostForecasting/blob/acc1f9a68773c3fa7b87325f7fb814c049f03306/femadisastercostforecasting/README.md)

#### Data Definition

Column names
- Data types
- Description of the columns
- Counts and percents unique values
- Ranges of values
- Calc Summary statistics

In [10]:
#Check Data types,  unique values, range of index
femaMasterData.info(verbose=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4865 entries, 0 to 4864
Data columns (total 33 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   disasterNumber              4865 non-null   int64  
 1   declarationDate             4865 non-null   object 
 2   disasterName                4865 non-null   object 
 3   incidentBeginDate           4865 non-null   object 
 4   incidentEndDate             4599 non-null   object 
 5   declarationType             4865 non-null   object 
 6   stateCode                   4865 non-null   object 
 7   stateName                   4865 non-null   object 
 8   incidentType                4865 non-null   object 
 9   entryDate                   4865 non-null   object 
 10  updateDate                  4865 non-null   object 
 11  closeoutDate                3910 non-null   object 
 12  region                      4865 non-null   int64  
 13  ihProgramDeclared           4614 

##### Description of the columns
femaInfo
- [https://www.fema.gov/openfema-data-page/fema-web-disaster-declarations-v1](https://www.fema.gov/openfema-data-page/fema-web-disaster-declarations-v1)

femaCosts
- [https://www.fema.gov/openfema-data-page/fema-web-disaster-summaries-v1](https://www.fema.gov/openfema-data-page/fema-web-disaster-summaries-v1)

In [11]:
#Value counts for important categorical-like fields
print('\n======== declarationType ========')
print(femaMasterData['declarationType'].value_counts())

print('\n======== stateCode ========')
print(femaMasterData['stateCode'].value_counts())

print('\n======== stateName ========')
print(femaMasterData['stateName'].value_counts())

print('\n======== incidentType ========')
print(femaMasterData['incidentType'].value_counts())

print('\n======== region ========')
print(femaMasterData['region'].value_counts())

femaMasterData['declarationType'].value_counts()



declarationType
Major Disaster      2771
Fire Management     1027
Emergency            604
Fire Suppression     463
Name: count, dtype: int64

stateCode
CA    375
TX    374
OK    227
WA    202
FL    177
OR    146
NM    113
NY    112
AZ    110
LA    106
CO    102
AL    101
NV    101
MT    101
MS     93
SD     91
TN     90
KY     88
KS     84
AK     83
AR     82
WV     78
MN     78
MO     78
NE     77
NC     74
GA     74
IA     74
VA     73
ND     70
ME     70
HI     67
IL     67
PA     63
NH     61
OH     59
VT     58
NJ     58
MA     57
WI     54
ID     54
IN     52
UT     52
PR     48
MI     44
SC     41
WY     40
CT     40
MD     37
RI     31
VI     31
FM     26
MP     26
DE     25
DC     23
GU     22
AS     17
MH      7
PW      1
Name: count, dtype: int64

stateName
California                        375
Texas                             374
Oklahoma                          227
Washington                        202
Florida                           177
Oregon                       

declarationType
Major Disaster      2771
Fire Management     1027
Emergency            604
Fire Suppression     463
Name: count, dtype: int64

#### Data Cleaning

- Data set contains values for all rows (4865) that appear to be needed for analysis except for some of the costs
- "totalAmount.." and "totalObligated..." fields will be set to 0 as most of them represent money (in USD) spent or authorized for spending
- Duplicates for disasterNumber checked in pre-join steps

In [12]:
#missing or NA values (cost values should not be null...setting to zero)
femaMasterData['totalNumberIaApproved'].fillna(0, inplace=True)
femaMasterData['totalAmountHaApproved'].fillna(0, inplace=True)
femaMasterData['totalAmountIhpApproved'].fillna(0, inplace=True)
femaMasterData['totalAmountOnaApproved'].fillna(0, inplace=True)
femaMasterData['totalObligatedAmountPa'].fillna(0, inplace=True)
femaMasterData['totalObligatedAmountCatAb'].fillna(0, inplace=True)
femaMasterData['totalObligatedAmountCatC2g'].fillna(0, inplace=True)
femaMasterData['totalObligatedAmountHmgp'].fillna(0, inplace=True)

#drop columns
#...will drop after EDA so that I confirm which fields aren't useful

In [13]:
#check that the "total..." fields are no longer NULL/NaN
femaMasterData.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4865 entries, 0 to 4864
Data columns (total 33 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   disasterNumber              4865 non-null   int64  
 1   declarationDate             4865 non-null   object 
 2   disasterName                4865 non-null   object 
 3   incidentBeginDate           4865 non-null   object 
 4   incidentEndDate             4599 non-null   object 
 5   declarationType             4865 non-null   object 
 6   stateCode                   4865 non-null   object 
 7   stateName                   4865 non-null   object 
 8   incidentType                4865 non-null   object 
 9   entryDate                   4865 non-null   object 
 10  updateDate                  4865 non-null   object 
 11  closeoutDate                3910 non-null   object 
 12  region                      4865 non-null   int64  
 13  ihProgramDeclared           4614 