# Computing in Context: Public Policy
## Project 2: Data Summarization and Data Fusion

In this project, you will demonstrate your proficiency with what we have learned about Data Summarization and Data Fusion.

The project is due Monday, November 22. To submit the project, upload the completed notebook to the COMSW1002 Courseworks following the convention UNI_proj2.ipynb.

This is an individual assignment.

You will be working with the [NOAA Storm Events Database](https://www.ncdc.noaa.gov/stormevents/). Specifically, to start with, you will be working with storm events from 2010 to 2020.

### Motivation

You are a data analyst for the [National Flood Insurance Program](https://www.fema.gov/flood-insurance). You are helping to design a campaign to encourage people to mitigate flood risk and purchase flood insurance. Part of the campaign will be increasing awareness of weather events that can contribute to flooding. Your current task is to analyze trends in property damage caused by floods.

---

To help you in this project, I have provided a string of the URL that contains the relevant datasets and a dictionary that maps each year to the specific file path that contains that year's storm events dataset.

In [1]:
import pandas as pd 

url = "https://www.ncei.noaa.gov/pub/data/swdi/stormevents/csvfiles/"

#Observation: On November 11 some of the original links were broken, so I replaced them
filepaths = {
    2010: "StormEvents_details-ftp_v1.0_d2010_c20210803.csv.gz",
    2011: "StormEvents_details-ftp_v1.0_d2011_c20210803.csv.gz",
    2012: "StormEvents_details-ftp_v1.0_d2012_c20210803.csv.gz",
    2013: "StormEvents_details-ftp_v1.0_d2013_c20211120.csv.gz",
    2014: "StormEvents_details-ftp_v1.0_d2014_c20210830.csv.gz",
    2015: "StormEvents_details-ftp_v1.0_d2015_c20211120.csv.gz",
    2016: "StormEvents_details-ftp_v1.0_d2016_c20210803.csv.gz",
    2017: "StormEvents_details-ftp_v1.0_d2017_c20211120.csv.gz",
    2018: "StormEvents_details-ftp_v1.0_d2018_c20211120.csv.gz",
    2019: "StormEvents_details-ftp_v1.0_d2019_c20210803.csv.gz",
    2020: "StormEvents_details-ftp_v1.0_d2020_c20211120.csv.gz"
}


---

### Part 1 (20 points)

Starting only with storm events in 2010, calculate the total and median amount of property damage caused by **flood events** in each state. We will consider a storm event a flood event if the `EVENT_TYPE` contains "Flood".

I am providing you with a version of the `clean_damage_property()` function from the first project.

In [2]:
def clean_damage_property(col):
    
    numbers = pd.to_numeric(col.str.slice(start = 0, stop = -1))
    
    last_characters = col.str.slice(start = -1)
    
    factors = last_characters.map({"K": 1000, "M": 1000000, "B": 1000000000})
    
    rv = numbers * factors
    
    return(rv) 

In [3]:
'''Reading and treating the file'''

# Creating a full link for 2010
source = url + filepaths[2010]

# Reading the file
storm_2010 = pd.read_csv(source)

# Getting only the flood events
storm_2010_flood = storm_2010[storm_2010["EVENT_TYPE"].str.contains('Flood')]

# Creating a new column with treated damage property amounts
storm_2010_flood["DAMAGE_PROPERTY_CLEAN"] = clean_damage_property(storm_2010_flood["DAMAGE_PROPERTY"])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  storm_2010_flood["DAMAGE_PROPERTY_CLEAN"] = clean_damage_property(storm_2010_flood["DAMAGE_PROPERTY"])


In [4]:
'''Calculating the total and median amount of property damage caused by flood events in each state'''
storm_2010_flood.groupby("STATE")["DAMAGE_PROPERTY_CLEAN"].aggregate(["sum","median"])

Unnamed: 0_level_0,sum,median
STATE,Unnamed: 1_level_1,Unnamed: 2_level_1
ALABAMA,3209000.0,0.0
ALASKA,6160000.0,0.0
AMERICAN SAMOA,0.0,0.0
ARIZONA,19825100.0,0.0
ARKANSAS,19151500.0,1000.0
CALIFORNIA,181149500.0,4000.0
COLORADO,560000.0,5000.0
CONNECTICUT,10990000.0,0.0
DELAWARE,655000.0,0.0
DISTRICT OF COLUMBIA,0.0,0.0


---

### Part 2 (40 points)

Now for all storm events between 2010 and 2020, calculate the total amount of property damage caused by **flood events** in each state in each year. Your output should be a **long** series or dataframe.

In [5]:
'''Creating a complete file with data from 2010 and 2020'''

all_flood_events_clean = pd.DataFrame()

for year, path in filepaths.items():
    source = url+path
    
    # For each year, I'm reading the file
    yearly_events = pd.read_csv(source)
    
    # Getting only flood events
    yearly_flood_events = yearly_events[yearly_events["EVENT_TYPE"].str.contains('Flood')]
    
    # Creating a new column with treated damage property amounts
    yearly_flood_events["DAMAGE_PROPERTY_CLEAN"] = clean_damage_property(yearly_flood_events["DAMAGE_PROPERTY"])
    
    # Appending in the long form all yearly treated files in the final file
    all_flood_events_clean = all_flood_events_clean.append(yearly_flood_events)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  yearly_flood_events["DAMAGE_PROPERTY_CLEAN"] = clean_damage_property(yearly_flood_events["DAMAGE_PROPERTY"])


In [12]:
'''Calculating the total amount of property damage caused by flood events in EACH state in EACH year'''
all_flood_events_clean.groupby(["STATE","YEAR"])[["DAMAGE_PROPERTY_CLEAN"]].sum()

Unnamed: 0_level_0,Unnamed: 1_level_0,DAMAGE_PROPERTY_CLEAN
STATE,YEAR,Unnamed: 2_level_1
ALABAMA,2010,3209000.0
ALABAMA,2011,602000.0
ALABAMA,2012,358000.0
ALABAMA,2013,3881000.0
ALABAMA,2014,34618460.0
...,...,...
WYOMING,2016,1200000.0
WYOMING,2017,2251000.0
WYOMING,2018,10000.0
WYOMING,2019,245000.0


Calculate the total amount of property damage caused by floods across all states in each year.

In [11]:
'''Calculating the total amount of property damage caused by flood events across ALL states in EACH year'''
all_flood_events_clean.groupby("YEAR")[["DAMAGE_PROPERTY_CLEAN"]].sum()

Unnamed: 0_level_0,DAMAGE_PROPERTY_CLEAN
YEAR,Unnamed: 1_level_1
2010,3935464000.0
2011,8019957000.0
2012,21547760000.0
2013,2196687000.0
2014,2629546000.0
2015,2540421000.0
2016,10712390000.0
2017,64656590000.0
2018,1243669000.0
2019,2478318000.0


---

### Part 3 (40 points)

Now for all storm events between 2010 and 2020, calculate the total amount of property damage caused by **flood events** in each state in each year. Your output should be a **wide** dataframe and have the following columns:

- `STATE` State
- `DAMAGE_PROPERTY_CLEAN_2010` Property damage from flood events in 2010
- `DAMAGE_PROPERTY_CLEAN_2011` Property damage from flood events in 2011
- `DAMAGE_PROPERTY_CLEAN_2012` Property damage from flood events in 2012
- `DAMAGE_PROPERTY_CLEAN_2013` Property damage from flood events in 2013
- `DAMAGE_PROPERTY_CLEAN_2014` Property damage from flood events in 2014
- `DAMAGE_PROPERTY_CLEAN_2015` Property damage from flood events in 2015
- `DAMAGE_PROPERTY_CLEAN_2016` Property damage from flood events in 2016
- `DAMAGE_PROPERTY_CLEAN_2017` Property damage from flood events in 2017
- `DAMAGE_PROPERTY_CLEAN_2018` Property damage from flood events in 2018
- `DAMAGE_PROPERTY_CLEAN_2019` Property damage from flood events in 2019
- `DAMAGE_PROPERTY_CLEAN_2020` Property damage from flood events in 2020

In [7]:
'''Creating a complete file with data from 2010 and 2020'''

wide_flood_events = pd.DataFrame(columns=["STATE"])

for year, path in filepaths.items():
    source = url+path
    
    # For each year, I'm reading the file
    yearly_events = pd.read_csv(source)
    
    # Getting only flood events
    yearly_flood_events = yearly_events[yearly_events["EVENT_TYPE"].str.contains('Flood')]
    
    # Creating a new column with treated damage property amounts for each year
    col_name="DAMAGE_PROPERTY_CLEAN_"+str(year)
    yearly_flood_events[col_name] = clean_damage_property(yearly_flood_events["DAMAGE_PROPERTY"])
    
    # Grouping the yearly flood events and summing the clean property data 
    yearly_grouped = yearly_flood_events.groupby("STATE")[[col_name]].sum()
    
    # Merging in the wide form all yearly treated files in the final file
    wide_flood_events = pd.merge(left=wide_flood_events, right=yearly_grouped, left_on="STATE", right_on="STATE", validate = "1:m", how="outer")

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  yearly_flood_events[col_name] = clean_damage_property(yearly_flood_events["DAMAGE_PROPERTY"])


In [8]:
# Checking if the columns are correct
wide_flood_events.columns

Index(['STATE', 'DAMAGE_PROPERTY_CLEAN_2010', 'DAMAGE_PROPERTY_CLEAN_2011',
       'DAMAGE_PROPERTY_CLEAN_2012', 'DAMAGE_PROPERTY_CLEAN_2013',
       'DAMAGE_PROPERTY_CLEAN_2014', 'DAMAGE_PROPERTY_CLEAN_2015',
       'DAMAGE_PROPERTY_CLEAN_2016', 'DAMAGE_PROPERTY_CLEAN_2017',
       'DAMAGE_PROPERTY_CLEAN_2018', 'DAMAGE_PROPERTY_CLEAN_2019',
       'DAMAGE_PROPERTY_CLEAN_2020'],
      dtype='object')

In [9]:
# Inspecting if the data frame was corectly created
wide_flood_events.head()

Unnamed: 0,STATE,DAMAGE_PROPERTY_CLEAN_2010,DAMAGE_PROPERTY_CLEAN_2011,DAMAGE_PROPERTY_CLEAN_2012,DAMAGE_PROPERTY_CLEAN_2013,DAMAGE_PROPERTY_CLEAN_2014,DAMAGE_PROPERTY_CLEAN_2015,DAMAGE_PROPERTY_CLEAN_2016,DAMAGE_PROPERTY_CLEAN_2017,DAMAGE_PROPERTY_CLEAN_2018,DAMAGE_PROPERTY_CLEAN_2019,DAMAGE_PROPERTY_CLEAN_2020
0,ALABAMA,3209000.0,602000.0,358000.0,3881000.0,34618460.0,9472000.0,200000.0,3275000.0,2215000.0,210000.0,1280000.0
1,ALASKA,6160000.0,25535000.0,23430000.0,28076000.0,686000.0,17290300.0,158000.0,6190000.0,3400000.0,7900000.0,9574000.0
2,AMERICAN SAMOA,0.0,0.0,,,0.0,8000000.0,0.0,7000.0,100000.0,15000.0,671000.0
3,ARIZONA,19825100.0,3948000.0,18100000.0,3166000.0,14515000.0,1185000.0,1165000.0,7996500.0,10446000.0,2162000.0,334000.0
4,ARKANSAS,19151500.0,185908000.0,2242000.0,24728000.0,15060000.0,8893000.0,4712000.0,26719000.0,1610000.0,11670000.0,423000.0


Which states had more property damage from floods in 2020 than in 2010?

In [10]:
wide_flood_events[wide_flood_events["DAMAGE_PROPERTY_CLEAN_2010"] < wide_flood_events["DAMAGE_PROPERTY_CLEAN_2020"]][["STATE","DAMAGE_PROPERTY_CLEAN_2010","DAMAGE_PROPERTY_CLEAN_2020"]]

Unnamed: 0,STATE,DAMAGE_PROPERTY_CLEAN_2010,DAMAGE_PROPERTY_CLEAN_2020
1,ALASKA,6160000.0,9574000.0
2,AMERICAN SAMOA,0.0,671000.0
9,DISTRICT OF COLUMBIA,0.0,100000.0
10,FLORIDA,48000.0,47841000.0
11,GEORGIA,1268000.0,1985000.0
12,HAWAII,0.0,30600000.0
15,INDIANA,227500.0,630000.0
21,MARYLAND,30000.0,395000.0
23,MICHIGAN,5030000.0,258467000.0
33,NEW YORK,5736000.0,34668000.0


__Answer:__ There are 17 states wich had more property damage from floods in 2020 than in 2010: ALASKA, AMERICAN SAMOA, DISTRICT OF COLUMBIA, FLORIDA, GEORGIA, HAWAII, INDIANA, MARYLAND, MICHIGAN, NEW YORK, NORTH CAROLINA, OHIO, OREGON, PUERTO RICO, SOUTH CAROLINA, VIRGINIA and WASHINGTON.