# COGS 108 - Data Checkpoint

# Names

- Victoria Thai
- Hannah Yick
- Jane Dinh
- Natasha Supangkat
- Gabriel Ramiro

<a id='research_question'></a>
# Research Question

What trends can be found between a county’s designation of a superfund site and its socioeconomic/demographic trends? Does the demographic makeup of a county influence the amount of time between Superfund designation and the completion of the remediation process?

# Dataset(s)

We will combine our datasets using the common variable of geographic location (county). By analyzing both aspects - Superfund activity and demographic breakdown - holistically, we will be able to better visualize and understand trends and potentially a relationship between the two.

1) **Superfund/National Priorities List (NPL) Sites**
- Number of observations: 1327 (as of April 26, 2021)
- Features: Region, state, site name, site ID, EPA ID, address, city, zip, county, federal facility indicator (whether or not the site is a federal site), latitude, longitude, listing date.
 - We will be focusing on the site name, county, and listing date.
- Summary: This dataset provides a comprehensive overview of the main characteristics of current Superfund sites on the National Priorities List. Geographical information (county) can help us better understand the spread of the data by location, and will allow us to combine with demographic data to explore trends. Additionally, we will use the listing date to determine the source of funding (based on the date of policy changes) as well as to measure how long the site has been active.
- Source: https://semspub.epa.gov/work/HQ/201371.pdf 

2) **2019 American Community Survey 5 Year Estimate by County**
- Number of observations: 3220 
- Features: Total population, race, per capita income median household income, poverty status by race. 
- Summary: This dataset provides demographic and socioeconomic data on each city’s population. This will help us understand population trends based on the NPL sites. 
- Source: https://www.socialexplorer.com/explore-maps



# Setup

In [1]:
#Imports 
import pandas as pd
import numpy as np

#Graphing
import seaborn as sns
import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline

#Statistics
import patsy
import statsmodels.api as sm
import scipy.stats as stats
from scipy.stats import ttest_ind, chisquare, normaltest

#Webscraping
import requests 
import bs4
from bs4 import BeautifulSoup

import warnings
warnings.filterwarnings('ignore')

# Read in the data and store it within a data frame
npl_df = npl_df = pd.read_csv('All current Final NPL Sites (FOIA 4).csv') 

# Data Cleaning

Describe your data cleaning steps here.

In [2]:
npl_df

Unnamed: 0,Region,State,Site Name,Site ID,EPA ID,Address,City,Zip,County,FF Ind,Latitude,Longitude,NPL Status Date
0,1,CT,BARKHAMSTED-NEW HARTFORD LANDFILL,100255,CTD980732333,ROUTE 44,BARKHAMSTED,6063,LITCHFIELD,N,41.893947,-72.989337,10/4/1989
1,1,CT,BEACON HEIGHTS LANDFILL,100180,CTD072122062,BLACKBERRY HILL ROAD,BEACON FALLS,6403,NEW HAVEN,N,41.431950,-73.035281,9/8/1983
2,1,CT,DURHAM MEADOWS,100108,CTD001452093,MAIN ST,DURHAM,6422,MIDDLESEX,N,41.481110,-72.681388,10/4/1989
3,1,CT,GALLUP'S QUARRY,100201,CTD108960972,ROUTE 12,PLAINFIELD,6374,WINDHAM,N,41.665281,-71.924161,10/4/1989
4,1,CT,KELLOGG-DEERING WELL FIELD,100252,CTD980670814,NORWALK WATER DEPARTMENT,NORWALK,6856,FAIRFIELD,N,41.130550,-73.431950,9/21/1984
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1322,10,WA,QUEEN CITY FARMS,1000835,WAD980511745,S 1/2 SEC 28-MAPLE VALLEY QUAD,MAPLE VALLEY,98038,KING,N,47.450000,-122.041700,9/21/1984
1323,10,WA,QUENDALL TERMINALS,1000875,WAD980639215,4503 LK WASHINGTON BLVD N,RENTON,98055,KING,N,47.533333,-122.200000,4/19/2006
1324,10,WA,SEATTLE MUNICIPAL LANDFILL (KENT HIGHLANDS),1000889,WAD980639462,NE OF MILITARY RD AND KENT DES MOINES RD,KENT,98031,KING,N,47.391669,-122.279200,8/30/1990
1325,10,WA,"WESTERN PROCESSING CO., INC.",1000662,WAD009487513,7215 S 196TH ST,KENT,98031,KING,N,47.425000,-122.241700,9/8/1983


We removed the following columns from the dataset because they are irrelevant to our research.  

In [3]:
npl_df = npl_df.drop(["Region", "Site ID", "EPA ID", "Address", "Zip", "FF Ind", "Latitude", "Longitude"], axis=1)
npl_df

Unnamed: 0,State,Site Name,City,County,NPL Status Date
0,CT,BARKHAMSTED-NEW HARTFORD LANDFILL,BARKHAMSTED,LITCHFIELD,10/4/1989
1,CT,BEACON HEIGHTS LANDFILL,BEACON FALLS,NEW HAVEN,9/8/1983
2,CT,DURHAM MEADOWS,DURHAM,MIDDLESEX,10/4/1989
3,CT,GALLUP'S QUARRY,PLAINFIELD,WINDHAM,10/4/1989
4,CT,KELLOGG-DEERING WELL FIELD,NORWALK,FAIRFIELD,9/21/1984
...,...,...,...,...,...
1322,WA,QUEEN CITY FARMS,MAPLE VALLEY,KING,9/21/1984
1323,WA,QUENDALL TERMINALS,RENTON,KING,4/19/2006
1324,WA,SEATTLE MUNICIPAL LANDFILL (KENT HIGHLANDS),KENT,KING,8/30/1990
1325,WA,"WESTERN PROCESSING CO., INC.",KENT,KING,9/8/1983
