# Capstone Project - EPA Superfund sites, environmental justice

#### Problem statement


I have a dataset consisting of socioeconomic and demographic data involving communities with a variable that designates that area if it is near a Superfund site or not. They would like you to look at the demographic and socioeconomic factors that are most associated with communities that leave near a Superfund site. If it involves more "marginal" communities, they want to bring that up to policymakers, the EPA, and other significant influencers. 

#### Background

Thousands of contaminated sites exist nationally due to hazardous waste being dumped, left out in the open, or otherwise improperly managed. These sites include manufacturing facilities, processing plants, landfills and mining sites. In response, Congress established the Comprehensive Environmental Response, Compensation and Liability Act (CERCLA) in 1980. CERCLA is informally called Superfund. It allows EPA to clean up contaminated sites. It also forces the parties responsible for the contamination to either perform cleanups or reimburse the government for EPA-led cleanup work. When there is no viable responsible party, Superfund gives EPA the funds and authority to clean up contaminated sites. address abandoned hazardous waste sites. These abandoned sites are thought to pose a significant threat to human health and the environment, and as a result, may qualify for placement on the USEPA’s Superfund list to receive federal cleanup funds.
 
Superfund’s goals are to:
 
- Protect human health and the environment by cleaning up polluted sites;
- Make responsible parties pay for cleanup work;
- Involve communities in the Superfund process; and
- Return Superfund sites to productive use.

A Superfund that is still active (all the sites on the National Priorities List have either limited remediation, or none at all, such that they still present a significant threat to public safety). The list of Superfund sites are on the National Priorities List. There doesn't seem to be an column indicating one way or the other. After remediation is satisfactory, they are taken off the list.

From https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4228303/: 

The geographic distribution of Superfund sites has always been a controversial issue as it has been shown that hazardous waste  sites are located differently in terms of demographics and socioeconomic status. 
The geographic distribution of Superfund sites has always been a controversial issue because research has shown that hazardous waste sites are differentially located in predominately Non-White and low-income communities. An environmental justice (EJ) analysis conducted by Maranville et al., examined whether the presence of a Superfund site affected surrounding communities in the state of Illinois in order to inform future siting decisions and improve current sites [3]. Geographic Information Systems (GIS) was used to create one, two, and five mile buffer zones around Superfund sites to capture the sociodemographic composition of host communities [3]. The study found that percent Non-White was significantly higher than the percent of White populations within a one mile radius surrounding the Superfund sites [3]. Furthermore, over 50% (24/43) of the sites included in the analysis had a higher percentage of Non-White populations residing near the environmental hazards [3]. The aforementioned results suggest that race/ethnicity may be the principal driver of environmental inequity.

The objectivity of the Superfund program has been questioned due to the disproportionate number of Non-White and low-income populations that may not be benefiting from cleanup efforts [2]. While there are certain criteria that determine whether a site is placed on the NPL, such as the severity of the hazard, or if the site presents less of a hazard thus making the cleanup process less arduous; there are additional racial and socioeconomic determinants that may influence the fate of a site [3]. A 2007 study by O’Neil [2] examined the relationship between environmental remediation and EJ by evaluating the impact of Executive Order 12898 [4,5] on the Superfund listing and cleanup process.

O’Neil found that a one percent increase in Non-White populations was associated with a 0.2% decrease in the probability of a Superfund listing [2]. The results of the study suggest that for sites discovered after the 1994 Executive Order 12898, there was a lower chance of a Superfund listing for poor communities and disadvantaged communities of color [2]. Despite the EJ Executive Order, equity in the Superfund listing process worsened after 1994 [2]. In addition, it appears that the USEPA has failed to properly implement Executive Order 12898 in regards to the Superfund program [2] particularly in EJ communities known to have a high concentration of hazardous waste sites.

Environmental effects
Some of the common contaminants found at Superfund sites are asbestos, dioxin, and mercury, all of which may pose a significant threat to ecological health. Asbestos is a naturally occurring fibrous silicate mineral that has been mined for its invaluable properties and used in many commercial products that include insulation, brake linings, and roofing shingles [6]. Moreover, asbestos may enter the air and water from the weathering of natural deposits and the decomposition of manufactured products (e.g., brake pads) [6]. Small fibers may remain suspended in the air for an extended period of time before settling which may increase the duration of exposure [7].

Dioxin refers to a group of toxic chemical compounds that share certain chemical structures and biological characteristics [8]. Dioxins may be very toxic to certain animals as well as humans, particularly during their early stages of development when the body is less capable of metabolizing the aforementioned compound. While, mercury is another naturally occurring chemical in the environment, additional sources include coal and oil combustion as well as emissions from incineration and landfills. These emissions may contaminate soil and water, which can lead to deleterious effects on various animal species such as loons, eagles, otters, mink, and kingfishers [9].

Health effects
Despite the limitations in exposure science to link Superfund site contaminants with long-term health effects, there have been studies to show the detriment of volatile organic compounds (VOCs) that were released in drinking water among Superfund host communities [10]. The adverse health effects attributable to VOC exposure included the following: (1) birth defects, (2) diabetes, (3) urinary tract disorders, (4) eczema and skin conditions, (5) anemia, (6) speech and hearing difficulties in young children, and (7) stroke [10]. Moreover, a study that evaluated local health problems and exposure to heavy metals at the Tar Creek Superfund site in Ottawa County, Oklahoma found an increase in mortality incidence for heart disease among adults as well as increased blood lead levels (>10 μg/dl) in over 50% of the children which exceeded normal intake standards [11]. Another study by Williamson et al. found that people who live near multiple Superfund sites were more likely to have immunoglobulin test results that are lower or higher than the reference range when compared to populations further away from these sites or other environmental hazards [12]. The major implications of having abnormal immunoglobulin levels is that it decreases immune function, which then impairs the body’s ability to protect against disease [12]. As a result, populations living in close proximity to Superfund sites may be more susceptible to chronic and infectious diseases as well as those related to chemical exposures.

Property values
The proximity of Superfund sites to neighboring communities, whether commercial or residential may have a drastic effect on property values [13]. Properties located close to these sites may depreciate due to unwanted land uses [14]. Unfortunately, there is little that a homeowner can do to reduce their exposure to nearby waste sites since it is the responsibility of the company to ensure that harmful chemicals are not released into the community. If these hazardous chemicals were dispersed into the environment, they could pose a serious health threat to the community and surrounding property. Specifically, the area may be deemed unlivable due to irreversible contamination of soil or pollution of surface waters and drinking water resources [13].

While research has shown that hazardous waste sites are located in predominately Non-White and low-income communities, there is a paucity of research describing the profile of populations hosting those sites, particularly in the state of South Carolina (SC). The purpose of this study was to evaluate the spatial distribution of Superfund sites in SC across areas with varying racial/ethnic and socioeconomic composition.

#### Objectives

- Parse through the 300 features and see whether in case there could be "duplicates" as they could explain the same thing. Inputting all the features would most likely cause overfitting of the model and would require too much computer power to run.

#### Proposed models and methods

- Classification modeling will be used here:
    - Logistic Regression
    - K-Nearest Neighbors
    - Support Vector Machines
    - Neural Network Classifier
    - Decision Tree Classiifer
    - Random Forests Classifer
    - Extra Trees Classifier
- Methods
    - Feature Engineering
    - Data Cleaning/Munging
    - Exploratory Data Analysis
    - Model Building
    - Hyperparameter Tuning
    - Classification!

#### Risks & assumptions of the data

While this data is linked to demographic and socioeconomic data based on either the block group (or census tract), the impacts of a particular site's pollution may extend beyond these geographic regions.

#### Goals & success criteria

## Data 

- priorities_list_full.json: the NPL containing all geographic, site information, text descriptions, and Census Bureau data from the relevant block groups.
- pdb_tract.csv: the planning database aggregated on the tract level with an additional indicator (has_superfund) noting whether or not the tract contains the address of a Superfund site.
- pdb_block_group.csv: the planning database aggregated on the block group level with an additional indicator (has_superfund) noting whether or not the block group contains the address of a Superfund site.

- https://www.census.gov/research/data/planning_database/2015/docs/PDB_Block_Group_2015-07-28a.pdf
- https://www.census.gov/research/data/planning_database/2015/docs/PDB_Tract_2015-07-28a.pdf

The data does not say:
- When was it designated as a Superfund site
- Any pollution-related data

### Data Cleaning/Munging

By the look of this dataset, there will be definitely a lot of columns that need to be dropped here. 

In [154]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib as pyplot

%matplotlib inline

In [155]:
tract = pd.read_csv('./federal-superfunds/pdb_tract.csv')

In [156]:
tract.head() #checking out the head of the data frame

Unnamed: 0,FIPS_Tract,State,State_name,County,County_name,Tract,Flag,Num_BGs_in_Tract,LAND_AREA,AIAN_LAND,...,pct_TEA_Update_Leave_CEN_2010,pct_Census_Mail_Returns_CEN_2010,pct_Vacant_CEN_2010,pct_Deletes_CEN_2010,pct_Census_UAA_CEN_2010,pct_Mailback_Count_CEN_2010,pct_FRST_FRMS_CEN_2010,pct_RPLCMNT_FRMS_CEN_2010,pct_BILQ_Mailout_count_CEN_2010,has_superfund
0,1001020100,1,Alabama,1,Autauga County,20100,,2.0,3.788,0.0,...,0.0,68.25,1.92,0.0,16.39,81.69,61.84,6.4,0.0,0
1,1001020200,1,Alabama,1,Autauga County,20200,,2.0,1.29,0.0,...,0.0,68.82,2.28,0.0,13.07,84.65,60.79,8.03,0.0,0
2,1001020300,1,Alabama,1,Autauga County,20300,,2.0,2.065,0.0,...,0.0,72.95,1.67,0.0,6.53,91.79,72.95,0.0,0.0,0
3,1001020400,1,Alabama,1,Autauga County,20400,,4.0,2.464,0.0,...,0.0,77.64,1.46,0.0,5.51,93.03,77.64,0.0,0.0,0
4,1001020500,1,Alabama,1,Autauga County,20500,,3.0,4.401,0.0,...,0.0,70.97,2.16,0.0,5.96,91.87,70.97,0.0,0.0,0


## Target Variable

In [157]:
tract.has_superfund.value_counts() / tract.shape[0] #number of census tracts that has a superfund site

0    0.983207
1    0.016793
Name: has_superfund, dtype: float64

## Feature elimination

- Removed features that involved language speaking, unncessary labels (bloc groups, flag?)
- dropped any ACS feature that is also captured by the Census Bureau

In [158]:
tract = tract.drop(['State', 'County', 'Flag', 'FIPS_Tract', 'Num_BGs_in_Tract', 'LAND_AREA',
                   'AIAN_LAND'], axis=1) #dropping State and County first as they aren't needed here

In [159]:
tract = tract[tract.columns.drop(list(tract.filter(regex='Age5p')))] #don't need spoken languages features

In [160]:
tract = tract[tract.columns.drop(list(tract.filter(regex='ACSMOE')))] #dropping the American Community
#Survey margin of error

In [161]:
tract = tract.drop(['Tot_GQ_CEN_2010','Inst_GQ_CEN_2010','Non_Inst_GQ_CEN_2010',
                   'Tot_Population_ACS_09_13','Males_ACS_09_13', 'Females_ACS_09_13', 'Pop_under_5_ACS_09_13',
                   'Pop_5_17_ACS_09_13','Pop_18_24_ACS_09_13','Pop_25_44_ACS_09_13',
                   'Pop_45_64_ACS_09_13', 'Pop_65plus_ACS_09_13', 'Hispanic_ACS_09_13', 
                    'NH_White_alone_ACS_09_13', 'Pop_25yrs_Over_ACS_09_13',],axis=1) 
#Dropping any American Community Survey feature and leaving the Census Bureau feature. 

In [162]:
tract = tract[tract.columns.drop(list(tract.filter(regex='ENG_VW')))] #dropping those who speak very well of 
#different languages


In [163]:
tract = tract.drop(['Rel_Family_HHD_ACS_09_13', 'MrdCple_Fmly_HHD_ACS_09_13',
                   'Not_MrdCple_HHD_ACS_09_13'],axis=1) #dropping how many people who live in households.
                                #doesn't seem to be a good predictor of census tracts that have a Superfund site

In [164]:
tract = tract.drop(['Female_No_HB_CEN_2010', 'Female_No_HB_ACS_09_13',
                   'NonFamily_HHD_CEN_2010', 'NonFamily_HHD_ACS_09_13',
                   'Rel_Family_HHDS_CEN_2010','MrdCple_Fmly_HHD_CEN_2010',
                   'Not_MrdCple_HHD_CEN_2010', 'Sngl_Prns_HHD_CEN_2010',
                   'Sngl_Prns_HHD_ACS_09_13', 'HHD_PPL_Und_18_CEN_2010',
                   'HHD_PPL_Und_18_ACS_09_13', 'Tot_Prns_in_HHD_CEN_2010',
                   'Tot_Prns_in_HHD_ACS_09_13', 'Rel_Child_Under_6_CEN_2010',
                   'Rel_Child_Under_6_ACS_09_13'],axis=1) #reasoning is similar to the previous cell and we will
                                                            #drop these features

In [165]:
tract = tract.drop(['Tot_Housing_Units_CEN_2010', 'Tot_Housing_Units_ACS_09_13',
                    'Tot_Occp_Units_CEN_2010', 'Tot_Occp_Units_ACS_09_13',
                   'Tot_Vacant_Units_CEN_2010', 'Tot_Vacant_Units_ACS_09_13',
                   'Renter_Occp_HU_CEN_2010', 'Renter_Occp_HU_ACS_09_13',
                   'Owner_Occp_HU_CEN_2010', 'Owner_Occp_HU_ACS_09_13',
                   'Single_Unit_ACS_09_13','MLT_U2_9_STRC_ACS_09_13',
                   'MLT_U10p_ACS_09_13', 'Mobile_Homes_ACS_09_13',
                   'Crowd_Occp_U_ACS_09_13','Occp_U_NO_PH_SRVC_ACS_09_13',
                   'No_Plumb_ACS_09_13','Recent_Built_HU_ACS_09_13',
                   'MailBack_Area_Count_CEN_2010','TEA_Mail_Out_Mail_Back_CEN_2010',
                   'TEA_Update_Leave_CEN_2010', 'Census_Mail_Returns_CEN_2010',
                   'Mail_Return_Rate_CEN_2010', 'Low_Response_Score', 
                   'Vacants_CEN_2010', 'Deletes_CEN_2010', 'Census_UAA_CEN_2010',
                   'Valid_Mailback_Count_CEN_2010','FRST_FRMS_CEN_2010', 'RPLCMNT_FRMS_CEN_2010',
                   'BILQ_Mailout_count_CEN_2010', 'BILQ_Frms_CEN_2010'],axis=1) 

## Some percentage features are interesting to look at, so I'll take them out into a new data frame and then drop all the percentage features and then concetenate them back to the original

In [166]:
pct = tract[['pct_Not_HS_Grad_ACS_09_13', 'pct_Born_foreign_ACS_09_13', 
          'pct_Born_US_ACS_09_13', 'pct_PUB_ASST_INC_ACS_09_13',
            'pct_TwoPHealth_Ins_ACS_09_13', 'pct_One_Health_Ins_ACS_09_13',
          'pct_Prs_Blw_Pov_Lev_ACS_09_13', 'pct_College_ACS_09_13',
            'pct_Not_HS_Grad_ACS_09_13', 'pct_Males_CEN_2010', 'pct_Females_CEN_2010', 
          'pct_Pop_Under_5_CEN_2010', 'pct_Pop_5_17_CEN_2010',
            'pct_Pop_18_24_CEN_2010', 'pct_Pop_25_44_CEN_2010', 'pct_Pop_45_64_CEN_2010', 
          'pct_Pop_65plus_CEN_2010', 'pct_No_Health_Ins_ACS_09_13',
            'pct_Hispanic_CEN_2010', 'pct_NH_White_alone_CEN_2010', 'pct_NH_Blk_alone_CEN_2010',
             'pct_NH_Asian_alone_CEN_2010', 'pct_NH_AIAN_alone_CEN_2010' ,'pct_NH_NHOPI_alone_CEN_2010',
            'pct_NH_SOR_alone_CEN_2010']]

In [167]:
tract = tract[tract.columns.drop(list(tract.filter(regex='pct')))] #no really need for percentage units

In [168]:
tract.head()

Unnamed: 0,State_name,County_name,Tract,URBANIZED_AREA_POP_CEN_2010,URBAN_CLUSTER_POP_CEN_2010,RURAL_POP_CEN_2010,Tot_Population_CEN_2010,Males_CEN_2010,Females_CEN_2010,Pop_under_5_CEN_2010,...,PUB_ASST_INC_ACS_09_13,Med_HHD_Inc_ACS_09_13,Aggregate_HH_INC_ACS_09_13,Med_House_value_ACS_09_13,Aggr_House_Value_ACS_09_13,avg_Tot_Prns_in_HHD_CEN_2010,avg_Tot_Prns_in_HHD_ACS_09_13,avg_Agg_HH_INC_ACS_09_13,avg_Agg_House_Value_ACS_09_13,has_superfund
0,Alabama,Autauga County,20100,1594.0,0.0,318.0,1912.0,947.0,965.0,118.0,...,0.0,"$63,030","$40,417,700","$124,800","$80,532,000",2.759,2.88474,"$65,613","$130,734",0
1,Alabama,Autauga County,20200,2170.0,0.0,0.0,2170.0,1013.0,1157.0,127.0,...,6.0,"$44,019","$43,269,600","$129,200","$91,182,000",2.677,2.638655,"$51,944","$109,462",0
2,Alabama,Autauga County,20300,3373.0,0.0,0.0,3373.0,1573.0,1800.0,243.0,...,40.0,"$43,201","$64,824,500","$113,800","$123,226,000",2.6855,2.621856,"$56,222","$106,874",0
3,Alabama,Autauga County,20400,4386.0,0.0,0.0,4386.0,2083.0,2303.0,234.0,...,17.0,"$54,730","$114,414,700","$130,500","$208,002,200",2.547,2.516,"$65,380","$118,858",0
4,Alabama,Autauga County,20500,10762.0,0.0,4.0,10766.0,5127.0,5639.0,729.0,...,196.0,"$65,132","$300,798,700","$177,000","$472,779,100",2.5931,2.59826,"$72,709","$114,281",0


In [169]:
pct.head()

Unnamed: 0,pct_Not_HS_Grad_ACS_09_13,pct_Born_foreign_ACS_09_13,pct_Born_US_ACS_09_13,pct_PUB_ASST_INC_ACS_09_13,pct_TwoPHealth_Ins_ACS_09_13,pct_One_Health_Ins_ACS_09_13,pct_Prs_Blw_Pov_Lev_ACS_09_13,pct_College_ACS_09_13,pct_Not_HS_Grad_ACS_09_13.1,pct_Males_CEN_2010,...,pct_Pop_45_64_CEN_2010,pct_Pop_65plus_CEN_2010,pct_No_Health_Ins_ACS_09_13,pct_Hispanic_CEN_2010,pct_NH_White_alone_CEN_2010,pct_NH_Blk_alone_CEN_2010,pct_NH_Asian_alone_CEN_2010,pct_NH_AIAN_alone_CEN_2010,pct_NH_NHOPI_alone_CEN_2010,pct_NH_SOR_alone_CEN_2010
0,19.277108,0.276549,99.723451,0.0,16.371681,70.85177,10.523354,26.678141,19.277108,49.53,...,27.62,11.56,11.061947,2.3,83.73,11.35,0.73,0.68,0.0,0.05
1,23.149394,3.057325,96.942675,0.720288,21.061571,57.876858,15.377616,20.659489,23.149394,46.68,...,24.06,9.86,11.847134,3.46,38.89,55.94,0.23,0.23,0.0,0.14
2,11.432571,3.990841,96.009159,3.469211,22.473013,64.115146,13.897181,17.405506,11.432571,46.64,...,24.67,13.02,12.136081,2.58,75.24,19.18,0.5,0.27,0.15,0.21
3,10.25557,2.566432,97.433568,0.971429,26.186691,58.664547,2.952532,24.115334,10.25557,47.49,...,25.56,20.61,14.217579,1.94,91.88,4.35,0.41,0.25,0.07,0.02
4,4.369356,2.792369,97.207631,4.737733,14.874205,73.956317,7.803468,39.20769,4.369356,47.62,...,22.51,10.46,7.354161,3.3,78.38,13.17,2.74,0.41,0.06,0.11


In [170]:
tract = pd.concat([tract, pct], axis=1)

In [171]:
(tract.isnull().sum()/tract.shape[0]).sort_values(ascending=False).head(30)

avg_Agg_House_Value_ACS_09_13    0.039178
Aggr_House_Value_ACS_09_13       0.039178
Med_House_value_ACS_09_13        0.024385
pct_Not_HS_Grad_ACS_09_13        0.021643
pct_Not_HS_Grad_ACS_09_13        0.021643
Med_HHD_Inc_ACS_09_13            0.013780
Aggregate_HH_INC_ACS_09_13       0.013780
avg_Agg_HH_INC_ACS_09_13         0.013780
Not_HS_Grad_ACS_09_13            0.013037
US_Cit_Nat_ACS_09_13             0.013037
NON_US_Cit_ACS_09_13             0.013037
avg_Tot_Prns_in_HHD_ACS_09_13    0.012118
pct_PUB_ASST_INC_ACS_09_13       0.012118
pct_Prs_Blw_Pov_Lev_ACS_09_13    0.011375
pct_College_ACS_09_13            0.009659
pct_Born_US_ACS_09_13            0.009484
pct_No_Health_Ins_ACS_09_13      0.009484
pct_Born_foreign_ACS_09_13       0.009484
pct_One_Health_Ins_ACS_09_13     0.009484
pct_TwoPHealth_Ins_ACS_09_13     0.009484
avg_Tot_Prns_in_HHD_CEN_2010     0.009470
pct_NH_NHOPI_alone_CEN_2010      0.008038
pct_NH_SOR_alone_CEN_2010        0.008038
pct_Males_CEN_2010               0

In [172]:
tract.columns #these are the features that I want. Let's go deeper!

Index(['State_name', 'County_name', 'Tract', 'URBANIZED_AREA_POP_CEN_2010',
       'URBAN_CLUSTER_POP_CEN_2010', 'RURAL_POP_CEN_2010',
       'Tot_Population_CEN_2010', 'Males_CEN_2010', 'Females_CEN_2010',
       'Pop_under_5_CEN_2010', 'Pop_5_17_CEN_2010', 'Pop_18_24_CEN_2010',
       'Pop_25_44_CEN_2010', 'Pop_45_64_CEN_2010', 'Pop_65plus_CEN_2010',
       'Hispanic_CEN_2010', 'NH_White_alone_CEN_2010', 'NH_Blk_alone_CEN_2010',
       'NH_Blk_alone_ACS_09_13', 'NH_AIAN_alone_CEN_2010',
       'NH_AIAN_alone_ACS_09_13', 'NH_Asian_alone_CEN_2010',
       'NH_Asian_alone_ACS_09_13', 'NH_NHOPI_alone_CEN_2010',
       'NH_NHOPI_alone_ACS_09_13', 'NH_SOR_alone_CEN_2010',
       'NH_SOR_alone_ACS_09_13', 'Pop_5yrs_Over_ACS_09_13',
       'Othr_Lang_ACS_09_13', 'Not_HS_Grad_ACS_09_13', 'College_ACS_09_13',
       'Pov_Univ_ACS_09_13', 'Prs_Blw_Pov_Lev_ACS_09_13',
       'One_Health_Ins_ACS_09_13', 'Two_Plus_Health_Ins_ACS_09_13',
       'No_Health_Ins_ACS_09_13', 'Civ_labor_16plus_ACS_09_13

In [173]:
tract = tract.drop(['URBANIZED_AREA_POP_CEN_2010',
       'URBAN_CLUSTER_POP_CEN_2010', 'RURAL_POP_CEN_2010',
       'Tot_Population_CEN_2010', 'Males_CEN_2010', 'Females_CEN_2010',
       'Pop_under_5_CEN_2010', 'Pop_5_17_CEN_2010', 'Pop_18_24_CEN_2010',
       'Pop_25_44_CEN_2010', 'Pop_45_64_CEN_2010', 'Pop_65plus_CEN_2010',
       'US_Cit_Nat_ACS_09_13',
       'NON_US_Cit_ACS_09_13', 'HHD_Moved_in_ACS_09_13', 'Othr_Lang_ACS_09_13','Civ_labor_16plus_ACS_09_13',
       'Civ_emp_16plus_ACS_09_13', 'Civ_unemp_16plus_ACS_09_13', 'Pop_1yr_Over_ACS_09_13', 'Diff_HU_1yr_Ago_ACS_09_13'
                   ],axis=1)

In [174]:
tract = tract.drop(['Pop_5yrs_Over_ACS_09_13',
       'Not_HS_Grad_ACS_09_13', 'College_ACS_09_13', 'Pov_Univ_ACS_09_13',
       'Prs_Blw_Pov_Lev_ACS_09_13', 'One_Health_Ins_ACS_09_13',
       'Two_Plus_Health_Ins_ACS_09_13', 'No_Health_Ins_ACS_09_13'],axis=1)

In [175]:
tract.shape

(74021, 64)

In [176]:
tract.columns

Index(['State_name', 'County_name', 'Tract', 'Hispanic_CEN_2010',
       'NH_White_alone_CEN_2010', 'NH_Blk_alone_CEN_2010',
       'NH_Blk_alone_ACS_09_13', 'NH_AIAN_alone_CEN_2010',
       'NH_AIAN_alone_ACS_09_13', 'NH_Asian_alone_CEN_2010',
       'NH_Asian_alone_ACS_09_13', 'NH_NHOPI_alone_CEN_2010',
       'NH_NHOPI_alone_ACS_09_13', 'NH_SOR_alone_CEN_2010',
       'NH_SOR_alone_ACS_09_13', 'Civ_labor_16_24_ACS_09_13',
       'Civ_emp_16_24_ACS_09_13', 'Civ_unemp_16_24_ACS_09_13',
       'Civ_labor_25_44_ACS_09_13', 'Civ_emp_25_44_ACS_09_13',
       'Civ_unemp_25_44_ACS_09_13', 'Civ_labor_45_64_ACS_09_13',
       'Civ_emp_45_64_ACS_09_13', 'Civ_unemp_45_64_ACS_09_13',
       'Civ_labor_65plus_ACS_09_13', 'Civ_emp_65plus_ACS_09_13',
       'Civ_unemp_65plus_ACS_09_13', 'Born_US_ACS_09_13',
       'Born_foreign_ACS_09_13', 'PUB_ASST_INC_ACS_09_13',
       'Med_HHD_Inc_ACS_09_13', 'Aggregate_HH_INC_ACS_09_13',
       'Med_House_value_ACS_09_13', 'Aggr_House_Value_ACS_09_13',
       

In [177]:
tract.to_csv('Feature Elimination 1.csv', index=False)

# Let's clean the data and then let's explore this data!