# Project 2 Data sources:

## Observed dependent variable:
* Drug-induced mortality per county for 2016
* reach: opioid-induced mortality per county for 2016
* double-reach: opioid-induced mortality per county over 5 years, 2011 - 2016

## Independent variables/regressors:

### Medical:
* Age of electronic PDMP per state
* Number of methadone clinics per county
* Prescribing map -> rate of opioid prescriptions per 100 US residents

### Legislative:
* Number of bills per state that contain the word 'opioid'

### Economic:
* Poverty
* Percent unemployment
* Median salary

### Demographic:
* Median age
* Percentage minority
* Education - percentage high school grad or higher

Websites for data sources:
* https://wonder.cdc.gov/
* https://www.bls.gov/lau/data.htm
* https://censusreporter.org/

### Economic and Demograhic data

Let's start with the easiest scraping first, which is going to be economic and demographic data.

The more in-depth way to do this would be to figure out how to use the US Census API to get the 6 variables I specified, but as an intermediate step I'm going to scrape CensusReporter.com first since those pages are much easier.

First, get a list of all of the counties along with their states and INCITS from this page: https://en.wikipedia.org/wiki/List_of_United_States_counties_and_county_equivalents

In [1]:
import requests
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import re

from bs4 import BeautifulSoup

In [2]:
url = "https://en.wikipedia.org/wiki/List_of_United_States_counties_and_county_equivalents"
response = requests.get(url)
page = response.text
soup = BeautifulSoup(page, "html5lib")

In [3]:
table = soup.select_one('table[class*="wikitable"]')
rows = [row for row in table.find_all("tr")]

In [4]:
# Remove the first <tr> element, which is the header row for the wikitable.
rows = rows[1:]

county_df = pd.DataFrame(columns=['INCITS', 'county_name', 'state'])

for row in rows:
    elements = row.find_all("td")
    elements = [element.text.strip() for element in elements]
    county_df = county_df.append({'INCITS': elements[0], 'county_name': elements[1], 'state': elements[2]}, ignore_index=True)

In [5]:
county_df.head()

Unnamed: 0,INCITS,county_name,state
0,1001,Autauga County,Alabama
1,1003,Baldwin County,Alabama
2,1005,Barbour County,Alabama
3,1007,Bibb County,Alabama
4,1009,Blount County,Alabama


In [6]:
county_df['state'].unique()

array(['Alabama', 'Alaska', 'American Samoa', 'Arizona', 'Arkansas',
       'California', 'Colorado', 'Connecticut', 'Delaware',
       'District of Columbia', 'Florida', 'Georgia', 'Guam', 'Hawaiʻi',
       'Idaho', 'Illinois', 'Indiana', 'Iowa', 'Kansas', 'Kentucky',
       'Louisiana', 'Maine', 'Maryland', 'Massachusetts', 'Michigan',
       'Minnesota', 'Mississippi', 'Missouri', 'Montana', 'Nebraska',
       'Nevada', 'New Hampshire', 'New Jersey', 'New Mexico', 'New York',
       'North Carolina', 'North Dakota', 'Northern Mariana Islands',
       'Ohio', 'Oklahoma', 'Oregon', 'Pennsylvania', 'Puerto Rico',
       'Rhode Island', 'South Carolina', 'South Dakota', 'Tennessee',
       'Texas', 'U.S. Minor Outlying Islands', 'Utah', 'Vermont',
       'Virgin Islands (U.S.)', 'Virginia', 'Washington', 'West Virginia',
       'Wisconsin', 'Wyoming'], dtype=object)

In [7]:
# There are 'states' included that are not actually states. Let's make a list of them.
not_states = ['American Samoa', 'Guam', 'Northern Mariana Islands', 'Puerto Rico', 'U.S. Minor Outlying Islands', 'Virgin Islands (U.S.)']

In [8]:
# Now, let's remove all of the entries for these 'not states' from the data frame.
county_df = county_df[~county_df.state.isin(not_states)]

In [9]:
# Confirm that there are only states and the District of Columbia left.

county_df['state'].unique()

array(['Alabama', 'Alaska', 'Arizona', 'Arkansas', 'California',
       'Colorado', 'Connecticut', 'Delaware', 'District of Columbia',
       'Florida', 'Georgia', 'Hawaiʻi', 'Idaho', 'Illinois', 'Indiana',
       'Iowa', 'Kansas', 'Kentucky', 'Louisiana', 'Maine', 'Maryland',
       'Massachusetts', 'Michigan', 'Minnesota', 'Mississippi',
       'Missouri', 'Montana', 'Nebraska', 'Nevada', 'New Hampshire',
       'New Jersey', 'New Mexico', 'New York', 'North Carolina',
       'North Dakota', 'Ohio', 'Oklahoma', 'Oregon', 'Pennsylvania',
       'Rhode Island', 'South Carolina', 'South Dakota', 'Tennessee',
       'Texas', 'Utah', 'Vermont', 'Virginia', 'Washington',
       'West Virginia', 'Wisconsin', 'Wyoming'], dtype=object)

Now, let's use Censusreporter.org to get the census data we are interested in. This website provides based on the ACS (American Communities Survey) 2016 5-year data, queried from the US Census.

In [10]:
census_reporter_query = "https://censusreporter.org/profiles/05000US{}"

mini_county_df = county_df.iloc[0:3]

for incit in mini_county_df['INCITS']:
    response = requests.get(census_reporter_query.format(incit))
    page = response.text
    soup = BeautifulSoup(page, "html5lib")
    county_name = soup.select_one("h1")
    print(county_name.text)

Autauga County, AL
Baldwin County, AL
Barbour County, AL


In [56]:
# Let's pull the relevant 6 parameters for Barbour County.
# For economic parameters, I want % poverty, % unemployment, and median salary.
# For demographics, I want median age, percentage minority, and education (percentage high school grad)
# We can go ahead and print all of the stats within any element that belongs to a class w/ the word 'stat'.
    
stats = soup.select('a[class*="stat"]')
for stat in stats:
    val = stat.select_one('span[class="value"]').text
    whitespace = r'[\s\t\n]*'
    val = re.sub(whitespace, "", val) # There's a lot of weird whitespace in the class=value element, so strip all of it out.
    
    name = stat.select_one('span[class="name"]').text.strip()
    
    print("{} is {}".format(name, val))
    
    val2 = val.split("±")[0]
    regex = r'([a-zA-Z\$,\%])*'
    print(re.sub(regex, "", val2))

# The code above will get you % poverty (index 3), median salary (index 2) (2/3 economic parameters), 
# median age (index 0), and education (2/3 demographic parameters).
# To get percent unemployment by county, will have to use the Bureau of Labor Statistics instead.

Median age is 38.7±0.6
38.7
Per capita income is $17,249±$822
17249
Median household income is $33,956±$2,655
33956
Persons below poverty line is 26.4%±2.7%(6,235±636)
26.4
Mean travel time to work is 23.7minutes±1.7(205,890±17,933)
23.7
Number of households is 9,122±286
9122
Persons per household is 2.6±0.1(23,682±244)
2.6
Women 15-50 who gave birth during past year is 5%±1.8%(252±92)
5
Number of housing units is 11,802±101
11802
Median value of owner-occupied housing units is $90,300±$7,258
90300
Moved since previous year is 12.4%±1.4%(3,262±373.4)
12.4
High school grad or higher is 73.8%±3.1%(13,563±573.2)
73.8
Bachelor's degree or higher is 12.9%±1.5%(2,366±270.8)
12.9
Persons with language other than English spoken at home is N/A
/
Foreign-born population is 2.9%±0.3%(761±82)
2.9
Population with veteran status is 8.4%±0.8%(1,751±173)
8.4
Total veterans is 1,751
1751


In [80]:
census_reporter_dict = ({"Median age": "median_age", 
                         "Median household income": "median_hh_income", 
                         "Persons below poverty line": "poverty_percent",
                         "High school grad or higher": "hs_percent"
                        })

census_reporter_dict.values()

dict_values(['median_age', 'median_hh_income', 'poverty_percent', 'hs_percent'])

In [81]:
cols = ['INCITS']
cols.extend(list(census_reporter_dict.values()))
print(cols)
census_reporter_df = pd.DataFrame(columns = cols)
census_reporter_df

['INCITS', 'median_age', 'median_hh_income', 'poverty_percent', 'hs_percent']


Unnamed: 0,INCITS,median_age,median_hh_income,poverty_percent,hs_percent


In [87]:
stats = soup.select('a[class*="stat"]')
new_dict = {}
new_dict['INCITS'] = incit
for stat in stats:
    val = stat.select_one('span[class="value"]').text
    whitespace = r'[\s\t\n]*'
    val = re.sub(whitespace, "", val) # There's a lot of weird whitespace in the class=value element, so strip all of it out.
    
    val = val.split("±")[0] # A few of these stats have +/- ranges on them; we only want the main value, so split and get the first element.
    regex = r'([a-zA-Z\$,\%])*'
    val = re.sub(regex, "", val) # Strip out all the letters, dollar signs, commas, and percentage signs as well.
    
    name = stat.select_one('span[class="name"]').text.strip()
    
    if name in census_reporter_dict:
        new_dict[census_reporter_dict[name]] = val

print(new_dict)
census_reporter_df.append(new_dict, ignore_index=True)

{'INCITS': '01005', 'median_age': '38.7', 'median_hh_income': '33956', 'poverty_percent': '26.4', 'hs_percent': '73.8'}


Unnamed: 0,INCITS,median_age,median_hh_income,poverty_percent,hs_percent
0,1005,38.7,33956,26.4,73.8
