# Data sources:

## Observed dependent variable:
* Drug-induced mortality per county for 2016
* reach: opioid-induced mortality per county for 2016
* double-reach: opioid-induced mortality per county over 5 years, 2011 - 2016

## Independent variables/regressors:

### Medical:
* Age of electronic PDMP per state
* Number of methadone clinics per county
* Prescribing map -> rate of opioid prescriptions per 100 US residents

### Legislative:
* Number of bills per state that contain the word 'opioid'

### Economic:
* Poverty
* Percent unemployment
* Median salary

### Demographic:
* Median age
* Percentage minority
* Education - percentage high school grad or higher

Let's start with the easiest scraping first, which is going to be economic and demographic data.

The more in-depth way to do this would be to figure out how to use the US Census API to get the 6 variables I specified, but as an intermediate step I'm going to scrape CensusReporter.com first since those pages are much easier.

First, get a list of all of the counties along with their states and INCITS from this page: https://en.wikipedia.org/wiki/List_of_United_States_counties_and_county_equivalents

In [34]:
import requests
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

from bs4 import BeautifulSoup

In [None]:
url = "https://en.wikipedia.org/wiki/List_of_United_States_counties_and_county_equivalents"
response = requests.get(url)
page = response.text
soup = BeautifulSoup(page, "html5lib")

In [39]:
with open("test.txt", "w") as file:
    file.write(str(soup))

In [140]:
table = soup.select_one('table[class*="wikitable"]')
rows = []
for row in table.find_all("tr"):
    rows.append(row)

In [141]:
small_rows = rows[1:10]
rows = rows[1:]
#INCIT = []
#county_name = []
#state = []

county_df = pd.DataFrame(columns=['INCIT', 'county_name', 'state'])

for row in rows:
    elements = row.find_all("td")
    elements = [element.text.strip() for element in elements]
    county_df = county_df.append({'INCIT': elements[0], 'county_name': elements[1], 'state': elements[2]}, ignore_index=True)

In [142]:
county_df.head()

Unnamed: 0,INCIT,county_name,state
0,1001,Autauga County,Alabama
1,1003,Baldwin County,Alabama
2,1005,Barbour County,Alabama
3,1007,Bibb County,Alabama
4,1009,Blount County,Alabama


In [143]:
county_df['state'].unique()

array(['Alabama', 'Alaska', 'American Samoa', 'Arizona', 'Arkansas',
       'California', 'Colorado', 'Connecticut', 'Delaware',
       'District of Columbia', 'Florida', 'Georgia', 'Guam', 'Hawaiʻi',
       'Idaho', 'Illinois', 'Indiana', 'Iowa', 'Kansas', 'Kentucky',
       'Louisiana', 'Maine', 'Maryland', 'Massachusetts', 'Michigan',
       'Minnesota', 'Mississippi', 'Missouri', 'Montana', 'Nebraska',
       'Nevada', 'New Hampshire', 'New Jersey', 'New Mexico', 'New York',
       'North Carolina', 'North Dakota', 'Northern Mariana Islands',
       'Ohio', 'Oklahoma', 'Oregon', 'Pennsylvania', 'Puerto Rico',
       'Rhode Island', 'South Carolina', 'South Dakota', 'Tennessee',
       'Texas', 'U.S. Minor Outlying Islands', 'Utah', 'Vermont',
       'Virgin Islands (U.S.)', 'Virginia', 'Washington', 'West Virginia',
       'Wisconsin', 'Wyoming'], dtype=object)

In [144]:
not_states = ['American Samoa', 'Guam', 'Northern Mariana Islands', 'Puerto Rico', 'U.S. Minor Outlying Islands', 'Virgin Islands (U.S.)']

In [145]:
b = county_df[county_df.state != 'American Samoa']

In [146]:
b['state'].unique()

array(['Alabama', 'Alaska', 'Arizona', 'Arkansas', 'California',
       'Colorado', 'Connecticut', 'Delaware', 'District of Columbia',
       'Florida', 'Georgia', 'Guam', 'Hawaiʻi', 'Idaho', 'Illinois',
       'Indiana', 'Iowa', 'Kansas', 'Kentucky', 'Louisiana', 'Maine',
       'Maryland', 'Massachusetts', 'Michigan', 'Minnesota',
       'Mississippi', 'Missouri', 'Montana', 'Nebraska', 'Nevada',
       'New Hampshire', 'New Jersey', 'New Mexico', 'New York',
       'North Carolina', 'North Dakota', 'Northern Mariana Islands',
       'Ohio', 'Oklahoma', 'Oregon', 'Pennsylvania', 'Puerto Rico',
       'Rhode Island', 'South Carolina', 'South Dakota', 'Tennessee',
       'Texas', 'U.S. Minor Outlying Islands', 'Utah', 'Vermont',
       'Virgin Islands (U.S.)', 'Virginia', 'Washington', 'West Virginia',
       'Wisconsin', 'Wyoming'], dtype=object)

In [147]:
b = county_df[~county_df.state.isin(not_states)]

In [148]:
b.head()

Unnamed: 0,INCIT,county_name,state
0,1001,Autauga County,Alabama
1,1003,Baldwin County,Alabama
2,1005,Barbour County,Alabama
3,1007,Bibb County,Alabama
4,1009,Blount County,Alabama


In [151]:
b.shape

(3142, 3)