# Overview

This is a scratch notebook, where I conducted much of my data exploration - finding column data types, creating sub-sets of the data, combining the Texas Department of State Health Services overdose death data, etc.

## Washington Post Data

Source: [Washington Post DEA Database](https://www.washingtonpost.com/graphics/2019/investigations/dea-pain-pill-database/)

In [1]:
# Imports

# Data import and manipulation
import pandas as pd
# Math
import numpy as np
# Let's go ahead and seed the notebook, for reproducibility
np.random.seed(113)

import utils

In [2]:
# After a brief exploration, these are the datatypes per column of the
# Washington Post dataset. Making them explicit for speed here
# Note this still takes for-ever to load
dtypes = {
    "REPORTER_DEA_NO" : "object",
    "REPORTER_BUS_ACT" : "object",
    "REPORTER_NAME" : "object",
    "REPORTER_ADDL_CO_INFO" : "object",
    "REPORTER_ADDRESS1" : "object",
    "REPORTER_ADDRESS2" : "object",
    "REPORTER_CITY" : "object",
    "REPORTER_STATE" : "object",
    "REPORTER_ZIP" : "int64",
    "REPORTER_COUNTY" : "object",
    "BUYER_DEA_NO" : "object",
    "BUYER_BUS_ACT" : "object",
    "BUYER_NAME" : "object",
    "BUYER_ADDL_CO_INFO" : "object",
    "BUYER_ADDRESS1" : "object",
    "BUYER_ADDRESS2" : "object",
    "BUYER_CITY" : "object",
    "BUYER_STATE" : "object",
    "BUYER_ZIP" : "int64",
    "BUYER_COUNTY" : "object",
    "TRANSACTION_CODE" : "object",
    "DRUG_CODE" : "int64",
    "NDC_NO" : "object",
    "DRUG_NAME" : "object",
    "QUANTITY" : "float64",
    "UNIT" : "float64",
    "ACTION_INDICATOR" : "object",
    "ORDER_FORM_NO" : "object",
    "CORRECTION_NO" :  "float64",
    "STRENGTH" : "float64",
    "TRANSACTION_DATE" : "int64",
    "CALC_BASE_WT_IN_GM" : "float64",
    "DOSAGE_UNIT" : "float64",
    "TRANSACTION_ID" : "int64",
    "Product_Name" : "object",
    "Ingredient_Name" : "object",
    "Measure" : "object",
    "MME_Conversion_Factor" : "float64",
    "Combined_Labeler_Name" : "object",
#     "Revised_Company_Name" : "object", # was in original 2019 data
    "Reporter_family" : "object",
    "dos_str" : "float64",
    "MME" : "float64"
}
wp_data = pd.read_csv("../data/arcos-tx-statewide-itemized_downloadedjune9.csv", dtype=dtypes)

#### Checking for most common values, for nulls, etc:

In [3]:
wp_data.info(null_counts=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15270974 entries, 0 to 15270973
Data columns (total 42 columns):
REPORTER_DEA_NO          15270974 non-null object
REPORTER_BUS_ACT         15270974 non-null object
REPORTER_NAME            15270974 non-null object
REPORTER_ADDL_CO_INFO    1103159 non-null object
REPORTER_ADDRESS1        15270974 non-null object
REPORTER_ADDRESS2        2366808 non-null object
REPORTER_CITY            15270974 non-null object
REPORTER_STATE           15270974 non-null object
REPORTER_ZIP             15270974 non-null int64
REPORTER_COUNTY          15270974 non-null object
BUYER_DEA_NO             15270974 non-null object
BUYER_BUS_ACT            15270974 non-null object
BUYER_NAME               15270974 non-null object
BUYER_ADDL_CO_INFO       5955702 non-null object
BUYER_ADDRESS1           15270974 non-null object
BUYER_ADDRESS2           2101166 non-null object
BUYER_CITY               15270974 non-null object
BUYER_STATE              15270974 non-nu

In [4]:
# Let's try to parse through the time stamp on these transactions
wp_data["TRANSACTION_DATE"].head(10)

0     4042014
1    10032014
2    12052014
3     5232014
4     5022014
5    11052014
6     9182014
7     9152014
8     4292014
9     3272014
Name: TRANSACTION_DATE, dtype: int64

In [5]:
wp_data["TRANSACTION_DATE"].sort_values().head()

9503888    1012006
4907864    1012006
4907467    1012006
4907269    1012006
4907194    1012006
Name: TRANSACTION_DATE, dtype: int64

In [6]:
# Can see that we need to fill in preceeding zeros for months with 1 digit, 
# so each date has 8 digits
# First need to turn that column into strings
wp_data["TRANSACTION_DATE"] = wp_data["TRANSACTION_DATE"].astype('str')
wp_data["TRANSACTION_DATE"] = wp_data["TRANSACTION_DATE"].str.zfill(8)

In [7]:
# Much better
wp_data["TRANSACTION_DATE"].head()

0    04042014
1    10032014
2    12052014
3    05232014
4    05022014
Name: TRANSACTION_DATE, dtype: object

In [8]:
# Now turning into a datetime object
wp_data["TRANSACTION_DATE"] = pd.to_datetime(wp_data["TRANSACTION_DATE"],
                                             format='%m%d%Y')

In [9]:
# Success
wp_data["TRANSACTION_DATE"].head(10)

0   2014-04-04
1   2014-10-03
2   2014-12-05
3   2014-05-23
4   2014-05-02
5   2014-11-05
6   2014-09-18
7   2014-09-15
8   2014-04-29
9   2014-03-27
Name: TRANSACTION_DATE, dtype: datetime64[ns]

In [10]:
# Checking how far back our data now goes
transaction_year = wp_data['TRANSACTION_DATE'].map(lambda x: x.year)
set(transaction_year)

{2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014}

In [11]:
# WP said they found 5,432,109,643 pills supplied to TX between 2006 and 2012
# Let's find the newest number using the Dosage Unit column
total_dosage = int(wp_data["DOSAGE_UNIT"].sum())
print(f"{total_dosage:,}")

7,089,413,965


## Opioid-Related Death Data

Source: [Texas Department of State Health Services](http://healthdata.dshs.texas.gov/dashboard/drugs-and-alcohol/substance-related-deaths)

Using their "Data Table Builder" tool, segmenting based on:

- Year: 2006-2014
- Type of Substance: All Opioids
- Circumstance: All
- Occurrence/Residence: Place of Occurrence
- Geographic Category: County
- County Name: All
- Demographic Category: Total

In [86]:
dshs = pd.read_csv("../data/tx_dshs_2006_2014_all-counties.csv")

In [87]:
dshs.columns

Index(['Year', 'County Name', 'Demographic Category', 'Demographic Group',
       'Substance', 'Circumstance1', 'Geo Cat', 'Resid Place',
       '"Map count" calculation', 'County Code', 'Demographic Group (copy)',
       'Health Service', 'Map "Count" Calculation', 'Map "Rates" Calculation',
       'Rate Per 100000 Res', 'Substance (copy)', 'Sup Rate Per 100000 Res',
       'Area Pop', 'Freq', 'Number of Records', 'Public Health', 'Sup Freq',
       'Suppressed Rate Per 100000 Res'],
      dtype='object')

In [88]:
dshs.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4572 entries, 0 to 4571
Data columns (total 23 columns):
Year                              4572 non-null int64
County Name                       4572 non-null object
Demographic Category              4572 non-null object
Demographic Group                 4572 non-null object
Substance                         4572 non-null object
Circumstance1                     4572 non-null object
Geo Cat                           4572 non-null object
Resid Place                       4572 non-null object
"Map count" calculation           4572 non-null object
County Code                       4572 non-null int64
Demographic Group (copy)          4572 non-null object
Health Service                    4572 non-null object
Map "Count" Calculation           4572 non-null object
Map "Rates" Calculation           4572 non-null object
Rate Per 100000 Res               0 non-null float64
Substance (copy)                  4572 non-null object
Sup Rate Per 1000

In [89]:
utils.describe_df(dshs)

Dataset Shape: (4572, 23)


Unnamed: 0,Name,dtypes,Missing,Uniques,First Value,Second Value,Last Value
0,Year,int64,0,9,2006,2006,2014
1,County Name,object,0,254,Cass,Harrison,Zavala
2,Demographic Category,object,0,1,Total,Total,Total
3,Demographic Group,object,0,1,County Total,County Total,County Total
4,Substance,object,0,1,All Opioids,All Opioids,All Opioids
5,Circumstance1,object,0,2,Accidental Poisoning,Accidental Poisoning,Accidental Poisoning
6,Geo Cat,object,0,1,County,County,County
7,Resid Place,object,0,1,Place of Occurrence,Place of Occurrence,Place of Occurrence
8,"""Map count"" calculation",object,0,100,Suppressed,Suppressed,0
9,County Code,int64,0,254,34,102,254


In [90]:
# Removing columns with all nans or all the same value
to_drop = []
for col in dshs.columns.to_list():
    if len(dshs[col].unique()) <= 1:
        to_drop.append(col)
print(to_drop)

['Demographic Category', 'Demographic Group', 'Substance', 'Geo Cat', 'Resid Place', 'Demographic Group (copy)', 'Map "Rates" Calculation', 'Rate Per 100000 Res', 'Substance (copy)', 'Sup Rate Per 100000 Res', 'Area Pop', 'Number of Records', 'Suppressed Rate Per 100000 Res']


In [91]:
# Adding one of the map count colums, since they're the same
print(sum(dshs['"Map count" calculation'] == dshs['Map "Count" Calculation']))
to_drop.append('"Map count" calculation')

4572


In [92]:
dshs = dshs.drop(columns = to_drop)

In [93]:
dshs.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4572 entries, 0 to 4571
Data columns (total 9 columns):
Year                       4572 non-null int64
County Name                4572 non-null object
Circumstance1              4572 non-null object
County Code                4572 non-null int64
Health Service             4572 non-null object
Map "Count" Calculation    4572 non-null object
Freq                       4572 non-null int64
Public Health              4572 non-null int64
Sup Freq                   2962 non-null float64
dtypes: float64(1), int64(4), object(4)
memory usage: 321.6+ KB


In [79]:
# Checking we have all 254 counties in TX
len(dshs['County Name'].unique())

254

In [95]:
# Renaming this weird column
dshs = dshs.rename(columns = {'Map "Count" Calculation': "Count"})

In [106]:
# Grabbing the total number of deaths represented
not_suppressed = dshs.loc[dshs['Count'] != 'Suppressed']
print(f"Total: {not_suppressed['Count'].astype(int).sum():,}")
print(f"Average: {not_suppressed['Count'].astype(int).mean():.2f}")

Total: 15,366
Average: 5.19


#### Population Data

Source: [Texas State Library and Archives Commission](https://www.tsl.texas.gov/ref/abouttx/population.html), which links to the Census data to download.

Note that, per year, I am using the July county population estimates - even in years where there was a census conducted. This is for consistency, because if I need to use July estimates in other years I'd prefer to use it each year, not switching to the April 2010 census count and then back to July 2011 estimates (for example). 

In [38]:
# Loading in the data for the 2000-2010 population estimates
# Defining column names and skipping some opening rows/footers because excel
pop_2000_2010 = pd.read_excel("../data/2000-2010_Population_Estimates_TX.xls",
                              names=["COUNTY", "APR_2000", "JUL_2000", 
                                     "JUL_2001", "JUL_2002", "JUL_2003", 
                                     "JUL_2004", "JUL_2005", "JUL_2006", 
                                     "JUL_2007", "JUL_2008", "JUL_2009", 
                                     "APR_2010", "JUL_2010"],
                              skiprows=[0, 1, 2, 3], skipfooter=8)

In [39]:
pop_2000_2010.head()

Unnamed: 0,COUNTY,APR_2000,JUL_2000,JUL_2001,JUL_2002,JUL_2003,JUL_2004,JUL_2005,JUL_2006,JUL_2007,JUL_2008,JUL_2009,APR_2010,JUL_2010
0,.Anderson County,55114,55062,54263,54740,56068,56245,56873,57386,57870,57963,58410,58458,58452
1,.Andrews County,13002,12949,12856,13022,12976,13006,13016,13195,13513,14099,14601,14786,14833
2,.Angelina County,80123,80270,80273,80803,81510,82070,82553,83810,84518,84961,86029,86771,86953
3,.Aransas County,22457,22452,22287,22616,22843,23067,23561,23395,23172,23225,23291,23158,23151
4,.Archer County,8904,8966,8849,8942,9013,9078,9068,9063,9026,9104,9023,9054,9060


In [40]:
pop_2000_2010.tail(5)

Unnamed: 0,COUNTY,APR_2000,JUL_2000,JUL_2001,JUL_2002,JUL_2003,JUL_2004,JUL_2005,JUL_2006,JUL_2007,JUL_2008,JUL_2009,APR_2010,JUL_2010
249,.Wood County,36729,36811,37288,37633,38915,39600,39917,41099,41414,41722,41870,41964,42019
250,.Yoakum County,7325,7274,7299,7212,7235,7362,7404,7404,7588,7765,7908,7879,7865
251,.Young County,17872,17846,17659,17665,17892,17921,17801,18227,18135,18104,18466,18550,18559
252,.Zapata County,12081,12088,12266,12514,12584,12727,13043,13069,13388,13640,13876,14018,14070
253,.Zavala County,11645,11636,11596,11616,11513,11512,11565,11642,11657,11725,11544,11677,11724


In [41]:
pop_2000_2010.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 254 entries, 0 to 253
Data columns (total 14 columns):
COUNTY      254 non-null object
APR_2000    254 non-null int64
JUL_2000    254 non-null int64
JUL_2001    254 non-null int64
JUL_2002    254 non-null int64
JUL_2003    254 non-null int64
JUL_2004    254 non-null int64
JUL_2005    254 non-null int64
JUL_2006    254 non-null int64
JUL_2007    254 non-null int64
JUL_2008    254 non-null int64
JUL_2009    254 non-null int64
APR_2010    254 non-null int64
JUL_2010    254 non-null int64
dtypes: int64(13), object(1)
memory usage: 27.9+ KB


In [42]:
# All of the counties have a period at the beginning
# We want them in the format "ANDERSON" not ".Anderson County"

# Removing the dot
pop_2000_2010["COUNTY"] = [x.strip('.') for x in pop_2000_2010["COUNTY"]]

# Removing " County"
pop_2000_2010["COUNTY"] = pop_2000_2010["COUNTY"].str.split(' County').str[0]

# Changing all to uppercase
pop_2000_2010["COUNTY"] = pop_2000_2010["COUNTY"].str.upper()

In [43]:
# Dropping data from before my dataset, because I won't need it
# Also dropping 2010, since I'll use the updated and hopefully more accurate
# 2010 estimates from the more recent database
pop_2006_2009 = pop_2000_2010[[
    "COUNTY", "JUL_2006", "JUL_2007", "JUL_2008", "JUL_2009"]]

In [44]:
# Much better
pop_2006_2009.head()

Unnamed: 0,COUNTY,JUL_2006,JUL_2007,JUL_2008,JUL_2009
0,ANDERSON,57386,57870,57963,58410
1,ANDREWS,13195,13513,14099,14601
2,ANGELINA,83810,84518,84961,86029
3,ARANSAS,23395,23172,23225,23291
4,ARCHER,9063,9026,9104,9023


In [46]:
# Now loading the data for 2010-2018 population estimates
pop_2010_2018 = pd.read_csv("../data/2010-2018_Population_Estimates_TX.csv",
                            header=1,
                            names=["EXT_ID", "ID", "COUNTY", "APR_2010_CEN",
                                   "APR_2010_BASE", "JUL_2010", "JUL_2011",
                                   "JUL_2012", "JUL_2013", "JUL_2014",
                                   "JUL_2015", "JUL_2016", "JUL_2017", 
                                   "JUL_2018"])

In [47]:
pop_2010_2018.head()

Unnamed: 0,EXT_ID,ID,COUNTY,APR_2010_CEN,APR_2010_BASE,JUL_2010,JUL_2011,JUL_2012,JUL_2013,JUL_2014,JUL_2015,JUL_2016,JUL_2017,JUL_2018
0,0500000US48001,48001,"Anderson County, Texas",58458,58459,58497,58394,58065,57977,57849,57646,57550,58212,58057
1,0500000US48003,48003,"Andrews County, Texas",14786,14786,14849,15388,16113,16788,17445,18083,17805,17631,18128
2,0500000US48005,48005,"Angelina County, Texas",86771,86771,86905,87295,87520,87333,87599,87874,87759,87711,87092
3,0500000US48007,48007,"Aransas County, Texas",23158,23158,23182,23214,23457,23890,24570,24815,25191,25447,23792
4,0500000US48009,48009,"Archer County, Texas",9054,9055,9112,8834,8809,8795,8837,8758,8780,8786,8786


In [48]:
# Again, need to get the counties to be just the uppercase name

# Removing " County"
pop_2010_2018["COUNTY"] = pop_2010_2018["COUNTY"].str.split(' County').str[0]

# Changing all to uppercase
pop_2010_2018["COUNTY"] = pop_2010_2018["COUNTY"].str.upper()

pop_2010_2018.head()

Unnamed: 0,EXT_ID,ID,COUNTY,APR_2010_CEN,APR_2010_BASE,JUL_2010,JUL_2011,JUL_2012,JUL_2013,JUL_2014,JUL_2015,JUL_2016,JUL_2017,JUL_2018
0,0500000US48001,48001,ANDERSON,58458,58459,58497,58394,58065,57977,57849,57646,57550,58212,58057
1,0500000US48003,48003,ANDREWS,14786,14786,14849,15388,16113,16788,17445,18083,17805,17631,18128
2,0500000US48005,48005,ANGELINA,86771,86771,86905,87295,87520,87333,87599,87874,87759,87711,87092
3,0500000US48007,48007,ARANSAS,23158,23158,23182,23214,23457,23890,24570,24815,25191,25447,23792
4,0500000US48009,48009,ARCHER,9054,9055,9112,8834,8809,8795,8837,8758,8780,8786,8786


In [49]:
pop_2010_2012 = pop_2010_2018[["COUNTY", "JUL_2010", "JUL_2011", "JUL_2012"]]

In [50]:
pop_2010_2012.head()

Unnamed: 0,COUNTY,JUL_2010,JUL_2011,JUL_2012
0,ANDERSON,58497,58394,58065
1,ANDREWS,14849,15388,16113
2,ANGELINA,86905,87295,87520
3,ARANSAS,23182,23214,23457
4,ARCHER,9112,8834,8809


In [51]:
# And now, a dataset of all the relevant population data!
pop_data = pop_2006_2009.merge(pop_2010_2012, on="COUNTY")
# Renaming columns for ease of use, since now they're all July estimates
pop_data.rename(columns={"JUL_2006": 2006, "JUL_2007": 2007,
                         "JUL_2008": 2008, "JUL_2009": 2009,
                         "JUL_2010": 2010, "JUL_2011": 2011,
                         "JUL_2012": 2012}, inplace=True)
pop_data.head()

Unnamed: 0,COUNTY,2006,2007,2008,2009,2010,2011,2012
0,ANDERSON,57386,57870,57963,58410,58497,58394,58065
1,ANDREWS,13195,13513,14099,14601,14849,15388,16113
2,ANGELINA,83810,84518,84961,86029,86905,87295,87520
3,ARANSAS,23395,23172,23225,23291,23182,23214,23457
4,ARCHER,9063,9026,9104,9023,9112,8834,8809


#### Combining to arrive at a opioid death per capita figure, then deaths per 100k of population

In [31]:
dshs_for_csv.head()

Unnamed: 0,County Name,Number of Deaths,Type of Death1,Year (copy),Latitude (generated),Longitude (generated)
0,Zavala,0.0,All Deaths (Natural and Injury) where Opioids ...,2006,28.866,-99.761
1,Zapata,0.0,All Deaths (Natural and Injury) where Opioids ...,2006,26.971,-99.203
2,Young,2.5,All Deaths (Natural and Injury) where Opioids ...,2006,33.175,-98.687
3,Yoakum,0.0,All Deaths (Natural and Injury) where Opioids ...,2006,33.173,-102.829
4,Wood,2.5,All Deaths (Natural and Injury) where Opioids ...,2006,32.783,-95.407


In [32]:
# Some quick cleaning
dshs_clean = dshs_for_csv.copy()
# Only keeping the columns we want
dshs_clean.drop(columns=["Type of Death1", "Latitude (generated)", 
                         "Longitude (generated)"], inplace=True)
# Renaming the columns
dshs_clean.rename(columns={"County Name": "COUNTY",
                           "Number of Deaths": "NUM_DEATHS", 
                           "Year (copy)": "YEAR"}, inplace=True)
# Making all the county names uppercase
dshs_clean["COUNTY"] = dshs_clean["COUNTY"].str.upper()
# Making sure the values in the Number of Deaths column are floats
dshs_clean["NUM_DEATHS"] = dshs_clean["NUM_DEATHS"].astype("float")

dshs_clean.head()

Unnamed: 0,COUNTY,NUM_DEATHS,YEAR
0,ZAVALA,0.0,2006
1,ZAPATA,0.0,2006
2,YOUNG,2.5,2006
3,YOAKUM,0.0,2006
4,WOOD,2.5,2006


In [33]:
# Pivoting the table to make the columns each year
dshs_pivot = pd.pivot_table(dshs_clean, index="COUNTY",
                          columns="YEAR",
                          values="NUM_DEATHS")

In [34]:
# Removing a weird index name, leftover from the pivot
dshs_pivot.rename_axis(None, axis=1, inplace=True)

In [35]:
# Hooray, now it looks just like our population data table
dshs_pivot.head()

Unnamed: 0_level_0,2006,2007,2008,2009,2010,2011,2012
COUNTY,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
ANDERSON,2.5,2.5,2.5,2.5,2.5,2.5,2.5
ANDREWS,0.0,0.0,0.0,2.5,0.0,0.0,0.0
ANGELINA,2.5,2.5,2.5,0.0,2.5,2.5,2.5
ARANSAS,2.5,2.5,2.5,2.5,2.5,2.5,0.0
ARCHER,0.0,0.0,2.5,0.0,2.5,2.5,0.0


In [59]:
pop_data.set_index("COUNTY", inplace=True)
pop_data.head()

Unnamed: 0_level_0,2006,2007,2008,2009,2010,2011,2012
COUNTY,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
ANDERSON,57386,57870,57963,58410,58497,58394,58065
ANDREWS,13195,13513,14099,14601,14849,15388,16113
ANGELINA,83810,84518,84961,86029,86905,87295,87520
ARANSAS,23395,23172,23225,23291,23182,23214,23457
ARCHER,9063,9026,9104,9023,9112,8834,8809


In [48]:
pop_data.loc[pop_data.index == "TERRELL"]

Unnamed: 0_level_0,2006,2007,2008,2009,2010,2011,2012
COUNTY,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
TERRELL,921,886,884,930,1011,950,921


In [49]:
dshs_pivot.loc[dshs_pivot.index == "TERRELL"]

Unnamed: 0_level_0,2006,2007,2008,2009,2010,2011,2012
COUNTY,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
TERRELL,0.0,2.5,0.0,0.0,0.0,0.0,0.0


In [60]:
deaths_percapita = dshs_pivot / pop_data

In [51]:
deaths_percapita.describe()

Unnamed: 0,2006,2007,2008,2009,2010,2011,2012
count,254.0,254.0,254.0,254.0,254.0,254.0,254.0
mean,3.9e-05,5e-05,3.9e-05,3.3e-05,3.5e-05,3.5e-05,3e-05
std,8.4e-05,0.000197,7.3e-05,7.4e-05,7.9e-05,6.3e-05,9.4e-05
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,4.9e-05,4.9e-05,5.2e-05,4.3e-05,4e-05,5e-05,3.1e-05
max,0.000849,0.002822,0.000476,0.000752,0.000615,0.000414,0.00128


In [61]:
pop_100k = pop_data / 100000

In [62]:
deaths_per100k = dshs_pivot / pop_100k

In [54]:
deaths_per100k.describe()

Unnamed: 0,2006,2007,2008,2009,2010,2011,2012
count,254.0,254.0,254.0,254.0,254.0,254.0,254.0
mean,3.884246,4.954376,3.877724,3.318701,3.506903,3.528426,2.972931
std,8.368138,19.666148,7.340291,7.375099,7.930177,6.333992,9.354662
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,4.916587,4.854982,5.164023,4.266988,4.046553,4.984242,3.130589
max,84.860828,282.167043,47.600914,75.165364,61.546036,41.377027,128.008193


In [55]:
deaths_per100k[2007].sort_values(ascending=False).head()

COUNTY
TERRELL     282.167043
REAL         75.574365
HUDSPETH     72.212594
SUTTON       57.950858
GOLIAD       35.004201
Name: 2007, dtype: float64

In [12]:
wp_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15270974 entries, 0 to 15270973
Data columns (total 42 columns):
REPORTER_DEA_NO          object
REPORTER_BUS_ACT         object
REPORTER_NAME            object
REPORTER_ADDL_CO_INFO    object
REPORTER_ADDRESS1        object
REPORTER_ADDRESS2        object
REPORTER_CITY            object
REPORTER_STATE           object
REPORTER_ZIP             int64
REPORTER_COUNTY          object
BUYER_DEA_NO             object
BUYER_BUS_ACT            object
BUYER_NAME               object
BUYER_ADDL_CO_INFO       object
BUYER_ADDRESS1           object
BUYER_ADDRESS2           object
BUYER_CITY               object
BUYER_STATE              object
BUYER_ZIP                int64
BUYER_COUNTY             object
TRANSACTION_CODE         object
DRUG_CODE                int64
NDC_NO                   object
DRUG_NAME                object
QUANTITY                 float64
UNIT                     float64
ACTION_INDICATOR         object
ORDER_FORM_NO         

In [13]:
pills = wp_data[["BUYER_COUNTY", "TRANSACTION_DATE", "DOSAGE_UNIT"]].copy()

In [14]:
pills.head()

Unnamed: 0,BUYER_COUNTY,TRANSACTION_DATE,DOSAGE_UNIT
0,HARRIS,2014-04-04,2000.0
1,HARRIS,2014-10-03,2000.0
2,HARRIS,2014-12-05,1000.0
3,HARRIS,2014-05-23,1500.0
4,HARRIS,2014-05-02,2000.0


In [15]:
pills["YEAR"] = pills["TRANSACTION_DATE"].dt.year

In [16]:
pills.drop(columns="TRANSACTION_DATE", inplace=True)
pills.rename(columns={"BUYER_COUNTY": "COUNTY"}, inplace=True)

In [17]:
pills.head()

Unnamed: 0,COUNTY,DOSAGE_UNIT,YEAR
0,HARRIS,2000.0,2014
1,HARRIS,2000.0,2014
2,HARRIS,1000.0,2014
3,HARRIS,1500.0,2014
4,HARRIS,2000.0,2014


In [22]:
set(pills['COUNTY'])

{'ANDERSON',
 'ANDREWS',
 'ANGELINA',
 'ARANSAS',
 'ARCHER',
 'ARMSTRONG',
 'ATASCOSA',
 'AUSTIN',
 'BAILEY',
 'BANDERA',
 'BASTROP',
 'BAYLOR',
 'BEE',
 'BELL',
 'BEXAR',
 'BLANCO',
 'BOSQUE',
 'BOWIE',
 'BRAZORIA',
 'BRAZOS',
 'BREWSTER',
 'BROOKS',
 'BROWN',
 'BURLESON',
 'BURNET',
 'CALDWELL',
 'CALHOUN',
 'CALLAHAN',
 'CAMERON',
 'CAMP',
 'CARSON',
 'CASS',
 'CASTRO',
 'CHAMBERS',
 'CHEROKEE',
 'CHILDRESS',
 'CLAY',
 'COCHRAN',
 'COLEMAN',
 'COLLIN',
 'COLORADO',
 'COMAL',
 'COMANCHE',
 'CONCHO',
 'COOKE',
 'CORYELL',
 'CRANE',
 'CROCKETT',
 'CROSBY',
 'CULBERSON',
 'DALLAM',
 'DALLAS',
 'DAWSON',
 'DE WITT',
 'DEAF SMITH',
 'DELTA',
 'DENTON',
 'DICKENS',
 'DIMMIT',
 'DONLEY',
 'DUVAL',
 'EASTLAND',
 'ECTOR',
 'EDWARDS',
 'EL PASO',
 'ELLIS',
 'ERATH',
 'FALLS',
 'FANNIN',
 'FAYETTE',
 'FISHER',
 'FLOYD',
 'FOARD',
 'FORT BEND',
 'FRANKLIN',
 'FREESTONE',
 'FRIO',
 'GAINES',
 'GALVESTON',
 'GARZA',
 'GILLESPIE',
 'GOLIAD',
 'GONZALES',
 'GRAY',
 'GRAYSON',
 'GREGG',
 'GRIMES',
 '

In [23]:
# Alas, one specific county has a space in this dataframe, while all the 
# other datasets I have spell it without a space - let's fix that
pills["COUNTY"].replace(to_replace="DE WITT", value="DEWITT", inplace=True)

In [24]:
# Pivoting the table to make the columns each year
pills_pivot = pd.pivot_table(pills, index="COUNTY", columns="YEAR",
                             values="DOSAGE_UNIT", aggfunc="sum")

In [26]:
pills_pivot

YEAR,2006,2007,2008,2009,2010,2011,2012,2013,2014
COUNTY,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
ANDERSON,2209130.0,2148570.0,2296470.0,2348990.0,2445130.0,2740100.0,2672540.0,2565550.0,2444420.0
ANDREWS,246600.0,274080.0,320200.0,331510.0,367330.0,415720.0,481510.0,567050.0,521480.0
ANGELINA,3070975.0,3472800.0,4016760.0,4286080.0,4319439.0,4789710.0,4693270.0,4451230.0,4117660.0
ARANSAS,734500.0,948950.0,989600.0,1014920.0,1110790.0,1206540.0,1294570.0,1322050.0,1270980.0
ARCHER,,200.0,,,,,100.0,400.0,100.0
...,...,...,...,...,...,...,...,...,...
WOOD,2132070.0,2221790.0,2283790.0,2336510.0,2299032.0,2526840.0,2526920.0,2521410.0,2502620.0
YOAKUM,228600.0,217080.0,215250.0,263300.0,262390.0,299850.0,359190.0,392640.0,409970.0
YOUNG,1313930.0,1546490.0,1602100.0,1592630.0,1608900.0,1804590.0,1772090.0,1649920.0,1591130.0
ZAPATA,57700.0,69770.0,101130.0,101410.0,121410.0,162200.0,179160.0,170050.0,150400.0


In [None]:
pills_pivot = pills_pivot.reindex_like(pop_data)

In [27]:
# Filling nulls, when no pills were sent to that county
pills_pivot.fillna(0, inplace=True)
# Removing a weird index name, leftover from the pivot
pills_pivot.rename_axis(None, axis=1, inplace=True)

In [28]:
pills_pivot.head()

Unnamed: 0_level_0,2006,2007,2008,2009,2010,2011,2012,2013,2014
COUNTY,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
ANDERSON,2209130.0,2148570.0,2296470.0,2348990.0,2445130.0,2740100.0,2672540.0,2565550.0,2444420.0
ANDREWS,246600.0,274080.0,320200.0,331510.0,367330.0,415720.0,481510.0,567050.0,521480.0
ANGELINA,3070975.0,3472800.0,4016760.0,4286080.0,4319439.0,4789710.0,4693270.0,4451230.0,4117660.0
ARANSAS,734500.0,948950.0,989600.0,1014920.0,1110790.0,1206540.0,1294570.0,1322050.0,1270980.0
ARCHER,0.0,200.0,0.0,0.0,0.0,0.0,100.0,400.0,100.0


In [29]:
# pills_pivot.to_csv("../data/tx_pills_per_year_by_county.csv")

In [73]:
pills_percapita = pills_pivot / pop_data

In [74]:
pills_percapita.describe()

Unnamed: 0,2006,2007,2008,2009,2010,2011,2012
count,254.0,254.0,254.0,254.0,254.0,254.0,254.0
mean,21.064651,24.087389,25.850747,27.449645,29.093404,32.659495,33.197048
std,14.82111,16.798077,17.151721,18.069433,18.793225,20.035172,20.283625
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,11.408081,13.224395,14.64114,16.07243,16.933764,19.514753,20.747307
50%,19.323922,21.784128,24.089366,26.064345,27.147936,32.777558,33.328704
75%,29.753164,33.5909,36.777378,38.822188,40.756245,45.414605,46.297715
max,103.736457,119.518327,88.494255,86.246615,88.626591,98.326704,96.952074


In [75]:
pills_percapita.head()

Unnamed: 0_level_0,2006,2007,2008,2009,2010,2011,2012
COUNTY,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
ANDERSON,38.495975,37.127527,39.619585,40.215545,41.799238,46.924342,46.026694
ANDREWS,18.688897,20.282691,22.710831,22.704609,24.737693,27.015857,29.883324
ANGELINA,36.642107,41.089472,47.277692,49.821339,49.702998,54.868091,53.625114
ARANSAS,31.395597,40.952443,42.609257,43.57563,47.916056,51.97467,55.189069
ARCHER,0.0,0.022158,0.0,0.0,0.0,0.0,0.011352


In [70]:
pills_percapita.shape

(254, 7)

## Visualizing

In [76]:
# Need county identification numbers, which are Federal Information Processing
# Standard codes - which, luckily, were a part of one of the population csvs
county_id = {}
for obs in pop_2010_2018.index:
    county_id[pop_2010_2018["COUNTY"][obs]] = pop_2010_2018["ID"][obs]

In [77]:
# Creating a new column for the County IDs
pills_percapita["COUNTY_ID"] = county_id.values()

In [78]:
pills_percapita.head()

Unnamed: 0_level_0,2006,2007,2008,2009,2010,2011,2012,COUNTY_ID
COUNTY,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
ANDERSON,38.495975,37.127527,39.619585,40.215545,41.799238,46.924342,46.026694,48001
ANDREWS,18.688897,20.282691,22.710831,22.704609,24.737693,27.015857,29.883324,48003
ANGELINA,36.642107,41.089472,47.277692,49.821339,49.702998,54.868091,53.625114,48005
ARANSAS,31.395597,40.952443,42.609257,43.57563,47.916056,51.97467,55.189069,48007
ARCHER,0.0,0.022158,0.0,0.0,0.0,0.0,0.011352,48009


In [79]:
pills_percapita.describe()

Unnamed: 0,2006,2007,2008,2009,2010,2011,2012,COUNTY_ID
count,254.0,254.0,254.0,254.0,254.0,254.0,254.0,254.0
mean,21.064651,24.087389,25.850747,27.449645,29.093404,32.659495,33.197048,48254.0
std,14.82111,16.798077,17.151721,18.069433,18.793225,20.035172,20.283625,146.93536
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,48001.0
25%,11.408081,13.224395,14.64114,16.07243,16.933764,19.514753,20.747307,48127.5
50%,19.323922,21.784128,24.089366,26.064345,27.147936,32.777558,33.328704,48254.0
75%,29.753164,33.5909,36.777378,38.822188,40.756245,45.414605,46.297715,48380.5
max,103.736457,119.518327,88.494255,86.246615,88.626591,98.326704,96.952074,48507.0


In [80]:
pills_percapita.reset_index(inplace=True)

In [81]:
pills_percapita.head()

Unnamed: 0,COUNTY,2006,2007,2008,2009,2010,2011,2012,COUNTY_ID
0,ANDERSON,38.495975,37.127527,39.619585,40.215545,41.799238,46.924342,46.026694,48001
1,ANDREWS,18.688897,20.282691,22.710831,22.704609,24.737693,27.015857,29.883324,48003
2,ANGELINA,36.642107,41.089472,47.277692,49.821339,49.702998,54.868091,53.625114,48005
3,ARANSAS,31.395597,40.952443,42.609257,43.57563,47.916056,51.97467,55.189069,48007
4,ARCHER,0.0,0.022158,0.0,0.0,0.0,0.0,0.011352,48009


## Creating a Better Dataframe to Export

In [30]:
pills_pivot.reset_index(inplace=True)

In [31]:
pills_pivot.head()

Unnamed: 0,COUNTY,2006,2007,2008,2009,2010,2011,2012,2013,2014
0,ANDERSON,2209130.0,2148570.0,2296470.0,2348990.0,2445130.0,2740100.0,2672540.0,2565550.0,2444420.0
1,ANDREWS,246600.0,274080.0,320200.0,331510.0,367330.0,415720.0,481510.0,567050.0,521480.0
2,ANGELINA,3070975.0,3472800.0,4016760.0,4286080.0,4319439.0,4789710.0,4693270.0,4451230.0,4117660.0
3,ARANSAS,734500.0,948950.0,989600.0,1014920.0,1110790.0,1206540.0,1294570.0,1322050.0,1270980.0
4,ARCHER,0.0,200.0,0.0,0.0,0.0,0.0,100.0,400.0,100.0


In [32]:
pills_total_melted = pd.melt(pills_pivot, id_vars=["COUNTY"],
                           var_name="YEAR", value_name="TOTAL_PILLS")

In [33]:
pills_total_melted.head()

Unnamed: 0,COUNTY,YEAR,TOTAL_PILLS
0,ANDERSON,2006,2209130.0
1,ANDREWS,2006,246600.0
2,ANGELINA,2006,3070975.0
3,ARANSAS,2006,734500.0
4,ARCHER,2006,0.0


In [34]:
pills_total_melted.to_csv("../data/tx_total_pills_melted.csv")

In [88]:
pop_data.head()

Unnamed: 0_level_0,2006,2007,2008,2009,2010,2011,2012
COUNTY,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
ANDERSON,57386,57870,57963,58410,58497,58394,58065
ANDREWS,13195,13513,14099,14601,14849,15388,16113
ANGELINA,83810,84518,84961,86029,86905,87295,87520
ARANSAS,23395,23172,23225,23291,23182,23214,23457
ARCHER,9063,9026,9104,9023,9112,8834,8809


In [89]:
pop_data.reset_index(inplace=True)
pop_melted = pd.melt(pop_data, id_vars=["COUNTY"],
                     var_name="YEAR", value_name="TOTAL_POPULATION")

In [90]:
pop_melted.head()

Unnamed: 0,COUNTY,YEAR,TOTAL_POPULATION
0,ANDERSON,2006,57386
1,ANDREWS,2006,13195
2,ANGELINA,2006,83810
3,ARANSAS,2006,23395
4,ARCHER,2006,9063


In [91]:
pills_percapita.head()

Unnamed: 0,COUNTY,2006,2007,2008,2009,2010,2011,2012,COUNTY_ID
0,ANDERSON,38.495975,37.127527,39.619585,40.215545,41.799238,46.924342,46.026694,48001
1,ANDREWS,18.688897,20.282691,22.710831,22.704609,24.737693,27.015857,29.883324,48003
2,ANGELINA,36.642107,41.089472,47.277692,49.821339,49.702998,54.868091,53.625114,48005
3,ARANSAS,31.395597,40.952443,42.609257,43.57563,47.916056,51.97467,55.189069,48007
4,ARCHER,0.0,0.022158,0.0,0.0,0.0,0.0,0.011352,48009


In [92]:
pills_pc_melted = pd.melt(pills_percapita, id_vars=["COUNTY", "COUNTY_ID"],
                          var_name="YEAR", value_name="PILLS_PC")

In [93]:
pills_pc_melted.head()

Unnamed: 0,COUNTY,COUNTY_ID,YEAR,PILLS_PC
0,ANDERSON,48001,2006,38.495975
1,ANDREWS,48003,2006,18.688897
2,ANGELINA,48005,2006,36.642107
3,ARANSAS,48007,2006,31.395597
4,ARCHER,48009,2006,0.0


In [94]:
dshs_pivot.head()

Unnamed: 0_level_0,2006,2007,2008,2009,2010,2011,2012
COUNTY,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
ANDERSON,2.5,2.5,2.5,2.5,2.5,2.5,2.5
ANDREWS,0.0,0.0,0.0,2.5,0.0,0.0,0.0
ANGELINA,2.5,2.5,2.5,0.0,2.5,2.5,2.5
ARANSAS,2.5,2.5,2.5,2.5,2.5,2.5,0.0
ARCHER,0.0,0.0,2.5,0.0,2.5,2.5,0.0


In [95]:
dshs_pivot.reset_index(inplace=True)
deaths_melted = pd.melt(dshs_pivot, id_vars="COUNTY",
                        var_name="YEAR", value_name="TOTAL_DEATHS")

In [96]:
deaths_melted.head()

Unnamed: 0,COUNTY,YEAR,TOTAL_DEATHS
0,ANDERSON,2006,2.5
1,ANDREWS,2006,0.0
2,ANGELINA,2006,2.5
3,ARANSAS,2006,2.5
4,ARCHER,2006,0.0


In [97]:
deaths_percapita.reset_index(inplace=True)

In [98]:
deaths_percapita.head()

Unnamed: 0,COUNTY,2006,2007,2008,2009,2010,2011,2012
0,ANDERSON,4.4e-05,4.3e-05,4.3e-05,4.3e-05,4.3e-05,4.3e-05,4.3e-05
1,ANDREWS,0.0,0.0,0.0,0.000171,0.0,0.0,0.0
2,ANGELINA,3e-05,3e-05,2.9e-05,0.0,2.9e-05,2.9e-05,2.9e-05
3,ARANSAS,0.000107,0.000108,0.000108,0.000107,0.000108,0.000108,0.0
4,ARCHER,0.0,0.0,0.000275,0.0,0.000274,0.000283,0.0


In [99]:
deaths_pc_melted = pd.melt(deaths_percapita, id_vars=["COUNTY"],
                           var_name="YEAR", value_name="DEATHS_PC")

In [100]:
deaths_pc_melted.head()

Unnamed: 0,COUNTY,YEAR,DEATHS_PC
0,ANDERSON,2006,4.4e-05
1,ANDREWS,2006,0.0
2,ANGELINA,2006,3e-05
3,ARANSAS,2006,0.000107
4,ARCHER,2006,0.0


In [101]:
deaths_per100k.head()

Unnamed: 0_level_0,2006,2007,2008,2009,2010,2011,2012
COUNTY,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
ANDERSON,4.356463,4.320028,4.313096,4.280089,4.273723,4.281262,4.30552
ANDREWS,0.0,0.0,0.0,17.122115,0.0,0.0,0.0
ANGELINA,2.982938,2.95795,2.942527,0.0,2.876704,2.863852,2.85649
ARANSAS,10.686044,10.788883,10.764263,10.73376,10.784229,10.769363,0.0
ARCHER,0.0,0.0,27.460457,0.0,27.436348,28.299751,0.0


In [102]:
deaths_per100k.reset_index(inplace=True)
deaths_p100k_melted = pd.melt(deaths_per100k, id_vars=["COUNTY"],
                              var_name="YEAR", value_name="DEATHS_PER_100K")

In [103]:
deaths_p100k_melted.head()

Unnamed: 0,COUNTY,YEAR,DEATHS_PER_100K
0,ANDERSON,2006,4.356463
1,ANDREWS,2006,0.0
2,ANGELINA,2006,2.982938
3,ARANSAS,2006,10.686044
4,ARCHER,2006,0.0


In [104]:
pop_melted.shape

(1778, 3)

In [105]:
pills_total_melted.shape

(1778, 3)

In [106]:
pills_pc_melted.shape

(1778, 4)

In [107]:
deaths_melted.shape

(1778, 3)

In [108]:
deaths_pc_melted.shape

(1778, 3)

In [109]:
deaths_p100k_melted.shape

(1778, 3)

In [110]:
county_data = pills_pc_melted[["COUNTY", "COUNTY_ID", "YEAR"]].copy()
county_data.head()

Unnamed: 0,COUNTY,COUNTY_ID,YEAR
0,ANDERSON,48001,2006
1,ANDREWS,48003,2006
2,ANGELINA,48005,2006
3,ARANSAS,48007,2006
4,ARCHER,48009,2006


In [111]:
county_data["TOTAL_POPULATION"] = pop_melted["TOTAL_POPULATION"]
county_data["TOTAL_PILLS"] = pills_total_melted["TOTAL_PILLS"]
county_data["PILLS_PER_CAPITA"] = pills_pc_melted["PILLS_PC"]
county_data["TOTAL_OVERDOSE_DEATHS"] = deaths_melted["TOTAL_DEATHS"]
county_data["DEATHS_PER_CAPITA"] = deaths_pc_melted["DEATHS_PC"]
county_data["DEATHS_PER_100K_PEOPLE"] = deaths_p100k_melted["DEATHS_PER_100K"]

In [112]:
county_data.head()

Unnamed: 0,COUNTY,COUNTY_ID,YEAR,TOTAL_POPULATION,TOTAL_PILLS,PILLS_PER_CAPITA,TOTAL_OVERDOSE_DEATHS,DEATHS_PER_CAPITA,DEATHS_PER_100K_PEOPLE
0,ANDERSON,48001,2006,57386,2209130.0,38.495975,2.5,4.4e-05,4.356463
1,ANDREWS,48003,2006,13195,246600.0,18.688897,0.0,0.0,0.0
2,ANGELINA,48005,2006,83810,3070975.0,36.642107,2.5,3e-05,2.982938
3,ARANSAS,48007,2006,23395,734500.0,31.395597,2.5,0.000107,10.686044
4,ARCHER,48009,2006,9063,0.0,0.0,0.0,0.0,0.0


## MORE DATA - Opioid Prescription Rates

From the [CDC's U.S. Opioid Prescribing Rate Maps](https://www.cdc.gov/drugoverdose/maps/rxrate-maps.html), which tracks retail opioid prescriptions dispensed per 100 persons per year.

My data is available as a [Google sheet](https://docs.google.com/spreadsheets/d/1fJWN3LYSLfiX_vkp4ONo0-dSyhKYmGvoYbmBOrp3PUk/edit?usp=sharing), after copy/pasting from the above source.

In [174]:
prescribe_rate = pd.read_csv("../data/TexasCountyOpioidPrescribingRates(per100people)-From CDC.csv")

In [175]:
prescribe_rate.rename(columns={"County": "COUNTY", 
                               "Prescribing Rate": "PRESCRIBE_RATE", 
                               "Year": "YEAR"}, inplace=True)
prescribe_rate["COUNTY"] = prescribe_rate["COUNTY"].str.upper()

In [176]:
prescribe_rate.head()

Unnamed: 0,COUNTY,PRESCRIBE_RATE,YEAR
0,ANDERSON,105.9,2012
1,ANDREWS,53.2,2012
2,ANGELINA,123.0,2012
3,ARANSAS,124.9,2012
4,ARCHER,,2012


In [177]:
prescribe_rate.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1778 entries, 0 to 1777
Data columns (total 3 columns):
COUNTY            1778 non-null object
PRESCRIBE_RATE    1309 non-null float64
YEAR              1778 non-null int64
dtypes: float64(1), int64(1), object(1)
memory usage: 41.8+ KB


In [179]:
prescribe_rate.sort_values(by=["YEAR", "COUNTY"], inplace=True)

Unnamed: 0,COUNTY,PRESCRIBE_RATE,YEAR
1524,ANDERSON,95.9,2006
1525,ANDREWS,56.9,2006
1526,ANGELINA,109.5,2006
1527,ARANSAS,84.9,2006
1528,ARCHER,,2006


In [182]:
prescribe_rate.reset_index(drop=True, inplace=True)

In [183]:
prescribe_rate.head()

Unnamed: 0,COUNTY,PRESCRIBE_RATE,YEAR
0,ANDERSON,95.9,2006
1,ANDREWS,56.9,2006
2,ANGELINA,109.5,2006
3,ARANSAS,84.9,2006
4,ARCHER,,2006


## And now, Unemployment Rates

From the [Bureau of Labor Statistics](https://data.bls.gov/lausmap/showMap.jsp)

My data is available in a [Google sheet](https://docs.google.com/spreadsheets/d/1BVljj8YRMTuZMQSyuwyd2LP2Y71-q-ryJSmkHgIgUbc/edit?usp=sharing), after copy/pasting from the above source.

In [156]:
bls_data = pd.read_csv("../data/TexasCountyUnemploymentData-FromBLS.csv")

In [157]:
bls_data.rename(columns={"County": "COUNTY", "July\n2006": "2006",
                         "July\n2007": "2007", "July\n2008": "2008",
                         "July\n2009": "2009", "July\n2010": "2010",
                         "July\n2011": "2011", "July\n2012": "2012"}, 
                inplace=True)
bls_data["COUNTY"] = bls_data["COUNTY"].str.upper()

In [161]:
bls_data["COUNTY"] = bls_data["COUNTY"].str.split(' COUNTY').str[0]

In [162]:
bls_data.head()

Unnamed: 0,COUNTY,2006,2007,2008,2009,2010,2011,2012
0,ANDERSON,6.2,5.2,6.0,9.8,8.3,7.9,7.0
1,ANDREWS,3.8,3.5,3.6,8.9,6.2,5.5,4.5
2,ANGELINA,4.9,5.0,4.9,9.6,8.6,8.5,7.6
3,ARANSAS,4.8,4.2,4.6,7.6,9.3,9.2,7.5
4,ARCHER,3.6,3.5,3.8,6.8,7.1,6.7,6.1


In [163]:
bls_data_melted = pd.melt(bls_data, id_vars=["COUNTY"],
                          var_name="YEAR", value_name="UNEMPLOYMENT")

In [164]:
bls_data_melted.head()

Unnamed: 0,COUNTY,YEAR,UNEMPLOYMENT
0,ANDERSON,2006,6.2
1,ANDREWS,2006,3.8
2,ANGELINA,2006,4.9
3,ARANSAS,2006,4.8
4,ARCHER,2006,3.6


In [184]:
county_data.head()

Unnamed: 0,COUNTY,COUNTY_ID,YEAR,TOTAL_POPULATION,TOTAL_PILLS,PILLS_PER_CAPITA,TOTAL_OVERDOSE_DEATHS,DEATHS_PER_CAPITA,DEATHS_PER_100K_PEOPLE
0,ANDERSON,48001,2006,57386,2209130.0,38.495975,2.5,4.4e-05,4.356463
1,ANDREWS,48003,2006,13195,246600.0,18.688897,0.0,0.0,0.0
2,ANGELINA,48005,2006,83810,3070975.0,36.642107,2.5,3e-05,2.982938
3,ARANSAS,48007,2006,23395,734500.0,31.395597,2.5,0.000107,10.686044
4,ARCHER,48009,2006,9063,0.0,0.0,0.0,0.0,0.0


In [185]:
county_data["PRESCRIPTION_RATE"] = prescribe_rate["PRESCRIBE_RATE"]
county_data["UNEMPLOYMENT"] = bls_data_melted["UNEMPLOYMENT"]

In [187]:
county_data.head()

Unnamed: 0,COUNTY,COUNTY_ID,YEAR,TOTAL_POPULATION,TOTAL_PILLS,PILLS_PER_CAPITA,TOTAL_OVERDOSE_DEATHS,DEATHS_PER_CAPITA,DEATHS_PER_100K_PEOPLE,PRESCRIPTION_RATE,UNEMPLOYMENT
0,ANDERSON,48001,2006,57386,2209130.0,38.495975,2.5,4.4e-05,4.356463,95.9,6.2
1,ANDREWS,48003,2006,13195,246600.0,18.688897,0.0,0.0,0.0,56.9,3.8
2,ANGELINA,48005,2006,83810,3070975.0,36.642107,2.5,3e-05,2.982938,109.5,4.9
3,ARANSAS,48007,2006,23395,734500.0,31.395597,2.5,0.000107,10.686044,84.9,4.8
4,ARCHER,48009,2006,9063,0.0,0.0,0.0,0.0,0.0,,3.6


In [190]:
# Writing to CSV
# county_data.to_csv(r"../data/TX_County_By_Year.csv", index=False)

In [35]:
# Sanity check
county_test = pd.read_csv("../data/TX_County_By_Year.csv")

In [36]:
county_test.head()

Unnamed: 0,COUNTY,COUNTY_ID,YEAR,TOTAL_POPULATION,TOTAL_PILLS,PILLS_PER_CAPITA,TOTAL_OVERDOSE_DEATHS,DEATHS_PER_CAPITA,DEATHS_PER_100K_PEOPLE,PRESCRIPTION_RATE,UNEMPLOYMENT
0,ANDERSON,48001,2006,57386,2209130.0,38.495975,2.5,4.4e-05,4.356463,95.9,6.2
1,ANDREWS,48003,2006,13195,246600.0,18.688897,0.0,0.0,0.0,56.9,3.8
2,ANGELINA,48005,2006,83810,3070975.0,36.642107,2.5,3e-05,2.982938,109.5,4.9
3,ARANSAS,48007,2006,23395,734500.0,31.395597,2.5,0.000107,10.686044,84.9,4.8
4,ARCHER,48009,2006,9063,0.0,0.0,0.0,0.0,0.0,,3.6
