# Opioid Crisis - Analysis

I want to take a second look at the data from the Opioid Crisis datasheet.

- (link here: https://www.mathmodels.org/Problems/2019/MCM-C/index.html)

Motivation: worked on it during 2019 for two days. Still interested.

Data we will work with:
* Drug identification counts in years 2010-2016
* Socio-economic factors collected for five states (Ohio, Kentucky, West Virginia, Virginia, Pennsylvania)

In [None]:
import numpy as np
import pandas as pd

# drug use data.
df_nflis = pd.read_excel('2018_MCMProblemC_DATA/MCM_NFLIS_Data.xlsx', sheet_name="Data")

# socio-economic data.
df10 = pd.read_csv('2018_MCMProblemC_DATA/ACS_10_5YR_DP02/ACS_10_5YR_DP02_with_ann.csv')
df11 = pd.read_csv('2018_MCMProblemC_DATA/ACS_11_5YR_DP02/ACS_11_5YR_DP02_with_ann.csv')
df12 = pd.read_csv('2018_MCMProblemC_DATA/ACS_12_5YR_DP02/ACS_12_5YR_DP02_with_ann.csv')
df13 = pd.read_csv('2018_MCMProblemC_DATA/ACS_13_5YR_DP02/ACS_13_5YR_DP02_with_ann.csv')
df14 = pd.read_csv('2018_MCMProblemC_DATA/ACS_14_5YR_DP02/ACS_14_5YR_DP02_with_ann.csv')
df15 = pd.read_csv('2018_MCMProblemC_DATA/ACS_15_5YR_DP02/ACS_15_5YR_DP02_with_ann.csv')
df16 = pd.read_csv('2018_MCMProblemC_DATA/ACS_16_5YR_DP02/ACS_16_5YR_DP02_with_ann.csv')

# indexing data.
df10_meta = pd.read_csv('2018_MCMProblemC_DATA/ACS_10_5YR_DP02/ACS_10_5YR_DP02_metadata.csv')
df11_meta = pd.read_csv('2018_MCMProblemC_DATA/ACS_11_5YR_DP02/ACS_11_5YR_DP02_metadata.csv')
df12_meta = pd.read_csv('2018_MCMProblemC_DATA/ACS_12_5YR_DP02/ACS_12_5YR_DP02_metadata.csv')
df13_meta = pd.read_csv('2018_MCMProblemC_DATA/ACS_13_5YR_DP02/ACS_13_5YR_DP02_metadata.csv')
df14_meta = pd.read_csv('2018_MCMProblemC_DATA/ACS_14_5YR_DP02/ACS_14_5YR_DP02_metadata.csv')
df15_meta = pd.read_csv('2018_MCMProblemC_DATA/ACS_15_5YR_DP02/ACS_15_5YR_DP02_metadata.csv')
df16_meta = pd.read_csv('2018_MCMProblemC_DATA/ACS_16_5YR_DP02/ACS_16_5YR_DP02_metadata.csv')

Parts of the data are not available.

## Preprocessing

General plan: iterate over socio-economic data, and append with relevant drug use data.

Feature extraction part:
* Include geography (specifically `GEO.display-label`).
* Exclude margin of error.
* Exclude columns with `(X)`.
* Exclude non-universal data.

In [None]:
df10

The function `feature_extract` will extract features as to satisfy the above conditions (save universality).

In [None]:
from opioid_crisis_lib import feature_extract
df10[feature_extract(df10, df10_meta)]

### Filtering Data with Universal Property

As we see only the "important" data remain. We've gotten rid of error estimate data as well as inadmissible data.

In [None]:
from opioid_crisis_lib import feature_index

ddf = [df10, df11, df12, df13, df14, df15, df16]
ddf_metadata = [df10_meta,
                df11_meta,
                df12_meta,
                df13_meta,
                df14_meta,
                df15_meta,
                df16_meta]

f_index = feature_index(ddf, ddf_metadata)

sorted(f_index)

In [None]:
f_index['Estimate; ANCESTRY - Total population']

We obtain a map from descriptors to corresponding labels. This allows us to go back and forth between description and label.

The above data corresponds to the socio-economic data of a particular county at some specified year. When we go over drug use data, the `YYYY`, `State` and `County` data should sufficiently return the appropriate socio-economic data.

In [None]:
from opioid_crisis_lib import state_and_county
state_and_county("Adair County, Kentucky")

### Retrieving Geographic Data

In addition, for each county there should be a method to retrieve numerical geographic data:

In [None]:
df_geo = pd.read_csv('2021_Gaz_counties_national.txt', sep="\t")
from opioid_crisis_lib import locate
locate("ky", "adair", df_geo)

In [None]:
np.array(df_geo[["USPS", "NAME"]])

### Processing Drug Data

In [None]:
df_nflis[["YYYY", "State", "COUNTY", "SubstanceName", "DrugReports"]]

Of importance is the type of drugs reported.

In [None]:
df_nflis["SubstanceName"]

The following gives a survey of distinct drug types (indexed).

In [None]:
substanceNames = sorted(set(df_nflis["SubstanceName"]))
substanceNames

In [None]:
print(len(substanceNames))

Convert this to a dictionary so we can map substance use to a particular index.

In [None]:
substanceNamesDict = {substanceNames[i]:i for i in range(69)}

Given data on `drug reports` and `substance name`, we can construct a vector which indicates extent of a particular drug use in a county.

In [None]:
from opioid_crisis_lib import drug_matrix
drug_matrix(df_nflis, substanceNamesDict)

Now due to redundancy, a single county may occur multiple times. We want to have one drug vector per county.

In [None]:
from opioid_crisis_lib import drug_vector
drug_vector("2010", "oh", "adams", df_nflis, substanceNamesDict, identify=True)

### Compiling Data

We can do this:
1. Determine the overall dimension of the sample matrix. (doable)
1. Iterate through socio-economic dataframe rows, through the years 2010-2016,
    1. For each row, gather socio-economic data, AND data which identifies the State, county and year.
    1. Retrieve the geographic location of the county, append.
    1. Retrieve the drug vector of the county, append.
2. Write all appended data into one numpy array.
3. (Optional) Move independent columns (geographical location, socio-economic data) to the beginning, and the drug vector to the end.

Make sure the data is recoverable and that rows have unique identifiers. (which indicate year, state and county)

Index dataframes with their corresponding year.

In [None]:
ddf_yyyy = {
    "2010": df10,
    "2011": df11,
    "2012": df12,
    "2013": df13,
    "2014": df14,
    "2015": df15,
    "2016": df16,
}

ddf_metadata_yyyy = {
    "2010": df10_meta,
    "2011": df11_meta,
    "2012": df12_meta,
    "2013": df13_meta,
    "2014": df14_meta,
    "2015": df15_meta,
    "2016": df16_meta,
}