# Stage I - Data Understanding and Linking

## What is the U.S. Opioid Epidemic?

- In the late 1990s, pharmaceutical companies reassured the medical community that patients would not become addicted to opioid pain relievers and healthcare providers began to prescribe them at greater rates.

- Increased prescription of opioid medications led to widespread misuse of both prescription and non-prescription opioids before it became clear that these medications could indeed be highly addictive.

![opioid](https://www.hhs.gov/opioids/sites/default/files/inline-images/opioids-infographic.png)

Source: https://www.hhs.gov/opioids/about-the-epidemic/index.html


We are going to study the underlying patterns that exist between opioid related deaths and the different socio-economic, demographic, geographic, and equity related variables that are available for US population. Our goal is to extract such patterns and understand them futher using data science techniques. 

In order to achieve that, the project is separated into 5 stages:

- Stage I - Data and Project Understanding,
- Stage II - Data Modeling,
- Stage III - Distributions and Hypothesis Testing,
- Stage IV - Dashboard




### Opioid overdose dataset linking

In this stage we utilize multiple publicly available datasets and link them together for analytics. Our goal here is to help identify patterns which contribute to drug overdose deaths. Within this notebook we will explore:

1. **Creating Index** - Developing an index key for linking datasets
2. **Join** - Using join to merge data based on index key.

The notebooks is viewable via any browser. 

#### Software used (open source):

- `python` - https://www.python.org/download/releases/3.0/
- `pandas` - https://pandas.pydata.org/
- `plotly` - https://plotly.com/

### Datasets


#### 1. Drug Overdose Dataset

The overdose death/cause dataset was obtained from CDC Wonder (https://wonder.cdc.gov/ucd-icd10.html). The dataset is from the Underlying Cause of Death database contains mortality and population counts for all U.S. counties. Data are based on death certificates for U.S. residents. Each death certificate identifies a single underlying cause of death and demographic data. 

- From this data we obtained the Drug/Alcohol Induced causes data for 2019 across all counties in US. 
- https://wonder.cdc.gov/wonder/help/ucd.html#Drug/Alcohol%20Induced%20Causes
- File: `././../../../data/stage_1/Underlying Cause of Death-County-2019.txt`

#### 2. County Health Rankings

The County Health Rankings provide a snapshot of a community’s health and a starting point for investigating and discussing ways to improve health. The annual Rankings measures vital health factors, including high school graduation rates, obesity, smoking, unemployment, access to healthy foods, the quality of air and water, income inequality, and teen births in nearly every county in America. The dataset provides a snapshot of how health is influenced by where we live, learn, work and play.

- From this data we obtained the measures data for 2019 across all counties in US. 
- https://www.countyhealthrankings.org/
- Data Dictionary - https://www.countyhealthrankings.org/sites/default/files/DataDictionary_2019.pdf
- File: `./../../../data/stage_1/County_Health_Ranking.csv`

#### 3. County Opioid Dispensing Rates

The third dataset in this notebook is the Opoid Dispensing Rate dataset. The dataset has geographic distribution of retail opioid prescriptions dispensed per 100 persons per year from 2006–2019. Rates are classified by the Jenks natural breaks classification method into four groups using the 14-year range of data to determine the class breaks. 

- We utilize County Opoid Dispensing Rates for 2019.
- https://www.cdc.gov/drugoverdose/maps/rxcounty2019.html
- File: `././../../../data/stage_1/2019-Opioid_Rate.csv`

## Tasks:

#### Task 1: (10 pts)
- **T1.1** Initialize a Github Repository for your project. (10 pts)
    - Add a description (readme.MD) to your project. See here on how to setup: https://bulldogjob.com/news/449-how-to-write-a-good-readme-for-your-github-project

#### Task 2: (50 pts)
- Team:
    - **T2.1** Entire team looks at the datasets and understands the type of variables present in each of the data. (10 pts)
        - **Deliverable** 
            - Create a report of what the project is about, why this is an important area of work (in your own words), and how can data science help.
            - Outline how the datasets can be merged together and the common variables. 
        
- Member: 
    - **M2.1** Study prior research in the area. (20 pts)
        - Read https://link.springer.com/article/10.1007/s40265-017-0846-6
        - Select one other paper to study in relation to the area. Use https://scholar.google.com/ to search for papers which are related to the goal of this project.
        - **Deliverable**
            - Prepare a 1 page summary of what was discovered in these two paper. Significant outcomes, i.e. which variables/determinants are linked to opioid endemic. 
    - **M2.2** Each student member of the team selects 10 variables they think that are important from the available dataset. (20 pts) 
        - **Deliverable**
            - Prepare a data dictionary (data and datatype - variable dictionary. https://analystanswers.com/what-is-a-data-dictionary-a-simple-thorough-overview/) of the selected variables
            - Include justification of why you think the variables are important. 

Upload the team and member reports to canvas and your Github Repository. 

#### Task 3: (50 pts)
- Team: (20 pts)
    
    - **T3.1** Create a team notebook to read in the Opioid Mortality data using `pandas` and display the dataframe in a notebook.
    - **T3.2** Normalize the mortality data by population, i.e. number of deaths per 100,000 population. 
    - **T3.3** Identify issues with the data
        - Merge issues, missing values, inconsistent values, etc.
        - Describe solutions to fix it. 
    
- Member: (30 pts)
    - **M3.1** Merge all the three datasets to create a super datafame. (10 pts)
        - Display the super dataframe - Its should be (2527, 542) shape
        - Export it to a csv format.
    - **M3.2** Identify counties and states (top 10) with the highest opioid mortality rates. (20 pts)
        - Use mean and median for counties within states to compare (for the state level).
        - Describe your intution on why the rates are high in these states and counties. 

**Deliverable**
Each member creates separate notebooks for member tasks. Upload all notebooks to Github Repository. 


### Importing pandas package

In [1]:
import pandas as pd

### Reading csv file using pandas

* Reading txt file using read_csv() and adding tab as separator 

In [2]:
cause_of_death = pd.read_csv("../../../data/stage_1/Underlying Cause of Death-County-2019.txt", sep="\t")
cause_of_death

Unnamed: 0,Notes,County,County Code,Drug/Alcohol Induced Cause,Drug/Alcohol Induced Cause Code,Deaths,Population,Crude Rate
0,,"Autauga County, AL",1001,Drug poisonings (overdose) Unintentional (X40-...,D1,69,1087149,6.3
1,,"Autauga County, AL",1001,Drug poisonings (overdose) Suicide (X60-X64),D2,14,1087149,Unreliable
2,,"Baldwin County, AL",1003,Drug poisonings (overdose) Unintentional (X40-...,D1,424,3758097,11.3
3,,"Baldwin County, AL",1003,Drug poisonings (overdose) Suicide (X60-X64),D2,71,3758097,1.9
4,,"Baldwin County, AL",1003,Drug poisonings (overdose) Undetermined (Y10-Y14),D4,19,3758097,Unreliable
...,...,...,...,...,...,...,...,...
5536,,"Sweetwater County, WY",56037,Drug poisonings (overdose) Suicide (X60-X64),D2,15,873221,Unreliable
5537,,"Teton County, WY",56039,Drug poisonings (overdose) Unintentional (X40-...,D1,13,440125,Unreliable
5538,,"Uinta County, WY",56041,Drug poisonings (overdose) Unintentional (X40-...,D1,51,426347,12.0
5539,,"Uinta County, WY",56041,Drug poisonings (overdose) Undetermined (Y10-Y14),D4,10,426347,Unreliable


In [3]:
health_ranking = pd.read_csv("../../../data/stage_1/County_Health_Ranking.csv")
health_ranking

Unnamed: 0,State FIPS Code,County FIPS Code,5-digit FIPS Code,State Abbreviation,Name,Release Year,County Ranked (Yes=1/No=0),Premature death raw value,Premature death numerator,Premature death denominator,...,Male population 18-44 raw value,Male population 45-64 raw value,Male population 65+ raw value,Total male population raw value,Female population 0-17 raw value,Female population 18-44 raw value,Female population 45-64 raw value,Female population 65+ raw value,Total female population raw value,Population growth raw value
0,1,0,1000,AL,Alabama,2019,,9917.232898,80440.0,13636816.0,...,,,,,,,,,,
1,1,1,1001,AL,Autauga County,2019,1.0,8824.057123,815.0,156132.0,...,,,,,,,,,,
2,1,3,1003,AL,Baldwin County,2019,1.0,7224.632160,2827.0,576496.0,...,,,,,,,,,,
3,1,5,1005,AL,Barbour County,2019,1.0,9586.165037,451.0,72222.0,...,,,,,,,,,,
4,1,7,1007,AL,Bibb County,2019,1.0,11783.543680,445.0,63653.0,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3188,56,37,56037,WY,Sweetwater County,2019,1.0,7497.439952,495.0,127427.0,...,,,,,,,,,,
3189,56,39,56039,WY,Teton County,2019,1.0,3786.128226,124.0,66351.0,...,,,,,,,,,,
3190,56,41,56041,WY,Uinta County,2019,1.0,7790.302043,262.0,59466.0,...,,,,,,,,,,
3191,56,43,56043,WY,Washakie County,2019,1.0,5504.650970,108.0,22335.0,...,,,,,,,,,,


* Reading file using read_csv() and adding encoding as "unicode escape" to remove encoding issues

In [4]:
opioid_rate = pd.read_csv("../../../data/stage_1/2019-Opioid_Rate.csv",encoding='unicode_escape')
opioid_rate

Unnamed: 0,State,County,FIPS,Opiod_Dispensing_Rate
0,AL,Autauga County,1001,101.3
1,AL,Baldwin County,1003,67.6
2,AL,Barbour County,1005,27.2
3,AL,Bibb County,1007,21.0
4,AL,Blount County,1009,23.7
...,...,...,...,...
3090,WY,Sweetwater County,56037,70.0
3091,WY,Teton County,56039,54.6
3092,WY,Uinta County,56041,59.5
3093,WY,Washakie County,56043,46.7


* Displaying columns of all dataframes

In [5]:
cause_of_death.columns

Index(['Notes', 'County', 'County Code', 'Drug/Alcohol Induced Cause',
       'Drug/Alcohol Induced Cause Code', 'Deaths', 'Population',
       'Crude Rate'],
      dtype='object')

In [6]:
health_ranking.columns

Index(['State FIPS Code', 'County FIPS Code', '5-digit FIPS Code',
       'State Abbreviation', 'Name', 'Release Year',
       'County Ranked (Yes=1/No=0)', 'Premature death raw value',
       'Premature death numerator', 'Premature death denominator',
       ...
       'Male population 18-44 raw value', 'Male population 45-64 raw value',
       'Male population 65+ raw value', 'Total male population raw value',
       'Female population 0-17 raw value', 'Female population 18-44 raw value',
       'Female population 45-64 raw value', 'Female population 65+ raw value',
       'Total female population raw value', 'Population growth raw value'],
      dtype='object', length=534)

In [7]:
opioid_rate.columns

Index(['State', 'County', 'FIPS', 'Opiod_Dispensing_Rate'], dtype='object')

### Normalize Deaths data by population per 100,000

* Normalizing deaths data by population per 100000

In [8]:
cause_of_death["Norm_Deaths"] = (cause_of_death["Deaths"]/cause_of_death["Population"])*100000
cause_of_death

Unnamed: 0,Notes,County,County Code,Drug/Alcohol Induced Cause,Drug/Alcohol Induced Cause Code,Deaths,Population,Crude Rate,Norm_Deaths
0,,"Autauga County, AL",1001,Drug poisonings (overdose) Unintentional (X40-...,D1,69,1087149,6.3,6.346876
1,,"Autauga County, AL",1001,Drug poisonings (overdose) Suicide (X60-X64),D2,14,1087149,Unreliable,1.287772
2,,"Baldwin County, AL",1003,Drug poisonings (overdose) Unintentional (X40-...,D1,424,3758097,11.3,11.282306
3,,"Baldwin County, AL",1003,Drug poisonings (overdose) Suicide (X60-X64),D2,71,3758097,1.9,1.889254
4,,"Baldwin County, AL",1003,Drug poisonings (overdose) Undetermined (Y10-Y14),D4,19,3758097,Unreliable,0.505575
...,...,...,...,...,...,...,...,...,...
5536,,"Sweetwater County, WY",56037,Drug poisonings (overdose) Suicide (X60-X64),D2,15,873221,Unreliable,1.717778
5537,,"Teton County, WY",56039,Drug poisonings (overdose) Unintentional (X40-...,D1,13,440125,Unreliable,2.953706
5538,,"Uinta County, WY",56041,Drug poisonings (overdose) Unintentional (X40-...,D1,51,426347,12.0,11.962087
5539,,"Uinta County, WY",56041,Drug poisonings (overdose) Undetermined (Y10-Y14),D4,10,426347,Unreliable,2.345507


### **T3.3** Identify issues with the data
- Merge issues, missing values, inconsistent values, etc.
- Describe solutions to fix it. 

#### To find null values/missing values in the data, we are using isnull() function on each column.
#### Then summing up data gives total count of missing values for each column

#### For inconsistent values, we used min() and max().

In [9]:
cause_of_death.isnull().sum()

Notes                              5541
County                                0
County Code                           0
Drug/Alcohol Induced Cause            0
Drug/Alcohol Induced Cause Code       0
Deaths                                0
Population                            0
Crude Rate                            0
Norm_Deaths                           0
dtype: int64

#### Missing values:
* Notes column

In [10]:
cause_of_death.min(), cause_of_death.max()

(Notes                                                        NaN
 County                                      Abbeville County, SC
 County Code                                                 1001
 Drug/Alcohol Induced Cause         All other drug-induced causes
 Drug/Alcohol Induced Cause Code                               D1
 Deaths                                                        10
 Population                                                 68113
 Crude Rate                                                   0.0
 Norm_Deaths                                             0.018188
 dtype: object,
 Notes                                                                            NaN
 County                                                             Zavala County, TX
 County Code                                                                    56043
 Drug/Alcohol Induced Cause         Drug poisonings (overdose) Unintentional (X40-...
 Drug/Alcohol Induced Cause Code              

In [11]:
cause_of_death["Crude Rate"].min(),cause_of_death["Crude Rate"].max()

('0.0', 'Unreliable')

#### Inconsistent values:
#### Crude Rate
* Crude Rate column has inconsistency with "Unreliable" values for some rows. Other columns don't have any inconsistent or missing data

In [12]:
# You can run the commented code of dict() to see the full data
# dict(health_ranking.isnull().sum())
health_ranking.isnull().sum()

State FIPS Code                         0
County FIPS Code                        0
5-digit FIPS Code                       0
State Abbreviation                      0
Name                                    0
                                     ... 
Female population 18-44 raw value    3120
Female population 45-64 raw value    3120
Female population 65+ raw value      3120
Total female population raw value    3120
Population growth raw value          3120
Length: 534, dtype: int64

In [13]:
# You can run the commented code of dict() to see the full data
# dict(health_ranking.min())
health_ranking.min()

State FIPS Code                                     1
County FIPS Code                                    0
5-digit FIPS Code                                1000
State Abbreviation                                 AK
Name                                 Abbeville County
                                           ...       
Female population 18-44 raw value               483.0
Female population 45-64 raw value               581.0
Female population 65+ raw value                 341.0
Total female population raw value              2157.0
Population growth raw value                 -0.021724
Length: 534, dtype: object

In [14]:
# You can run the commented code of dict() to see the full data
# dict(health_ranking.max())
health_ranking.max()

State FIPS Code                                  56
County FIPS Code                                840
5-digit FIPS Code                             56045
State Abbreviation                               WY
Name                                 Ziebach County
                                          ...      
Female population 18-44 raw value          969103.0
Female population 45-64 raw value          789101.0
Female population 65+ raw value            519498.0
Total female population raw value         2904358.0
Population growth raw value                0.069832
Length: 534, dtype: object

#### Missing values:
* The health ranking dataset has many missing values and this can have a significant effect on the conclusions that can be drawn from the data.
#### Inconsistent values:
* There exists a few columns with  inconsistent data. Eg: 'Population growth raw value', Ratio of population to primary care providers other than physicians have negative values.

##### The lost data can cause bias in the estimation of parameters like mean(), median() etc. which in-turn may complicate the analysis of the study. Each of these distortions may threaten the validity of the trials and can lead to invalid conclusions.

In [15]:
opioid_rate.isnull().sum()

State                    0
County                   0
FIPS                     0
Opiod_Dispensing_Rate    0
dtype: int64

In [16]:
opioid_rate.min(),opioid_rate.max()

(State                                  AK
 County                   Abbeville County
 FIPS                                 1001
 Opiod_Dispensing_Rate                 0.0
 dtype: object,
 State                               WY
 County                   Zavala County
 FIPS                             56045
 Opiod_Dispensing_Rate            567.9
 dtype: object)

#### There are no missing values in opioid_rate dataframe. From the min() and max() values we can see that there are no inconsistent values.

### Describe solutions to fix it.
* If whole column is empty or NULL, we can discard/remove/drop it.
* If there are few missing values in a column, then it can replaced by the mean, rolling average values. It will benefit by not distorting different parameters for analysis.
* We can use fillna() to fill it with any relevant values for the column.