## TITLE

* description of what the notebook does

### The Data

* link to source for data and description from site

In [3]:
import warnings
warnings.simplefilter('ignore')
import matplotlib.pyplot as plt
import pandas as pd
import datetime

In [4]:
accidental_drug_deaths_df = pd.read_csv('../data/accidental_drug_deaths_CT.csv')

In [5]:
accidental_drug_deaths_df

Unnamed: 0,ID,Date,Date Type,Age,Sex,Race,Residence City,Residence County,Residence State,Death City,...,Morphine (Not Heroin),Hydromorphone,Xylazine,Other,Opiate NOS,Any Opioid,Manner of Death,DeathCityGeo,ResidenceCityGeo,InjuryCityGeo
0,12-0187,07/17/2012,DateofDeath,34.0,Female,White,MAHOPAC,PUTNAM,,DANBURY,...,,,,Duster,,,Accident,"DANBURY, CT\n(41.393666, -73.451539)",,"CT\n(41.575155, -72.738288)"
1,12-0258,10/01/2012,DateofDeath,51.0,Male,White,PORTLAND,MIDDLESEX,,PORTLAND,...,,,,,,,Accident,"PORTLAND, CT\n(41.581345, -72.634112)","PORTLAND, CT\n(41.581345, -72.634112)","CT\n(41.575155, -72.738288)"
2,13-0146,04/28/2013,DateofDeath,28.0,Male,White,,,,HARTFORD,...,,,,,,,Accident,"HARTFORD, CT\n(41.765775, -72.673356)","CT\n(41.575155, -72.738288)","CT\n(41.575155, -72.738288)"
3,14-0150,04/06/2014,DateofDeath,46.0,Male,White,WATERBURY,,,TORRINGTON,...,,,,,,,Accident,"TORRINGTON, CT\n(41.812186, -73.101552)","WATERBURY, CT\n(41.554261, -73.043069)","CT\n(41.575155, -72.738288)"
4,14-0183,04/27/2014,DateofDeath,52.0,Male,White,NEW LONDON,,,NEW LONDON,...,,,,,,,Accident,"NEW LONDON, CT\n(41.355167, -72.099561)","NEW LONDON, CT\n(41.355167, -72.099561)","CT\n(41.575155, -72.738288)"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7674,14-0128,03/20/2014,DateofDeath,25.0,Male,White,MILFORD,,,WETHERSFIELD,...,,,,,,,Accident,"WETHERSFIELD, CT\n(41.712487, -72.663607)","MILFORD, CT\n(41.224276, -73.057564)","CT\n(41.575155, -72.738288)"
7675,20-1217,11/19/2020,DateofDeath,62.0,Female,White,STAMFORD,FAIRFIELD,CT,STAMFORD,...,,,,,,Y,Accident,"Stamford, CT\n(41.051924, -73.539475)","STAMFORD, CT\n(41.051924, -73.539475)","STAMFORD, CT\n(41.051924, -73.539475)"
7676,20-1138,10/31/2020,DateofDeath,50.0,Female,White,NEW BRITAIN,HARTFORD,CT,NEW BRITAIN,...,,,Y,,,Y,Accident,"New Britain, CT\n(41.667528, -72.783437)","NEW BRITAIN, CT\n(41.667528, -72.783437)","NEW BRITAIN, CT\n(41.667528, -72.783437)"
7677,16-0640,09/17/2016,DateofDeath,36.0,Male,White,SHELTON,FAIRFIELD,CT,SHELTON,...,,,,,,Y,Accident,"SHELTON, CT\n(41.316843, -73.092968)","SHELTON, CT\n(41.316843, -73.092968)","SHELTON, CT\n(41.316843, -73.092968)"


# Data Frame Examination:

### What are the dimensions (number of rows and columns) of the data frame?

In [6]:
accidental_drug_deaths_df.shape

(7679, 42)

There are 7679 rows and 42 columns. 

### What does each row represent?

Each row represents a date with an accident drug death of a person in CT. 

### What do the columns mean? 

* ID: a number ID for each person
* Date: The date 
* Date type: Describes if the date is for the date of death or the date reported 
* Age: age in years
* Sex: sex (male or female)
* Race: race
* Residence city: the city they resided in 
* Residence county: the county they resided in 
* Residence state: the state (note that some of the states are not CT for some reason so we should filter those out) 
* Death city: they city they died in 
* Location: where they died
* Description of injury: descirbed if the drug was injected, huffed, etc. 
* Injury place: where the drug was used
* Injury city: city where the drug was used
* Cause of death: The exact cause of the death (string) 
* Then, there are a bunch of columns listing different drugs. A 'Y' is placed in the column that describes the correct drug used. We want to look at heroin, fentanyl, fentanyl analogue, oxycodone, oxymorphone, hydrocodone, methadone, morphine, hydromorphone. 
* Then, they have a column for any opioid, this will be extremely helpful. If there is a 'y' in the any opioid column, this means the person overdosed on an opioid. 
* Manner of death: explains if it was an accident
* The last 3 columns gives the geolocation so we can use geo json to create a map of where the opioid overdoses occured.

# Data Cleaning

### Rename Columns

First, let's look at the current name of the columns

In [7]:
accidental_drug_deaths_df.columns

Index(['ID', 'Date', 'Date Type', 'Age', 'Sex', 'Race', 'Residence City',
       'Residence County', 'Residence State', 'Death City', 'Death County',
       'Location', 'Location if Other', 'Description of Injury',
       'Injury Place', 'Injury City', 'Injury County', 'Injury State',
       'Cause of Death', 'Other Significant Conditions ', 'Heroin', 'Cocaine',
       'Fentanyl', 'Fentanyl Analogue', 'Oxycodone', 'Oxymorphone', 'Ethanol',
       'Hydrocodone', 'Benzodiazepine', 'Methadone', 'Amphet', 'Tramad',
       'Morphine (Not Heroin)', 'Hydromorphone', 'Xylazine', 'Other',
       'Opiate NOS', 'Any Opioid', 'Manner of Death', 'DeathCityGeo',
       'ResidenceCityGeo', 'InjuryCityGeo'],
      dtype='object')

One thing that I notice is that the columns have spaces in them. Let's rename these so that there are no spaces.

In [8]:
colname_map = {'Date Type': 'Date_Type',
               'Residence City': 'Residence_City', 
               'Residence County': 'Residence_County', 
               'Residence State':'Residence_State',
               'Death City':'Death_City', 
               'Death County':'Death_County', 
               'Location if Other' : 'Location_if_Other', 
               'Description of Injury' : 'Description_of_Injury', 
               'Injury Place' : 'Injury_Place', 
               'Injury City' : 'Injury_City', 
               'Injury County' : 'Injury_County', 
               'Injury State' : 'Injury_State', 
               'Cause of Death' : 'Cause_of_Death', 
               'Other Significant Conditions ': 'Other_Significant_Conditions', 
               'Fentanyl Analogue':'Fentanyl_Analogue', 
               'Morphine (Not Heroin)':'Morphine_Not_Heroin', 
               'Opiate NOS':'Opiate_NOS', 
               'Any Opioid':'Any_Opioid', 
               'Manner of Death':'Manner_of_Death'
              }

accidental_drug_deaths_df = accidental_drug_deaths_df.rename(columns=colname_map)

In [9]:
accidental_drug_deaths_df.columns

Index(['ID', 'Date', 'Date_Type', 'Age', 'Sex', 'Race', 'Residence_City',
       'Residence_County', 'Residence_State', 'Death_City', 'Death_County',
       'Location', 'Location_if_Other', 'Description_of_Injury',
       'Injury_Place', 'Injury_City', 'Injury_County', 'Injury_State',
       'Cause_of_Death', 'Other_Significant_Conditions', 'Heroin', 'Cocaine',
       'Fentanyl', 'Fentanyl_Analogue', 'Oxycodone', 'Oxymorphone', 'Ethanol',
       'Hydrocodone', 'Benzodiazepine', 'Methadone', 'Amphet', 'Tramad',
       'Morphine_Not_Heroin', 'Hydromorphone', 'Xylazine', 'Other',
       'Opiate_NOS', 'Any_Opioid', 'Manner_of_Death', 'DeathCityGeo',
       'ResidenceCityGeo', 'InjuryCityGeo'],
      dtype='object')

### Remove unwanted rows or columns

This data set is very big and has some columns that we are not going to use for our data analysis. Let's remove those columns now. 

In [10]:
accidental_drug_deaths_df = accidental_drug_deaths_df.drop(['Date_Type', 'Other_Significant_Conditions'], axis=1)

### Save the cleaned data frame

In [24]:
accidental_drug_deaths_df.to_csv('../data/accidental_drug_deaths_CT_clean.csv', index=False)

## Data Exploration

### Examine the distribution in the relevant columns

Let's start by looking at the range of dates in this data set. 

In [11]:
accidental_drug_deaths_df['Date'] = pd.to_datetime(accidental_drug_deaths_df['Date'])

In [12]:
accidental_drug_deaths_df['Date'].nunique()

2803

There are 2803 unique dates in the data set. 

In [11]:
accidental_drug_deaths_df.sort_values('Date')

Unnamed: 0,ID,Date,Date_Type,Age,Sex,Race,Residence_City,Residence_County,Residence_State,Death_City,...,Morphine_Not_Heroin,Hydromorphone,Xylazine,Other,Opiate_NOS,Any_Opioid,Manner_of_Death,DeathCityGeo,ResidenceCityGeo,InjuryCityGeo
231,12-0001,2012-01-01,DateofDeath,35.0,Male,White,HEBRON,TOLLAND,,HEBRON,...,,,,,,,Accident,"HEBRON, CT\n(41.658069, -72.366324)","HEBRON, CT\n(41.658069, -72.366324)","CT\n(41.575155, -72.738288)"
1839,12-0002,2012-01-03,DateofDeath,41.0,Male,White,BRISTOL,HARTFORD,,BRISTOL,...,,,,,,,Accident,"BRISTOL, CT\n(41.673037, -72.945791)","BRISTOL, CT\n(41.673037, -72.945791)","CT\n(41.575155, -72.738288)"
3415,12-0003,2012-01-04,DateofDeath,61.0,Male,Black,DANBURY,FAIRFIELD,,DANBURY,...,,,,,,,Accident,"DANBURY, CT\n(41.393666, -73.451539)","DANBURY, CT\n(41.393666, -73.451539)","CT\n(41.575155, -72.738288)"
4956,12-0004,2012-01-05,DateofDeath,51.0,Male,White,STRATFORD,FAIRFIELD,,BRIDGEPORT,...,,,,,,,Accident,"BRIDGEPORT, CT\n(41.179195, -73.189476)","STRATFORD, CT\n(41.200888, -73.131323)","CT\n(41.575155, -72.738288)"
6969,12-0005,2012-01-07,DateofDeath,45.0,Male,White,HARTFORD,HARTFORD,,HARTFORD,...,,,,,,,Accident,"HARTFORD, CT\n(41.765775, -72.673356)","HARTFORD, CT\n(41.765775, -72.673356)","CT\n(41.575155, -72.738288)"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3418,20-1374,2020-12-31,DateofDeath,64.0,Male,White,NORWICH,NEW LONDON,CT,NORWICH,...,,,,,,Y,Accident,"Norwich, CT\n(41.524304, -72.075821)","NORWICH, CT\n(41.524304, -72.075821)","NORWICH, CT\n(41.524304, -72.075821)"
1586,20-1373,2020-12-31,DateofDeath,32.0,Male,White,NEW HARTFORD,LITCHFIELD,CT,NEW HARTFORD,...,,,Y,,,Y,Accident,"New Hartford, CT\n(41.879454, -72.976047)","NEW HARTFORD, CT\n(41.879454, -72.976047)","NEW HARTFORD, CT\n(41.879454, -72.976047)"
6068,20-1372,2020-12-31,DateofDeath,28.0,Male,White,LEDYARD,NEW LONDON,CT,LEDYARD,...,,,,,,Y,Accident,"Ledyard, CT\n(41.445618, -72.018188)","LEDYARD, CT\n(41.445618, -72.018188)","LEDYARD, CT\n(41.445618, -72.018188)"
125,15-0729,NaT,,28.0,Male,White,,,,,...,,,,2-A,,,,"CT\n(41.575155, -72.738288)","CT\n(41.575155, -72.738288)","STRATFORD, CT\n(41.200888, -73.131323)"


In [15]:
# get the range (min and max) of the Date
accidental_drug_deaths_df['Date'].agg(['min', 'max'])

min   2012-01-01
max   2020-12-31
Name: Date, dtype: datetime64[ns]

The dates range from January 1, 2012 to December 31, 2020. 

Next, let's look at the number of deaths that involved an opioid of some sort. 

In [18]:
opioid_filter = accidental_drug_deaths_df['Any_Opioid'] == 'Y'
opioid_filter

0       False
1       False
2       False
3       False
4       False
        ...  
7674    False
7675     True
7676     True
7677     True
7678     True
Name: Any_Opioid, Length: 7679, dtype: bool

In [19]:
accidental_drug_deaths_df[opioid_filter]

Unnamed: 0,ID,Date,Age,Sex,Race,Residence_City,Residence_County,Residence_State,Death_City,Death_County,...,Morphine_Not_Heroin,Hydromorphone,Xylazine,Other,Opiate_NOS,Any_Opioid,Manner_of_Death,DeathCityGeo,ResidenceCityGeo,InjuryCityGeo
6,15-0052,2015-02-01,52.0,Male,White,MIDDLETOWN,MIDDLESEX,CT,MIDDLETOWN,MIDDLESEX,...,,,,,,Y,Accident,"MIDDLETOWN, CT\n(41.544654, -72.651713)","MIDDLETOWN, CT\n(41.544654, -72.651713)","CT\n(41.575155, -72.738288)"
7,15-0239,2015-05-21,32.0,Male,White,ORLANDO,ORANGE,FL,TORRINGTON,LITCHFIELD,...,,,,,,Y,Accident,"TORRINGTON, CT\n(41.812186, -73.101552)",,"CT\n(41.575155, -72.738288)"
9,15-0365,2015-07-17,42.0,Male,White,CANTERBURY,WINDHAM,CT,CANTERBURY,WINDHAM,...,,,,,,Y,Accident,"CANTERBURY, CT\n(41.698351, -71.971118)","CANTERBURY, CT\n(41.698351, -71.971118)","CT\n(41.575155, -72.738288)"
10,15-0715,2015-12-23,44.0,Male,White,,,,STAFFORD SPRINGS,TOLLAND,...,,,,,,Y,Accident,"STAFFORD SPRINGS, CT\n(41.953931, -72.302901)","CT\n(41.575155, -72.738288)","CT\n(41.575155, -72.738288)"
11,16-0032,2016-01-17,26.0,Male,Black,BRISTOL,HARTFORD,CT,BRISTOL,,...,,,,,,Y,Accident,"BRISTOL, CT\n(41.673037, -72.945791)","BRISTOL, CT\n(41.673037, -72.945791)","BRISTOL, CT\n(41.673037, -72.945791)"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7673,18-0648,2018-08-18,51.0,Female,Black,WATERBURY,NEW HAVEN,CT,HARTFORD,HARTFORD,...,,,,,,Y,Accident,"HARTFORD, CT\n(41.765775, -72.673356)","WATERBURY, CT\n(41.554261, -73.043069)","HARTFORD, CT\n(41.765775, -72.673356)"
7675,20-1217,2020-11-19,62.0,Female,White,STAMFORD,FAIRFIELD,CT,STAMFORD,FAIRFIELD,...,,,,,,Y,Accident,"Stamford, CT\n(41.051924, -73.539475)","STAMFORD, CT\n(41.051924, -73.539475)","STAMFORD, CT\n(41.051924, -73.539475)"
7676,20-1138,2020-10-31,50.0,Female,White,NEW BRITAIN,HARTFORD,CT,NEW BRITAIN,HARTFORD,...,,,Y,,,Y,Accident,"New Britain, CT\n(41.667528, -72.783437)","NEW BRITAIN, CT\n(41.667528, -72.783437)","NEW BRITAIN, CT\n(41.667528, -72.783437)"
7677,16-0640,2016-09-17,36.0,Male,White,SHELTON,FAIRFIELD,CT,SHELTON,,...,,,,,,Y,Accident,"SHELTON, CT\n(41.316843, -73.092968)","SHELTON, CT\n(41.316843, -73.092968)","SHELTON, CT\n(41.316843, -73.092968)"


Since there are 4860 rows in the opioid filtered dataframe, which only contains the rows of people who died via an opioid, and there are a total of 7679 in the bigger dataframe, this means that over half of the overdoses from January 1, 2012 - December 31, 2020 were due to opioids. 

In [23]:
# distribution of opioid deaths by city
accidental_drug_deaths_df[opioid_filter]['Residence_City'].value_counts().head(30)

HARTFORD         329
WATERBURY        299
BRIDGEPORT       235
NEW HAVEN        232
NEW BRITAIN      180
BRISTOL          134
MERIDEN          106
NORWICH           98
MANCHESTER        97
TORRINGTON        94
EAST HARTFORD     88
WEST HAVEN        80
STRATFORD         79
DANBURY           76
MIDDLETOWN        76
NEW LONDON        76
ENFIELD           67
HAMDEN            60
NORWALK           60
MILFORD           55
NAUGATUCK         53
EAST HAVEN        51
STAMFORD          50
SHELTON           46
WALLINGFORD       44
SOUTHINGTON       41
ANSONIA           41
GROTON            40
VERNON            39
BRANFORD          38
Name: Residence_City, dtype: int64

### Checking years in data

In [27]:
pd.to_datetime(accidental_drug_deaths_df['Date']).dt.year.value_counts()

2020.0    1374
2019.0    1200
2017.0    1038
2018.0    1018
2016.0     917
2015.0     727
2014.0     558
2013.0     490
2012.0     355
Name: Date, dtype: int64

### What observations and questions do you have after exploring the data?

After exploring the data, we see that 4860 deaths from 2012-2020 were due to opioids. In our project, we want to further explore if the trend in these deaths have increased or decreased? Which specific opioid is causing the most deaths? Is there a strong correlation between opioid deaths and opioid prescriptions? Are certain locations in CT more prone to opioid deaths than otherss? 