# Week 11 Assignment

Because I was unable to conduct our workshop this week, I'm keeping the assignment light as well.  Below you'll find just two steps for this week: one programming exercise and then a planning activity for your final project.

For clarification, the "final project" I've been referring to is your "final."  It is not a project in addition to a final exam.  They're one-in-the-same.

Please do the programming exercise and verify that your code works using the tests, then think about your final project and fill out the questions in the second part.

---
---

### 47.1: Filtering and summarizing data

For this work, you'll find a data file in `/data/complications_all.csv`.

Read in the data file and create a variable called `mo_hospitals` that contains a data frame from the `complications_all.csv` file, filtered down to only contain those hospitals from the state of Missouri (MO).

Then aggregate that data by hospital into a variable named `mo_summary`.  There are some key fields that we want to summarize:
* We want to know the earliest date that each hospital was participating in any program
* We want to know the latest date that each hospital stopped participating in any program
* We want to know the total number of patients in the denominators of these programs

Some things to note:
* You will need to convert the `Start Date` and `End Date` to actual datetime fields
* You will need to clean up and convert the `Denominator` field to just be numeric - the rule that you should use it to simply remove any records where the `Denominator` is `'Not Available'`


The final result of this step should be a new data frame called `mo_summary` that contains one row for each hospital and contains the min start date, max end date, and total denominator.  Use the names `start_date`, `end_date`, and `number` for those columns in `mo_summary`.


You do not need to create your code in the form of a function, just make sure your variable names match what I've described above so the tests work.

In [1]:
import pandas as pd
# This is just to show you the name to use for the variable you need to create for this step to pass.
data = pd.read_csv('complications_all.csv') 
filter = data['State'] == "MO"
mo_hospitals = data[filter]

In [2]:
# These assertions will help make sure that you're on the right track.
assert(mo_hospitals['State'].unique() == ['MO'])
assert(mo_hospitals.shape == (2133,18))

In [3]:
mo_hospitals.shape

(2133, 18)

In [4]:
#filter out denominators that are not numbers, ie, those that say "not available"
filter = mo_hospitals['Denominator'] != 'Not Available'

In [5]:
mo_hospitals = mo_hospitals[filter]

In [6]:
#make data type integer for denominator field 
number = mo_hospitals['Denominator'].astype(int)

In [7]:
mo_hospitals['number']= number

In [8]:
#change variables to datetime format and ensure that all variables for mo summary df are grouped by facility name 
import datetime
mo_hospitals['start_date']= pd.to_datetime(mo_hospitals['Start Date'])
mo_hospitals['end_date']= pd.to_datetime(mo_hospitals['End Date'])
mo_denominator= mo_hospitals.groupby('Facility Name')['number'].sum()
mo_start= mo_hospitals.groupby('Facility Name')['start_date'].min()
mo_end=mo_hospitals.groupby('Facility Name')['end_date'].max()

In [9]:
#concat these variables into new df 
mo_summary = pd.concat([mo_denominator, mo_start,mo_end],axis = 1) 


In [10]:
assert(mo_summary['number'].sum() == 1766908)
assert(mo_summary['start_date'].min() == pd.Timestamp(2015,4,1))
assert(mo_summary['end_date'].max() == pd.Timestamp(2018,6,30))
assert(mo_summary.shape == (108,3))
assert(mo_summary.loc['BARNES JEWISH HOSPITAL'].number == 131313)
assert(mo_summary.loc['BOONE HOSPITAL CENTER'].number == 63099)

In [11]:
mo_summary['number'].sum() 

1766908

In [12]:
mo_summary.loc['BARNES JEWISH HOSPITAL'].number == 131313

True

---

### 47.2 Planning your final project

You should be thinking about the things we've been learning and how you can apply them to your final project.  Use the rubric to help guid your thinking and then answer the questions below.  This is meant as a guide to help you think through what you will do.

#### A) Data Access

Your project should include data from at least three distinct types of sources.  For example: AWS S3, Relational Databases, Internet, Web Services, local files.  List what data sources you're planning to use.


First, I will use data from KFF analysis of Centers for Disease Control and Prevention (CDC), National Center for Health Statistics regarding Opioid Overdose Death Rates and All Drug Overdose Death Rates per 100,000 Population (Age-Adjusted). It is in a raw data CSV format. 

Next, I will use Data from this same site to assess incidence of Depression and Anxiety per 100,000 Population. For both of these KFF files, I will download and use as local files. 

I also plan to use data from the internet via a website called "https://worldpopulationreview.com/".  I will be looking at a table on the website that ranks the "happiness" score of the 50 states 

Lastly, data from an internet website about Naloxone laws in states- I will be reading text files from the internet. 

#### B. Data Formats

Your project should include data that comes in different file formats.  For example: HL7, EDI, HTML, CSV, Excel, JSON, XML.  List what data formats you're planning to use.



I will be using CSV files from KFF and JSON from the website. For the text information I will using, it will be in HTML format

#### C. Objective

What purpose would your project serve in a real work setting?  Take a couple of paragraphs to write down why this is an interesting product.


The purpose of my project is to better understand the data that surrounds the opioid epidemic, and to get a sense of which states face the highest incidences and try to get an understanding about why these states have such high rates. I plan to use the happiness data in comparison with the opioid overdose death data by state to see the correlation between the happiest and most unhappy states and their overdose rates. This, in addition to analyzing the states with the highest rates of depression and anxiety, will hopefully show the connection between poor mental health/wellbeing  and opioid overdoses. 

I also plan to use the state data along with the knowledge about naloxone laws and how they differ per state. I would like to see if having naloxone access laws helps to have a decreased amount of fatal overdoses. For example, 8 states have laws that allow for anyone to get naloxone from a pharmacist without a prescription. I will be analyzing data to see if those 8 states show less fatal overdoses than those requiring a prescription. If possible, I will show how those states change over time with the addition of that law, but I am still looking for that data. 

In a real world setting, the purpose of this is to highlight the communities still in most need of resources to battle this very sad epidemic. Next, it is useful to see the connection between depression and anxiety, as well as happiness scores, and how this is correlated with overdose fatalities. Lastly, the naloxone law comparisons by state aim to show whether or not more states might benefit from implementing these same laws and to draw attention to the usefulness and effectiveness of naloxone distribution as a way to mitigate opioid misuse and fight back against this epidemic. This project is interesting because despite efforts to reduce misuse, the problem has continued to worsen in recent years. This issue is deeply tied to the pharma industry and it's financial gains. The more we know about the causes of this epidemic and the proposed solutions many states are trying, the better chance we have as a nation in saving more lives in the future. A goal of this project is to promote public awareness of the issue and solutions as well. So many people, myself included, have a personal connection to this issue and wish to share with those among them. That is the reason this project would be of interest to many as well. 



---



## Submitting Your Work

In order to submit your work, you'll need to use the `git` command line program to **add** your homework file (this file) to your local repository, **commit** your changes to your local repository, and then **push** those changes up to github.com.  From there, I'll be able to **pull** the changes down and do my grading.  I'll provide some feedback, **commit** and **push** my comments back to you.  Next week, I'll show you how to **pull** down my comments.

To run through everything one last time and submit your work:
1. Use the `Kernel` -> `Restart Kernel and Run All Cells` menu option to run everything from top to bottom and stop here.
2. Follow the instruction on the prompt below to either ssave and submit your work, or continue working.

If anything fails along the way with this submission part of the process, let me know.  I'll help you troubleshoort.

---

In [13]:
a=input('''
Are you ready to submit your work?
1. Click the Save icon (or do Ctrl-S / Cmd-S)
2. Type "yes" or "no" below
3. Press Enter

''')

if a=='yes':
    !git add week11_assignment_2.ipynb
    !git commit -a -m "Submitting the week 11 programming assignment"
    !git push
else:
    print('''
    
OK. We can wait.
''')


Are you ready to submit your work?
1. Click the Save icon (or do Ctrl-S / Cmd-S)
2. Type "yes" or "no" below
3. Press Enter

 yes


[main f886d5b] Submitting the week 11 programming assignment
 2 files changed, 402 insertions(+), 6 deletions(-)
 create mode 100644 week11/week11_assignment_2.ipynb
Counting objects: 5, done.
Delta compression using up to 2 threads.
Compressing objects: 100% (5/5), done.
Writing objects: 100% (5/5), 5.18 KiB | 5.18 MiB/s, done.
Total 5 (delta 2), reused 0 (delta 0)
remote: Resolving deltas: 100% (2/2), completed with 2 local objects.[K
To github.com:lillyeversman/hds5210-2022.git
   1a53223..f886d5b  main -> main
