# Week 10 Assignment

Because I was unable to conduct our workshop this week, I'm keeping the assignment light as well.  Below you'll find just two steps for this week: one programming exercise and then a planning activity for your final project.

For clarification, the "final project" I've been referring to is your "final."  It is not a project in addition to a final exam.  They're one-in-the-same.

Please do the programming exercise and verify that your code works using the tests, then think about your final project and fill out the questions in the second part.

---
---

### 47.1: Filtering and summarizing data

For this work, you'll find a data file in `/data/complications_all.csv`.

Read in the data file and create a variable called `mo_hospitals` that contains a data frame from the `complications_all.csv` file, filtered down to only contain those hospitals from the state of Missouri (MO).

Then aggregate that data by hospital into a variable named `mo_summary`.  There are some key fields that we want to summarize:
* We want to know the earliest date that each hospital was participating in any program
* We want to know the latest date that each hospital stopped participating in any program
* We want to know the total number of patients in the denominators of these programs

Some things to note:
* You will need to convert the `Start Date` and `End Date` to actual datetime fields
* You will need to clean up and convert the `Denominator` field to just be numeric - the rule that you should use it to simply remove any records where the `Denominator` is `'Not Available'`


The final result of this step should be a new data frame called `mo_summary` that contains one row for each hospital and contains the min start date, max end date, and total denominator.  Use the names `start_date`, `end_date`, and `number` for those columns in `mo_summary`.


You do not need to create your code in the form of a function, just make sure your variable names match what I've described above so the tests work.

In [82]:
import pandas as pd
import numpy as np

mo_hospitals = pd.read_csv('/data/complications_all.csv', sep =',', skipinitialspace=True) #load in file

mo_hospitals = mo_hospitals[(mo_hospitals['State']=='MO')] #filter for MO states
mo_hospitals['Start Date'] = pd.to_datetime(mo_hospitals['Start Date'],format ='%m/%d/%Y') #convert column/series to datetime
mo_hospitals['End Date'] = pd.to_datetime(mo_hospitals['End Date'],format ='%m/%d/%Y') #convert column/series to datetime

In [81]:
# These assertions will help make sure that you're on the right track.
assert(mo_hospitals['State'].unique() == ['MO'])
assert(mo_hospitals.shape == (2133,18))

In [None]:
mo_hospitals = mo_hospitals[(mo_hospitals['Denominator'] !='Not Available')] #remove row that contains str
#mo_hospitals['Denominator'].replace('Not Available',np.nan, inplace=True) #convert str to NaN but keep row 
mo_hospitals['Denominator'] = pd.to_numeric(mo_hospitals['Denominator']) #convert col to int 
mo_hospitals['Facility Name'] = mo_hospitals['Facility Name'].str.upper() #make sure name is capitalized

mo_hospitals.head()

In [74]:
data = []
facilities = mo_hospitals['Facility Name'].unique()

#get info for each facility df
for facility in mo_hospitals['Facility Name'].unique():
    df = mo_hospitals[(mo_hospitals['Facility Name'] == facility)] 
    #ID = df['Facility ID'].unique()[0] 
    start_date = min(df['Start Date'])
    end_date = max(df['End Date'])
    number = np.nansum(df['Denominator'])
    data.append([start_date,end_date,number])

In [75]:
mo_summary = pd.DataFrame(data,index=facilities,columns = ['start_date','end_date','number'])

In [77]:
mo_summary.head()

Unnamed: 0,start_date,end_date,number
MERCY HOSPITAL JOPLIN,2015-04-01,2018-06-30,29977
COOPER COUNTY COMMUNITY HOSPITAL,2015-07-01,2018-06-30,697
SSM HEALTH ST JOSEPH HOSPITAL-ST CHARLES,2015-04-01,2018-06-30,23443
MOSAIC LIFE CARE AT ST JOSEPH,2015-04-01,2018-06-30,54860
BOTHWELL REGIONAL HEALTH CENTER,2015-04-01,2018-06-30,14717


In [79]:
assert(mo_summary['number'].sum() == 1766908)
assert(mo_summary['start_date'].min() == pd.Timestamp(2015,4,1))
assert(mo_summary['end_date'].max() == pd.Timestamp(2018,6,30))
assert(mo_summary.shape == (108,3))  
assert(mo_summary.loc['BARNES JEWISH HOSPITAL'].number == 131313)
assert(mo_summary.loc['BOONE HOSPITAL CENTER'].number == 63099)

---

### 47.2 Planning your final project

You should be thinking about the things we've been learning and how you can apply them to your final project.  Use the rubric to help guid your thinking and then answer the questions below.  This is meant as a guide to help you think through what you will do.

#### A) Data Access

Your project should include data from at least three distinct types of sources.  For example: AWS S3, Relational Databases, Internet, Web Services, local files.  List what data sources you're planning to use.

I will be using two distinct types of sources: Internet and Kaggle. 

#### B. Data Formats

Your project should include data that comes in different file formats.  For example: HL7, EDI, HTML, CSV, Excel, JSON, XML.  List what data formats you're planning to use.

From the Internet, I will be webscraping information from the article/webpage "Best Workplaces for Millennials" and two job search enginees (Indeed and Google jobs). Need to check/figure out how to get companies ratings from these sources. 

From Kaggle, I will be using the dataset "Data Scientist Jobs", which is formatted as a csv file. The file webscraped data from Glassdoor in August 2020. This file already contains company ratings. 

#### C. Objective

What purpose would your project serve in a real work setting?  Take a couple of paragraphs to write down why this is an interesting product.

Project Outline:
1. Find current job openings that match specific key words of interest provided by a user (use kaggle dataset to help identify companies w/ best ratings)
2. Explore today's data science job market derived from informaiton in part 1. 


Why is this interesting?
1. People spend a lot of time searching for jobs; the first part of the project will help people find tailored jobs. Being happily employed is vital to the economy and data science positions are especially important as they drive companies to improve processes that ultimately (hopefully) have a positive impact on consumers or patients if its a healthcare setting.  

2. The second part of the project will show what to expect in today's data scientist job market. Is it still a popular job? How many openings are there currently? I will compare results with literature/articles about data science before covid.  


---

## Submitting Your Work

In order to submit your work, you'll need to use the `git` command line program to **add** your homework file (this file) to your local repository, **commit** your changes to your local repository, and then **push** those changes up to github.com.  From there, I'll be able to **pull** the changes down and do my grading.  I'll provide some feedback, **commit** and **push** my comments back to you.  Next week, I'll show you how to **pull** down my comments.

To run through everything one last time and submit your work:
1. Use the `Kernel` -> `Restart Kernel and Run All Cells` menu option to run everything from top to bottom and stop here.
2. Save this note with Ctrl-S (or Cmd-S)
2. Skip down to the last command cell (the one starting with `%%bash`) and run that cell.

If anything fails along the way with this submission part of the process, let me know.  I'll help you troubleshoort.

In [None]:
assert False, "DO NOT REMOVE THIS LINE"

---

In [None]:
%%bash
git pull
git add week10_assignment_2.ipynb
git commit -a -m "Submitting the week 10 programming assignment"
git push


---

If the message above says something like _Submitting the week 8 programming assignment_ or _Everything is up to date_, then your work was submitted correctly.