<a href="https://colab.research.google.com/github/pushpalatha2297/hds5210-2023/blob/main/week11/week11_assignment_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Week 11 Assignment


Please do the programming exercise and verify that your code works using the tests, then think about your final project and fill out the questions in the second part.

---
---

### 47.1: Filtering and summarizing data

For this work, you'll find a data file in `https://hds5210-data.s3.amazonaws.com/complications_all.csv`.

Read in the data file and create a variable called `mo_hospitals` that contains a data frame from the `complications_all.csv` file, filtered down to only contain those hospitals from the state of Missouri (MO).

Then aggregate that data by hospital into a variable named `mo_summary`.  There are some key fields that we want to summarize:
* We want to know the earliest date that each hospital was participating in any program
* We want to know the latest date that each hospital stopped participating in any program
* We want to know the total number of patients in the denominators of these programs

Some things to note:
* You will need to convert the `Start Date` and `End Date` to actual datetime fields
* You will need to clean up and convert the `Denominator` field to just be numeric - the rule that you should use it to simply remove any records where the `Denominator` is `'Not Available'`


The final result of this step should be a new data frame called `mo_summary` that contains one row for each hospital and contains the min start date, max end date, and total denominator.  Use the names `start_date`, `end_date`, and `number` for those columns in `mo_summary`.


You do not need to create your code in the form of a function, just make sure your variable names match what I've described above so the tests work.

In [1]:
import pandas as pd

# Read the data
url = 'https://hds5210-data.s3.amazonaws.com/complications_all.csv'
data = pd.read_csv(url)


In [2]:
# Filter the data to include only Missouri hospitals
mo_hospitals = data[data['State'] == 'MO'].copy()  # Added .copy() to avoid the warning

# Convert 'Start Date' and 'End Date' to datetime using .loc
mo_hospitals.loc[:, 'Start Date'] = pd.to_datetime(mo_hospitals['Start Date'])
mo_hospitals.loc[:, 'End Date'] = pd.to_datetime(mo_hospitals['End Date'])

# Remove records where the Denominator is 'Not Available'
mo_hospitals = mo_hospitals[mo_hospitals['Denominator'] != 'Not Available']

# Convert 'Denominator' to numeric, coercing errors which will turn strings that can't be converted into NaN, then drop these
mo_hospitals['Denominator'] = pd.to_numeric(mo_hospitals['Denominator'], errors='coerce')
mo_hospitals.dropna(subset=['Denominator'], inplace=True)

# Aggregate the data by hospital
mo_summary = mo_hospitals.groupby('Facility Name').agg({
    'Start Date': 'min',
    'End Date': 'max',
    'Denominator': 'sum'
})

# Rename columns for clarity
mo_summary.rename(columns={
    'Start Date': 'start_date',
    'End Date': 'end_date',
    'Denominator': 'number'
}, inplace=True)

  mo_hospitals.loc[:, 'Start Date'] = pd.to_datetime(mo_hospitals['Start Date'])
  mo_hospitals.loc[:, 'End Date'] = pd.to_datetime(mo_hospitals['End Date'])


In [3]:
assert mo_summary['number'].sum() == 1766908
assert mo_summary['start_date'].min() == pd.Timestamp(2015,4,1)
assert mo_summary['end_date'].max() == pd.Timestamp(2018,6,30)
assert mo_summary.shape == (108, 3)
assert mo_summary.loc['BARNES JEWISH HOSPITAL'].number == 131313
assert mo_summary.loc['BOONE HOSPITAL CENTER'].number == 63099

In [4]:
mo_summary['number'].sum()

1766908

In [5]:
mo_summary['start_date'].min()

Timestamp('2015-04-01 00:00:00')

In [6]:
mo_summary['end_date'].max()

Timestamp('2018-06-30 00:00:00')

In [7]:
mo_summary.shape == (108, 3)

True

In [8]:
mo_summary.loc['BARNES JEWISH HOSPITAL'].number

131313

In [9]:
mo_summary.loc['BOONE HOSPITAL CENTER'].number

63099

---

### 47.2 Planning your final project

You should be thinking about the things we've been learning and how you can apply them to your final project.  Use the rubric to help guid your thinking and then answer the questions below.  This is meant as a guide to help you think through what you will do.

#### A) Data Access

Your project should include data from at least three distinct types of sources.  For example: AWS S3, Relational Databases, Internet, Web Services, local files.  List what data sources you're planning to use.

I have chosen these two datasets for my resaerch project



**Heart Disease Mortality by State 2021**.

Accessed from: https://www.cdc.gov/nchs/pressroom/sosmap/heart_disease_mortality/heart_disease.htm

local file: https://drive.google.com/file/d/1ALDjXt69Hhp9JT7NnEo02hFE2i87auTr/view?usp=drive_link


#### B. Data Formats

Your project should include data that comes in different file formats.  For example: HL7, EDI, HTML, CSV, Excel, JSON, XML.  List what data formats you're planning to use.

The Heart Disease Mortality  was available to download in CSV format


#### C. Objective

What purpose would your project serve in a real work setting?  Take a couple of paragraphs to write down why this is an interesting product.

A project on heart disease mortality in 2021 could serve  several crucial purposes in a real-world setting:


Firstly, knowing the frequency and distribution of heart disease-related mortality across various demographic groups and geographical areas may be crucial information for public health professionals and policymakers. Trends showing the efficacy of public health initiatives and policies can be found by examining death rates. Additionally, it could highlight high-death rate regions that require more funding and focused treatments.

Secondly, by identifying probable risk factors linked to higher mortality, the initiative may help physicians and medical researchers. This might lead to more individualized patient care and help guide future studies on therapies, prevention strategies, and resource allocation. Healthcare practitioners might enhance patient outcomes by early intervention and better care techniques by looking for trends in the data.

Finally, given that heart disease continues to be one of the world's top causes of mortality, this is an intriguing product. Given the substantial health burden associated with cardiovascular disease, a study that delves deeply into mortality statistics can be beneficial on many levels. It can improve clinical knowledge of patient outcomes and influence health policy and initiatives that may save lives. In one of the most critical public health sectors, it is an example of the effectiveness of data-driven decision-making.






---



## Submit your work via GitHub as normal
