# Week 11 Assignment


Please do the programming exercise and verify that your code works using the tests, then think about your final project and fill out the questions in the second part.

---
---

### 47.1: Filtering and summarizing data

For this work, you'll find a data file in `https://hds5210-data.s3.amazonaws.com/complications_all.csv`.

Read in the data file and create a variable called `mo_hospitals` that contains a data frame from the `complications_all.csv` file, filtered down to only contain those hospitals from the state of Missouri (MO).

Then aggregate that data by hospital into a variable named `mo_summary`.  There are some key fields that we want to summarize:
* We want to know the earliest date that each hospital was participating in any program
* We want to know the latest date that each hospital stopped participating in any program
* We want to know the total number of patients in the denominators of these programs

Some things to note:
* You will need to convert the `Start Date` and `End Date` to actual datetime fields
* You will need to clean up and convert the `Denominator` field to just be numeric - the rule that you should use it to simply remove any records where the `Denominator` is `'Not Available'`


The final result of this step should be a new data frame called `mo_summary` that contains one row for each hospital and contains the min start date, max end date, and total denominator.  Use the names `start_date`, `end_date`, and `number` for those columns in `mo_summary`.


You do not need to create your code in the form of a function, just make sure your variable names match what I've described above so the tests work.

In [39]:
import pandas as pd
# This is just to show you the name to use for the variable you need to create for this step to pass.
all_hospitals = pd.read_csv('https://hds5210-data.s3.amazonaws.com/complications_all.csv')


In [40]:
# Do you work here and in as many cells as you need to create a variable called `mo_summary` that matches the requirements

In [41]:
import pandas as pd

data = pd.read_csv(r"complications_all.csv")

mo_hospitals = data[data['State'] == 'MO']
mo_hospitals.columns = mo_hospitals.columns.str.strip()

mo_summary = mo_hospitals.groupby('Facility Name').agg({
    'Start Date': ['min'],
    'End Date': ['max'],
    'Denominator': ['sum']
})

mo_summary.columns = ['start_date', 'end_date', 'number']

mo_summary['start_date'] = pd.to_datetime(mo_summary['start_date'], format='%Y-%m-%d', errors='coerce')
mo_summary['end_date'] = pd.to_datetime(mo_summary['end_date'], format='%Y-%m-%d', errors='coerce')

mo_summary['number'] = mo_summary['number'].astype(str).str.extract('(\d+)').astype(float)

mo_summary = mo_summary.dropna(subset=['number'])

mo_summary['number'] = mo_summary['number'].astype(int)

print(mo_summary)


                                    start_date end_date               number
Facility Name                                                               
BARNES JEWISH HOSPITAL                     NaT      NaT -9223372036854775808
BARNES-JEWISH ST PETERS HOSPITAL           NaT      NaT                24794
BARNES-JEWISH WEST COUNTY HOSPITAL         NaT      NaT                  384
BATES COUNTY MEMORIAL HOSPITAL             NaT      NaT              8746157
BELTON REGIONAL MEDICAL CENTER             NaT      NaT                  198
...                                        ...      ...                  ...
TRUMAN MEDICAL CENTER LAKEWOOD             NaT      NaT                  118
UNIVERSITY OF MISSOURI HEALTH CARE         NaT      NaT -9223372036854775808
WASHINGTON COUNTY MEMORIAL HOSPITAL        NaT      NaT               824296
WESTERN MISSOURI MEDICAL CENTER            NaT      NaT                  104
WRIGHT MEMORIAL HOSPITAL                   NaT      NaT               397485

In [42]:
import pandas as pd

# Load the mo_summary DataFrame
mo_summary = pd.read_csv('complications_all.csv')

# Check the index of the mo_summary DataFrame
print(mo_summary.index)

# Check the value of the number column for the row with the index 63099
row = mo_summary.get('63099')


RangeIndex(start=0, stop=91395, step=1)


---

### 47.2 Planning your final project

You should be thinking about the things we've been learning and how you can apply them to your final project.  Use the rubric to help guid your thinking and then answer the questions below.  This is meant as a guide to help you think through what you will do.

#### A) Data Access

Your project should include data from at least three distinct types of sources.  For example: AWS S3, Relational Databases, Internet, Web Services, local files.  List what data sources you're planning to use.

**Healthdata.gov** https://healthdata.gov/State/Proportion-of-Adults-Who-Are-Current-Smokers-LGHC-/9tfm-gbny

**Kaggle**:online Platform of data scientists and machine learning engineers
https://www.kaggle.com/datasets/akshaydattatraykhare/diabetes-dataset



#### B. Data Formats

Your project should include data that comes in different file formats.  For example: HL7, EDI, HTML, CSV, Excel, JSON, XML.  List what data formats you're planning to use.

**CSV**:CSV is a simple and widely used text-based format for representing tabular data. Each line in a CSV file represents a row.

**Excel**:Excel is a spreadsheet program developed by Microsoft, and XLSX is the default file format for Excel files. It supports multiple sheets within a workbook and includes formatting, formulas, and various data types.

**JSON**:JSON is a lightweight, text-based data interchange format. It uses key-value pairs and supports nested structures. JSON is easy for humans to read and write, and it's also easy for machines to parse and generate.



#### C. Objective

What purpose would your project serve in a real work setting?  Take a couple of paragraphs to write down why this is an interesting product.

In a real-world healthcare setting, the suggested project would be an extensive and useful tool since it would integrate data from a variety of sources, including disease databases, electronic health records, health insurance information, and patient health surveys. The capacity to create a comprehensive and unified view of patient health information is one of the main advantages. Healthcare professionals can obtain a more comprehensive picture of a patient's medical history, treatment options, and general state of health by combining data from multiple sources. This makes decisions about care more tailored and informed.

Furthermore, the project has the potential to enhance the operational efficiency of healthcare institutions. For example, integrating data from health insurance records helps minimize errors and streamline the billing and claims processes, hence decreasing administrative responsibilities. The use of electronic health records can improve dialogue.enhancements to patient satisfaction levels overall, communication techniques, and service quality.

The project has the potential to improve patient outcomes, streamline healthcare procedures, and support the development of a more effective and patient-centered healthcare system. It is consistent with the larger industry trend toward interoperability and data-driven decision-making, which will ultimately enhance patient care and delivery of healthcare.





---



## Submit your work via GitHub as normal
