# [AHA! Activity Health Analytics](http://casas.wsu.edu/)
[Center for Advanced Studies of Adaptive Systems (CASAS)](http://casas.wsu.edu/)

[Washington State University](https://wsu.edu)
# L2-PA Clinical Data Aggregation (100 pts)

## Learner Objectives
At the conclusion of this programming assignment, participants should be able to:
* Load data into Pandas objects
* Save data in Pandas objects to csv files
* Clean data
* Aggregate data and compute summary statistics
* Visualize data

## Prerequisites
Before starting this programming assignment, participants should be able to:
* Write object-oriented Python code
* Use the `pandas` library

## Acknowledgments
Content used in this assignment is based upon information in the following sources:
* [Pandas website](http://pandas.pydata.org/)
* [Centers for Medicaid and Medicare Services IRF-PAI training manual](https://www.cms.gov/medicare/medicare-fee-for-service-payment/inpatientrehabfacpps/downloads/irfpai-manual-2012.pdf)

## Overview and Requirements
For this programming assignment, we are going to take a look at a real-world example of clinical data that needs cleaning and contains interesting insights once aggregated and visualized. For the purposes of working with data and practicing with Pandas and Matplotlib, we are going to work with this data in the following ways:
1. Load the data 
1. Clean the data
1. Aggregate the data
1. Compute summary statistics from the data
1. Plot the data

## Program Details
### Dataset
Download the [patient_data_to_clean.csv](https://raw.githubusercontent.com/gsprint23/aha/master/progassignments/files/patient_data_to_clean.csv) dataset. This dataset contains gender, marital status, and rehabilitation impairment category (RIC) information from 4,555 inpatient rehabilitation patients. The data has been de-identified and randomized. Here is a sample of the format of the data in the csv file:

|ID|Gender|Age|Marital Status|RIC|Admission Total FIM Score|Discharge Total FIM Score|
|-|-|-|-|-|-|-|
|0|M|80|Widowed|8|40|89|
|1|M|90|Divorced|1|65|75|
|2|M|53|Married|2|67|99|
|...|...|...|...|...|...|...|

And a description of each column:
* ID (integer): Index of the dataset. Counting numbers starting at 0.
* Gender (string): Gender of the patient, "M" for male and "F" for female.
* Age (integer): Age of the patient in years
* Marital Status (string): Description of the patient's marital status. No coding system enforced.
* RIC (integer): RIC of the patient assigned according to Appendix B in the [Centers for Medicaid and Medicare Services IRF-PAI training manual](https://www.cms.gov/medicare/medicare-fee-for-service-payment/inpatientrehabfacpps/downloads/irfpai-manual-2012.pdf).
* Admission Total FIM Score: The admission total Functional Independence Measure (FIM) score of the patient. 
    * The FIM is a clinical assessment used to measure patient functioning at inpatient rehabilitation hospitals. The FIM is measured at two distinct points in time: admission and discharge. 
    * The FIM measures the level of assistance required to perform 18 ADL tasks. 
    * The tasks are categorized as either motor (13 tasks) or cognitive (5 tasks). Each task is scored on a 7-point ordinal scale to measure independence as determined by the amount of assistance required to perform each ADL task. 
    * For more information about the FIM, see Section III in the [Centers for Medicaid and Medicare Services IRF-PAI training manual](https://www.cms.gov/medicare/medicare-fee-for-service-payment/inpatientrehabfacpps/downloads/irfpai-manual-2012.pdf).
* Discharge Total FIM Score: The discharge total FIM score of the patient.

### Data Storage
Read the patient data into a `pandas` `DataFrame` object. The index column is 0, which is the location of the ID column. The header row is the first row in the file.

### Clean the Data
Let's take a look at each column in the data and how the data needs to be cleaned:
* ID: No cleaning necessary
* Gender: Transform this data so it is a Boolean data type, 1 for male and 0 for female. Rename the column from "Gender" to "Is Male" to provide the information necessary to know what 1 and 0 represents in the column. 
* Age: No cleaning necessary
* Marital Status: Update this data so it adheres to a strict coding system instead free response. This column is quite messy compared to the other columns. If we only look at the first 8 rows of the dataset, the Marital Status column looks like it is well coded; however, we see for ID 8 there is a period after "Married." which doesn't match any of the previous "Married" entries. Upon further exploration, we see this column truly was free response for the clinicians to enter text. For example, take a look at IDs 33, 36, 38, 41, and 42! We are going to after do some string matching to apply a uniform encoding for the Marital Status column. When cleaning this column, use a simple rule based system to handle the various spellings and word choices that represent the following marital statuses:
    * Never married
    * Divorced
    * Married
    * Widowed
    * Separated
* RIC (integer): Decode the integer RIC label to the plain text string RIC label.
    1. "Stroke"
    1. "TBI" (Traumatic brain injury)
    1. "NTBI" (Non-traumatic brain injury)
    1. "TSCI" (Traumatic spinal cord injury)
    1. "NTSCI" (Non-traumatic spinal cord injury)
    1. "Neuro" (Neurologic conditions)
    1. "FracLE" (Fracture, lower extremity)
    1. "ReplLE" (Joint replacement, lower extremity)
    1. "Ortho" (Other orthopaedic)
    1. "AMPLE" (Amputation, lower extremity)
    1. "AMP-NLE"(Amputation, upper extremity or other)
    1. "OsteoA" (Osteoarthritis)
    1. "RheumA" (Rheumatoid arthritis)
    1. "Cardiac" (Cardiac disorders)
    1. "Pulmonary" (Pulmonary disorders)
    1. "Pain" (Pain syndromes)
    1. "MMT-NBSCI" (Major multiple trauma, non brain injury or spinal cord injury)
    1. "MMT-BSCI" (Major multiple trauma, brain injury or spinal cord injury)
    1. "GB": (Guillain-Barre Syndrome)
    1. "Misc" (Miscellaneous)
    1. "Burns"
* Admission Total FIM Score: No cleaning necessary
* Discharge Total FIM Score: No cleaning necessary

Note: there are 6 entries that we cannot classify as one of the above labels:
retired
1. 227: "rried"
1. 1442: "no"
1. 1720: "Student"
1. 4073: "D X 1 YEAR AGO"
1. 4343: "Wife."

For these cases, overwrite the entry with a null value (`NaN`) to represent missing data.

### Visualization
For each RIC category, produce the following plots:
1. Age histogram
    * X axis label: "Age (years)"
    * Y axis label: "Frequency"
    * Title: "`<RIC>` Age (N=`<total>`): $\mu=$ `<2 decimal places>`, $\sigma=$ `<2 decimal places>`"
    * [Bars](https://matplotlib.org/devdocs/api/_as_gen/matplotlib.pyplot.hist.html): Green, 30 bins, normed
    * [Normal PDF](https://matplotlib.org/api/mlab_api.html#matplotlib.mlab.normpdf): Red, line width 3
    * Example: <img src="https://raw.githubusercontent.com/gsprint23/aha/master/progassignments/figures/ReplLE_age.png" width="400">
1. FIM scatter plot
    * X axis label: Admission FIM score
    * Y axis label: Discharge FIM score
    * Title: "`<RIC>` (N=`<total>`)"
    * Male [scatter points](http://matplotlib.org/api/_as_gen/matplotlib.axes.Axes.scatter.html): Blue, circle markers ("."), size 100, label "Male (N=`<total>`)"
    * Female [scatter points](http://matplotlib.org/api/_as_gen/matplotlib.axes.Axes.scatter.html): Red, plus markers ("+"), size 100, label "Female (N=`<total>`)" 
    * Y = X [line](https://matplotlib.org/api/pyplot_api.html#matplotlib.pyplot.plot): Black, dashed line style ("--"), [x limits](https://matplotlib.org/api/_as_gen/matplotlib.axes.Axes.set_xlim.html) and [y limits](https://matplotlib.org/api/_as_gen/matplotlib.axes.Axes.set_ylim.html) are [0, 140]
        * This is called a "no change" line, Y = X. This line represents when the discharge FIM score is the same as the admission FIM score. Patients above this line showed a FIM score improvement, patients below this line showed a regression.
    * [Legend](https://matplotlib.org/api/pyplot_api.html#matplotlib.pyplot.legend): lower right corner ("4")
    * Example: <img src="https://raw.githubusercontent.com/gsprint23/aha/master/progassignments/figures/ReplLE_fim.png" width="500">

### Program Output
In addition to the plots, the textual output of the program includes the following two files:
1. Cleaned data: write the cleaned data frame out to a new file. This dataset is now cleaned and ready for use in the next step of our data analysis pipeline. Depending on what we want to do with the data, this could be continuing exploration by generating visualizations of the data, or perhaps scaling the features in preparation for machine learning.
1. Stats data: construct a Pandas `Series` with the following statistics about the cleaned data:
    1. `patients_total`: total number of patients
    1. `males_total`: total number of males
    1. `females_total`: total number of females
    1. `most_common_RIC`: RIC label for the most commonly occurring RIC
    1. `most_common_RIC_total`: total number of patients with the most commonly occurring RIC
    1. `stroke_age_avg`: average age for stroke patients
    1. `stroke_age_std`: standard deviation of age for stroke patients
    1. `stroke_age_male_avg`: average age for male stroke patients
    1. `stroke_age_male_std`: standard deviation of age for male stroke patients
    1. `stroke_age_female_avg`: average age for female stroke patients
    1. `stroke_age_female_std`: standard deviation of age for female stroke patients

### Command Line Arguments
Your program should accept the following parameters as command line arguments (in this order):
1. Input filename: the filename of the input data specified as a relative path
1. Output data filename: the filename (relative path) of the output data specified
1. Output stats filename: the filename (relative path) of the output data specified
1. Output histograms directory name: the name (relative path) of the directory to save histogram plots
1. Output scatter plots directory name: the name (relative path) of the directory to save scatter plots

Example: `files\patient_data_to_clean.csv files\patient_data_cleaned.csv files\patient_data_stats.csv figures\patient_age_histograms figures\patient_fim_scatters`

## Deliverables
For your assignment submission, turn in a .zip file with the following files:
1. main.py file with a `main()` function that drives your program
1. Two .csv files
    1. Cleaned patient data
    1. Cleaned patient data statistics
1. Two folders of plots
    1. Age histograms
    1. FIM scatter plots

## Grading Guidelines
This assignment is worth 100 points. Your assignment will be evaluated based on a successful compilation and adherence to the program requirements. We will grade according to the following criteria:
* 5 pts for correctly extracting program parameters from command line arguments
* 5 pts for correctly storing the daily step data in a `DataFrame`
* 5 pts for correctly re-sampling
* 5 pts for correctly computing features
* 10 pts for correctly computing a weighted Normalized Euclidean distance
* 15 pts for correctly implementing the baseline comparison mode
* 15 pts for correctly implementing the sliding comparison mode
* 5 pts for correctly storing the change scores in a `Series`
* 5 pts for correctly writing the results to a csv file
* 10 pts for correctly creating age histograms
* 15 pts for correctly creating FIM scatter plots
* 5 pts for adherence to proper programming style and comments established for the class