# [CPSC 222](https://github.com/GonzagaCPSC222) Intro to Data Science
[Gonzaga University](https://www.gonzaga.edu/)

[Gina Sprint](http://cs.gonzaga.edu/faculty/sprint/)
## DA4 Cleaning and Visualization (100 pts)

## Learner Objectives
At the conclusion of this programming assignment, participants should be able to:
* Clean data
* Use Matplotlib to visualize data
 
## Prerequisites
Before starting this programming assignment, participants should be able to:
* Utilize the Pandas library to
    * Read/write data from/to a CSV
    * Work with DataFrames and Series
    * Aggregate data and compute summary statistics

## Acknowledgments
Content used in this assignment is based upon information in the following sources:
* [Pandas website](http://pandas.pydata.org/)
* [Centers for Medicaid and Medicare Services IRF-PAI training manual](https://www.cms.gov/medicare/medicare-fee-for-service-payment/inpatientrehabfacpps/downloads/irfpai-manual-2012.pdf)

## Github Classroom Setup
For this assignment, you will use GitHub Classroom to create a private code repositories to track code changes and submit your assignment. Open this DA4 link to accept the assignment and create a private repository for your assignment in Github classroom: https://classroom.github.com/a/NzXjwY2n

Your repo, for example, will be named GonzagaCPSC222/da4-yourusername (where yourusername is your Github username). I highly recommend committing/pushing regularly so your work is always backed up. We will grade your most recent commit, even if that commit is after the due date (your work will be marked late if this is the case).

## Programming (70 pts)
We are going to take a look at a real-world example of clinical data that needs cleaning and contains interesting insights once aggregated and visualized. For the purposes of working with data and practicing with Pandas and Matplotlib, we are going to work with this data in the following ways:
1. Load the data 
1. Clean the data
1. Aggregate the data and compute summary statistics
1. Visualize the data

### Load the Data
Download the patient_data_to_clean.csv from the DAs repo on Github: https://github.com/GonzagaCPSC222/DAs/blob/master/files. One way to download a file is to click "Raw" then right click on the page and click "Save As." Move this file into the same folder as your local DA4 Git repo. This dataset contains gender, marital status, and rehabilitation impairment category (RIC) information from 4,555 inpatient rehabilitation patients. The data has been de-identified and randomized. Here is a sample of the format of the data in the csv file:

|ID|Gender|Age|Marital Status|RIC|Admission Total FIM Score|Discharge Total FIM Score|
|-|-|-|-|-|-|-|
|0|M|80|Widowed|8|40|89|
|1|M|90|Divorced|1|65|75|
|2|M|53|Married|2|67|99|
|...|...|...|...|...|...|...|

And a description of each column:
* ID (integer): Index of the dataset. Counting numbers starting at 0.
* Gender (string): Gender of the patient, "M" for male and "F" for female.
* Age (integer): Age of the patient in years
* Marital Status (string): Description of the patient's marital status. No coding system enforced.
* RIC (integer): RIC of the patient assigned according to Appendix B in the [Centers for Medicaid and Medicare Services IRF-PAI training manual](https://www.cms.gov/medicare/medicare-fee-for-service-payment/inpatientrehabfacpps/downloads/irfpai-manual-2012.pdf).
* Admission Total FIM Score: The admission total Functional Independence Measure (FIM) score of the patient. 
    * The FIM is a clinical assessment used to measure patient functioning at inpatient rehabilitation hospitals. The FIM is measured at two distinct points in time: admission and discharge. 
    * The FIM measures the level of assistance required to perform 18 activities of daily living (ADL) tasks (e.g. eating, walking, problem-solving, etc.). 
    * The tasks are categorized as either motor (13 tasks) or cognitive (5 tasks). Each task is scored on a 7-point ordinal scale to measure independence as determined by the amount of assistance required to perform each ADL task. 
    * For more information about the FIM, see Section III in the [Centers for Medicaid and Medicare Services IRF-PAI training manual](https://www.cms.gov/medicare/medicare-fee-for-service-payment/inpatientrehabfacpps/downloads/irfpai-manual-2012.pdf).
* Discharge Total FIM Score: The discharge total FIM score of the patient.

Read the patient data into a Pandas `DataFrame` object. The index column is 0, which is the location of the ID column. The header row is the first row in the file.

### Clean the Data
Let's take a look at each column in the data and how the data needs to be cleaned:
* ID: No cleaning necessary
* Gender: No cleaning necessary
* Age: No cleaning necessary
* Marital Status: Update this data so it adheres to a strict coding system instead free response. This column is quite messy compared to the other columns. If we only look at the first 8 rows of the dataset, the Marital Status column looks like it is well coded; however, we see for ID 8 there is a period after "Married." which doesn't match any of the previous "Married" entries. Upon further exploration, we see this column truly was free response for the clinicians to enter text. For example, take a look at IDs 33, 36, 38, 41, and 42! We are going to after do some string matching to apply a uniform encoding for the Marital Status column. When cleaning this column, use a simple rule based system to handle the various spellings and word choices that represent the following marital statuses:
    * Never married
    * Divorced
    * Married
    * Widowed
    * Separated
* RIC (integer): Decode the integer RIC label to the plain text string RIC label. Here is a dictionary storing the integer-string mappings: `ric_decoder = {1: "Stroke", 2: "TBI", 3: "NTBI", 4: "TSCI", 5: "NTSCI", 6: "Neuro", 7: "FracLE", 8: "ReplLE", 9: "Ortho", 10: "AMPLE", 11: "AMP-NLE", 12: "OsteoA", 13: "RheumA", 14: "Cardiac", 15: "Pulmonary", 16: "Pain", 17: "MMT-NBSCI", 18: "MMT-BSCI", 19: "GB", 20: "Misc", 21: "Burns"}`
    1. "Stroke"
    1. "TBI" (Traumatic brain injury)
    1. "NTBI" (Non-traumatic brain injury)
    1. "TSCI" (Traumatic spinal cord injury)
    1. "NTSCI" (Non-traumatic spinal cord injury)
    1. "Neuro" (Neurologic conditions)
    1. "FracLE" (Fracture, lower extremity)
    1. "ReplLE" (Joint replacement, lower extremity)
    1. "Ortho" (Other orthopaedic)
    1. "AMPLE" (Amputation, lower extremity)
    1. "AMP-NLE"(Amputation, upper extremity or other)
    1. "OsteoA" (Osteoarthritis)
    1. "RheumA" (Rheumatoid arthritis)
    1. "Cardiac" (Cardiac disorders)
    1. "Pulmonary" (Pulmonary disorders)
    1. "Pain" (Pain syndromes)
    1. "MMT-NBSCI" (Major multiple trauma, non brain injury or spinal cord injury)
    1. "MMT-BSCI" (Major multiple trauma, brain injury or spinal cord injury)
    1. "GB": (Guillain-Barre Syndrome)
    1. "Misc" (Miscellaneous)
    1. "Burns"
* Admission Total FIM Score: No cleaning necessary
* Discharge Total FIM Score: No cleaning necessary

Note: there are 6 entries that we cannot classify as one of the above labels:
1. "D X 1 YEAR AGO"
1. "no"
1. "rried"
1. "Student"
1. "retired" 
1. "Wife."

For these cases, overwrite the entry with a null value (`NaN`) to represent missing data. Write the cleaned Pandas `DataFrame` out to a new file called patient_data_cleaned.csv (include this file in your Git repo). This dataset is now cleaned and ready for use in the next step of our data analysis pipeline.

### Aggregate the Data
Construct a Pandas `Series` with the following statistics about the cleaned data:
1. `patients_total`: total number of patients
1. `males_total`: total number of males
1. `females_total`: total number of females
1. `married_total`: total number of married patients
1. `most_common_RIC`: RIC label for the most commonly occurring RIC
1. `most_common_RIC_total`: total number of patients with the most commonly occurring RIC
1. `stroke_age_avg`: average age for stroke patients
1. `stroke_age_std`: standard deviation of age for stroke patients
1. `stroke_age_male_avg`: average age for male stroke patients
1. `stroke_age_male_std`: standard deviation of age for male stroke patients
1. `stroke_age_female_avg`: average age for female stroke patients
1. `stroke_age_female_std`: standard deviation of age for female stroke patients

Write this `Series` to a file called patients_stats.csv (include this file in your Git repo).

### Visualize the Data
For each RIC category with sufficient data, produce the following plots using Matplotlib (note: save them as PNGs, don't plt.show() them... more info on PNG filenaming conventions are below):
1. Age [histogram](https://matplotlib.org/devdocs/api/_as_gen/matplotlib.pyplot.hist.html)
    * X axis label: "Age (years)"
    * Y axis label: "Frequency"
    * Title: "`<RIC>` Age (N=`<total>`): Mean: `<2 decimal places>`, StdDev: `<2 decimal places>`"
    * Bars: Green, 30 bins, normed
    * Example:   
<img src="https://github.com/GonzagaCPSC222/DAs/raw/master/figures/ReplLE_age.png" width="400">
1. FIM [scatter](http://matplotlib.org/api/_as_gen/matplotlib.axes.Axes.scatter.html) plot
    * X axis label: Admission FIM score
    * Y axis label: Discharge FIM score
    * Title: "`<RIC>` (N=`<total>`)"
    * Male scatter points: Blue, circle markers ("."), size 100, label "Male (N=`<total>`)"
    * Female scatter points: Red, plus markers ("+"), size 100, label "Female (N=`<total>`)" 
    * Y = X [line](https://matplotlib.org/api/pyplot_api.html#matplotlib.pyplot.plot): Black, dashed line style ("--"), [x limits](https://matplotlib.org/api/_as_gen/matplotlib.axes.Axes.set_xlim.html) and [y limits](https://matplotlib.org/api/_as_gen/matplotlib.axes.Axes.set_ylim.html) are [0, 140]
        * This is called a "no change" line, Y = X. This line represents when the discharge FIM score is the same as the admission FIM score. Patients above this line showed a FIM score improvement, patients below this line showed a regression.
    * [Legend](https://matplotlib.org/api/pyplot_api.html#matplotlib.pyplot.legend): lower right corner ("4")
    * Example:  
<img src="https://github.com/GonzagaCPSC222/DAs/raw/master/figures/ReplLE_fim.png" width="500">  
Name the plot PNGs using the following naming convention: RIC_column name_plottype.png. For example, ReplLE_age_histogram.png and ReplLE_fim_scatter.png. Include all plot PNGs in your Git repo.

### Bonus (5 pts)
Add the following improvements to the age histogram plots for each RIC category:
* Use Latex to show the plot title a follows: "`<RIC>` Age (N=`<total>`): $\mu=$ `<2 decimal places>`, $\sigma=$ `<2 decimal places>`"
* Use SciPy's [Normal Probability Density Function (PDF)](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.norm.html) to show an estimated normal curve fit over the histogram: Red, line width 3

Example: 

<img src="https://github.com/GonzagaCPSC222/DAs/raw/master/figures/ReplLE_age_BONUS.png" width="400">

## Project Part 4 (15 pts)
For the dataset you are most likely going to use for your project (note: you don't need to commit to this now, though you will need to commit to a project dataset in DA5), download a CSV file of your data and include it in your Git repo. In Excel or Python, clean the data to the point where you can load the data into a Pandas `DataFrame`. In a Python file named project_part4.py, document any issues or questions you have cleaning your data in comment blocks. Once you have loaded your CSV file into a `DataFrame`, produce at least one plot of the data by saving the plot to a PNG (include this plot in your Git repo). The plot should contain at least two columns of data and should be a different plot type than a histogram or line/scatterplot. Feel free to use a different plot type covered in class or try a new one by exploring the [Matplotlib examples](https://matplotlib.org/3.3.2/tutorials/introductory/sample_plots.html). 

## Data Ethics (15 pts)
Read Chapter 1 of [Weapons of Math Destruction](https://www.amazon.com/Weapons-Math-Destruction-Increases-Inequality/dp/0553418815) by Cathy O'Neill. This book is a NY Times bestseller, National Book Award longlist winner, and frequently mentioned as one of the top non-technical data science/big data books that everyone should read (here are a few lists: [kdnuggets](https://www.kdnuggets.com/2019/12/non-technical-reading-list-data-science.html), [dataquest](https://www.dataquest.io/blog/data-science-books/), [builtin](https://builtin.com/data-science/data-science-books), etc.). You don't need to purchase this book, unless you want a hard-copy or you want to read the whole thing. I'll [post the sections we will read to Piazza](https://piazza.com/class/ka4awqqd2opro?cid=15) so they are not publicly available on Github.

In a document called ethics (a text file, markdown file, or PDF is acceptable), provide your reflection on the following discussion points:
1. O'Neill uses baseball as an example of a domain this has been "long ruled by the gut". Do you think such domains, like baseball, benefit from predictive models? Are there some domains that you think models should not be applied to? Or if models are already applied to such domains, do you wish they had not been applied? Explain your reasoning with examples.
1. Based on your experience with social media (or the experiences of your friends/family/others), what models do you think are used in social media? Would you consider these models WMDs? Here are some questions from the chapter to help you form your answer: "...is the model opaque, or even invisible?", "Does the model work against the subject's interests?", "In short, is it unfair?", "Does it damage or destroy lives?", and "...can it scale?" Cite at least one source to support your claims.
1. What else struck you about this chapter?

This write-up should be written using full sentences and should be grammaticallly correct. Proof read your writing before you submit it!!

## Submitting Assignments
1. Use Github classroom to submit your assignment via a Github repo. See the "Github Classroom Setup" section at the beginning of this document for details on how to do this. You must commit your solution by the due date and time.
1. Your repo should contain only your .py file(s), your .csv file(s), your .png file(s) and your write-up file(s) (.txt, .md, or .pdf). Double check that this is the case by cloning (or downloading a zip) your submission repo and running your code from VS Code like we will when we grade your code. Note: there are several files your final Git repo should contain. Here is a checklist you can use to ensure your submission is complete:
    1. main.py and associated utility source files
    1. patient_data_to_clean.csv
    1. patient_data_cleaned.csv
    1. patient_data_stats.csv
    1. Histogram PNGs for each RIC with sufficient data (note: how much data is needed to create a plot is for you to determine)
    1. Scatter PNGs for each RIC with sufficient data (note: how much data is needed to create a plot is for you to determine)
    1. project_part4.py
    1. PNG for your own data
    1. ethics.txt (or .md or .pdf)

## Grading Guidelines
This assignment is worth 100 points + 5 points bonus. Your assignment will be evaluated based on a successful execution in VS Code (using the Anaconda Python Distribution v3.8) and adherence to the program requirements. We will grade according to the following criteria:
* 5 pts for decoding the RIC
* 10 pts for cleaning the marital status
* 5 pts for writing the cleaned `DataFrame` to a CSV file
* 15 pts for computing the correct stats, storing them in a `Series`, and writing the `Series` to a CSV file 
* 15 pts for correct age histogram plot PNGs for each RIC
* 15 pts for correct FIM scatterplot PNGs for each RIC
* 5 pts for adherence to course [coding standard](https://nbviewer.jupyter.org/github/GonzagaCPSC222/DAs/blob/master/Coding%20Standard.ipynb)
* 15 pts for including a CSV and a plot of your own data, along with code to load the data and plot it in project_part4.py
* 15 pts for quality, clarity, and creativity in the data ethics write-up