# Exercise - Assess Data Quality and Structure Visually

In this exercise, we will be looking at synthetic phase two clinical trial dataset of 350 patients for a new innovative oral insulin called **Auralin** - a proprietary capsule that can solve a stomach lining problem. 

**Note:** Auralin and Novodra are not real insulin products. This clinical trial data was fabricated for the sake of this course. When assessing this data, the issues that you'll detect (and later clean) are meant to simulate real-world data quality and tidiness issues with the capability to **impact quality of care, patient registration, and revenue**. The datasets, `patients` and `treatments`, were constructed with the consultation of real doctors to ensure plausibility.

In [None]:
#DO NOT MODIFY - imports and loaing data
import pandas as pd

patients = pd.read_csv('patients.csv')
treatments = pd.read_csv('treatments.csv')

## 1. Identify data quality issues in the patients data

The `patients` dataframe contains the following variables:

- **patient_id**: the unique identifier for each patient
- **assigned_sex**: the assigned sex of each patient at birth (male or female)
- **given_name**: the given name (i.e. first name) of each patient
- **surname**: the surname (i.e. last name) of each patient
- **address**: the main address for each patient
- **city**: the corresponding city for the main address of each patient
- **state**: the corresponding state for the main address of each patient
- **zip_code**: the corresponding zip code for the main address of each patient
- **country**: the corresponding country for the main address of each patient (all United states for this clinical trial)
- **contact**: phone number and email information for each patient
- **birthdate**: the date of birth of each patient (month/day/year).
> The [inclusion criteria](https://en.wikipedia.org/wiki/Inclusion_and_exclusion_criteria) for this clinical trial is  age >= 18 *(there is no maximum age because diabetes is a [growing problem](http://www.diabetes.co.uk/diabetes-and-the-elderly.html) among the elderly population)*
- **weight**: the weight of each patient in pounds (lbs)
- **height**: the height of each patient in inches (in)
- **bmi**: the Body Mass Index (BMI) of each patient. 
> BMI is a simple calculation using a person's height and weight. The formula is BMI = kg/m<sup>2</sup> where kg is a person's weight in kilograms and m<sup>2</sup> is their height in metres squared. A BMI of 25.0 or more is overweight, while the healthy range is 18.5 to 24.9. *The [inclusion criteria](https://en.wikipedia.org/wiki/Inclusion_and_exclusion_criteria) for this clinical trial is 16 >= BMI >= 38.*

### 1.1 Display dataset

Display the first 10 rows of the patients data, and visually inspect the data, using directed and non-directed visual assessment as you see necessary. 

Identify **four** instances of problematic data points corresponding to **accuracy** and **validity** in the patients dataset.

**Hint**: take a look at the `given_name`, `zip_code`, and `height` columns. 

In [None]:
#FILL IN - inspect the first 10 rows of the patients data



In [None]:
#FILL IN - More code if needed

### 1.2 Identify issues

FILL IN the identified issues. For each issue, denote the row and column where the issue takes place:

- Issue 1 (Accuracy): Row ..., column ...  The issue is...
- Issue 2 (Accuracy): Row ..., column ...  The issue is...
- Issue 3 (Validity): Row ..., column ...  The issue is...
- Issue 4 (Validity): Row ..., column ...  The issue is...

## 2. Identify data quality and structural issues in the treatments data

Let's take a look at brief context around the `treatments` data:

350 patients participated in this clinical trial. None of the patients were using **Novodra** (a popular injectable insulin) or **Auralin** (the oral insulin being researched) as their primary source of insulin before.  All were experiencing elevated HbA1c levels.
> HbA1c stands for Hemoglobin A1c. The [HbA1c test](https://depts.washington.edu/uwcoe/healthtopics/diabetes.html) measures what the average blood sugar has been over the past three months. It is thus a powerful way to get an overall sense of how well diabetes has been controlled. Everyone with diabetes should have this test 2 to 4 times per year. Measured in %.

All 350 patients were treated with Novodra to establish a baseline HbA1c level and insulin dose. After four weeks:
- 175 patients switched to Auralin for 24 weeks
- 175 patients continued using Novodra for 24 weeks

`treatments` columns:
- **given_name**: the given name of each patient in the Master Patient Index that took part in the clinical trial
- **surname**: the surname of each patient in the Master Patient Index that took part in the clinical trial
- **auralin**: the baseline median daily dose of insulin from the week prior to switching to Auralin (the number before the dash `-`) *and* the ending median daily dose of insulin at the end of the 24 weeks of treatment measured over the 24th week of treatment (the number after the dash `-`). Both are measured in units (shortform 'u'), which is the [international unit](https://en.wikipedia.org/wiki/International_unit) of measurement and the standard measurement for insulin.
- **novodra**: same as above, except for patients that continued treatment with Novodra
- **hba1c_start**: the patient's HbA1c level at the beginning of the first week of treatment.

- **hba1c_end**: the patient's HbA1c level at the end of the last week of treatment.
- **hba1c_change**: the change in the patient's HbA1c level from the start of treatment to the end, i.e., `hba1c_start - hba1c_end`.

### 2.1 Display data

Display the first 10 rows of the `treatments` data, and visually inspect the data. Identify 3 instances of problematic data quality issues corresponding to **completeness** and **consistency** in the dataset.

Hints: 
- For the first completeness issue, recall that the size of each treatment arm was 175 patients in each for the Auralin and Novodra arms. How many records should we have in the dataset, and how many rows does the treatments data currently contain?
- For the second completeness issue, take a look at the `hba1c_change` variable. Is there anything noticable about the values for this column? What could we do to mitigate this during cleaning?
- For the consistency issue, take a look at the `given_name` and `surname` variables. How are these different from what we saw in the `patients` dataframe?

In [None]:
#FILL IN - inspect the first 10 rows of the treatments data


In [None]:
#FILL IN - calculate the number of rows in your data

In [None]:
#Fill in - More code if needed

### 2.2. Identifiy issues

FILL IN the identified issues. For each issue, denote the row and column where the issue takes place:

- Issue 1 (Completeness): Row ..., column ....  The issue is...
- Issue 2 (Completeness): Row ..., column ....  ...
- Issue 3 (Consistency): Row ..., column ....  ...

### 2.3 Data structural issue 

Identify one data structural issue in the dataset - take a look at the `auralin` and `novodra` columns. Why are these variables causing a messy dataset (what rule of tidiness do they violate?), and how should they be structured instead?

*FILL IN*

- Issue (Structural): ...