## Understand data assessment and data cleaning process
This (fake) clinical trial dataset has three tables: patients, treatments, and adverse_reactions.

In [1]:
import numpy as np
import pandas as pd

In [2]:
patients = pd.read_csv('patients.csv')
treatments = pd.read_csv('treatments.csv')
adverse_reactions = pd.read_csv('adverse_reactions.csv')

### Visual Assessment

In [97]:
# Display the patients table
patients

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,contact,birthdate,weight,height,bmi
0,1,female,Zoe,Wellish,576 Brown Bear Drive,Rancho California,California,92390.0,United States,951-719-9170ZoeWellish@superrito.com,7/10/1976,121.7,66,19.6
1,2,female,Pamela,Hill,2370 University Hill Road,Armstrong,Illinois,61812.0,United States,PamelaSHill@cuvox.de+1 (217) 569-3204,4/3/1967,118.8,66,19.2
2,3,male,Jae,Debord,1493 Poling Farm Road,York,Nebraska,68467.0,United States,402-363-6804JaeMDebord@gustr.com,2/19/1980,177.8,71,24.8
3,4,male,Liêm,Phan,2335 Webster Street,Woodbridge,NJ,7095.0,United States,PhanBaLiem@jourrapide.com+1 (732) 636-8246,7/26/1951,220.9,70,31.7
4,5,male,Tim,Neudorf,1428 Turkey Pen Lane,Dothan,AL,36303.0,United States,334-515-7487TimNeudorf@cuvox.de,2/18/1928,192.3,27,26.1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
498,499,male,Mustafa,Lindström,2530 Victoria Court,Milton Mills,ME,3852.0,United States,207-477-0579MustafaLindstrom@jourrapide.com,4/10/1959,181.1,72,24.6
499,500,male,Ruman,Bisliev,494 Clarksburg Park Road,Sedona,AZ,86341.0,United States,928-284-4492RumanBisliev@gustr.com,3/26/1948,239.6,70,34.4
500,501,female,Jinke,de Keizer,649 Nutter Street,Overland Park,MO,64110.0,United States,816-223-6007JinkedeKeizer@teleworm.us,1/13/1971,171.2,67,26.8
501,502,female,Chidalu,Onyekaozulu,3652 Boone Crockett Lane,Seattle,WA,98109.0,United States,ChidaluOnyekaozulu@jourrapide.com1 360 443 2060,2/13/1952,176.9,67,27.7


There, the column `bmi` is the Body Mass Index (BMI) of each patient. 

BMI is a simple calculation using a person's height and weight. The formula is `BMI = kg/m^2` where `kg` is a person's weight in kilograms and `m2` is their height in metres squared. A BMI of 25.0 or more is overweight, while the healthy range is **18.5** to **24.9**. 

The inclusion criteria for this clinical trial is 16 >= BMI >= 38.

The other columns need to note:
* weight: the weight of each patient in pounds (lbs)
* height: the height of each patient in inches (in)
* birthdate: the date of birth of each patient (month/day/year). The inclusion criteria for this clinical trial is age >= 18 (there is no maximum age because diabetes is a growing problem among the elderly population)
* assigned_sex: the assigned sex of each patient at birth (male or female)

In [98]:
treatments

Unnamed: 0,given_name,surname,auralin,novodra,hba1c_start,hba1c_end,hba1c_change
0,veronika,jindrová,41u - 48u,-,7.63,7.20,
1,elliot,richardson,-,40u - 45u,7.56,7.09,0.97
2,yukitaka,takenaka,-,39u - 36u,7.68,7.25,
3,skye,gormanston,33u - 36u,-,7.97,7.62,0.35
4,alissa,montez,-,33u - 29u,7.78,7.46,0.32
...,...,...,...,...,...,...,...
275,albina,zetticci,45u - 51u,-,7.93,7.73,0.20
276,john,teichelmann,-,49u - 49u,7.90,7.58,
277,mathea,lillebø,23u - 36u,-,9.04,8.67,0.37
278,vallie,prince,31u - 38u,-,7.64,7.28,0.36


350 patients participated in this clinical trial. None of the patients were using Novodra (a popular injectable insulin) or Auralin (the oral insulin being researched) as their primary source of insulin before. All were experiencing elevated HbA1c levels.

All 350 patients were treated with Novodra to establish a baseline HbA1c level and insulin dose. After four weeks, which isn’t enough time to capture all the change in HbA1c that can be attributed by the switch to Auralin or Novodra:

175 patients switched to Auralin for 24 weeks
175 patients continued using Novodra for 24 weeks

Some `treatments` columns:

* auralin: the baseline median daily dose of insulin from the week prior to switching to Auralin (the number before the dash) and the ending median daily dose of insulin at the end of the 24 weeks of treatment measured over the 24th week of treatment (the number after the dash). Both are measured in units (shortform 'u'), which is the international unit of measurement and the standard measurement for insulin.
* novodra: same as above, except for patients that continued treatment with Novodra
* hba1c_start: the patient's HbA1c level at the beginning of the first week of treatment. HbA1c stands for Hemoglobin A1c. The HbA1c test measures what the average blood sugar has been over the past three months. It is thus a powerful way to get an overall sense of how well diabetes has been controlled. Everyone with diabetes should have this test 2 to 4 times per year. Measured in %.
* hba1c_end: the patient's HbA1c level at the end of the last week of treatment
* hba1c_change: the change in the patient's HbA1c level from the start of treatment to the end, i.e., hba1c_start - hba1c_end. For Auralin to be deemed effective, it must be "noninferior" to Novodra, the current standard for insulin. This "noninferiority" is statistically defined as the upper bound of the 95% confidence interval being less than 0.4% for the difference between the mean HbA1c changes for Novodra and Auralin (i.e. Novodra minus Auralin).

In [99]:
adverse_reactions

Unnamed: 0,given_name,surname,adverse_reaction
0,berta,napolitani,injection site discomfort
1,lena,baer,hypoglycemia
2,joseph,day,hypoglycemia
3,flavia,fiorentino,cough
4,manouck,wubbels,throat irritation
5,jasmine,sykes,hypoglycemia
6,louise,johnson,hypoglycemia
7,albinca,komavec,hypoglycemia
8,noe,aranda,hypoglycemia
9,sofia,hermansen,injection site discomfort


Additional information:

* Insulin resistance varies person to person, which is why both starting median daily dose and ending median daily dose are required, i.e., to calculate change in dose.
* It is important to test drugs and medical products in the people they are meant to help. People of different age, race, sex, and ethnic group must be included in clinical trials. This diversity is reflected in the patients table.
* Ensuring column names are descriptive enough is an important step in acquainting yourself with the data. 'Descriptive enough' is subjective. Ideally you want short column names (so they are easier to type and read in code form) but also fully descriptive. Length vs. descriptiveness is a tradeoff and common debate (a similar debate exists for variable names). Like the auralin and novodra column names are probably not descriptive enough.

### Programmatic Assessment
This means using code to do anything other than looking through the data in its entirety. In pandas, this means using functions and methods to reveal data's quality and tidiness. 

There are lots of panda's functions and methods to be useful. A lof of assessing is driven by the problems you want to solve. We can check the values in the columns and rows that we plan usng in our analysis. Many times, non-directed programmatic assessment can also be useful. This is randonly typing in programmatic assessments without any directed goal in mind.


In [100]:
# .info() (DataFrame only)
patients.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 503 entries, 0 to 502
Data columns (total 14 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   patient_id    503 non-null    int64  
 1   assigned_sex  503 non-null    object 
 2   given_name    503 non-null    object 
 3   surname       503 non-null    object 
 4   address       491 non-null    object 
 5   city          491 non-null    object 
 6   state         491 non-null    object 
 7   zip_code      491 non-null    float64
 8   country       491 non-null    object 
 9   contact       491 non-null    object 
 10  birthdate     503 non-null    object 
 11  weight        503 non-null    float64
 12  height        503 non-null    int64  
 13  bmi           503 non-null    float64
dtypes: float64(3), int64(2), object(9)
memory usage: 55.1+ KB


In [101]:
treatments.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 280 entries, 0 to 279
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   given_name    280 non-null    object 
 1   surname       280 non-null    object 
 2   auralin       280 non-null    object 
 3   novodra       280 non-null    object 
 4   hba1c_start   280 non-null    float64
 5   hba1c_end     280 non-null    float64
 6   hba1c_change  171 non-null    float64
dtypes: float64(3), object(4)
memory usage: 15.4+ KB


In [3]:
#  sample() ( DataFrame and Series)
# Non-directed assessment using sample()
# Can give us a clue what we might need to clean
patients.sample(5)

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,contact,birthdate,weight,height,bmi
202,203,female,Jiřina,Šubrtová,4262 Heron Way,Portland,OR,97204.0,United States,JirinaSubrtova@rhyta.com503-820-7877,12/10/1987,138.4,61,26.1
485,486,male,Trifon,Izmailov,3697 Drainer Avenue,Fort Walton Beach,FL,32548.0,United States,TrifonIzmailov@fleckens.hu1 850 659 0417,2/15/1973,255.9,74,32.9
436,437,male,Sun,Ko,1962 Cabell Avenue,Washington,VA,20008.0,United States,703-547-0551SunKo@einrot.com,7/8/1969,154.4,72,20.9
74,75,female,Hanka,Gegič,192 Patton Lane,Tulsa,OK,74106.0,United States,918-975-7594HankaGegic@fleckens.hu,1/20/1926,103.2,61,19.5
410,411,male,Nathan,Cumpston,1203 Benson Park Drive,Wayne,OK,73095.0,United States,NathanCumpston@rhyta.com1 405 449 7960,10/6/1965,178.0,67,27.9


Some useful functions and methods good for programmatic assessment:
- .info()
- .sample()
- .head()
- .tail()
- .describe()
- .value_counts()
- .Various methods of indexing and selecting data (.loc and bracket notation with/without boolean indexing, also .iloc)
- .duplicated()
- .isnull()
- sum()
- .sort_values()

The following is some programmatic assessment examples.

In [103]:
patients.surname.value_counts()

Doe          6
Jakobsen     3
Taylor       3
Schiavone    2
Lâm          2
            ..
Quynh        1
Yudina       1
Ekwueme      1
Montagu      1
Ruais        1
Name: surname, Length: 466, dtype: int64

In [104]:
patients.address.value_counts()

123 Main Street             6
2778 North Avenue           2
2476 Fulton Street          2
648 Old Dear Lane           2
350 Ross Street             1
                           ..
4649 Worley Avenue          1
1619 Melm Street            1
4646 Highland View Drive    1
4148 Callison Lane          1
2945 Ferguson Street        1
Name: address, Length: 483, dtype: int64

In [105]:
# Check any duplicated records
patients[patients.address.duplicated()]

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,contact,birthdate,weight,height,bmi
29,30,male,Jake,Jakobsen,648 Old Dear Lane,Port Jervis,New York,12771.0,United States,JakobCJakobsen@einrot.com+1 (845) 858-7707,8/1/1985,155.8,67,24.4
219,220,male,Mỹ,Quynh,,,,,,,4/9/1978,237.8,69,35.1
229,230,male,John,Doe,123 Main Street,New York,NY,12345.0,United States,johndoe@email.com1234567890,1/1/1975,180.0,72,24.4
230,231,female,Elisabeth,Knudsen,,,,,,,9/23/1976,165.9,63,29.4
234,235,female,Martina,Tománková,,,,,,,4/7/1936,199.5,65,33.2
237,238,male,John,Doe,123 Main Street,New York,NY,12345.0,United States,johndoe@email.com1234567890,1/1/1975,180.0,72,24.4
242,243,male,John,O'Brian,,,,,,,2/25/1957,205.3,74,26.4
244,245,male,John,Doe,123 Main Street,New York,NY,12345.0,United States,johndoe@email.com1234567890,1/1/1975,180.0,72,24.4
249,250,male,Benjamin,Mehler,,,,,,,10/30/1951,146.5,69,21.6
251,252,male,John,Doe,123 Main Street,New York,NY,12345.0,United States,johndoe@email.com1234567890,1/1/1975,180.0,72,24.4


In [106]:
patients.weight.sort_values()

210     48.8
459    102.1
335    102.7
74     103.2
317    106.0
       ...  
144    244.9
61     244.9
283    245.5
118    254.5
485    255.9
Name: weight, Length: 503, dtype: float64

In [107]:
# Verify Zaitseva's weight is surely in kg units
weight_lbs = patients[patients.surname == 'Zaitseva'].weight * 2.20462
height_in = patients[patients.surname == 'Zaitseva'].height
bmi_check = 703 * weight_lbs / (height_in * height_in)
bmi_check

210    19.055827
dtype: float64

In [108]:
# it matches the bmi metric
patients[patients.surname == 'Zaitseva'].bmi

210    19.1
Name: bmi, dtype: float64

In [109]:
sum(treatments.auralin.isnull())

0

In [110]:
sum(treatments.novodra.isnull())

0

In [111]:
# Find duplicated column names across the whole dataset
# Seek need to combine tables establish data tidiness
all_columns = pd.Series(list(patients) + list(treatments) + list(adverse_reactions))
all_columns[all_columns.duplicated()]

14    given_name
15       surname
21    given_name
22       surname
dtype: object

### The data quality issues
Data quality dimensions can help guide your thought while assessing. The four main data quality dimensions are:

- **Completeness**: do we have all of the records that we should? Do we have missing records or not? Are there specific rows, columns, or cells missing?
- **Validity**: we have the records, but they're not valid, i.e., they don't conform to a defined schema. A schema is a defined set of rules for data. These rules can be real-world constraints (e.g. negative height is impossible) and table-specific constraints (e.g. unique key constraints in tables).
- **Accuracy**: inaccurate data is wrong data that is valid. It adheres to the defined schema, but it is still incorrect. Example: a patient's weight that is 5 lbs too heavy because the scale was faulty.
- **Consistency**: inconsistent data is both valid and accurate, but there are multiple correct ways of referring to the same thing. Consistency, i.e., a standard format, in columns that represent the same data across tables and/or within tables is desired.



#### To follow the best practce of data wrangling, the data issues are to be documented so someone else can reproduce the result. We document here just above the Clean section

### The data tideness issues

The data is tidy if:
1. Every column is a variable
2. Every row is an observation
3. Every cell is a single value.

The five most common problems with messy datasets:
- Column headers are values, not variable names.
- Multiple variables are stored in one column.
- Variables are stored in both rows and columns.
- Multiple types of observational units are stored in the same table.
- A single observational unit is stored in multiple tables.

     *by Hadley Wickham*

For data issues that we found, we document here just above the Clean section. 


#### Quality Issues

##### `patients` table
- zip code has four digits sometimes
- Tim Neudorf height is 27 in instead of 72 in (using bmi and weight to find the possible height typo)
- Full state names sometimes, abbreviations other times
- Dsvid Gustafsson in patient_id 9 wrong given name spelling
- Missing demographic information (address - contact columns)
- Erroneous datatypes (assigned sex, state, zip_code, and birthdate columns)
- Multiple phone number formats
- Duplicated default John Doe data
- Multiple records for Jakobsen, Gersten, Taylor
- Kgs instead of lbs for Zaitseva weight (verified by converting the value to lbs and calculated the bmi)


##### `treatments` table 
- missing hbA1c_changes values
- The letter u in starting and ending doses for Auralin and Novodra
- lowercase given names and surnames
- missing records (280 instead of 350)
- Erroneous datatypes (auralin and novodra columns)
- Inaccurate HbA1c changes (4s mistaken as 9s)
- Nulls represented as dashes (-) in auralin and novodra columns

##### `adverse_reactions` table
- lowercase given names and surnames

#### Tideness Issues
- contact column in `patients` table has two variables - a phone number and an email
- three variables in two columns, auralin and novodra, in treatments table (treatment, start dose, and end dose) since column headers are values (of a new column named `treatment`), not column names
- There should be only two observational unit (patients and treatments) in this dataset  other than 3 units (3 tables). This means adverse reaction should be part of the `treatments` table
- Given name and surname columns in `patients` table duplicated in `treatments` and `adverse_reactions` tables 


#### Up till now, we have done our first run assessment. Next, we are going to clean the data use the documented issues as our guidelines.

### Clean

We first address missing data, and next, tackle the tidiness issues, and finally clean up the quality issues. The very first thing to do before any cleaning occurs is to make a copy of each piece of data.
Programmatic data cleaning requires three steps: define, code, and test.
- Define: Convert assessments into how-to guide like pseudo code, also serve as documentation
- Code: Translate the pseudo code into code and run it
- Test: Verify the result is valid often using code

We need to re-visit the assessment part in this step, and restate each issue before each cleanning process.

We first deal with the most severe problem - the missing data, and we will make our data tidy, and finally we will clean up all data content problems.

In [4]:
# First of all, make a copy of data frames
patients_clean = patients.copy()
treatments_clean = treatments.copy()
adverse_reactions_clean = adverse_reactions.copy()

### Missing Data

#### `treatments`: Missing records (280 instead of 350)

###### Define

Import the cut treatments from the treatments_cut.csv file, and concatenate it with the treatments_clean DataFrame

###### Code

In [274]:
treatments_cut = pd.read_csv('treatments_cut.csv')

In [275]:
treatments_clean = pd.concat([treatments_clean, treatments_cut], ignore_index=True)

###### Test

In [115]:
treatments_clean.shape

(350, 7)

#### `treatments`: Missing HbA1c changes and Inaccurate HbA1c changes (leading 4s mistaken as 9s)

##### Define

Replace HbA1c changes with the recalculated value from subtracting hba1c_start to hba1c_end

##### Code

In [276]:
# Replace all the cells with the difference of hba1c_start and hba1c_end 
treatments_clean.hba1c_change = treatments_clean.hba1c_start - treatments_clean.hba1c_end

##### Test

In [117]:
treatments_clean.head()

Unnamed: 0,given_name,surname,auralin,novodra,hba1c_start,hba1c_end,hba1c_change
0,veronika,jindrová,41u - 48u,-,7.63,7.2,0.43
1,elliot,richardson,-,40u - 45u,7.56,7.09,0.47
2,yukitaka,takenaka,-,39u - 36u,7.68,7.25,0.43
3,skye,gormanston,33u - 36u,-,7.97,7.62,0.35
4,alissa,montez,-,33u - 29u,7.78,7.46,0.32
