# The Final Step: Cleaning data
- This is where the quality and tidiness issues identified in the assessing step are remedied

- While there are many ways to clean data manually using spreadsheet programs and text editors, the best way to clean data is to code it . This requires three steps;
    1. Define - define how you will clean the issue in words
    2. Code - convert your definitions into executable code
    3. Test - test your data to ensure your code was implemented correctly

- We'll take the assessments for the last lesson and define, code, and test cleaning operations for each.

## Manual vs Programmatic Cleaning
- **Manual Data CLeaning** includes:
    1. Retyping incorrect data.
    2. Copying and pasting columns and rows

- __Programmatic Data Cleaning__ uses code to:
    1. Automate cleaning tasks
    2. Minimize repetition
    3. Save time

## Data Cleaning Process
- The first thing to do before any cleaning occurs is to make a copy of each piece of data. All of the cleaning operations will be conducted on this copy so you can still view the original dirty and/or messy dataset later
- Copying DataFrames in pandas is done using the `copy` method.
- Simply assiging a DataFrame to a new variable name leaves the original DataFrame vulnerable to modifications
- [Why should I make a copy of a data frame in pandas](https://stackoverflow.com/questions/27673231/why-should-i-make-a-copy-of-a-data-frame-in-pandas)

### The Cleaning Process
- Programmatic data cleaning is a separate step within data wrangling. It has three steps;
    1. Defining
    2. Coding
    
    3. Testing

- __Define__ - The first step is to define a data cleaning plan in writing by converting your assesments into cleaning tasks by writing little how-to guides. This plan also serves as documentation so that your work can be reproduced.

- __Coding__ - translate these words into code and actually run it.

- __Test__ - Test your dataset often using code to make sure your cleaning code worked. This is like revisiting the asses step

### Cleaning Sequences
- There are multiple ways of sequencing your steps in the data cleaning process;
    1. The __Define__, __Code__ and __Test__ headers being used once in the sequence with multiple definitions, cleaning operations and tests under each header respectively
    2. Multiple __Define__, __Code__, and __Test__ headers, one for each data quality and tidiness issue. Effectively you are defining then coding then testing immediately 

## Addressing Missing Data First
- When checking __data quality__, it is usually best to deal with completeness issues first. For missing data this means;
    1. Concatenate
    2. Join
    3. Impute, if possible
- It is important to do this upfront so that subsequent data cleaning will not have to be repeated

## Clinical Trial Dataset
- In the dataset three completeness issues were identified

1. __treatments table__
- missing HbA1c changes
- missing records

2. __patients table__
- missing demographic information (address - contact columns)

- Nothing can be done about the missing demographic inforamtion because there is no way of accessing that infountil those patients come back

# Missing Data

### Gather

In [173]:
import pandas as pd
import numpy as np

In [174]:
patients = pd.read_csv('../2-assessing/datasets/patients.csv')
treatments = pd.read_csv('../2-assessing/datasets/treatments.csv')
adverse_reactions = pd.read_csv('../2-assessing/datasets/adverse_reactions.csv')

### Assess

In [175]:
patients

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,contact,birthdate,weight,height,bmi
0,1,female,Zoe,Wellish,576 Brown Bear Drive,Rancho California,California,92390.0,United States,951-719-9170ZoeWellish@superrito.com,7/10/1976,121.7,66,19.6
1,2,female,Pamela,Hill,2370 University Hill Road,Armstrong,Illinois,61812.0,United States,PamelaSHill@cuvox.de+1 (217) 569-3204,4/3/1967,118.8,66,19.2
2,3,male,Jae,Debord,1493 Poling Farm Road,York,Nebraska,68467.0,United States,402-363-6804JaeMDebord@gustr.com,2/19/1980,177.8,71,24.8
3,4,male,Liêm,Phan,2335 Webster Street,Woodbridge,NJ,7095.0,United States,PhanBaLiem@jourrapide.com+1 (732) 636-8246,7/26/1951,220.9,70,31.7
4,5,male,Tim,Neudorf,1428 Turkey Pen Lane,Dothan,AL,36303.0,United States,334-515-7487TimNeudorf@cuvox.de,2/18/1928,192.3,27,26.1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
498,499,male,Mustafa,Lindström,2530 Victoria Court,Milton Mills,ME,3852.0,United States,207-477-0579MustafaLindstrom@jourrapide.com,4/10/1959,181.1,72,24.6
499,500,male,Ruman,Bisliev,494 Clarksburg Park Road,Sedona,AZ,86341.0,United States,928-284-4492RumanBisliev@gustr.com,3/26/1948,239.6,70,34.4
500,501,female,Jinke,de Keizer,649 Nutter Street,Overland Park,MO,64110.0,United States,816-223-6007JinkedeKeizer@teleworm.us,1/13/1971,171.2,67,26.8
501,502,female,Chidalu,Onyekaozulu,3652 Boone Crockett Lane,Seattle,WA,98109.0,United States,ChidaluOnyekaozulu@jourrapide.com1 360 443 2060,2/13/1952,176.9,67,27.7


In [176]:
treatments

Unnamed: 0,given_name,surname,auralin,novodra,hba1c_start,hba1c_end,hba1c_change
0,veronika,jindrová,41u - 48u,-,7.63,7.20,
1,elliot,richardson,-,40u - 45u,7.56,7.09,0.97
2,yukitaka,takenaka,-,39u - 36u,7.68,7.25,
3,skye,gormanston,33u - 36u,-,7.97,7.62,0.35
4,alissa,montez,-,33u - 29u,7.78,7.46,0.32
...,...,...,...,...,...,...,...
275,albina,zetticci,45u - 51u,-,7.93,7.73,0.20
276,john,teichelmann,-,49u - 49u,7.90,7.58,
277,mathea,lillebø,23u - 36u,-,9.04,8.67,0.37
278,vallie,prince,31u - 38u,-,7.64,7.28,0.36


In [177]:
adverse_reactions

Unnamed: 0,given_name,surname,adverse_reaction
0,berta,napolitani,injection site discomfort
1,lena,baer,hypoglycemia
2,joseph,day,hypoglycemia
3,flavia,fiorentino,cough
4,manouck,wubbels,throat irritation
5,jasmine,sykes,hypoglycemia
6,louise,johnson,hypoglycemia
7,albinca,komavec,hypoglycemia
8,noe,aranda,hypoglycemia
9,sofia,hermansen,injection site discomfort


In [178]:
patients.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 503 entries, 0 to 502
Data columns (total 14 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   patient_id    503 non-null    int64  
 1   assigned_sex  503 non-null    object 
 2   given_name    503 non-null    object 
 3   surname       503 non-null    object 
 4   address       491 non-null    object 
 5   city          491 non-null    object 
 6   state         491 non-null    object 
 7   zip_code      491 non-null    float64
 8   country       491 non-null    object 
 9   contact       491 non-null    object 
 10  birthdate     503 non-null    object 
 11  weight        503 non-null    float64
 12  height        503 non-null    int64  
 13  bmi           503 non-null    float64
dtypes: float64(3), int64(2), object(9)
memory usage: 55.1+ KB


In [179]:
treatments.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 280 entries, 0 to 279
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   given_name    280 non-null    object 
 1   surname       280 non-null    object 
 2   auralin       280 non-null    object 
 3   novodra       280 non-null    object 
 4   hba1c_start   280 non-null    float64
 5   hba1c_end     280 non-null    float64
 6   hba1c_change  171 non-null    float64
dtypes: float64(3), object(4)
memory usage: 15.4+ KB


In [180]:
adverse_reactions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34 entries, 0 to 33
Data columns (total 3 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   given_name        34 non-null     object
 1   surname           34 non-null     object
 2   adverse_reaction  34 non-null     object
dtypes: object(3)
memory usage: 944.0+ bytes


In [181]:
all_columns = pd.Series(list(patients) + list(treatments) + list(adverse_reactions))
all_columns[all_columns.duplicated()]

14    given_name
15       surname
21    given_name
22       surname
dtype: object

In [182]:
list(patients)

['patient_id',
 'assigned_sex',
 'given_name',
 'surname',
 'address',
 'city',
 'state',
 'zip_code',
 'country',
 'contact',
 'birthdate',
 'weight',
 'height',
 'bmi']

In [183]:
patients[patients.address.isnull()]

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,contact,birthdate,weight,height,bmi
209,210,female,Lalita,Eldarkhanov,,,,,,,8/14/1950,143.4,62,26.2
219,220,male,Mỹ,Quynh,,,,,,,4/9/1978,237.8,69,35.1
230,231,female,Elisabeth,Knudsen,,,,,,,9/23/1976,165.9,63,29.4
234,235,female,Martina,Tománková,,,,,,,4/7/1936,199.5,65,33.2
242,243,male,John,O'Brian,,,,,,,2/25/1957,205.3,74,26.4
249,250,male,Benjamin,Mehler,,,,,,,10/30/1951,146.5,69,21.6
257,258,male,Jin,Kung,,,,,,,5/17/1995,231.7,69,34.2
264,265,female,Wafiyyah,Asfour,,,,,,,11/3/1989,158.6,63,28.1
269,270,female,Flavia,Fiorentino,,,,,,,10/9/1937,175.2,61,33.1
278,279,female,Generosa,Cabán,,,,,,,12/16/1962,124.3,69,18.4


In [184]:
patients.describe()

Unnamed: 0,patient_id,zip_code,weight,height,bmi
count,503.0,491.0,503.0,503.0,503.0
mean,252.0,49084.118126,173.43499,66.634195,27.483897
std,145.347859,30265.807442,33.916741,4.411297,5.276438
min,1.0,1002.0,48.8,27.0,17.1
25%,126.5,21920.5,149.3,63.0,23.3
50%,252.0,48057.0,175.3,67.0,27.2
75%,377.5,75679.0,199.5,70.0,31.75
max,503.0,99701.0,255.9,79.0,37.7


In [185]:
treatments.describe()

Unnamed: 0,hba1c_start,hba1c_end,hba1c_change
count,280.0,280.0,171.0
mean,7.985929,7.589286,0.546023
std,0.568638,0.569672,0.279555
min,7.5,7.01,0.2
25%,7.66,7.27,0.34
50%,7.8,7.42,0.38
75%,7.97,7.57,0.92
max,9.95,9.58,0.99


In [186]:
patients.sample(5)

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,contact,birthdate,weight,height,bmi
403,404,male,Robert,Maslov,2356 Myra Street,Providence,RI,2908.0,United States,RobertMaslov@fleckens.hu401-535-2675,7/2/1932,219.1,65,36.5
419,420,female,Maret,Sultygov,2127 Elk City Road,Indianapolis,IN,46268.0,United States,317-956-6166MaretSultygov@teleworm.us,10/20/1969,126.1,63,22.3
305,306,female,Addolorata,Lombardi,550 Cliffside Drive,Binghamton,New York,13901.0,United States,AddolorataLombardi@jourrapide.com+1 (607) 348-...,10/19/1962,189.0,65,31.4
275,276,male,Eddie,Archer,2043 Jadewood Drive,Lombard,Illinois,60148.0,United States,EddieAArcher@gustr.com+1 (224) 305-6805,7/17/1982,158.6,69,23.4
218,219,female,Sabr,Amari,4929 Raver Croft Drive,La Follette,TN,37766.0,United States,423-563-2014SabrRumaithahAmari@fleckens.hu,11/19/1936,122.2,64,21.0


In [187]:
patients.surname.value_counts()

Doe            6
Jakobsen       3
Taylor         3
Ogochukwu      2
Tucker         2
              ..
Casárez        1
Mata           1
Pospíšil       1
Rukavina       1
Onyekaozulu    1
Name: surname, Length: 466, dtype: int64

In [188]:
patients.address.value_counts()

123 Main Street             6
2778 North Avenue           2
2476 Fulton Street          2
648 Old Dear Lane           2
3094 Oral Lake Road         1
                           ..
1066 Goosetown Drive        1
4291 Patton Lane            1
4643 Reeves Street          1
174 Lost Creek Road         1
3652 Boone Crockett Lane    1
Name: address, Length: 483, dtype: int64

In [189]:
patients[patients.address.duplicated()]

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,contact,birthdate,weight,height,bmi
29,30,male,Jake,Jakobsen,648 Old Dear Lane,Port Jervis,New York,12771.0,United States,JakobCJakobsen@einrot.com+1 (845) 858-7707,8/1/1985,155.8,67,24.4
219,220,male,Mỹ,Quynh,,,,,,,4/9/1978,237.8,69,35.1
229,230,male,John,Doe,123 Main Street,New York,NY,12345.0,United States,johndoe@email.com1234567890,1/1/1975,180.0,72,24.4
230,231,female,Elisabeth,Knudsen,,,,,,,9/23/1976,165.9,63,29.4
234,235,female,Martina,Tománková,,,,,,,4/7/1936,199.5,65,33.2
237,238,male,John,Doe,123 Main Street,New York,NY,12345.0,United States,johndoe@email.com1234567890,1/1/1975,180.0,72,24.4
242,243,male,John,O'Brian,,,,,,,2/25/1957,205.3,74,26.4
244,245,male,John,Doe,123 Main Street,New York,NY,12345.0,United States,johndoe@email.com1234567890,1/1/1975,180.0,72,24.4
249,250,male,Benjamin,Mehler,,,,,,,10/30/1951,146.5,69,21.6
251,252,male,John,Doe,123 Main Street,New York,NY,12345.0,United States,johndoe@email.com1234567890,1/1/1975,180.0,72,24.4


In [190]:
patients.weight.sort_values()

210     48.8
459    102.1
335    102.7
74     103.2
317    106.0
       ...  
144    244.9
61     244.9
283    245.5
118    254.5
485    255.9
Name: weight, Length: 503, dtype: float64

In [191]:
weight_lbs = patients[patients.surname == 'Zaitseva'].weight * 2.20462
height_in = patients[patients.surname == 'Zaitseva'].height
bmi_check = 703 * weight_lbs / (height_in * height_in)
bmi_check

210    19.055827
dtype: float64

In [192]:
patients[patients.surname == 'Zaitseva'].bmi

210    19.1
Name: bmi, dtype: float64

In [193]:
sum(treatments.auralin.isnull())

0

In [194]:
sum(treatments.novodra.isnull())

0

#### Quality
##### `patients` table
- Zip code is a float not a string
- Zip code has four digits sometimes
- Tim Neudorf height is 27 in instead of 72 in
- Full state names sometimes, abbreviations other times
- Dsvid Gustafsson
- Missing demographic information (address - contact columns) ***(can't clean)***
- Erroneous datatypes (assigned sex, state, zip_code, and birthdate columns)
- Multiple phone number formats
- Default John Doe data
- Multiple records for Jakobsen, Gersten, Taylor
- kgs instead of lbs for Zaitseva weight

##### `treatments` table
- Missing HbA1c changes
- The letter 'u' in starting and ending doses for Auralin and Novodra
- Lowercase given names and surnames
- Missing records (280 instead of 350)
- Erroneous datatypes (auralin and novodra columns)
- Inaccurate HbA1c changes (leading 4s mistaken as 9s)
- Nulls represented as dashes (-) in auralin and novodra columns

##### `adverse_reactions` table
- Lowercase given names and surnames

#### Tidiness
- Contact column in `patients` table should be split into phone number and email
- Three variables in two columns in `treatments` table (treatment, start dose and end dose)
- Adverse reaction should be part of the `treatments` table
- Given name and surname columns in `patients` table duplicated in `treatments` and `adverse_reactions` tables

### Clean

In [195]:
patients_clean = patients.copy()
treatments_clean = treatments.copy()
adverse_reactions_clean = adverse_reactions.copy()

# Missing Data

### 1. `treatments`: Missing records (280 instead of 350)

##### Define
Import the cut treatments into a DataFrame and concatenate it with the original treatments DataFrame.

##### Code

In [196]:
treatments_cut = pd.read_csv('treatments_cut.csv')
treatments_clean = pd.concat([treatments_clean, treatments_cut], ignore_index=True)

##### Test

In [197]:
treatments_clean.head()

Unnamed: 0,given_name,surname,auralin,novodra,hba1c_start,hba1c_end,hba1c_change
0,veronika,jindrová,41u - 48u,-,7.63,7.2,
1,elliot,richardson,-,40u - 45u,7.56,7.09,0.97
2,yukitaka,takenaka,-,39u - 36u,7.68,7.25,
3,skye,gormanston,33u - 36u,-,7.97,7.62,0.35
4,alissa,montez,-,33u - 29u,7.78,7.46,0.32


In [198]:
treatments_clean.tail()

Unnamed: 0,given_name,surname,auralin,novodra,hba1c_start,hba1c_end,hba1c_change
345,rovzan,kishiev,32u - 37u,-,7.75,7.41,0.34
346,jakob,jakobsen,-,28u - 26u,7.96,7.51,0.95
347,bernd,schneider,48u - 56u,-,7.74,7.44,0.3
348,berta,napolitani,-,42u - 44u,7.68,7.21,
349,armina,sauvé,36u - 46u,-,7.86,7.4,


### 2. `treatments`: Missing HbA1c changes and inaccurate HbA1c changes (leading 4s mistaken as 9s)

##### Define

Recalculate the `hba1c_change` column:  hba1c_start - hba1c_end

In [199]:
treatments_clean.hba1c_change = (treatments_clean.hba1c_start - treatments_clean.hba1c_end)

##### Test

In [200]:
treatments.query('hba1c_change == hba1c_change.max()')

Unnamed: 0,given_name,surname,auralin,novodra,hba1c_start,hba1c_end,hba1c_change
32,laura,ehrlichmann,-,43u - 40u,7.95,7.46,0.99
138,giovana,rocha,-,23u - 21u,7.87,7.38,0.99
245,wu,sung,-,47u - 48u,7.61,7.12,0.99


In [201]:
# treatments_clean.iloc[[32, 138, 245]].hba1c_change
treatments_clean.iloc[[32, 138, 245]]

Unnamed: 0,given_name,surname,auralin,novodra,hba1c_start,hba1c_end,hba1c_change
32,laura,ehrlichmann,-,43u - 40u,7.95,7.46,0.49
138,giovana,rocha,-,23u - 21u,7.87,7.38,0.49
245,wu,sung,-,47u - 48u,7.61,7.12,0.49


In [202]:
treatments_clean.hba1c_change.head()

0    0.43
1    0.47
2    0.43
3    0.35
4    0.32
Name: hba1c_change, dtype: float64

## Cleaning for Tidiness
#### Address Tidiness After Structural Issues and Before Content Issues
- After addressing missing data,cleaning for tidiness is usually the next logical step
- In his paper, `Tidy Data` statistician Hadley Wickham, the pioneer of tidy data makes these key points:
    * Tidy datasets are easy to manipulate
    * Tidy datasets with data quality issues are almost always easier to clean than untidy datasets with the same issues

- That means its generally easy to clean the tidiness issues first which are structural issues then clean the quality issues whhich are the content issues

### Clinical Trial Dataset
- In the oral insulin clinical trial dataset, we found four sections of the dataset were not tidy

##### Tidiness
1. _contact column in `patients` table should be split into phone number and email_
2. _Three variables in two columns in `treatments` table(treatment, start dose and end dose)_
3. _adverse_reactions should be part of the treatments table_
4. _given name and surname columns in `patients` table duplicated in `treatments` and `adverse_reactions` tables_

# Tidiness

### 1. contact column in `patients` table should be split into phone number and email

##### Define

- Extract the phone number and email variables using regular expressions and pandas `str.extract` method
- Drop the contact column when done

##### Code

In [203]:
patients_clean.head()

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,contact,birthdate,weight,height,bmi
0,1,female,Zoe,Wellish,576 Brown Bear Drive,Rancho California,California,92390.0,United States,951-719-9170ZoeWellish@superrito.com,7/10/1976,121.7,66,19.6
1,2,female,Pamela,Hill,2370 University Hill Road,Armstrong,Illinois,61812.0,United States,PamelaSHill@cuvox.de+1 (217) 569-3204,4/3/1967,118.8,66,19.2
2,3,male,Jae,Debord,1493 Poling Farm Road,York,Nebraska,68467.0,United States,402-363-6804JaeMDebord@gustr.com,2/19/1980,177.8,71,24.8
3,4,male,Liêm,Phan,2335 Webster Street,Woodbridge,NJ,7095.0,United States,PhanBaLiem@jourrapide.com+1 (732) 636-8246,7/26/1951,220.9,70,31.7
4,5,male,Tim,Neudorf,1428 Turkey Pen Lane,Dothan,AL,36303.0,United States,334-515-7487TimNeudorf@cuvox.de,2/18/1928,192.3,27,26.1


In [204]:
patients_clean['phone_number'] = patients_clean.contact.str.extract('((?:\+\d{1,2}\s)?\(?\d{3}\)?[\s.-]?\d{3}[\s.-]?\d{4})', expand=True)

# [a-zA-Z] to signify emails in this dataset all start and end with letters
patients_clean['email'] = patients_clean.contact.str.extract('([a-zA-Z][a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+[a-zA-Z])', expand=True)

# Note: axis=1 denotes that we are referring to a column, not a row
patients_clean = patients_clean.drop('contact', axis=1)

##### Test

In [205]:
# confirm contact column is gone
list(patients_clean)

['patient_id',
 'assigned_sex',
 'given_name',
 'surname',
 'address',
 'city',
 'state',
 'zip_code',
 'country',
 'birthdate',
 'weight',
 'height',
 'bmi',
 'phone_number',
 'email']

In [206]:
patients_clean.phone_number.sample(25)

278                  NaN
310         913 322 9114
252         978-243-8596
387         561-826-5683
29     +1 (845) 858-7707
7           408 778 3236
386         408-215-6012
72          504-289-1386
176         508 857 0477
239         228-378-1355
311         601-389-7682
196         512-738-2609
233         504-546-5321
259         727-439-7150
496         209 762 2320
384    +1 (605) 440-5492
235         606-368-9825
171         815-533-7692
98     +1 (907) 328-4125
430         701-662-1983
48          312-719-7238
488         352-453-4601
486         254-681-4504
402         601-885-6550
71          860-515-0122
Name: phone_number, dtype: object

In [207]:
# Confirm that no emails start with an integer (regex didn't match for this)
patients_clean.email.sort_values().head()

404               AaliyahRice@dayrep.com
11          Abdul-NurMummarIsa@rhyta.com
332                AbelEfrem@fleckens.hu
258              AbelYonatan@teleworm.us
305    AddolorataLombardi@jourrapide.com
Name: email, dtype: object

### 2. Three variables in two in `treatments` table (treatment, start dose and end dose)

##### Define

- `melt` the auralin and novodra columns to a `treatment` and `dose` column (dose will still contain both start and end dose at this point)
- Then split the dose column on `'-'` to obtain `start_dose` and `end_dose` columns
- Drop the intermediate dose column

#### Code

In [208]:
treatments_clean.head()

Unnamed: 0,given_name,surname,auralin,novodra,hba1c_start,hba1c_end,hba1c_change
0,veronika,jindrová,41u - 48u,-,7.63,7.2,0.43
1,elliot,richardson,-,40u - 45u,7.56,7.09,0.47
2,yukitaka,takenaka,-,39u - 36u,7.68,7.25,0.43
3,skye,gormanston,33u - 36u,-,7.97,7.62,0.35
4,alissa,montez,-,33u - 29u,7.78,7.46,0.32


In [209]:
treatments_clean = pd.melt(treatments_clean, id_vars=['given_name', 'surname', 'hba1c_start', 'hba1c_end', 'hba1c_change'], var_name='treatment', value_name='dose')
treatments_clean.sample(8)

Unnamed: 0,given_name,surname,hba1c_start,hba1c_end,hba1c_change,treatment,dose
117,javier,moquin,8.0,7.59,0.41,auralin,-
375,benoît,bonami,9.82,9.4,0.42,novodra,44u - 43u
35,csaba,sági,7.88,7.48,0.4,auralin,-
439,stefanie,herman,7.72,7.3,0.42,novodra,-
399,jackson,addison,7.99,7.51,0.48,novodra,42u - 42u
1,elliot,richardson,7.56,7.09,0.47,auralin,-
106,sofia,karlsen,7.62,7.15,0.47,auralin,-
580,eric,ek,7.92,7.47,0.45,novodra,51u - 47u


In [210]:
treatments_clean = treatments_clean[treatments_clean.dose != '-']
treatments_clean['dose_start'], treatments_clean['dose_end'] = treatments_clean['dose'].str.split('-', 1).str
treatments_clean = treatments_clean.drop('dose', axis=1)
treatments_clean.sample(10)

  treatments_clean['dose_start'], treatments_clean['dose_end'] = treatments_clean['dose'].str.split('-', 1).str


Unnamed: 0,given_name,surname,hba1c_start,hba1c_end,hba1c_change,treatment,dose_start,dose_end
418,fatimah,kinfe,7.88,7.56,0.32,novodra,43u,42u
85,nilton,quintanilla,7.86,7.48,0.38,auralin,42u,49u
653,miłosław,wiśniewski,7.51,7.08,0.43,novodra,34u,33u
343,žarka,rap,7.54,7.15,0.39,auralin,35u,48u
607,mathilde,nørgaard,8.5,8.1,0.4,novodra,27u,28u
422,gabrielle,bidwill,7.76,7.37,0.39,novodra,44u,49u
327,regolo,nucci,7.53,7.02,0.51,auralin,51u,59u
222,veronica,bogolyubova,7.69,7.31,0.38,auralin,25u,35u
287,frydryk,adamski,7.75,7.27,0.48,auralin,63u,74u
295,dani,antoun,7.73,7.34,0.39,auralin,36u,44u


### 3. `adverse_reaction` should be part of the `treatments` table

##### Define

- Merge the adverse_reaction column to the `treatments` table joining on given name and surname

##### Code

In [212]:
treatments_clean.sample(10)

Unnamed: 0,given_name,surname,hba1c_start,hba1c_end,hba1c_change,treatment,dose_start,dose_end,adverse_reaction
121,tomáš,navrátil,7.84,7.41,0.43,auralin,24u,36u,
271,valur,bjarkason,9.71,9.41,0.3,novodra,31u,36u,
8,enco,žibrik,7.78,7.34,0.44,auralin,55u,68u,
63,barbora,vesecká,7.9,7.46,0.44,auralin,29u,45u,
251,jesse,luoma,7.72,7.35,0.37,novodra,39u,37u,
215,noriyuki,sakai,7.58,7.16,0.42,novodra,32u,31u,
244,david,gustafsson,7.72,7.28,0.44,novodra,33u,34u,
40,furuta,osman,7.52,7.18,0.34,auralin,30u,41u,
90,samuel,blix,7.97,7.56,0.41,auralin,48u,55u,
349,berta,napolitani,7.68,7.21,0.47,novodra,42u,44u,injection site discomfort


### 4. Given name and surname columns in `patients` table duplicated in `treatments` and `adverse_reactions` tables and lowercase given names and surnames

##### Define

- Adverse reactions table is no longer needed
- Isolate the patient ID and names in the `patients` table then convert these names to lowercase to join with the `treatments`
- Then drop the given name and surname columns in the treatments table (so this being lowercase isn't an issue anymore)

##### Code

In [213]:
id_names = patients_clean[['patient_id', 'given_name', 'surname']]
id_names.given_name = id_names.given_name.str.lower()
id_names.surname = id_names.surname.str.lower()
treatments_clean = pd.merge(treatments_clean, id_names, on=['given_name', 'surname'])
treatments_clean = treatments_clean.drop(['given_name', 'surname'], axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  id_names.given_name = id_names.given_name.str.lower()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  id_names.surname = id_names.surname.str.lower()


##### Test

In [214]:
# confirm the merge was executed correctly
treatments_clean.head()

Unnamed: 0,hba1c_start,hba1c_end,hba1c_change,treatment,dose_start,dose_end,adverse_reaction,patient_id
0,7.63,7.2,0.43,auralin,41u,48u,,225
1,7.97,7.62,0.35,auralin,33u,36u,,242
2,7.65,7.27,0.38,auralin,37u,42u,,345
3,7.89,7.55,0.34,auralin,31u,38u,,276
4,7.76,7.37,0.39,auralin,30u,36u,,15


In [215]:
# patient ID should be the only duplicated column
all_columns = pd.Series(list(patients_clean) + list(treatments_clean))
all_columns[all_columns.duplicated()]
# list(all_columns)

22    patient_id
dtype: object

## Cleaning for Quality

- Once the missing data and tidiness issues are cleaned all that remains is cleaning the remaining data quality issues

#### Quality
__patients table__ <br>

1. _zip code is a float not a string_
2. _zip code has four digits sometimes_
3. _Tim Neudorf height is 27 in instead of 72 in_
4. _full state names sometimes, abbreviations other times_
5. _Dsvid Gustafsson_
6. _Missing demographic information (address - contact columns)_
7. _Erroneous datatypes (assigned sex, state, zip_code, and birthdate columns)_
8. _Multiple phone number formats_
9. _Default John Doe data_
10. _Multiple records for Jakobsen, Gersten, Taylor_
11. _kgs instead of lbs for Zaitseva weight_

__treatments table__ <br>

1. _missing HbA1c changes_
2. _the letter u in starting and ending doses for Auralin and Novodra_
3. _lowercase given names and surnames_
4. _missing records (280 instead of 350)_
5. _Erroneous datatypes (auralin and novodra columns)_
6. _Inaccurate HbA1c changes (4s mistaken as 9s)_
7. _Nulls represented as dashes (-) in auralin and novodra columns_

__adverse_reactions table__ <br>

1. _lowercase given names and surnames_

# Quality

### 1. Zip code is a float not a string and Zip code has four digits sometimes

##### Define

- Convert the zip code's column data type from a float to a string using `astype`, remove the `0` using string slicing and pad four digit zip codes with a leading zero

In [216]:
patients_clean.sample(10)

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,birthdate,weight,height,bmi,phone_number,email
391,392,male,Daimy,Tromp,522 Lamberts Branch Road,Sunrise,FL,33323.0,United States,5/23/1990,198.0,75,24.7,786-970-4206,DaimyTromp@superrito.com
135,136,male,Willem-Jan,van der Lubbe,1717 Vineyard Drive,Cleveland,OH,44115.0,United States,7/9/1941,152.9,69,22.6,440-385-5011,Willem-JanvanderLubbe@gustr.com
62,63,female,Firenze,Fodor,1786 Gerald L. Bates Drive,Belmont,MA,2178.0,United States,4/1/1943,131.1,60,25.6,617-883-5967,FodorFirenze@dayrep.com
461,462,male,Cannan,Cabrera,2102 Geraldine Lane,New York,NY,10014.0,United States,10/12/1980,209.7,71,29.2,646-289-4177,CannanCabreraOrdonez@superrito.com
331,332,male,Leon,Scholz,3106 Evergreen Lane,Irvine,CA,92618.0,United States,9/14/1989,150.9,70,21.6,323 635 9919,LeonScholz@fleckens.hu
445,446,male,Maximus,Henzen,4334 Black Oak Hollow Road,San Jose,California,95113.0,United States,11/14/1924,180.8,72,24.5,408-792-9489,MaximusHenzen@einrot.com
234,235,female,Martina,Tománková,,,,,,4/7/1936,199.5,65,33.2,,
463,464,female,Bouke,Glaser,3006 Maple Court,Owensville,MO,65066.0,United States,10/21/1996,142.1,66,22.9,573 437 7334,BoukeGlaser@einrot.com
145,146,male,Sauli,Koivuniemi,1990 Spring Avenue,Eagleville,PA,19403.0,United States,7/24/1974,170.9,66,27.6,+1 (267) 679-4137,SauliKoivuniemi@einrot.com
13,14,female,Anenechi,Chidi,826 Broad Street,Birmingham,AL,35203.0,United States,3/7/1961,228.4,67,35.8,+1 (205) 417-8095,AnenechiChidi@armyspy.com


##### Code

In [217]:
patients_clean.zip_code = patients_clean.zip_code.astype(str).str[:-2].str.pad(5, fillchar='0')
patients_clean.sample(10)

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,birthdate,weight,height,bmi,phone_number,email
108,109,female,Marina,Glockner,475 Preston Street,Bushton,KS,67427,United States,6/18/1934,191.4,63,33.9,620 940 1131,MarinaGlockner@dayrep.com
481,482,male,Michael,Kristensen,1614 Heather Sees Way,Tulsa,OK,74116,United States,8/10/1930,154.7,65,25.7,918 706 2776,MichaelKristensen@gustr.com
23,24,male,Lovre,Galić,4941 Marion Drive,Winter Haven,Florida,33830,United States,5/26/1960,222.9,66,36.0,813 355 9476,LovreGalic@gustr.com
368,369,male,Corey,Nicholls,3427 Gerald L. Bates Drive,Boston,MA,2110,United States,4/25/1989,165.0,74,21.2,617 830 7216,CoreyNicholls@jourrapide.com
192,193,female,Jade,Parker,2957 Feathers Hooves Drive,Garden City,NY,11530,United States,6/19/1927,188.8,63,33.4,631-704-6487,JadeParker@superrito.com
270,271,female,Jowita,Wiśniewska,2168 Butternut Lane,Granite City,Illinois,62040,United States,11/8/1934,108.1,61,20.4,+1 (618) 512-3319,JowitaWisniewska@armyspy.com
40,41,male,Tješimir,Lukić,3636 Junior Avenue,Atlanta,GA,30303,United States,10/24/1941,147.8,70,21.2,404 547 4508,TjesimirLukic@jourrapide.com
275,276,male,Eddie,Archer,2043 Jadewood Drive,Lombard,Illinois,60148,United States,7/17/1982,158.6,69,23.4,+1 (224) 305-6805,EddieAArcher@gustr.com
165,166,male,Zlatko,Rukavina,592 Rafe Lane,Yazoo City,MS,39194,United States,12/12/1976,198.7,70,28.5,+1 (662) 716-9586,ZlatkoRukavina@cuvox.de
136,137,female,Victoria,Mikkelsen,2121 Liberty Avenue,Los Angeles,California,90017,United States,5/7/1925,179.3,63,31.8,714-507-4204,VictoriaTMikkelsen@armyspy.com


In [218]:
# Reconvert NaNs entries that were converted to '0000n' by code above
patients_clean.zip_code = patients_clean.zip_code.replace('0000n', np.nan)

In [219]:
patients_clean.zip_code.sample(5)

117    70112
176    01852
395    27055
333    46773
5      32114
Name: zip_code, dtype: object

### 2. Tim Neudorf height is 27 in instead of 72 in

##### Define
- Replace height for rows in the `patients` table that have a height of 27 in (there is only one) with 72 in

##### Code

In [220]:
patients_clean.height = patients_clean.height.replace(27, 72)

##### Test

In [221]:
# should be empty
patients_clean[patients_clean.height == 27]

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,birthdate,weight,height,bmi,phone_number,email


In [222]:
# confirm the replacement worked
patients_clean[patients_clean.surname == 'Neudorf']

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,birthdate,weight,height,bmi,phone_number,email
4,5,male,Tim,Neudorf,1428 Turkey Pen Lane,Dothan,AL,36303,United States,2/18/1928,192.3,72,26.1,334-515-7487,TimNeudorf@cuvox.de


### 3. Full state names sometimes, other times abbreviations

##### Define

- Apply a function that converts full state name to state abbreviation for California, New York, Illinois, Florida and Nebraska

##### Code

In [223]:
# Mapping from full state name to abbreviation
state_abbrev = {
    "California" : "CA",
    "New York" : "NY",
    "Illinois": "IL",
    "Florida" : "FL",
    "Nebraska" : "NE"
}

# Function to apply
def abbreviate_state(patient):
    if patient.state in state_abbrev:
        return state_abbrev[patient.state]
    else:
        return patient.state

patients_clean.state = patients_clean.apply(abbreviate_state, axis=1)

In [224]:
patients_clean.state.value_counts()

CA    60
NY    47
TX    32
IL    24
FL    22
MA    22
PA    18
GA    15
OH    14
MI    13
OK    13
LA    13
NJ    12
VA    11
WI    10
MS    10
AL     9
TN     9
IN     9
MN     9
NC     8
KY     8
WA     8
MO     7
NE     6
KS     6
ID     6
NV     6
SC     5
IA     5
CT     5
RI     4
ND     4
AR     4
AZ     4
ME     4
CO     4
MD     3
DE     3
SD     3
WV     3
OR     3
VT     2
MT     2
DC     2
AK     1
NM     1
NH     1
WY     1
Name: state, dtype: int64

### 4. Dsvid Gustafsson

##### Define

- Replace given_name for rows in the `patients` table that have a given name of "Dsvid" with "David"

##### Code

In [225]:
patients_clean.given_name = patients_clean.given_name.replace("Dsvid", "David")

In [226]:
patients_clean[patients_clean.surname == "Gustafsson"]

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,birthdate,weight,height,bmi,phone_number,email
8,9,male,David,Gustafsson,1790 Nutter Street,Kansas City,MO,64105,United States,3/6/1937,163.9,66,26.5,816-265-9578,DavidGustafsson@armyspy.com


### 5. Erroneous datatypes (assigned sex, state, zip_code and birthdate columns) and erroneous datatypes(auralin and novodra columns) and the letter 'u' in starting and ending doses for Auralin ana Novodra

##### Define

- Convert assigned sex and state to categorical data types
- Zip code data type was already addresses above
- Convert birthdate to datetime datatype
- strip the letter u in the start_dose and end_dose and convert those columns to data type integer

#### Code

In [228]:
treatments_clean.head(20)

Unnamed: 0,hba1c_start,hba1c_end,hba1c_change,treatment,dose_start,dose_end,adverse_reaction,patient_id
0,7.63,7.2,0.43,auralin,41u,48u,,225
1,7.97,7.62,0.35,auralin,33u,36u,,242
2,7.65,7.27,0.38,auralin,37u,42u,,345
3,7.89,7.55,0.34,auralin,31u,38u,,276
4,7.76,7.37,0.39,auralin,30u,36u,,15
5,7.7,7.19,0.51,auralin,29u,36u,hypoglycemia,70
6,7.7,7.19,0.51,auralin,29u,36u,hypoglycemia,70
7,9.54,9.14,0.4,auralin,29u,38u,,18
8,7.74,7.3,0.44,auralin,27u,37u,,424
9,7.78,7.34,0.44,auralin,55u,68u,,292


In [230]:
# To category
patients_clean.assigned_sex = patients_clean.assigned_sex.astype('category')
patients_clean.state = patients_clean.state.astype('category')

# To datetime
patients_clean.birthdate = pd.to_datetime(patients_clean.birthdate)

# Strip u and to integer
# treatments_clean.dose_start = treatments_clean.dose_start.str.strip('u').astype(int)
# treatments_clean.dose_end = treatments_clean.dose_end.str.strip('u').astype(int)
treatments_clean.dose_end = treatments_clean.dose_end.str[:-1]

In [231]:
treatments_clean.head(10)

Unnamed: 0,hba1c_start,hba1c_end,hba1c_change,treatment,dose_start,dose_end,adverse_reaction,patient_id
0,7.63,7.2,0.43,auralin,41u,48,,225
1,7.97,7.62,0.35,auralin,33u,36,,242
2,7.65,7.27,0.38,auralin,37u,42,,345
3,7.89,7.55,0.34,auralin,31u,38,,276
4,7.76,7.37,0.39,auralin,30u,36,,15
5,7.7,7.19,0.51,auralin,29u,36,hypoglycemia,70
6,7.7,7.19,0.51,auralin,29u,36,hypoglycemia,70
7,9.54,9.14,0.4,auralin,29u,38,,18
8,7.74,7.3,0.44,auralin,27u,37,,424
9,7.78,7.34,0.44,auralin,55u,68,,292


In [232]:
treatments_clean.dose_start = treatments_clean.dose_start.str.strip('u')

In [233]:
treatments_clean.head(10)

Unnamed: 0,hba1c_start,hba1c_end,hba1c_change,treatment,dose_start,dose_end,adverse_reaction,patient_id
0,7.63,7.2,0.43,auralin,41,48,,225
1,7.97,7.62,0.35,auralin,33,36,,242
2,7.65,7.27,0.38,auralin,37,42,,345
3,7.89,7.55,0.34,auralin,31,38,,276
4,7.76,7.37,0.39,auralin,30,36,,15
5,7.7,7.19,0.51,auralin,29,36,hypoglycemia,70
6,7.7,7.19,0.51,auralin,29,36,hypoglycemia,70
7,9.54,9.14,0.4,auralin,29,38,,18
8,7.74,7.3,0.44,auralin,27,37,,424
9,7.78,7.34,0.44,auralin,55,68,,292


In [235]:
# to integer
treatments_clean.dose_start = treatments_clean.dose_start.astype(int)
treatments_clean.dose_end = treatments_clean.dose_end.astype(int)

##### Test

In [None]:
patients_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 503 entries, 0 to 502
Data columns (total 15 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   patient_id    503 non-null    int64         
 1   assigned_sex  503 non-null    category      
 2   given_name    503 non-null    object        
 3   surname       503 non-null    object        
 4   address       491 non-null    object        
 5   city          491 non-null    object        
 6   state         491 non-null    category      
 7   zip_code      491 non-null    object        
 8   country       491 non-null    object        
 9   birthdate     503 non-null    datetime64[ns]
 10  weight        503 non-null    float64       
 11  height        503 non-null    int64         
 12  bmi           503 non-null    float64       
 13  phone_number  491 non-null    object        
 14  email         491 non-null    object        
dtypes: category(2), datetime64[ns](1), float

In [236]:
treatments_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 349 entries, 0 to 348
Data columns (total 8 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   hba1c_start       349 non-null    float64
 1   hba1c_end         349 non-null    float64
 2   hba1c_change      349 non-null    float64
 3   treatment         349 non-null    object 
 4   dose_start        349 non-null    int64  
 5   dose_end          349 non-null    int64  
 6   adverse_reaction  35 non-null     object 
 7   patient_id        349 non-null    int64  
dtypes: float64(3), int64(3), object(2)
memory usage: 24.5+ KB


### 6. Multiple phone number formats

##### Define
- strip all "", "-", "(", ")" and "+" and store each number without any formatting. Pad the phone number with a 1 if the length of the number is 10 digits (we want country code)

##### Code

In [237]:
patients_clean.phone_number = patients_clean.phone_number.str.replace(r'\D+', '').str.pad(11, fillchar='1')

  patients_clean.phone_number = patients_clean.phone_number.str.replace(r'\D+', '').str.pad(11, fillchar='1')


In [238]:
patients_clean.phone_number.head()

0    19517199170
1    12175693204
2    14023636804
3    17326368246
4    13345157487
Name: phone_number, dtype: object

### 7. Default John Doe data

##### Define

- Remove the non recoverable John Doe records from the patients table

##### Code

In [239]:
patients_clean = patients_clean[patients_clean.surname != 'Doe']

In [240]:
# should be no Doe records
patients_clean.surname.value_counts()

Jakobsen       3
Taylor         3
Aranda         2
Tucker         2
Souza          2
              ..
Casárez        1
Mata           1
Pospíšil       1
Rukavina       1
Onyekaozulu    1
Name: surname, Length: 465, dtype: int64

In [242]:
# should be no 123 Main Street records
patients_clean.address.value_counts()

2778 North Avenue           2
2476 Fulton Street          2
648 Old Dear Lane           2
576 Brown Bear Drive        1
2272 Williams Avenue        1
                           ..
1066 Goosetown Drive        1
4291 Patton Lane            1
4643 Reeves Street          1
174 Lost Creek Road         1
3652 Boone Crockett Lane    1
Name: address, Length: 482, dtype: int64

### 8. Multiple Records for Jakobsen, Gersten, Taylor

##### Define

- Remove the Jake Jakobson, Pat Gersten and Sandy Taylor rows from the `patients` table. This are nick names which happen to also not be in the `treatments` table(Removing the wrong name will create a consistency issue between the `patients` and `treatments` table)
- These are all the second occurence of the duplicate
- They are also the only occurence of non duplicate addresses.

##### Code

In [243]:
# tilde means not: http://pandas.pydata.org/pandas-docs/stable/indexing.html#boolean-indexing
patients_clean = patients_clean[~((patients_clean.address.duplicated()) & patients_clean.address.notnull())]

##### Test

In [244]:
patients_clean[patients_clean.surname == "Jakobsen"]

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,birthdate,weight,height,bmi,phone_number,email
24,25,male,Jakob,Jakobsen,648 Old Dear Lane,Port Jervis,NY,12771,United States,1985-08-01,155.8,67,24.4,18458587707,JakobCJakobsen@einrot.com
432,433,female,Karen,Jakobsen,1690 Fannie Street,Houston,TX,77020,United States,1962-11-25,185.2,67,29.0,19792030438,KarenJakobsen@jourrapide.com


In [245]:
patients_clean[patients_clean.surname == 'Gersten']

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,birthdate,weight,height,bmi,phone_number,email
97,98,male,Patrick,Gersten,2778 North Avenue,Burr,NE,68324,United States,1954-05-03,138.2,71,19.3,14028484923,PatrickGersten@rhyta.com


In [246]:
patients_clean[patients_clean.surname == 'Taylor']

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,birthdate,weight,height,bmi,phone_number,email
131,132,female,Sandra,Taylor,2476 Fulton Street,Rainelle,WV,25962,United States,1960-10-23,206.1,64,35.4,13044382648,SandraCTaylor@dayrep.com
426,427,male,Rogelio,Taylor,4064 Marigold Lane,Miami,FL,33179,United States,1992-09-02,186.6,69,27.6,13054346299,RogelioJTaylor@teleworm.us


### 9. kgs instead of lbs for Zaitseva weight

##### Define
Use [advanced indexing](https://stackoverflow.com/a/44913631) to isolate the row where the surname is Zaitseva and convert the entry in its weight field from kg to lbs.

In [247]:
weight_kg = patients_clean.weight.min()
mask = patients_clean.surname == 'Zaitseva'
column_name = 'weight'
patients_clean.loc[mask, column_name] = weight_kg * 2.20462

##### Test

In [248]:
# 48.8 shouldn't be the lowest anymore
patients_clean.weight.sort_values()


459    102.1
335    102.7
74     103.2
317    106.0
171    106.5
       ...  
144    244.9
61     244.9
283    245.5
118    254.5
485    255.9
Name: weight, Length: 494, dtype: float64

# Flashforward

## Is Auralin Effective?
- After assessing and cleaning the clinical trial dataset we are ready to determine if the proposed new oral insulin, Auralin compared to the injectible insulin, Novodra

### Key Metrics
1. Adverse Reactions
2. Pre-trial Post-trial Mean Insulin Dose Change.
3. HbA1c Change
4. Confidence Interval - range of values that a parameter is likely to fall in with a specific probability.

#### Adverse_reactions
- For aAuralin to pass this Phase II clinical trial it must be deemed safe and the adverse_reactions to it is encouraging
- These adverse_reactions were actually previously standalone, but we joined this to the treatments table to allow for this analysis. Between the two drugs, Auralin and Novodra, the counts of each adverse reaction are pretty similar. One exception is throat irritation for Auralin, the oral insulin which is expected because this pill is taken orally and passes by the throat before it gets to the stomach. Another is injection site discomfort for Novodra which is the injectable insulin because that's a common known adverse reaction for injectable insulin because of needles. This one of the reasons why we want oral insulin in the first place.

#### Confidence Interval
- A statistical term that refers to the range of values that a parameter will fall in with a specific probability


#### You can Iterate

#### When is Iteration Necessary?
- The concept of iterating isn't that applicable for clinical trials given the rigor involved in their planning. But, there are other situations that require iteration:
    1. Your statistical power calculations are wrong, and you need to recruit more patients to make your study statistically significant. You'd also have to revisit gathering in this scenario.
    2. You are missing a key piece of patient information, like patient blood type because you discover new research that related insulin resistance to blood type.You'd also have to revisit gathering in this scenario.
    4. You spotted another data quality issue. Revisiting assessing to add these assessments to your notes is fine.