# Data Wrangling Template

## Gather

In [149]:
import pandas as pd

In [2]:
patients = pd.read_csv('patients.csv')
treatments = pd.read_csv('treatments.csv')
adverse_reactions = pd.read_csv('adverse_reactions.csv')

## Assess
These are the programmatic assessment methods in pandas that you will probably use most often:

* .head (DataFrame and Series)
* .tail (DataFrame and Series)
* .sample (DataFrame and Series)
* .info (DataFrame only)
* .describe (DataFrame and Series)
* .value_counts (Series only)
* Various methods of indexing and selecting data (.loc and bracket notation with/without boolean indexing, also .iloc)

Try them out below and keep their results in mind. Some will come in handy later in the lesson.

Check out the [pandas API reference](https://pandas.pydata.org/pandas-docs/stable/api.html) for detailed usage information.

In [52]:
patients.head(n=10)

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,contact,birthdate,weight,height,bmi
0,1,female,Zoe,Wellish,576 Brown Bear Drive,Rancho California,California,92390.0,United States,951-719-9170ZoeWellish@superrito.com,7/10/1976,121.7,66,19.6
1,2,female,Pamela,Hill,2370 University Hill Road,Armstrong,Illinois,61812.0,United States,PamelaSHill@cuvox.de+1 (217) 569-3204,4/3/1967,118.8,66,19.2
2,3,male,Jae,Debord,1493 Poling Farm Road,York,Nebraska,68467.0,United States,402-363-6804JaeMDebord@gustr.com,2/19/1980,177.8,71,24.8
3,4,male,Liêm,Phan,2335 Webster Street,Woodbridge,NJ,7095.0,United States,PhanBaLiem@jourrapide.com+1 (732) 636-8246,7/26/1951,220.9,70,31.7
4,5,male,Tim,Neudorf,1428 Turkey Pen Lane,Dothan,AL,36303.0,United States,334-515-7487TimNeudorf@cuvox.de,2/18/1928,192.3,27,26.1
5,6,male,Rafael,Costa,1140 Willis Avenue,Daytona Beach,Florida,32114.0,United States,386-334-5237RafaelCardosoCosta@gustr.com,8/31/1931,183.9,70,26.4
6,7,female,Mary,Adams,3145 Sheila Lane,Burbank,NV,84728.0,United States,775-533-5933MaryBAdams@einrot.com,11/19/1969,146.3,65,24.3
7,8,female,Xiuxiu,Chang,2687 Black Oak Hollow Road,Morgan Hill,CA,95037.0,United States,XiuxiuChang@einrot.com1 408 778 3236,8/13/1958,158.0,60,30.9
8,9,male,Dsvid,Gustafsson,1790 Nutter Street,Kansas City,MO,64105.0,United States,816-265-9578DavidGustafsson@armyspy.com,3/6/1937,163.9,66,26.5
9,10,female,Sophie,Cabrera,3303 Anmoore Road,New York,New York,10011.0,United States,SophieCabreraIbarra@teleworm.us1 718 795 9124,12/3/1930,194.7,64,33.4


`patients` columns:
- **patient_id**: the unique identifier for each patient in the [Master Patient Index](https://en.wikipedia.org/wiki/Enterprise_master_patient_index) (i.e. patient database) of the pharmaceutical company that is producing Auralin
- **assigned_sex**: the assigned sex of each patient at birth (male or female)
- **given_name**: the given name (i.e. first name) of each patient
- **surname**: the surname (i.e. last name) of each patient
- **address**: the main address for each patient
- **city**: the corresponding city for the main address of each patient
- **state**: the corresponding state for the main address of each patient
- **zip_code**: the corresponding zip code for the main address of each patient
- **country**: the corresponding country for the main address of each patient (all United states for this clinical trial)
- **contact**: phone number and email information for each patient
- **birthdate**: the date of birth of each patient (month/day/year). The [inclusion criteria](https://en.wikipedia.org/wiki/Inclusion_and_exclusion_criteria) for this clinical trial is  age >= 18 *(there is no maximum age because diabetes is a [growing problem](http://www.diabetes.co.uk/diabetes-and-the-elderly.html) among the elderly population)*
- **weight**: the weight of each patient in pounds (lbs)
- **height**: the height of each patient in inches (in)
- **bmi**: the Body Mass Index (BMI) of each patient. BMI is a simple calculation using a person's height and weight. The formula is BMI = kg/m<sup>2</sup> where kg is a person's weight in kilograms and m<sup>2</sup> is their height in metres squared. A BMI of 25.0 or more is overweight, while the healthy range is 18.5 to 24.9. *The [inclusion criteria](https://en.wikipedia.org/wiki/Inclusion_and_exclusion_criteria) for this clinical trial is 16 >= BMI >= 38.*

In [15]:
patients.describe()

Unnamed: 0,patient_id,zip_code,weight,height,bmi
count,503.0,491.0,503.0,503.0,503.0
mean,252.0,49084.118126,173.43499,66.634195,27.483897
std,145.347859,30265.807442,33.916741,4.411297,5.276438
min,1.0,1002.0,48.8,27.0,17.1
25%,126.5,21920.5,149.3,63.0,23.3
50%,252.0,48057.0,175.3,67.0,27.2
75%,377.5,75679.0,199.5,70.0,31.75
max,503.0,99701.0,255.9,79.0,37.7


In [60]:
patients.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 503 entries, 0 to 502
Data columns (total 14 columns):
patient_id      503 non-null int64
assigned_sex    503 non-null object
given_name      503 non-null object
surname         503 non-null object
address         491 non-null object
city            491 non-null object
state           491 non-null object
zip_code        491 non-null float64
country         491 non-null object
contact         491 non-null object
birthdate       503 non-null object
weight          503 non-null float64
height          503 non-null int64
bmi             503 non-null float64
dtypes: float64(3), int64(2), object(9)
memory usage: 55.1+ KB


In [59]:
patients.state.value_counts()

California    36
TX            32
New York      25
CA            24
NY            22
MA            22
PA            18
GA            15
OH            14
Illinois      14
LA            13
Florida       13
OK            13
MI            13
NJ            12
VA            11
WI            10
MS            10
IL            10
IN             9
AL             9
FL             9
MN             9
TN             9
NC             8
WA             8
KY             8
MO             7
KS             6
ID             6
NV             6
SC             5
IA             5
CT             5
AR             4
CO             4
ME             4
RI             4
AZ             4
Nebraska       4
ND             4
OR             3
MD             3
WV             3
SD             3
DE             3
DC             2
MT             2
VT             2
NE             2
NH             1
NM             1
WY             1
AK             1
Name: state, dtype: int64

In [62]:
patients[patients.city=="New York"]

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,contact,birthdate,weight,height,bmi
9,10,female,Sophie,Cabrera,3303 Anmoore Road,New York,New York,10011.0,United States,SophieCabreraIbarra@teleworm.us1 718 795 9124,12/3/1930,194.7,64,33.4
35,36,female,Kamila,Pecinová,3558 Longview Avenue,New York,New York,10004.0,United States,718-501-0503KamilaPecinova@dayrep.com,12/23/1985,198.9,62,36.4
84,85,female,Nương,Vũ,465 Southern Street,New York,NY,10001.0,United States,VuCamNuong@fleckens.hu516-720-5094,2/1/1981,138.2,63,24.5
129,130,female,Rebecca,Jephcott,989 Wayback Lane,New York,NY,10004.0,United States,631-370-7406RebeccaJephcott@armyspy.com,8/1/1966,203.3,65,33.8
142,143,male,Finley,Chandler,2754 Westwood Avenue,New York,New York,10001.0,United States,516-740-5280FinleyChandler@dayrep.com,10/25/1936,150.9,70,21.6
152,153,male,Christopher,Woodward,3450 Southern Street,New York,NY,10004.0,United States,ChristopherWoodward@jourrapide.com+1 (516) 630...,9/4/1984,212.2,66,34.2
188,189,male,Søren,Sørensen,2397 Bell Street,New York,NY,10011.0,United States,SrenSrensen@superrito.com1 212 201 3108,12/31/1942,157.1,67,24.6
213,214,female,Onyemaechi,Onwughara,685 Duncan Avenue,New York,NY,10013.0,United States,917-622-9142OnyemaechiOnwughara@einrot.com,3/8/1989,131.1,69,19.4
215,216,male,John,Doe,123 Main Street,New York,NY,12345.0,United States,johndoe@email.com1234567890,1/1/1975,180.0,72,24.4
229,230,male,John,Doe,123 Main Street,New York,NY,12345.0,United States,johndoe@email.com1234567890,1/1/1975,180.0,72,24.4


In [44]:
##bmi formula, check for deviations.
patients[abs((703*patients.weight/(patients.height**2))-patients.bmi)>.05]

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,contact,birthdate,weight,height,bmi
4,5,male,Tim,Neudorf,1428 Turkey Pen Lane,Dothan,AL,36303.0,United States,334-515-7487TimNeudorf@cuvox.de,2/18/1928,192.3,27,26.1
210,211,female,Camilla,Zaitseva,4689 Briarhill Lane,Wooster,OH,44691.0,United States,330-202-2145CamillaZaitseva@superrito.com,11/26/1938,48.8,63,19.1


In [63]:
patients.surname.value_counts()

Doe            6
Jakobsen       3
Taylor         3
Parker         2
Aranda         2
Cindrić        2
Kowalczyk      2
Lâm            2
Gersten        2
Batukayev      2
Correia        2
Tucker         2
Souza          2
Liễu           2
Hueber         2
Lương          2
Dratchev       2
Lund           2
Johnson        2
Berg           2
Woźniak        2
Bùi            2
Tạ             2
Silva          2
Schiavone      2
Kadyrov        2
Cabrera        2
Grímsdóttir    2
Ogochukwu      2
Nilsen         2
              ..
Bjarkason      1
Petersen       1
Lorenzo        1
Wagner         1
Bonami         1
Adonay         1
Grubišić       1
Chung          1
Hsu            1
Györfy         1
Frederiksen    1
Sandgreen      1
Arsanukayev    1
Nebeolisa      1
Majewski       1
Grant          1
Quintanilla    1
Filatov        1
Sokołowska     1
Mattila        1
Osman          1
Terrazas       1
Mathiesen      1
Ferrari        1
Horvat         1
Štěpánek       1
Tománková      1
Lynge         

In [68]:
patients.address.value_counts()

123 Main Street             6
2778 North Avenue           2
648 Old Dear Lane           2
2476 Fulton Street          2
577 Chipmunk Lane           1
3216 Lodgeville Road        1
1614 Heather Sees Way       1
3209 Crowfield Road         1
796 Eagle Street            1
3450 Southern Street        1
3115 May Street             1
3543 Cherry Ridge Drive     1
1845 Saint Marys Avenue     1
4033 White Avenue           1
1012 Lords Way              1
108 Griffin Street          1
1904 Granville Lane         1
3634 Lyon Avenue            1
4839 North Avenue           1
149 Marion Drive            1
1428 Turkey Pen Lane        1
2690 Pin Oak Drive          1
475 Preston Street          1
1953 Rhapsody Street        1
4104 Kennedy Court          1
1846 Joseph Street          1
2421 Coal Road              1
4795 Better Street          1
1179 Patton Lane            1
2640 Sweetwood Drive        1
                           ..
3613 Lodgeville Road        1
1812 Poplar Street          1
4220 Simps

In [73]:
patients[patients.address.duplicated(keep=False)]

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,contact,birthdate,weight,height,bmi
24,25,male,Jakob,Jakobsen,648 Old Dear Lane,Port Jervis,New York,12771.0,United States,JakobCJakobsen@einrot.com+1 (845) 858-7707,8/1/1985,155.8,67,24.4
29,30,male,Jake,Jakobsen,648 Old Dear Lane,Port Jervis,New York,12771.0,United States,JakobCJakobsen@einrot.com+1 (845) 858-7707,8/1/1985,155.8,67,24.4
97,98,male,Patrick,Gersten,2778 North Avenue,Burr,NE,68324.0,United States,PatrickGersten@rhyta.com402-848-4923,5/3/1954,138.2,71,19.3
131,132,female,Sandra,Taylor,2476 Fulton Street,Rainelle,WV,25962.0,United States,304-438-2648SandraCTaylor@dayrep.com,10/23/1960,206.1,64,35.4
209,210,female,Lalita,Eldarkhanov,,,,,,,8/14/1950,143.4,62,26.2
215,216,male,John,Doe,123 Main Street,New York,NY,12345.0,United States,johndoe@email.com1234567890,1/1/1975,180.0,72,24.4
219,220,male,Mỹ,Quynh,,,,,,,4/9/1978,237.8,69,35.1
229,230,male,John,Doe,123 Main Street,New York,NY,12345.0,United States,johndoe@email.com1234567890,1/1/1975,180.0,72,24.4
230,231,female,Elisabeth,Knudsen,,,,,,,9/23/1976,165.9,63,29.4
234,235,female,Martina,Tománková,,,,,,,4/7/1936,199.5,65,33.2


In [49]:
treatments.head()

Unnamed: 0,given_name,surname,auralin,novodra,hba1c_start,hba1c_end,hba1c_change
0,veronika,jindrová,41u - 48u,-,7.63,7.2,
1,elliot,richardson,-,40u - 45u,7.56,7.09,0.97
2,yukitaka,takenaka,-,39u - 36u,7.68,7.25,
3,skye,gormanston,33u - 36u,-,7.97,7.62,0.35
4,alissa,montez,-,33u - 29u,7.78,7.46,0.32


In [45]:
treatments.describe()

Unnamed: 0,hba1c_start,hba1c_end,hba1c_change
count,280.0,280.0,171.0
mean,7.985929,7.589286,0.546023
std,0.568638,0.569672,0.279555
min,7.5,7.01,0.2
25%,7.66,7.27,0.34
50%,7.8,7.42,0.38
75%,7.97,7.57,0.92
max,9.95,9.58,0.99


In [54]:
treatments.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 280 entries, 0 to 279
Data columns (total 7 columns):
given_name      280 non-null object
surname         280 non-null object
auralin         280 non-null object
novodra         280 non-null object
hba1c_start     280 non-null float64
hba1c_end       280 non-null float64
hba1c_change    171 non-null float64
dtypes: float64(3), object(4)
memory usage: 15.4+ KB


In [74]:
sum(treatments.auralin.isnull())

0

350 patients participated in this clinical trial. None of the patients were using Novodra (a popular injectable insulin) or Auralin (the oral insulin being researched) as their primary source of insulin before.  All were experiencing elevated HbA1c levels.

All 350 patients were treated with Novodra to establish a baseline HbA1c level and insulin dose. After four weeks, which isn’t enough time to capture all the change in HbA1c that can be attributed by the switch to Auralin or Novodra:
- 175 patients switched to Auralin for 24 weeks
- 175 patients continued using Novodra for 24 weeks

`treatments` columns:
- **given_name**: the given name of each patient in the Master Patient Index that took part in the clinical trial
- **surname**: the surname of each patient in the Master Patient Index that took part in the clinical trial
- **auralin**: the baseline median daily dose of insulin from the week prior to switching to Auralin (the number before the dash) *and* the ending median daily dose of insulin at the end of the 24 weeks of treatment measured over the 24th week of treatment (the number after the dash). Both are measured in units (shortform 'u'), which is the [international unit](https://en.wikipedia.org/wiki/International_unit) of measurement and the standard measurement for insulin.
- **novodra**: same as above, except for patients that continued treatment with Novodra
- **hba1c_start**: the patient's HbA1c level at the beginning of the first week of treatment. HbA1c stands for Hemoglobin A1c. The [HbA1c test](https://depts.washington.edu/uwcoe/healthtopics/diabetes.html) measures what the average blood sugar has been over the past three months. It is thus a powerful way to get an overall sense of how well diabetes has been controlled. Everyone with diabetes should have this test 2 to 4 times per year. Measured in %.
- **hba1c_end**: the patient's HbA1c level at the end of the last week of treatment
- **hba1c_change**: the change in the patient's HbA1c level from the start of treatment to the end, i.e., `hba1c_start` - `hba1c_end`. For Auralin to be deemed effective, it must be "noninferior" to Novodra, the current standard for insulin. This "noninferiority" is statistically defined as the upper bound of the 95% confidence interval being less than 0.4% for the difference between the mean HbA1c changes for Novodra and Auralin (i.e. Novodra minus Auralin).

In [51]:
adverse_reactions.head()

Unnamed: 0,given_name,surname,adverse_reaction
0,berta,napolitani,injection site discomfort
1,lena,baer,hypoglycemia
2,joseph,day,hypoglycemia
3,flavia,fiorentino,cough
4,manouck,wubbels,throat irritation


In [53]:
adverse_reactions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34 entries, 0 to 33
Data columns (total 3 columns):
given_name          34 non-null object
surname             34 non-null object
adverse_reaction    34 non-null object
dtypes: object(3)
memory usage: 896.0+ bytes


In [61]:
adverse_reactions.adverse_reaction.value_counts()

hypoglycemia                 19
injection site discomfort     6
headache                      3
cough                         2
nausea                        2
throat irritation             2
Name: adverse_reaction, dtype: int64

`adverse_reactions` columns:
- **given_name**: the given name of each patient in the Master Patient Index that took part in the clinical trial and had an adverse reaction (includes both patients treated Auralin and Novodra)
- **surname**: the surname of each patient in the Master Patient Index that took part in the clinical trial and had an adverse reaction (includes both patients treated Auralin and Novodra)
- **adverse_reaction**: the adverse reaction reported by the patient

Additional useful information:
- [Insulin resistance varies person to person](http://www.tudiabetes.org/forum/t/how-much-insulin-is-too-much-on-a-daily-basis/9804/5), which is why both starting median daily dose and ending median daily dose are required, i.e., to calculate change in dose.
- It is important to test drugs and medical products in the people they are meant to help. People of different age, race, sex, and ethnic group must be included in clinical trials. This [diversity](https://www.clinicalleader.com/doc/an-fda-perspective-on-patient-diversity-in-clinical-trials-0001) is reflected in the `patients` table.
- Ensuring column names are descriptive enough is an important step in acquainting yourself with the data. 'Descriptive enough' is subjective. Ideally you want short column names (so they are easier to type and read in code form) but also fully descriptive. Length vs. descriptiveness is a tradeoff and common debate (a [similar debate](https://softwareengineering.stackexchange.com/questions/176582/is-there-an-excuse-for-short-variable-names) exists for variable names). The *auralin* and *novodra* column names are probably not descriptive enough, but you'll address that later so don't worry about that for now.

## Data Quality Dimensions:
Data quality dimensions help guide your thought process while assessing and also cleaning. The four main data quality dimensions are:

- Completeness: do we have all of the records that we should? Do we have missing records or not? Are there specific rows, columns, or cells missing?
- Validity: we have the records, but they're not valid, i.e., they don't conform to a defined schema. A schema is a defined set of rules for data. These rules can be real-world constraints (e.g. negative height is impossible) and table-specific constraints (e.g. unique key constraints in tables).
- Accuracy: inaccurate data is wrong data that is valid. It adheres to the defined schema, but it is still incorrect. Example: a patient's weight that is 5 lbs too heavy because the scale was faulty.
- Consistency: inconsistent data is both valid and accurate, but there are multiple correct ways of referring to the same thing. Consistency, i.e., a standard format, in columns that represent the same data across tables and/or within tables is desired.

### Quality issues

#### `patients` table:
- zip code is a float not a string
- zip code has four digits sometimes, not 5
- Tim Neudorf height is 27in, instead of 72in
- Camilla Zaitseva weight is 48, based on bmi should be 108 (in kg instead of lbs)
- some states have full name, some abbreviations.
- contacts has both phone number and email in one column
- patient_id 9 first name of Dsvid instead of David
- missing demographic information (address thorugh country, contact)
- state should be category instead of object type
- assigned sex should be category instead of object type
- birthdate should be datetime not object
- multiple phone number formats
- there are duplicate default john doe records
- duplication of jakob/jake jakobsen, patrick/pat gersten, and sandra/sandy taylor. Need to remove correct one.

#### `treatments` table: 
- missing HbA1c changes
- only has 280 rows, out of 350 participants
- u next to start and end dose (auralin and novodra should be int not objects
- given_name and surname are lowercase
- HbA1c values should be the same as start-end
- auralin and novodra null values are misrepresented (due to string type)

#### `adverse_reactions` table:
- given_name and surname are lowercase


 

### Tidyness issues

#### `patients` table:
- two variables in the contacts column, split into phone and email

#### `treatments` table:
- auralin and novodra columns need to be three columns: start dose, end dose, and treatment type (which is in the column headers)

#### `adverse_reactions` table:
- can be combined into treatments table


## Clean

1. fix completeness
2. fix tidyness

In [204]:
patients_clean=patients.copy()
treatments_clean=treatments.copy()
adverse_reactions_clean=adverse_reactions.copy()

#### Define

- `treatments`: only has 280 rows, out of 350 participants

Join missing treatment records from treatments_cut.csv into `treatments_clean` table

#### Code

In [205]:
treatments_cut=pd.read_csv('treatments_cut.csv')
treatments_cut.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 70 entries, 0 to 69
Data columns (total 7 columns):
given_name      70 non-null object
surname         70 non-null object
auralin         70 non-null object
novodra         70 non-null object
hba1c_start     70 non-null float64
hba1c_end       70 non-null float64
hba1c_change    42 non-null float64
dtypes: float64(3), object(4)
memory usage: 3.9+ KB


In [206]:
treatments_clean=pd.concat([treatments_clean,treatments_cut],ignore_index=True)

#### Test

In [207]:
treatments_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 350 entries, 0 to 349
Data columns (total 7 columns):
given_name      350 non-null object
surname         350 non-null object
auralin         350 non-null object
novodra         350 non-null object
hba1c_start     350 non-null float64
hba1c_end       350 non-null float64
hba1c_change    213 non-null float64
dtypes: float64(3), object(4)
memory usage: 19.2+ KB


#### Define

- `treatments`: missing/inaccurate HbA1c_change values

Fill in HbA1c column by subtracting HbA1c_end from HbA1c_start, replace all values.

#### Code

In [208]:
treatments_clean.hba1c_change=treatments_clean.hba1c_start-treatments_clean.hba1c_end


#### Test

In [209]:
treatments_clean.head()

Unnamed: 0,given_name,surname,auralin,novodra,hba1c_start,hba1c_end,hba1c_change
0,veronika,jindrová,41u - 48u,-,7.63,7.2,0.43
1,elliot,richardson,-,40u - 45u,7.56,7.09,0.47
2,yukitaka,takenaka,-,39u - 36u,7.68,7.25,0.43
3,skye,gormanston,33u - 36u,-,7.97,7.62,0.35
4,alissa,montez,-,33u - 29u,7.78,7.46,0.32


#### Define


 - Contact column in `patients` table contains two variables: phone number and email
 
 Use Regex to split this column into two.
 
help sources: regular expressions with pandas' [`str.extract` method](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.extract.html). [regex tutorial](https://regexone.com/). [various phone number regex patterns](https://stackoverflow.com/questions/16699007/regular-expression-to-match-standard-10-digit-phone-number). [email address regex pattern](http://emailregex.com/).

#### Code

In [210]:
patients_clean.contact.head()

0          951-719-9170ZoeWellish@superrito.com
1         PamelaSHill@cuvox.de+1 (217) 569-3204
2              402-363-6804JaeMDebord@gustr.com
3    PhanBaLiem@jourrapide.com+1 (732) 636-8246
4               334-515-7487TimNeudorf@cuvox.de
Name: contact, dtype: object

In [211]:
patients_clean['phonenumber']=patients_clean.contact.str.extract('(\d?\s*\d{3}[-\.\s]\d{3}[-\.\s]\d{4}|\d?\s*\(\d{3}\)\s*\d{3}[-\.\s]\d{4}|\d{3}[-\.\s]\d{4}|\d{10})',expand=True)
patients_clean.head()

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,contact,birthdate,weight,height,bmi,phonenumber
0,1,female,Zoe,Wellish,576 Brown Bear Drive,Rancho California,California,92390.0,United States,951-719-9170ZoeWellish@superrito.com,7/10/1976,121.7,66,19.6,951-719-9170
1,2,female,Pamela,Hill,2370 University Hill Road,Armstrong,Illinois,61812.0,United States,PamelaSHill@cuvox.de+1 (217) 569-3204,4/3/1967,118.8,66,19.2,1 (217) 569-3204
2,3,male,Jae,Debord,1493 Poling Farm Road,York,Nebraska,68467.0,United States,402-363-6804JaeMDebord@gustr.com,2/19/1980,177.8,71,24.8,402-363-6804
3,4,male,Liêm,Phan,2335 Webster Street,Woodbridge,NJ,7095.0,United States,PhanBaLiem@jourrapide.com+1 (732) 636-8246,7/26/1951,220.9,70,31.7,1 (732) 636-8246
4,5,male,Tim,Neudorf,1428 Turkey Pen Lane,Dothan,AL,36303.0,United States,334-515-7487TimNeudorf@cuvox.de,2/18/1928,192.3,27,26.1,334-515-7487


In [212]:
patients_clean['email']=patients_clean.contact.str.extract("([\D.+-]+@[\D]+\.[a-z]+)",expand=True)
patients_clean.head()

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,contact,birthdate,weight,height,bmi,phonenumber,email
0,1,female,Zoe,Wellish,576 Brown Bear Drive,Rancho California,California,92390.0,United States,951-719-9170ZoeWellish@superrito.com,7/10/1976,121.7,66,19.6,951-719-9170,ZoeWellish@superrito.com
1,2,female,Pamela,Hill,2370 University Hill Road,Armstrong,Illinois,61812.0,United States,PamelaSHill@cuvox.de+1 (217) 569-3204,4/3/1967,118.8,66,19.2,1 (217) 569-3204,PamelaSHill@cuvox.de
2,3,male,Jae,Debord,1493 Poling Farm Road,York,Nebraska,68467.0,United States,402-363-6804JaeMDebord@gustr.com,2/19/1980,177.8,71,24.8,402-363-6804,JaeMDebord@gustr.com
3,4,male,Liêm,Phan,2335 Webster Street,Woodbridge,NJ,7095.0,United States,PhanBaLiem@jourrapide.com+1 (732) 636-8246,7/26/1951,220.9,70,31.7,1 (732) 636-8246,PhanBaLiem@jourrapide.com
4,5,male,Tim,Neudorf,1428 Turkey Pen Lane,Dothan,AL,36303.0,United States,334-515-7487TimNeudorf@cuvox.de,2/18/1928,192.3,27,26.1,334-515-7487,TimNeudorf@cuvox.de


#### Test

In [213]:
patients_clean[patients_clean.phonenumber.isnull()]

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,contact,birthdate,weight,height,bmi,phonenumber,email
209,210,female,Lalita,Eldarkhanov,,,,,,,8/14/1950,143.4,62,26.2,,
219,220,male,Mỹ,Quynh,,,,,,,4/9/1978,237.8,69,35.1,,
230,231,female,Elisabeth,Knudsen,,,,,,,9/23/1976,165.9,63,29.4,,
234,235,female,Martina,Tománková,,,,,,,4/7/1936,199.5,65,33.2,,
242,243,male,John,O'Brian,,,,,,,2/25/1957,205.3,74,26.4,,
249,250,male,Benjamin,Mehler,,,,,,,10/30/1951,146.5,69,21.6,,
257,258,male,Jin,Kung,,,,,,,5/17/1995,231.7,69,34.2,,
264,265,female,Wafiyyah,Asfour,,,,,,,11/3/1989,158.6,63,28.1,,
269,270,female,Flavia,Fiorentino,,,,,,,10/9/1937,175.2,61,33.1,,
278,279,female,Generosa,Cabán,,,,,,,12/16/1962,124.3,69,18.4,,


In [214]:
patients_clean[['contact','phonenumber','email']].sample(10)

Unnamed: 0,contact,phonenumber,email
221,SagiCsaba@armyspy.com+1 (267) 932-9852,1 (267) 932-9852,SagiCsaba@armyspy.com
362,757-624-1525LubosPecha@rhyta.com,757-624-1525,LubosPecha@rhyta.com
35,718-501-0503KamilaPecinova@dayrep.com,718-501-0503,KamilaPecinova@dayrep.com
318,BenoitBonami@gustr.com1 718 954 8136,1 718 954 8136,BenoitBonami@gustr.com
136,714-507-4204VictoriaTMikkelsen@armyspy.com,714-507-4204,VictoriaTMikkelsen@armyspy.com
14,AsiaWozniak@rhyta.com918-712-3469,918-712-3469,AsiaWozniak@rhyta.com
474,EsperanzaLabrosse@armyspy.com678-263-3564,678-263-3564,EsperanzaLabrosse@armyspy.com
389,918-459-9811YegorUspensky@fleckens.hu,918-459-9811,YegorUspensky@fleckens.hu
418,MahmudKadyrov@gustr.com1 701 745 2700,1 701 745 2700,MahmudKadyrov@gustr.com
248,CecilieNilsen@superrito.com308-496-7837,308-496-7837,CecilieNilsen@superrito.com


In [215]:
patients_clean = patients_clean.drop('contact', axis=1)

#### Define

- `treatments`: auralin and novodra columns need to be three columns: start dose, end dose, and treatment type (which is in the column headers)

Use melt and str.split() to split the columns.

help sources: pandas [melt function](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.melt.html) and [`str.split()` method](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.split.html). [`melt` tutorial](https://deparkes.co.uk/2016/10/28/reshape-pandas-data-with-melt/).*

#### Code

In [216]:
treatments_clean.head()

Unnamed: 0,given_name,surname,auralin,novodra,hba1c_start,hba1c_end,hba1c_change
0,veronika,jindrová,41u - 48u,-,7.63,7.2,0.43
1,elliot,richardson,-,40u - 45u,7.56,7.09,0.47
2,yukitaka,takenaka,-,39u - 36u,7.68,7.25,0.43
3,skye,gormanston,33u - 36u,-,7.97,7.62,0.35
4,alissa,montez,-,33u - 29u,7.78,7.46,0.32


In [217]:
treatments_clean=pd.melt(treatments_clean, id_vars=['given_name', 'surname', 'hba1c_start', 'hba1c_end', 'hba1c_change'],
                           var_name='treatment', value_name='dose')
treatments_clean.head()

Unnamed: 0,given_name,surname,hba1c_start,hba1c_end,hba1c_change,treatment,dose
0,veronika,jindrová,7.63,7.2,0.43,auralin,41u - 48u
1,elliot,richardson,7.56,7.09,0.47,auralin,-
2,yukitaka,takenaka,7.68,7.25,0.43,auralin,-
3,skye,gormanston,7.97,7.62,0.35,auralin,33u - 36u
4,alissa,montez,7.78,7.46,0.32,auralin,-


In [218]:
treatments_clean=treatments_clean[treatments_clean.dose!="-"]
treatments_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 350 entries, 0 to 698
Data columns (total 7 columns):
given_name      350 non-null object
surname         350 non-null object
hba1c_start     350 non-null float64
hba1c_end       350 non-null float64
hba1c_change    350 non-null float64
treatment       350 non-null object
dose            350 non-null object
dtypes: float64(3), object(4)
memory usage: 21.9+ KB


In [219]:
treatments_clean['dose_start'], treatments_clean['dose_end']=treatments_clean.dose.str.split('-',1).str
treatments_clean.head()

Unnamed: 0,given_name,surname,hba1c_start,hba1c_end,hba1c_change,treatment,dose,dose_start,dose_end
0,veronika,jindrová,7.63,7.2,0.43,auralin,41u - 48u,41u,48u
3,skye,gormanston,7.97,7.62,0.35,auralin,33u - 36u,33u,36u
6,sophia,haugen,7.65,7.27,0.38,auralin,37u - 42u,37u,42u
7,eddie,archer,7.89,7.55,0.34,auralin,31u - 38u,31u,38u
9,asia,woźniak,7.76,7.37,0.39,auralin,30u - 36u,30u,36u


In [220]:
treatments_clean = treatments_clean.drop('dose', axis=1)

#### Test

In [225]:
treatments_clean.sample(10)

Unnamed: 0,given_name,surname,hba1c_start,hba1c_end,hba1c_change,treatment,dose_start,dose_end
652,beatrycze,woźniak,7.54,7.17,0.37,novodra,26u,27u
639,angela,lavrentyev,7.61,7.14,0.47,novodra,28u,24u
407,klementyna,sokołowska,7.98,7.53,0.45,novodra,42u,41u
447,niels,lange,7.58,7.21,0.37,novodra,43u,38u
478,david,gustafsson,7.72,7.28,0.44,novodra,33u,34u
159,yunadi,barsukov,9.47,9.05,0.42,auralin,48u,58u
209,kári,hervinsson,8.09,7.66,0.43,auralin,37u,43u
425,mackenzie,mckay,9.87,9.48,0.39,novodra,44u,43u
507,asuna,morita,7.58,7.25,0.33,novodra,35u,39u
658,una,traustadóttir,8.0,7.5,0.5,novodra,35u,34u


#### Define

- `adverse_reactions` table: can be combined into treatments table

Join `adverse_reactions` and `treatments` tables using the merge function

In [226]:
adverse_reactions_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34 entries, 0 to 33
Data columns (total 3 columns):
given_name          34 non-null object
surname             34 non-null object
adverse_reaction    34 non-null object
dtypes: object(3)
memory usage: 896.0+ bytes


#### Code

In [230]:
treatments_clean=treatments_clean.merge(adverse_reactions_clean,how='left',on=['given_name','surname'])

#### Test

In [234]:
treatments_clean[treatments_clean.adverse_reaction.notnull()].sample(10)

Unnamed: 0,given_name,surname,hba1c_start,hba1c_end,hba1c_change,treatment,dose_start,dose_end,adverse_reaction
348,jakob,jakobsen,7.96,7.51,0.45,novodra,28u,26u,hypoglycemia
227,abel,yonatan,7.88,7.5,0.38,novodra,38u,39u,cough
347,lixue,hsueh,9.21,8.8,0.41,novodra,22u,23u,injection site discomfort
264,tegan,johnson,7.79,7.43,0.36,novodra,34u,34u,headache
225,abdul-nur,isa,7.98,7.53,0.45,novodra,54u,50u,hypoglycemia
93,merci,leroux,8.98,8.64,0.34,auralin,27u,33u,hypoglycemia
13,clinton,miller,7.79,7.4,0.39,auralin,42u,51u,throat irritation
175,elliot,richardson,7.56,7.09,0.47,novodra,40u,45u,hypoglycemia
202,albinca,komavec,7.89,7.46,0.43,novodra,41u,39u,hypoglycemia
33,louise,johnson,7.63,7.32,0.31,auralin,32u,42u,hypoglycemia


In [241]:
treatments_clean.sample(10,random_state=10)

Unnamed: 0,given_name,surname,hba1c_start,hba1c_end,hba1c_change,treatment,dose_start,dose_end,adverse_reaction
43,stefanie,herman,7.72,7.3,0.42,auralin,44u,55u,
306,csilla,herczegh,7.71,7.27,0.44,novodra,43u,46u,
138,inunnguaq,heilmann,7.85,7.45,0.4,auralin,57u,67u,
275,lena,baer,7.7,7.4,0.3,novodra,41u,38u,hypoglycemia
65,nora,nyborg,7.83,7.48,0.35,auralin,55u,59u,
6,roxanne,andreyeva,9.54,9.14,0.4,auralin,29u,38u,
262,kong,lei,7.58,7.15,0.43,novodra,32u,30u,
172,rovzan,kishiev,7.75,7.41,0.34,auralin,32u,37u,
342,bjørnar,nilsen,7.99,7.7,0.29,novodra,36u,33u,
218,else,andersen,7.98,7.6,0.38,novodra,36u,38u,


#### Define

#### Code

#### Test

#### Define

#### Code

#### Test

#### Define

#### Code

#### Test

#### Define

#### Code

#### Test

#### Define

#### Code

#### Test

#### Define

#### Code

#### Test

#### Define

#### Code

#### Test

#### Define

#### Code

#### Test

#### Define

#### Code

#### Test