# Assessing Data
- Assessing data is the second step in the data wrangling process. When assessing you're inspecting your data set for two things;
    * __Data quality issues__ - Data tha has quality issues have issues with content like missing, duplicate or incorrect data. This is called dirty data
    * __Lack of tidiness__ - Data that has specific structural issues that slow you down when cleaning and analyzing, visualizing or modelling your data later

- You can search for these issues in two ways; visually by scrolling, programmatically using code
- When you detect an issue document it to make cleaning easier

- When you are assessing you are like a detective at work inspecting your dataset for two things; data quality issues(i.e content issues) and lack of tidiness (structural issues)

- Assessing is the precursor of cleaning. You can't clean something that you don't know exists

## Dataset: Oral InsulinPhase II Clinical Trial Data
### Learning About Our Dataset
#### Diabetes
- Increasing prevelance of diabetes in the 21st century is a problem. Patients have symptoms like;
    * unusual thirst
    * frequent urination
    * extreme fatigue

- Diabetes can also lead to more serious complications like stroke, blindness, loss of limbs, kidney failure and even heart attack

### Discovery Of Insulin
- Insulin was discovered in the 1920s by Frederick Banting
- Most of the food we eat is turned in glucose or sugar for our bodies to use for energy. The pancrease, an organ near the stomach makes a hormone called insulin to help glucose get into the cells of our bodies. When you have diabetes the body either doesn't make enough insulin or can't use its own insulin as well as it should and it causes the sugars to build up in the body

- With the discovery of insulin pharmaceutical companies began its production on a large scale. Although it doesn't cure diabetes, its one of the biggest discoveries in medicine

### Challenges with insulin
- Default administration is by needle, multiply times a day. Insulin pumps are a recent invention. They are insulin delivering devices that are semi permanently connected to a diabetics body.

### The Future: Oral Insulin?
- Wouldn't it be great if diabetics could take insulin orally? This is an active area of research but historically the roadblock is getting insulin through the stomach's thick lining

## Our Dataset: Auralian and Novodra Trials
- We willbe looking at the phase 2 clinical trial data of 350 patients for a new innovative oral insulin called Auralian - a propriety capsule that can solve this stomach lining problem 

- Phase 2 trials are intended to;
    * Test the efficacy and the dose response of a drug
    * Identify adverse reactions

- In this trial half of the pateints are being treated with Auralian and the other 175 being treated with a popular injectible insulin called Novodra
- By comparing key metrics between these two drugs, we can determine if Auralin is effective
- The most important metric HbA1c levels and specifically HbA1c change. HbA1c is a property of the blood that measures how well your blood sugar levels have been controlled over the past few months with higher levels being bad. If auralin, the new oral insulin can reduce HbA1c levels at a similar standard as the injectable insulin novodra from some standard petrial baseline like say they both decrease HbA1c_levels from `7.9%` to `7.4%`, a `.5%` drop..if we even get a `.4%` drop we've got ourselves a major medical breakthrough in the dramatic quality of life improvement for diabetics all over the world

## Why do we need data cleaning?
- Healthcare data is notorious for its erros and disorganization
- For example, human errors during the patient registration means we can have;
    * duplicate data
    * missing data
    * inaccurate data

- We'll take the first step in fixing these issues by assessing this datasets quality and tidiness and then cleaning all of these issues using Python and Pandas

__DISCLAIMER: This Data Isn't "Real"__

## Unclean Data: Dirty vs Messy
- __Dirty data__ which has issues with its content is often called __low quality__ data and can include things like inaccurate data, corrupted data and duplicate data
- __Messy data__ has issues with its structure. It is often referred to as __untidy__

- Tidy data means each variable forms a column, each observation forms a row and each type of observational unit forms a table. Any other arrangement is messy
- We will assess both dirty and messy data. The goal is to start distinguishing between the two

## Assessment: Types vs Steps
### Types of Assessment
- There are two types or styles of assessing your data; visual and programmatic
- __Visual assessment__ is just opening it and looking through the data in its entirety
- __Programmatic assesment__ uses code to view specific parts of the data like using functions or methods to summarize the data

### Steps to Assessing Data
- Regardless of the type of assessment, assessing data can be broken into two steps;
    * detecting an issue
    * documenting that issue

- When documenting an issue you don't have to specify how to fix it, which is part of the cleaning step in the data wrangling framework

## Documentation
- Why should we first document unclean issues we observe rather than just write what we need to do to fix the issues?
- When your data's issues get complicated, writing how to fix each can get confusing, lengthy and time-consuming. It can get overwhelming trying to think of how to clean something complicated immediately after documenting it

- If you are separating assessing and cleaning steps of the data wrangling process, writing only observations as a first step is good practice
- If you choose to assess an issue then immediately clean that issue, you can skip the observation and go straight to defining how to clean it

## Visual Assesment
- Involves looking at your program in its entirety in whatever program you like
- Visual assessment can be directed or non directed
- **Directed** - means looking through each table within a jupyter notebook or a spreadsheet like Google Sheets or MS Excel. Sometimes pandas is the only option because some datasets are so large that spreadsheet programs crash when trying to open them

- **Non-directed** is when you look at different pieces of tables such as scrolling aimlessly and stumbling upon issues, then dialling in on something once you have more of a clue on what issue you've spotted. At that point you can use pinpointed assesment whether visual or programmatic

- Visual assessment is great for getting acquainted with the dataset and just trying to understand what it is all about like acquiring a mental picture of it . Also alot of time assessing is driven by the problem you want to solve like checking the values in the columns and rows you plan on using in your analysis

## Visual Assessment: Acquaint Yourself with the Data

In [2]:
import pandas as pd
from IPython.display import display

In [3]:
patients = pd.read_csv('datasets/patients.csv')
treatments = pd.read_csv('datasets/treatments.csv')
adverse_reactions = pd.read_csv('datasets/adverse_reactions.csv')

### Assess

In [17]:
# display the patients table
display(patients)

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,contact,birthdate,weight,height,bmi
0,1,female,Zoe,Wellish,576 Brown Bear Drive,Rancho California,California,92390.0,United States,951-719-9170ZoeWellish@superrito.com,7/10/1976,121.7,66,19.6
1,2,female,Pamela,Hill,2370 University Hill Road,Armstrong,Illinois,61812.0,United States,PamelaSHill@cuvox.de+1 (217) 569-3204,4/3/1967,118.8,66,19.2
2,3,male,Jae,Debord,1493 Poling Farm Road,York,Nebraska,68467.0,United States,402-363-6804JaeMDebord@gustr.com,2/19/1980,177.8,71,24.8
3,4,male,Liêm,Phan,2335 Webster Street,Woodbridge,NJ,7095.0,United States,PhanBaLiem@jourrapide.com+1 (732) 636-8246,7/26/1951,220.9,70,31.7
4,5,male,Tim,Neudorf,1428 Turkey Pen Lane,Dothan,AL,36303.0,United States,334-515-7487TimNeudorf@cuvox.de,2/18/1928,192.3,27,26.1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
498,499,male,Mustafa,Lindström,2530 Victoria Court,Milton Mills,ME,3852.0,United States,207-477-0579MustafaLindstrom@jourrapide.com,4/10/1959,181.1,72,24.6
499,500,male,Ruman,Bisliev,494 Clarksburg Park Road,Sedona,AZ,86341.0,United States,928-284-4492RumanBisliev@gustr.com,3/26/1948,239.6,70,34.4
500,501,female,Jinke,de Keizer,649 Nutter Street,Overland Park,MO,64110.0,United States,816-223-6007JinkedeKeizer@teleworm.us,1/13/1971,171.2,67,26.8
501,502,female,Chidalu,Onyekaozulu,3652 Boone Crockett Lane,Seattle,WA,98109.0,United States,ChidaluOnyekaozulu@jourrapide.com1 360 443 2060,2/13/1952,176.9,67,27.7


patients | columns
-------- | -------
patient_id | the unique identifier for each patient in the Master Patient Index (i.e. patient database) of the pharmaceutical company that is producing Auralin
assigned_sex | the assigned sex of each patient at birth (male or female)
given_name | the given name (i.e. first name) of each patient
surname | the surname (i.e. last name) of each patient
address | the main address for each patient
city | the corresponding city for the main address of each patient
state | the corresponding state for the main address of each patient
zip_code | the corresponding zip code for the main address of each patient
country | the corresponding country for the main address of each patient (all United states for this clinical trial)
contact | phone number and email information for each patient
birthdate | the date of birth of each patient (month/day/year). The inclusion criteria for this clinical trial is age >= 18 (there is no maximum age because diabetes is a growing problem among the elderly population)
weight | the weight of each patient in pounds (lbs)
height | the height of each patient in inches (in)
bmi | the Body Mass Index (BMI) of each patient. BMI is a simple calculation using a person's height and weight. The formula is BMI = kg/m2 where kg is a person's weight in kilograms and m2 is their height in metres squared. A BMI of 25.0 or more is overweight, while the healthy range is 18.5 to 24.9. The inclusion criteria for this clinical trial is 16 >= BMI >= 38.

In [24]:
# display the treatments table
treatments

Unnamed: 0,given_name,surname,auralin,novodra,hba1c_start,hba1c_end,hba1c_change
0,veronika,jindrová,41u - 48u,-,7.63,7.20,
1,elliot,richardson,-,40u - 45u,7.56,7.09,0.97
2,yukitaka,takenaka,-,39u - 36u,7.68,7.25,
3,skye,gormanston,33u - 36u,-,7.97,7.62,0.35
4,alissa,montez,-,33u - 29u,7.78,7.46,0.32
...,...,...,...,...,...,...,...
275,albina,zetticci,45u - 51u,-,7.93,7.73,0.20
276,john,teichelmann,-,49u - 49u,7.90,7.58,
277,mathea,lillebø,23u - 36u,-,9.04,8.67,0.37
278,vallie,prince,31u - 38u,-,7.64,7.28,0.36


350 patients participated in this clinical trial. None of the patients were using Novodra (a popular injectable insulin) or Auralin (the oral insulin being researched) as their primary source of insulin before.  All were experiencing elevated HbA1c levels.

All 350 patients were treated with Novodra to establish a baseline HbA1c level and insulin dose. After four weeks, which isn’t enough time to capture all the change in HbA1c that can be attributed by the switch to Auralin or Novodra:
- 175 patients switched to Auralin for 24 weeks
- 175 patients continued using Novodra for 24 weeks

`treatments` columns:
- **given_name**: the given name of each patient in the Master Patient Index that took part in the clinical trial
- **surname**: the surname of each patient in the Master Patient Index that took part in the clinical trial
- **auralin**: the baseline median daily dose of insulin from the week prior to switching to Auralin (the number before the dash) *and* the ending median daily dose of insulin at the end of the 24 weeks of treatment measured over the 24th week of treatment (the number after the dash). Both are measured in units (shortform 'u'), which is the [international unit](https://en.wikipedia.org/wiki/International_unit) of measurement and the standard measurement for insulin.
- **novodra**: same as above, except for patients that continued treatment with Novodra
- **hba1c_start**: the patient's HbA1c level at the beginning of the first week of treatment. HbA1c stands for Hemoglobin A1c. The [HbA1c test](https://depts.washington.edu/uwcoe/healthtopics/diabetes.html) measures what the average blood sugar has been over the past three months. It is thus a powerful way to get an overall sense of how well diabetes has been controlled. Everyone with diabetes should have this test 2 to 4 times per year. Measured in %.
- **hba1c_end**: the patient's HbA1c level at the end of the last week of treatment
- **hba1c_change**: the change in the patient's HbA1c level from the start of treatment to the end, i.e., `hba1c_start` - `hba1c_end`. For Auralin to be deemed effective, it must be "noninferior" to Novodra, the current standard for insulin. This "noninferiority" is statistically defined as the upper bound of the 95% confidence interval being less than 0.4% for the difference between the mean HbA1c changes for Novodra and Auralin (i.e. Novodra minus Auralin).

In [19]:
# Display the adverse_reactions table
adverse_reactions.style

Unnamed: 0,given_name,surname,adverse_reaction
0,berta,napolitani,injection site discomfort
1,lena,baer,hypoglycemia
2,joseph,day,hypoglycemia
3,flavia,fiorentino,cough
4,manouck,wubbels,throat irritation
5,jasmine,sykes,hypoglycemia
6,louise,johnson,hypoglycemia
7,albinca,komavec,hypoglycemia
8,noe,aranda,hypoglycemia
9,sofia,hermansen,injection site discomfort


`adverse_reactions` columns:
- **given_name**: the given name of each patient in the Master Patient Index that took part in the clinical trial and had an adverse reaction (includes both patients treated Auralin and Novodra)
- **surname**: the surname of each patient in the Master Patient Index that took part in the clinical trial and had an adverse reaction (includes both patients treated Auralin and Novodra)
- **adverse_reaction**: the adverse reaction reported by the patient

Additional useful information:
- [Insulin resistance varies person to person](http://www.tudiabetes.org/forum/t/how-much-insulin-is-too-much-on-a-daily-basis/9804/5), which is why both starting median daily dose and ending median daily dose are required, i.e., to calculate change in dose.
- It is important to test drugs and medical products in the people they are meant to help. People of different age, race, sex, and ethnic group must be included in clinical trials. This [diversity](https://www.clinicalleader.com/doc/an-fda-perspective-on-patient-diversity-in-clinical-trials-0001) is reflected in the `patients` table.
- Ensuring column names are descriptive enough is an important step in acquainting yourself with the data. 'Descriptive enough' is subjective. Ideally you want short column names (so they are easier to type and read in code form) but also fully descriptive. Length vs. descriptiveness is a tradeoff and common debate (a [similar debate](https://softwareengineering.stackexchange.com/questions/176582/is-there-an-excuse-for-short-variable-names) exists for variable names). The *auralin* and *novodra* column names are probably not descriptive enough, but you'll address that later so don't worry about that for now.

## Data Quality using visual assesment
- Finding a few data quality issues through visual assesment
- The issues will be documented in the cell below and grouped according to quality or tidiness

1. __hba1c change__
- Assessment is always guided by what you need to analyze, for this clinincal trial data our key metric is change in `hba1c`
- It's important the data in this columnis clean but some entries are empty, NaN or missing values

    _treatments table missing hba1c_changes_
    
2. __zip codes__
- The entries in the `zip_code` column in the `patients` table have decimals. Some zip codes have 5 digits before the decimal and others have four before the decimal. This data was probably added from a spreadsheet that typed the zip codes as numbers adding the decimal and looping off the leading zero
- The two issues are the data type and the inconsistency of the data

- [Is it a good idea to use an integer column for storing US zip codes in a database?](https://stackoverflow.com/questions/893454/is-it-a-good-idea-to-use-an-integer-column-for-storing-us-zip-codes-in-a-databas)

3. __Height__
- In the `pateints` table the recorded `height` for Tim Neudorf. His recorded height is 27 inches or two feet and three inches tall which is inconsistent with his weight and body mass index
- This is important in terms of the clinical trial because of the reporting of average metrics, such as average height and weight.

4. __States__
- In the `patients` table the state column uses both full state name abbreviation to represent the staes

## Quality
- _treatments table - treatments table missing hba1c changes_
- _pateints table - zip code is a float not a string zip code has four digits sometimes_
- _pateints table - Tim Neudorf height is 27 instead of 72_
- _pateints table - full state names sometimes, abbreviations other times_

## Tidiness

## Assessing vs Exploring
- __Data wrangling__ is about;
    - __Gathering__ the right data
    - __Assessing__ the data's quality and structure
    - __Modifying__ the data to make it clean

- Assessments and modification will not make your analysis, visualizations or models better. It just makes them work

- __Exploratory Data Analysis__ is about;
    - **Exploring** the data with simple visualizations that summarize the data's main characteristics
    - **Augmenting** the data for example removing outliers and feature engineering

#### Assessing
- Assessing is everything we identified above, it also includes identifying structural issues that make analysis difficult
- The discovery of this data quality ensure that the analysis can be executed which for this clinincal trial data includes calculated average patient metrics (e.g age, weight, height and BMI) and calculating the confidence interval for the difference in HbA1c change means between Novodra and Auralian pateints

#### Exploring
- In the context of this dataset, **exploring** might include using summary statistics like `count` on the state column or `mean` on the weight column to see if pateints from certain states or of certain weights are more likely to have diabetes which we can use to exclude certain patients from the analysis and make it less biased

- In the context of a clinical trial, **exploring** is less likely to happen given that clinical trials are expensive and include alot of pre planning. So exploring this dataset would likely happen before the clinical trial was conducted

## Quality: Visual Assesment 2
### More Data Quality Issues
1. The `given_name` for the pateint with the `patient_id` of 9 (name Dsvid doesn't seem right)
2. `u` next to the start dose and end dose in the auralian and novodra columns (Will we be able to do anything with those values if a 'u' is next to each?)
3. Lowercase names in the `treatments` and `adverse_reactions` tables
4. 280 names in the treatments table (350 records should be provided.)

- The fact that is there's the letter `u` attached to the end means that pandas will not be able to interpret the variable as a float or integer which is required for calculations (should actually be in separate columns but we'll handle that later). This could have arised maybe because the data was transferred from paper to computer using something like optical character recognition

- The `given_name` and `surname` in the  `treatments` and `adverse_reactions` tables are all lowercase but the names in the patients start with uppercase. This will be anissue when we later join these tables

- There are 280 rows with last index being 279, the size of each treatment was actually 175 patients in each (175 for the auralin arm and 175 for the novodra arm). We are missing some data, finding where that data lives is a separate issue (we'll deal with that later)

- [Optical Character recognintion](https://pdf.abbyy.com/learning-center/what-is-ocr/)

## Quality
#### patients table
- _zip code is a float not a string_
- _zip code has four digits sometimes_
- _Tim Neudorf height is 27 instead of 72_
- _full state names sometimes, abbreviations other times_
- _Dsvid Gustafsson_

#### treatments table
- _treatments table missing hba1c changes_
- _the letter u in starting and ending doses for auralin and novodra_
- _lowercase given names ans surnames_
- _missing records (280 instead of 350)_

#### adverse_reactions table
- _lowercase given names and surnames_


## Tidiness

## Data Quailty Dimensions
- Every dirty dataset is dirty in its own unique way. Trying to list every quality issue is therefore futile but we can categorize them
- Categories of data quality are called data quality dimensions. the four main data quality dimensions are;
    1. __Completeness__ - Do we have all the records that we should? Do we have missing records or not? Are there specific rows, columns or cells missing?
    2. __Validity__ - we have the records but they are not valid i.e they dont conform to a defined schema
    - A schema is a defined set of rules for data. These rules can be real world constraints e.g negative height is impossible and table specific constraints e.g unique key constraints in tables

    3. __Accuracy__ - inaccurate data is wrong data that is valid. It adheres to a defined schema but it is still incorrect. Example a patients weight that is 5 lbs too heavy because the scale was faulty

    4. __Consistency__ - inconsistent data is both valid and accurate but there are multiple correct ways of refferring to the same thing. Consistency i.e a standard format in columns that represent the same data across tables and/or within tables is desired

    - These are listed in increasing order of severity, meaning the one listed first (completeness) is the most important

#### More information
- [How to improve data quality](https://www.informit.com/articles/article.aspx?p=399325&seqNum=3)
- [The Seven Dimensions of Data Quality](https://www.youtube.com/watch?v=dPsx8_Fcr-U)

## Identifying issues
- The typo in David is an inaccuracy issue. There's nothing illegal about having the name `Dsvid`, so its not invalid, its just inacurrate
- The letter `u` in the dosage info is a vaidity issue. `23u` is not a valid dose, the valid dose is `23` and the unit of measurement is `u`
- The lowercase given names and surnames in the `treatments` and **adverse_reactions** tables is a **consistency** issue. It's not necessarily a big deal that the given names and suurnames are lowercase, its just that in the `patients` table the names are capitalized so we'd run into issues when joining these tables based on name
- Missing records in the treatments table (280 instead of 350) is a straightforward __completeness issue__

## Programmatic Assessment
### Using Code to Assess Data
- Programmatic assesment uses functions and methods to reveal something about your daat's quality and tidiness
- i.e in pandas we can call the `.info` method to print a concise summary of the dataframe

### Programmatic assessment is driven by the problem you want to solve
- Looking at the summary of the `treatments` DataFrame returned by `.info()` we can see that there are only 171 `hbalc_change` entries while there are 280 entries for the other columns. That indicates that we are missing some data

In [9]:
treatments.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 280 entries, 0 to 279
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   given_name    280 non-null    object 
 1   surname       280 non-null    object 
 2   auralin       280 non-null    object 
 3   novodra       280 non-null    object 
 4   hba1c_start   280 non-null    float64
 5   hba1c_end     280 non-null    float64
 6   hba1c_change  171 non-null    float64
dtypes: float64(3), object(4)
memory usage: 15.4+ KB


## Non Directed programmatic Assessment can also be useful
- Non-directed programmatic assesment means randomly typing in programmatic assesments without any directed goal in mind. The `.sample()` method in pandas displays a random sample of  entries
    
    ```python
    df.sample()  # returns one entry
    df.sample(5) # returns 5 entries
    ```

In [10]:
patients.sample()

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,contact,birthdate,weight,height,bmi
432,433,female,Karen,Jakobsen,1690 Fannie Street,Houston,TX,77020.0,United States,KarenJakobsen@jourrapide.com1 979 203 0438,11/25/1962,185.2,67,29.0


### Assess
These are the programmatic assessment methods in pandas that you will probably use most often:

* .head (DataFrame and Series)
* .tail (DataFrame and Series)
* .sample (DataFrame and Series)
* .info (DataFrame only)
* .describe (DataFrame and Series)
* .value_counts (Series only)
* Various methods of indexing and selecting data (.loc and bracket notation with/without boolean indexing, also .iloc)

Try them out below and keep their results in mind. Some will come in handy later in the lesson.

Check out the [pandas API reference](https://pandas.pydata.org/pandas-docs/stable/reference/index.html) for detailed usage information.

In [11]:
adverse_reactions.describe()

Unnamed: 0,given_name,surname,adverse_reaction
count,34,34,34
unique,34,33,6
top,berta,johnson,hypoglycemia
freq,1,2,19


In [14]:
adverse_reactions['adverse_reaction'].value_counts()

hypoglycemia                 19
injection site discomfort     6
headache                      3
cough                         2
throat irritation             2
nausea                        2
Name: adverse_reaction, dtype: int64

In [46]:
# patients.style
print(len(patients.loc[patients.city == "New York"]))

18


In [42]:
# selecting the records in the patients table for patients that are from the city New York.
patients.loc[(patients.state == "NY") | (patients['state'] == "New York")].head()

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,contact,birthdate,weight,height,bmi
9,10,female,Sophie,Cabrera,3303 Anmoore Road,New York,New York,10011.0,United States,SophieCabreraIbarra@teleworm.us1 718 795 9124,12/3/1930,194.7,64,33.4
22,23,male,Manchu,Su,1092 Deans Lane,Pleasantville,NY,10570.0,United States,914-745-6108ManchuSu@einrot.com,1/19/1936,130.7,65,21.7
24,25,male,Jakob,Jakobsen,648 Old Dear Lane,Port Jervis,New York,12771.0,United States,JakobCJakobsen@einrot.com+1 (845) 858-7707,8/1/1985,155.8,67,24.4
29,30,male,Jake,Jakobsen,648 Old Dear Lane,Port Jervis,New York,12771.0,United States,JakobCJakobsen@einrot.com+1 (845) 858-7707,8/1/1985,155.8,67,24.4
35,36,female,Kamila,Pecinová,3558 Longview Avenue,New York,New York,10004.0,United States,718-501-0503KamilaPecinova@dayrep.com,12/23/1985,198.9,62,36.4


In [35]:
# selecting range of rows from 2 to 5
display(treatments.loc[2: 5])

Unnamed: 0,given_name,surname,auralin,novodra,hba1c_start,hba1c_end,hba1c_change
2,yukitaka,takenaka,-,39u - 36u,7.68,7.25,
3,skye,gormanston,33u - 36u,-,7.97,7.62,0.35
4,alissa,montez,-,33u - 29u,7.78,7.46,0.32
5,jasmine,sykes,-,42u - 44u,7.56,7.18,0.38


In [33]:
# selecting rows from 1 to 4 and columns from 2 to 4
display(patients.iloc[1: 5, 2: 5])

Unnamed: 0,given_name,surname,address
1,Pamela,Hill,2370 University Hill Road
2,Jae,Debord,1493 Poling Farm Road
3,Liêm,Phan,2335 Webster Street
4,Tim,Neudorf,1428 Turkey Pen Lane


In [34]:
# selecting 0th, 2th, 4th, and 7th index rows
display(adverse_reactions.iloc[[0, 2, 4, 7]])

Unnamed: 0,given_name,surname,adverse_reaction
0,berta,napolitani,injection site discomfort
2,joseph,day,hypoglycemia
4,manouck,wubbels,throat irritation
7,albinca,komavec,hypoglycemia


## Using Programmatic Assesment to find quality issues
- `.info` returns a concise summary of the dataframe including; number of entries, number of columns, data types of each column, memory usage for the entire dataframe
- We can see a completeness issue on the patients table, we have 503 entries for most columns but 491 for adress related columns
- We can check for missing data with `.isnll` which returns a list of rows with empty data

In [4]:
patients.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 503 entries, 0 to 502
Data columns (total 14 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   patient_id    503 non-null    int64  
 1   assigned_sex  503 non-null    object 
 2   given_name    503 non-null    object 
 3   surname       503 non-null    object 
 4   address       491 non-null    object 
 5   city          491 non-null    object 
 6   state         491 non-null    object 
 7   zip_code      491 non-null    float64
 8   country       491 non-null    object 
 9   contact       491 non-null    object 
 10  birthdate     503 non-null    object 
 11  weight        503 non-null    float64
 12  height        503 non-null    int64  
 13  bmi           503 non-null    float64
dtypes: float64(3), int64(2), object(9)
memory usage: 55.1+ KB


In [5]:
treatments.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 280 entries, 0 to 279
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   given_name    280 non-null    object 
 1   surname       280 non-null    object 
 2   auralin       280 non-null    object 
 3   novodra       280 non-null    object 
 4   hba1c_start   280 non-null    float64
 5   hba1c_end     280 non-null    float64
 6   hba1c_change  171 non-null    float64
dtypes: float64(3), object(4)
memory usage: 15.4+ KB


In [6]:
adverse_reactions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34 entries, 0 to 33
Data columns (total 3 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   given_name        34 non-null     object
 1   surname           34 non-null     object
 2   adverse_reaction  34 non-null     object
dtypes: object(3)
memory usage: 944.0+ bytes


In [10]:
# rows with empty data
patients.loc[patients.isnull().any(axis=1)]

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,contact,birthdate,weight,height,bmi
209,210,female,Lalita,Eldarkhanov,,,,,,,8/14/1950,143.4,62,26.2
219,220,male,Mỹ,Quynh,,,,,,,4/9/1978,237.8,69,35.1
230,231,female,Elisabeth,Knudsen,,,,,,,9/23/1976,165.9,63,29.4
234,235,female,Martina,Tománková,,,,,,,4/7/1936,199.5,65,33.2
242,243,male,John,O'Brian,,,,,,,2/25/1957,205.3,74,26.4
249,250,male,Benjamin,Mehler,,,,,,,10/30/1951,146.5,69,21.6
257,258,male,Jin,Kung,,,,,,,5/17/1995,231.7,69,34.2
264,265,female,Wafiyyah,Asfour,,,,,,,11/3/1989,158.6,63,28.1
269,270,female,Flavia,Fiorentino,,,,,,,10/9/1937,175.2,61,33.1
278,279,female,Generosa,Cabán,,,,,,,12/16/1962,124.3,69,18.4


In [8]:
patients[patients['address'].isnull()]

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,contact,birthdate,weight,height,bmi
209,210,female,Lalita,Eldarkhanov,,,,,,,8/14/1950,143.4,62,26.2
219,220,male,Mỹ,Quynh,,,,,,,4/9/1978,237.8,69,35.1
230,231,female,Elisabeth,Knudsen,,,,,,,9/23/1976,165.9,63,29.4
234,235,female,Martina,Tománková,,,,,,,4/7/1936,199.5,65,33.2
242,243,male,John,O'Brian,,,,,,,2/25/1957,205.3,74,26.4
249,250,male,Benjamin,Mehler,,,,,,,10/30/1951,146.5,69,21.6
257,258,male,Jin,Kung,,,,,,,5/17/1995,231.7,69,34.2
264,265,female,Wafiyyah,Asfour,,,,,,,11/3/1989,158.6,63,28.1
269,270,female,Flavia,Fiorentino,,,,,,,10/9/1937,175.2,61,33.1
278,279,female,Generosa,Cabán,,,,,,,12/16/1962,124.3,69,18.4


### Another Quality issue data types
- Different programs have different ways of representing data, pandas uses the **object** data type to represent strings. This is different from the categorical data type which is **category** in pandas. The category data type has a limited number of possible values
- In the `patients` table, `assigned_sex` and `state` are object types but they should be category because there is a limit range of values for each.
- Similarly zip code is `float` but it needs to be a string so it should be an `object` type and `birthdate` should be a `datetime` type

- There are also issues inthe `treatments` table for `auralin` and `novodra` columns which should eventually be integers
- Data tyes are important to change because there are special calculations you can do and also summaries of the categorical, numerical and even dateetime data types. If they are miscategorized you cant do those calculations or summaries i.e if `birthdate` is kept as `object` instead of `datetime` we cant take advantage of pandas time series or date time functionalities of which there are plenty but simple ones would be calculating age from a certain biirth date

- All of these erroneous data types are __validity issues__, they dont conform to the defined schema of what this table should be

#### pandas Describe method
- generates descriptive stats for the numerical data types in your dataframe including count, mean, standard deviation, minimum value, 25, 50 and 75 percent quartiles and maximum value
- The stas aren't useful for a `patient_id` and `zip_code` but they are useful for `weight`, `height` and `BMI`
- Recall tha `hba1c_change` determines the effectiveness of our insulin. In terms of controlling blood sugar a change of `0.4` is considered a success. So `0.9` is  a really big change that is somewhat implausable especially given that the 75 percentile is at `0.9` too 
- The gap between 25 and 50 is only `0.04`  while its nearly `0.6` between 50 and 75
- If we scroll up to the visual assesment of the treatments table, we'll see that the `hba1c_change` for this `0.97` entry is calculated incorrectly. It should be `0.47`, this is an __accuracy__ issue

- This data inaccuracy is important to clean because `hba1c_change` is the key metric for our clinical trial. The deemed success or failure of this oral insulin hinges on this variable

#### pandas Sample method
- `.sample` returns a random sample of entries from a dataframe
- If we inspect the contact column we see there are multiple representations for phone number. This is a consistency error in the data entry

In [12]:
patients.dtypes

patient_id        int64
assigned_sex     object
given_name       object
surname          object
address          object
city             object
state            object
zip_code        float64
country          object
contact          object
birthdate        object
weight          float64
height            int64
bmi             float64
dtype: object

In [14]:
treatments.describe()

Unnamed: 0,hba1c_start,hba1c_end,hba1c_change
count,280.0,280.0,171.0
mean,7.985929,7.589286,0.546023
std,0.568638,0.569672,0.279555
min,7.5,7.01,0.2
25%,7.66,7.27,0.34
50%,7.8,7.42,0.38
75%,7.97,7.57,0.92
max,9.95,9.58,0.99


In [17]:
patients.sample(5)

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,contact,birthdate,weight,height,bmi
259,260,female,Sara,Miles,4223 Chestnut Street,Tampa,FL,33610.0,United States,727-439-7150SaraDMiles@gustr.com,3/27/1982,166.5,61,31.5
120,121,female,Nicoline,Østergaard,2836 Boring Lane,San Francisco,CA,94108.0,United States,415-676-8818NicolinePstergaard@superrito.com,12/14/1926,124.5,63,22.1
344,345,female,Sophia,Haugen,4178 Despard Street,Atlanta,GA,30303.0,United States,404-713-3641SophiaHaugen@dayrep.com,6/4/1939,181.1,63,32.1
294,295,female,Annie,Allen,3634 Lyon Avenue,Cambridge,MA,2142.0,United States,AnnieJAllen@superrito.com1 508 921 6327,3/31/1926,159.7,60,31.2
179,180,male,Dominik,Grunewald,3574 Park Boulevard,Marshalltown,IA,50158.0,United States,641-753-5678DominikGrunewald@cuvox.de,11/20/1935,143.0,71,19.9


### Programmatic Assessement 2

In [26]:
patients.surname.value_counts()

Doe            6
Jakobsen       3
Taylor         3
Ogochukwu      2
Tucker         2
              ..
Casárez        1
Mata           1
Pospíšil       1
Rukavina       1
Onyekaozulu    1
Name: surname, Length: 466, dtype: int64

In [27]:
patients.address.value_counts()

123 Main Street             6
2778 North Avenue           2
2476 Fulton Street          2
648 Old Dear Lane           2
3094 Oral Lake Road         1
                           ..
1066 Goosetown Drive        1
4291 Patton Lane            1
4643 Reeves Street          1
174 Lost Creek Road         1
3652 Boone Crockett Lane    1
Name: address, Length: 483, dtype: int64

In [31]:
# revealing duplicates in the address column
patients[patients.address.duplicated()]

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,contact,birthdate,weight,height,bmi
29,30,male,Jake,Jakobsen,648 Old Dear Lane,Port Jervis,New York,12771.0,United States,JakobCJakobsen@einrot.com+1 (845) 858-7707,8/1/1985,155.8,67,24.4
219,220,male,Mỹ,Quynh,,,,,,,4/9/1978,237.8,69,35.1
229,230,male,John,Doe,123 Main Street,New York,NY,12345.0,United States,johndoe@email.com1234567890,1/1/1975,180.0,72,24.4
230,231,female,Elisabeth,Knudsen,,,,,,,9/23/1976,165.9,63,29.4
234,235,female,Martina,Tománková,,,,,,,4/7/1936,199.5,65,33.2
237,238,male,John,Doe,123 Main Street,New York,NY,12345.0,United States,johndoe@email.com1234567890,1/1/1975,180.0,72,24.4
242,243,male,John,O'Brian,,,,,,,2/25/1957,205.3,74,26.4
244,245,male,John,Doe,123 Main Street,New York,NY,12345.0,United States,johndoe@email.com1234567890,1/1/1975,180.0,72,24.4
249,250,male,Benjamin,Mehler,,,,,,,10/30/1951,146.5,69,21.6
251,252,male,John,Doe,123 Main Street,New York,NY,12345.0,United States,johndoe@email.com1234567890,1/1/1975,180.0,72,24.4


In [32]:
# sort values on the weight columns of the patients table
patients.sort_values(by=['weight'], ascending=False)

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,contact,birthdate,weight,height,bmi
485,486,male,Trifon,Izmailov,3697 Drainer Avenue,Fort Walton Beach,FL,32548.0,United States,TrifonIzmailov@fleckens.hu1 850 659 0417,2/15/1973,255.9,74,32.9
118,119,male,Adib,Ghanem,3457 Bridge Avenue,Delcambre,LA,70528.0,United States,337-685-4885AdibMutazzGhanem@fleckens.hu,12/31/1967,254.5,72,34.5
283,284,male,Nwachukwu,Nebeolisa,2873 John Calvin Drive,Chicago,IL,60605.0,United States,NwachukwuNebeolisa@cuvox.de+1 (708) 845-2053,3/10/1986,245.5,68,37.3
144,145,male,Mile,Stanić,4640 Windy Ridge Road,Fort Wayne,IN,46804.0,United States,260-591-5755MileStanic@dayrep.com,10/31/1961,244.9,71,34.2
61,62,male,Alan,Milne,707 Gateway Avenue,Bakersfield,California,93301.0,United States,AlanMilne@dayrep.com1 661 779 6795,4/29/1962,244.9,69,36.2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
317,318,female,Nancy,Parker,4605 Hall Street,Las Vegas,NV,89110.0,United States,NancyJParker@gustr.com+1 (702) 438-5138,3/21/1945,106.0,63,18.8
74,75,female,Hanka,Gegič,192 Patton Lane,Tulsa,OK,74106.0,United States,918-975-7594HankaGegic@fleckens.hu,1/20/1926,103.2,61,19.5
335,336,female,Lixue,Hsueh,1540 Overlook Drive,Crawfordsville,IN,47933.0,United States,765-359-0147LixueHsueh@dayrep.com,3/29/1925,102.7,59,20.7
459,460,female,Idalia,Moore,4380 Grim Avenue,San Diego,California,92073.0,United States,619-710-6286IdaliaEMoore@cuvox.de,10/26/1993,102.1,61,19.3


In [33]:
patients.weight.sort_values()

210     48.8
459    102.1
335    102.7
74     103.2
317    106.0
       ...  
144    244.9
61     244.9
283    245.5
118    254.5
485    255.9
Name: weight, Length: 503, dtype: float64

In [34]:
treatments.auralin.isnull().sum()

0

In [35]:
sum(treatments.novodra.isnull())

0

1. __value_counts on surname and address__
- `value_counts` returns the count of each uniques value in a column and there are 6 last names of `Doe` as well as 6 addresses of `123 Main street`
- This may be duplicates which we can check using `.duplicated`, we find that there are several John Doe's that live at 123 Main Street New York, New York, ZIP Code 12345 with the email `johndoe@email.com`. This is a validity issue because this data doesn't conform to the defined schema of one record per patient.

- There is another data quality issue revealed by the duplicated method, one offending record is the Jake Jacobson record, it actually has adress being duplicated as opposed to the records have **NaN** . The first case means that two people in the table have the same address. This is a **validity** issue. We can find similar issues with "Pat Gersten " and "Sandy Taylor".

- 'Elizabeth Knudsen' also appears in the duplicated column, but this isn't a data quality issue because 'Elizabeth Knudsen' her demographic information, which is filled with NaN entries. Those NaN entries match other patients' records with missing address, city, state, etc. information.

2. __`.sort_values()` on the `weight` column__
- The minimum value of `weight` on the patients table is 48.8 pounds which looks very low. We can check this by looking at the `height` and `BMI` entries for this patient, we see that 48.8 is actually kilograms instead of pounds

- `2.20462` is the conversion factor between kilograms and pounds and 703 times weight in pounds divided by height squared is the BMI calculation formula
- And with 48.8 kilograms converted to pounds this yields about 19.1 BMI which if we check the actual BMI in the dataset is exactly the same. So the 48.8 is actually right its just off in units
- This is a **consistency** issue. It is important to clean because we need to report the average metrics like weight,height and more in terms of each treatment arm of the clinical trial

In [8]:
weight_lbs = patients[patients.surname == 'Zaitseva'].weight * 2.20462
weight_lbs

210    107.585456
Name: weight, dtype: float64

In [7]:
weight_lbs = patients[patients.surname == 'Zaitseva'].weight * 2.20462
height_in = patients[patients.surname == 'Zaitseva'].height
bmi_check = 703 * weight_lbs / (height_in * height_in)
bmi_check
patients[patients.surname == 'Zaitseva'].bmi

210    19.1
Name: bmi, dtype: float64

3. __sum of `isnull` on the auralin and novodra__
- The output suggests that there are zero null entries for both the `auralin` and `novodra` columns but if we scroll up to our treatment 's visual assesment the entries with dashes should be null

- This reveals a common error in data sets, misinterpreting missing values as something else like a dash, a slash, N.A or none. These dashes aren't picked up as null or non-values which can be problematic when doing calculations on the data. Misinterpreting missing values is a __validity__ issue

## Tidiness: Visual Assessment
### Tidy Data
- Data can be low quality or dirty and can be untidy or messy. With quality we focused on content and now with tidiness its structural issues

- There are three requirements for tidiness:
    * Each variable forms a column
    * Each observation forms a row
    * Each type of observational unit forms a table

- Having two variables in one column as is the case with contact in the patients table violates the "each variable forms a column" requirement. THere's phone number and email address which should be split into two columns. Document the issue
- The treatments table is untidy too, the `auralin` and `novodra` columns are the offenders. They have dosage information in each entry in each column
- They violate the first rule of tidiness only, that each variable forms a column. There are three variable here: treatment (auralin or novodra, start dose and then end dose). Since there are three variables there should be three columns and there are currently two and both contain two variables. The `auralin` column contains the start and end dose for `auralin` patients and the `novodra` column contains the start and end dose for patients that were treated with `novodra`

### Where is the missing data?
- If these two columns both contain two variables, start and end dose each, where is that missing third variable?
- The third variable treatment is hidden in the column headers. Column headers in this case are values, not variable names. Instead of the `auralin` and `novodra` columns there should be three columns.

    * `treatment` which contains values Auralin or Novodra
    * `start_dose`
    * `end_dose`

- i.e Veronika Jindrova, was treated with Auralin insulin, with a starting dose of 41 units and an ending dose of 48 units. Her data should be: Auralin, 41, 48.
- [Tiding Messy data sets](https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html)

## Tidiness: Programmatic Assessment
- Programmatic assessment can be handy for the third tidiness requirement, __each type of observational unit forms a table__
- This requirement focuses on the columns of a dataset across all tables if there are multiples. Programmatic assessment can be especially useful if your dataset has lots of columns and/or lots of tables
- Using tools such as the `.info()` method allows you to get a quick glance at all the column names across all tables and also the data types within them 

In [9]:
patients.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 503 entries, 0 to 502
Data columns (total 14 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   patient_id    503 non-null    int64  
 1   assigned_sex  503 non-null    object 
 2   given_name    503 non-null    object 
 3   surname       503 non-null    object 
 4   address       491 non-null    object 
 5   city          491 non-null    object 
 6   state         491 non-null    object 
 7   zip_code      491 non-null    float64
 8   country       491 non-null    object 
 9   contact       491 non-null    object 
 10  birthdate     503 non-null    object 
 11  weight        503 non-null    float64
 12  height        503 non-null    int64  
 13  bmi           503 non-null    float64
dtypes: float64(3), int64(2), object(9)
memory usage: 55.1+ KB


In [10]:
treatments.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 280 entries, 0 to 279
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   given_name    280 non-null    object 
 1   surname       280 non-null    object 
 2   auralin       280 non-null    object 
 3   novodra       280 non-null    object 
 4   hba1c_start   280 non-null    float64
 5   hba1c_end     280 non-null    float64
 6   hba1c_change  171 non-null    float64
dtypes: float64(3), object(4)
memory usage: 15.4+ KB


In [11]:
adverse_reactions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34 entries, 0 to 33
Data columns (total 3 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   given_name        34 non-null     object
 1   surname           34 non-null     object
 2   adverse_reaction  34 non-null     object
dtypes: object(3)
memory usage: 944.0+ bytes


## How many tables should this dataset have for it to tidy?
- __Take a look at the column headers and think which headers should belong in what table__
- Two tables are needed: `patients` and `treatments`
1. Patients should have the same columns as the current `patients` table
2. The `adverse_reaction` column of the `adverse_reactions` table should be included in the `treatments` table. When looking at the `treatment` data, the adverse reaction for that treatment is the same type of observational unit
3. The other columns in the `adverse_reactions` table are already present in both the treatments and `patients` table so they can be eliminated

In [13]:
# finding duplicate column names in the three tables uisng pandas
all_columns = pd.Series(list(patients) + list(treatments) + list(adverse_reactions))
all_columns[all_columns.duplicated()]

14    given_name
15       surname
21    given_name
22       surname
dtype: object

- The final dataset should be two tables with the only shared column between them being `patient_id`. It's best practice to have the id as one primary identifier across tables because ID won't change. Name could change(i.e legally changing name via marriage). Having given name and surname in one table only means that you only have to update this columns in one table 

### Sources of dirty data
- There are lots of sources of dirty data. Basically, anytime humans are involved, there's going to be dirty data. There are lots of ways in which we touch data we work with.

1. We're going to have user entry errors.
2. In some situations, we won't have any data coding standards, or where we do have standards they'll be poorly applied, causing problems in the resulting data
3. We might have to integrate data where different schemas have been used for the same type of item.
4. We'll have legacy data systems, where data wasn't coded when disc and memory constraints were much more restrictive than they are now. Over time systems evolve. Needs change, and data changes.
5. Some of our data won't have the unique identifiers it should.
6. Other data will be lost in transformation from one format to another.
7. And then, of course, there's always programmer error.
8. And finally, data might have been corrupted in transmission or storage by cosmic rays or other physical phenomenon. So hey, one that's not our fault.


## Quality
#### patients table
- _zip code is a float not a string_
- _zip code has four digits sometimes_
- _Tim Neudorf height is 27 instead of 72_
- _full state names sometimes, abbreviations other times_
- _Dsvid Gustafsson_
- _Missing demographic information [address - contact columns]
- _Erroneous data types (assigned sex, state, zip_code, birthdate)_
- _Multiple phone number formats_
- _Default John Doe data_
- _Multiple records for Jakobsen, Gersten, Taylor_

#### treatments table
- _treatments table missing hba1c changes_
- _the letter u in starting and ending doses for auralin and novodra_
- _lowercase given names ans surnames_
- _missing records (280 instead of 350)_
- _Erroneous data types (auralin and novodra)_
- Inaccurate hba1c_changes (4s mistaken as 9s)
- _Nulls represented as dashes (-) in auralin and novodra columns_

#### adverse_reactions table
- _lowercase given names and surnames_


## Tidiness
- _contact column in `patients` table should be split into phone number and email_
- _three variables in two columns in `treatments` table (treatment, start_dose, end_dose)_