# Assessing Data
- Assessing data is the second step in the data wrangling process. When assessing you're inspecting your data set for two things;
    * __Data quality issues__ - Data tha has quality issues have issues with content like missing, duplicate or incorrect data. This is called dirty data
    * __Lack of tidiness__ - Data that has specific structural issues that slow you down when cleaning and analyzing, visualizing or modelling your data later

- You can search for these issues in two ways; visually by scrolling, programmatically using code
- When you detect an issue document it to make cleaning easier

- When you are assessing you are like a detective at work inspecting your dataset for two things; data quality issues(i.e content issues) and lack of tidiness (structural issues)

- Assessing is the precursor of cleaning. You can't clean something that you don't know exists

## Dataset: Oral InsulinPhase II Clinical Trial Data
### Learning About Our Dataset
#### Diabetes
- Increasing prevelance of diabetes in the 21st century is a problem. Patients have symptoms like;
    * unusual thirst
    * frequent urination
    * extreme fatigue

- Diabetes can also lead to more serious complications like stroke, blindness, loss of limbs, kidney failure and even heart attack

### Discovery Of Insulin
- Insulin was discovered in the 1920s by Frederick Banting
- Most of the food we eat is turned in glucose or sugar for our bodies to use for energy. The pancrease, an organ near the stomach makes a hormone called insulin to help glucose get into the cells of our bodies. When you have diabetes the body either doesn't make enough insulin or can't use its own insulin as well as it should and it causes the sugars to build up in the body

- With the discovery of insulin pharmaceutical companies began its production on a large scale. Although it doesn't cure diabetes, its one of the biggest discoveries in medicine

### Challenges with insulin
- Default administration is by needle, multiply times a day. Insulin pumps are a recent invention. They are insulin delivering devices that are semi permanently connected to a diabetics body.

### The Future: Oral Insulin?
- Wouldn't it be great if diabetics could take insulin orally? This is an active area of research but historically the roadblock is getting insulin through the stomach's thick lining

## Our Dataset: Auralian and Novodra Trials
- We willbe looking at the phase 2 clinical trial data of 350 patients for a new innovative oral insulin called Auralian - a propriety capsule that can solve this stomach lining problem 

- Phase 2 trials are intended to;
    * Test the efficacy and the dose response of a drug
    * Identify adverse reactions

- In this trial half of the pateints are being treated with Auralian and the other 175 being treated with a popular injectible insulin called Novodra
- By comparing key metrics between these two drugs, we can determine if Auralin is effective
- The most important metric HbA1c levels and specifically HbA1c change. HbA1c is a property of the blood that measures how well your blood sugar levels have been controlled over the past few months with higher levels being bad. If auralin, the new oral insulin can reduce HbA1c levels at a similar standard as the injectable insulin novodra from some standard petrial baseline like say they both decrease HbA1c_levels from `7.9%` to `7.4%`, a `.5%` drop..if we even get a `.4%` drop we've got ourselves a major medical breakthrough in the dramatic quality of life improvement for diabetics all over the world

## Why do we need data cleaning?
- Healthcare data is notorious for its erros and disorganization
- For example, human errors during the patient registration means we can have;
    * duplicate data
    * missing data
    * inaccurate data

- We'll take the first step in fixing these issues by assessing this datasets quality and tidiness and then cleaning all of these issues using Python and Pandas

__DISCLAIMER: This Data Isn't "Real"__

## Unclean Data: Dirty vs Messy
- __Dirty data__ which has issues with its content is often called __low quality__ data and can include things like inaccurate data, corrupted data and duplicate data
- __Messy data__ has issues with its structure. It is often referred to as __untidy__

- Tidy data means each variable forms a column, each observation forms a row and each type of observational unit forms a table. Any other arrangement is messy
- We will assess both dirty and messy data. The goal is to start distinguishing between the two

## Assessment: Types vs Steps
### Types of Assessment
- There are two types or styles of assessing your data; visual and programmatic
- __Visual assessment__ is just opening it and looking through the data in its entirety
- __Programmatic assesment__ uses code to view specific parts of the data like using functions or methods to summarize the data

### Steps to Assessing Data
- Regardless of the type of assessment, assessing data can be broken into two steps;
    * detecting an issue
    * documenting that issue

- When documenting an issue you don't have to specify how to fix it, which is part of the cleaning step in the data wrangling framework

## Documentation
- Why should we first document unclean issues we observe rather than just write what we need to do to fix the issues?
- When your data's issues get complicated, writing how to fix each can get confusing, lengthy and time-consuming. It can get overwhelming trying to think of how to clean something complicated immediately after documenting it

- If you are separating assessing and cleaning steps of the data wrangling process, writing only observations as a first step is good practice
- If you choose to assess an issue then immediately clean that issue, you can skip the observation and go straight to defining how to clean it

## Visual Assesment
- Involves looking at your program in its entirety in whatever program you like
- Visual assessment can be directed or non directed
- **Directed** - means looking through each table within a jupyter notebook or a spreadsheet like Google Sheets or MS Excel. Sometimes pandas is the only option because some datasets are so large that spreadsheet programs crash when trying to open them

- **Non-directed** is when you look at different pieces of tables such as scrolling aimlessly and stumbling upon issues, then dialling in on something once you have more of a clue on what issue you've spotted. At that point you can use pinpointed assesment whether visual or programmatic

- Visual assessment is great for getting acquainted with the dataset and just trying to understand what it is all about like acquiring a mental picture of it . Also alot of time assessing is driven by the problem you want to solve like checking the values in the columns and rows you plan on using in your analysis

## Visual Assessment: Acquaint Yourself with the Data

In [15]:
import pandas as pd
from IPython.display import display

In [4]:
patients = pd.read_csv('datasets/patients.csv')
treatments = pd.read_csv('datasets/treatments.csv')
adverse_reactions = pd.read_csv('datasets/adverse_reactions.csv')

### Assess

In [17]:
# display the patients table
display(patients)

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,contact,birthdate,weight,height,bmi
0,1,female,Zoe,Wellish,576 Brown Bear Drive,Rancho California,California,92390.0,United States,951-719-9170ZoeWellish@superrito.com,7/10/1976,121.7,66,19.6
1,2,female,Pamela,Hill,2370 University Hill Road,Armstrong,Illinois,61812.0,United States,PamelaSHill@cuvox.de+1 (217) 569-3204,4/3/1967,118.8,66,19.2
2,3,male,Jae,Debord,1493 Poling Farm Road,York,Nebraska,68467.0,United States,402-363-6804JaeMDebord@gustr.com,2/19/1980,177.8,71,24.8
3,4,male,Liêm,Phan,2335 Webster Street,Woodbridge,NJ,7095.0,United States,PhanBaLiem@jourrapide.com+1 (732) 636-8246,7/26/1951,220.9,70,31.7
4,5,male,Tim,Neudorf,1428 Turkey Pen Lane,Dothan,AL,36303.0,United States,334-515-7487TimNeudorf@cuvox.de,2/18/1928,192.3,27,26.1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
498,499,male,Mustafa,Lindström,2530 Victoria Court,Milton Mills,ME,3852.0,United States,207-477-0579MustafaLindstrom@jourrapide.com,4/10/1959,181.1,72,24.6
499,500,male,Ruman,Bisliev,494 Clarksburg Park Road,Sedona,AZ,86341.0,United States,928-284-4492RumanBisliev@gustr.com,3/26/1948,239.6,70,34.4
500,501,female,Jinke,de Keizer,649 Nutter Street,Overland Park,MO,64110.0,United States,816-223-6007JinkedeKeizer@teleworm.us,1/13/1971,171.2,67,26.8
501,502,female,Chidalu,Onyekaozulu,3652 Boone Crockett Lane,Seattle,WA,98109.0,United States,ChidaluOnyekaozulu@jourrapide.com1 360 443 2060,2/13/1952,176.9,67,27.7


patients | columns
-------- | -------
patient_id | the unique identifier for each patient in the Master Patient Index (i.e. patient database) of the pharmaceutical company that is producing Auralin
assigned_sex | the assigned sex of each patient at birth (male or female)
given_name | the given name (i.e. first name) of each patient
surname | the surname (i.e. last name) of each patient
address | the main address for each patient
city | the corresponding city for the main address of each patient
state | the corresponding state for the main address of each patient
zip_code | the corresponding zip code for the main address of each patient
country | the corresponding country for the main address of each patient (all United states for this clinical trial)
contact | phone number and email information for each patient
birthdate | the date of birth of each patient (month/day/year). The inclusion criteria for this clinical trial is age >= 18 (there is no maximum age because diabetes is a growing problem among the elderly population)
weight | the weight of each patient in pounds (lbs)
height | the height of each patient in inches (in)
bmi | the Body Mass Index (BMI) of each patient. BMI is a simple calculation using a person's height and weight. The formula is BMI = kg/m2 where kg is a person's weight in kilograms and m2 is their height in metres squared. A BMI of 25.0 or more is overweight, while the healthy range is 18.5 to 24.9. The inclusion criteria for this clinical trial is 16 >= BMI >= 38.

In [24]:
# display the treatments table
treatments

Unnamed: 0,given_name,surname,auralin,novodra,hba1c_start,hba1c_end,hba1c_change
0,veronika,jindrová,41u - 48u,-,7.63,7.20,
1,elliot,richardson,-,40u - 45u,7.56,7.09,0.97
2,yukitaka,takenaka,-,39u - 36u,7.68,7.25,
3,skye,gormanston,33u - 36u,-,7.97,7.62,0.35
4,alissa,montez,-,33u - 29u,7.78,7.46,0.32
...,...,...,...,...,...,...,...
275,albina,zetticci,45u - 51u,-,7.93,7.73,0.20
276,john,teichelmann,-,49u - 49u,7.90,7.58,
277,mathea,lillebø,23u - 36u,-,9.04,8.67,0.37
278,vallie,prince,31u - 38u,-,7.64,7.28,0.36


350 patients participated in this clinical trial. None of the patients were using Novodra (a popular injectable insulin) or Auralin (the oral insulin being researched) as their primary source of insulin before.  All were experiencing elevated HbA1c levels.

All 350 patients were treated with Novodra to establish a baseline HbA1c level and insulin dose. After four weeks, which isn’t enough time to capture all the change in HbA1c that can be attributed by the switch to Auralin or Novodra:
- 175 patients switched to Auralin for 24 weeks
- 175 patients continued using Novodra for 24 weeks

`treatments` columns:
- **given_name**: the given name of each patient in the Master Patient Index that took part in the clinical trial
- **surname**: the surname of each patient in the Master Patient Index that took part in the clinical trial
- **auralin**: the baseline median daily dose of insulin from the week prior to switching to Auralin (the number before the dash) *and* the ending median daily dose of insulin at the end of the 24 weeks of treatment measured over the 24th week of treatment (the number after the dash). Both are measured in units (shortform 'u'), which is the [international unit](https://en.wikipedia.org/wiki/International_unit) of measurement and the standard measurement for insulin.
- **novodra**: same as above, except for patients that continued treatment with Novodra
- **hba1c_start**: the patient's HbA1c level at the beginning of the first week of treatment. HbA1c stands for Hemoglobin A1c. The [HbA1c test](https://depts.washington.edu/uwcoe/healthtopics/diabetes.html) measures what the average blood sugar has been over the past three months. It is thus a powerful way to get an overall sense of how well diabetes has been controlled. Everyone with diabetes should have this test 2 to 4 times per year. Measured in %.
- **hba1c_end**: the patient's HbA1c level at the end of the last week of treatment
- **hba1c_change**: the change in the patient's HbA1c level from the start of treatment to the end, i.e., `hba1c_start` - `hba1c_end`. For Auralin to be deemed effective, it must be "noninferior" to Novodra, the current standard for insulin. This "noninferiority" is statistically defined as the upper bound of the 95% confidence interval being less than 0.4% for the difference between the mean HbA1c changes for Novodra and Auralin (i.e. Novodra minus Auralin).

In [19]:
# Display the adverse_reactions table
adverse_reactions.style

Unnamed: 0,given_name,surname,adverse_reaction
0,berta,napolitani,injection site discomfort
1,lena,baer,hypoglycemia
2,joseph,day,hypoglycemia
3,flavia,fiorentino,cough
4,manouck,wubbels,throat irritation
5,jasmine,sykes,hypoglycemia
6,louise,johnson,hypoglycemia
7,albinca,komavec,hypoglycemia
8,noe,aranda,hypoglycemia
9,sofia,hermansen,injection site discomfort


`adverse_reactions` columns:
- **given_name**: the given name of each patient in the Master Patient Index that took part in the clinical trial and had an adverse reaction (includes both patients treated Auralin and Novodra)
- **surname**: the surname of each patient in the Master Patient Index that took part in the clinical trial and had an adverse reaction (includes both patients treated Auralin and Novodra)
- **adverse_reaction**: the adverse reaction reported by the patient

Additional useful information:
- [Insulin resistance varies person to person](http://www.tudiabetes.org/forum/t/how-much-insulin-is-too-much-on-a-daily-basis/9804/5), which is why both starting median daily dose and ending median daily dose are required, i.e., to calculate change in dose.
- It is important to test drugs and medical products in the people they are meant to help. People of different age, race, sex, and ethnic group must be included in clinical trials. This [diversity](https://www.clinicalleader.com/doc/an-fda-perspective-on-patient-diversity-in-clinical-trials-0001) is reflected in the `patients` table.
- Ensuring column names are descriptive enough is an important step in acquainting yourself with the data. 'Descriptive enough' is subjective. Ideally you want short column names (so they are easier to type and read in code form) but also fully descriptive. Length vs. descriptiveness is a tradeoff and common debate (a [similar debate](https://softwareengineering.stackexchange.com/questions/176582/is-there-an-excuse-for-short-variable-names) exists for variable names). The *auralin* and *novodra* column names are probably not descriptive enough, but you'll address that later so don't worry about that for now.

## Data Quality using visual assesment
- Finding a few data quality issues through visual assesment
- The issues will be documented in the cell below and grouped according to quality or tidiness

1. __hba1c change__
- Assessment is always guided by what you need to analyze, for this clinincal trial data our key metric is change in `hba1c`
- It's important the data in this columnis clean but some entries are empty, NaN or missing values

    _treatments table missing hba1c_changes_
    
2. __zip codes__
- The entries in the `zip_code` column in the `patients` table have decimals. Some zip codes have 5 digits before the decimal and others have four before the decimal. This data was probably added from a spreadsheet that typed the zip codes as numbers adding the decimal and looping off the leading zero
- The two issues are the data type and the inconsistency of the data

- [Is it a good idea to use an integer column for storing US zip codes in a database?](https://stackoverflow.com/questions/893454/is-it-a-good-idea-to-use-an-integer-column-for-storing-us-zip-codes-in-a-databas)

3. __Height__
- In the `pateints` table the recorded `height` for Tim Neudorf. His recorded height is 27 inches or two feet and three inches tall which is inconsistent with his weight and body mass index
- This is important in terms of the clinical trial because of the reporting of average metrics, such as average height and weight.

4. __States__
- In the `patients` table the state column uses both full state name abbreviation to represent the staes

## Quality
- _treatments table - treatments table missing hba1c changes_
- _pateints table - zip code is a float not a string zip code has four digits sometimes_
- _pateints table - Tim Neudorf height is 27 instead of 72_
- _pateints table - full state names sometimes, abbreviations other times_

## Tidiness

## Assessing vs Exploring
- __Data wrangling__ is about;
    - __Gathering__ the right data
    - __Assessing__ the data's quality and structure
    - __Modifying__ the data to make it clean

- Assessments and modification will not make your analysis, visualizations or models better. It just makes them work

- __Exploratory Data Analysis__ is about;
    - **Exploring** the data with simple visualizations that summarize the data's main characteristics
    - **Augmenting** the data for example removing outliers and feature engineering

#### Assessing
- Assessing is everything we identified above, it also includes identifying structural issues that make analysis difficult
- The discovery of this data quality ensure that the analysis can be executed which for this clinincal trial data includes calculated average patient metrics (e.g age, weight, height and BMI) and calculating the confidence interval for the difference in HbA1c change means between Novodra and Auralian pateints

#### Exploring
- In the context of this dataset, **exploring** might include using summary statistics like `count` on the state column or `mean` on the weight column to see if pateints from certain states or of certain weights are more likely to have diabetes which we can use to exclude certain patients from the analysis and make it less biased

- In the context of a clinical trial, **exploring** is less likely to happen given that clinical trials are expensive and include alot of pre planning. So exploring this dataset would likely happen before the clinical trial was conducted

## Quality: Visual Assesment 2
### More Data Quality Issues
1. The `given_name` for the pateint with the `patient_id` of 9 (name Dsvid doesn't seem right)
2. `u` next to the start dose and end dose in the auralian and novodra columns (Will we be able to do anything with those values if a 'u' is next to each?)
3. Lowercase names in the `treatments` and `adverse_reactions` tables
4. 280 names in the treatments table (350 records should be provided.)

- The fact that is there's the letter `u` attached to the end means that pandas will not be able to interpret the variable as a float or integer which is required for calculations (should actually be in separate columns but we'll handle that later). This could have arised maybe because the data was transferred from paper to computer using something like optical character recognition

- The `given_name` and `surname` in the  `treatments` and `adverse_reactions` tables are all lowercase but the names in the patients start with uppercase. This will be anissue when we later join these tables

- There are 280 rows with last index being 279, the size of each treatment was actually 175 patients in each (175 for the auralin arm and 175 for the novodra arm). We are missing some data, finding where that data lives is a separate issue (we'll deal with that later)

- [Optical Character recognintion](https://pdf.abbyy.com/learning-center/what-is-ocr/)

## Quality
#### patients table
- _zip code is a float not a string_
- _zip code has four digits sometimes_
- _Tim Neudorf height is 27 instead of 72_
- _full state names sometimes, abbreviations other times_
- _Dsvid Gustafsson_

#### treatments table
- _treatments table missing hba1c changes_
- _the letter u in starting and ending doses for auralin and novodra_
- _lowercase given names ans surnames_
- _missing records (280 instead of 350)_

#### adverse_reactions table
- _lowercase given names and surnames_


## Tidiness

## Data Quailty Dimensions
- Every dirty dataset is dirty in its own unique way. Trying to list every quality issue is therefore futile but we can categorize them
- Categories of data quality are called data quality dimensions. the four main data quality dimensions are;
    1. __Completeness__ - Do we have all the records that we should? Do we have missing records or not? Are there specific rows, columns or cells missing?
    2. __Validity__ - we have the records but they are not valid i.e they dont conform to a defined schema
    - A schema is a defined set of rules for data. These rules can be real world constraints e.g negative height is impossible and table specific constraints e.g unique key constraints in tables

    3. __Accuracy__ - inaccurate data is wrong data that is valid. It adheres to a defined schema but it is still incorrect. Example a patients weight that is 5 lbs too heavy because the scale was faulty

    4. __Consistency__ - inconsistent data is both valid and accurate but there are multiple correct ways of refferring to the same thing. Consistency i.e a standard format in columns that represent the same data across tables and/or within tables is desired

    - These are listed in increasing order of severity, meaning the one listed first (completeness) is the most important

#### More information
- [How to improve data quality](https://www.informit.com/articles/article.aspx?p=399325&seqNum=3)
- [The Seven Dimensions of Data Quality](https://www.youtube.com/watch?v=dPsx8_Fcr-U)

## Identifying issues
- The typo in David is an inaccuracy issue. There's nothing illegal about having the name `Dsvid`, so its not invalid, its just inacurrate
- The letter `u` in the dosage info is a vaidity issue. `23u` is not a valid dose, the valid dose is `23` and the unit of measurement is `u`
- The lowercase given names and surnames in the `treatments` and **adverse_reactions** tables is a **consistency** issue. It's not necessarily a big deal that the given names and suurnames are lowercase, its just that in the `patients` table the names are capitalized so we'd run into issues when joining these tables based on name
- Missing records in the treatments table (280 instead of 350) is a straightforward __completeness issue__

## Programmatic Assessment
### Using Code to Assess Data
- Programmatic assesment uses functions and methods to reveal something about your daat's quality and tidiness
- i.e in pandas we can call the `.info` method to print a concise summary of the dataframe

### Programmatic assessment is driven by the problem you want to solve
- Looking at the summary of the `treatments` DataFrame returned by `.info()` we can see that there are only 171 `hbalc_change` entries while there are 280 entries for the other columns. That indicates that we are missing some data

In [9]:
treatments.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 280 entries, 0 to 279
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   given_name    280 non-null    object 
 1   surname       280 non-null    object 
 2   auralin       280 non-null    object 
 3   novodra       280 non-null    object 
 4   hba1c_start   280 non-null    float64
 5   hba1c_end     280 non-null    float64
 6   hba1c_change  171 non-null    float64
dtypes: float64(3), object(4)
memory usage: 15.4+ KB


## Non Directed programmatic Assessment can also be useful
- Non-directed programmatic assesment means randomly typing in programmatic assesments without any directed goal in mind. The `.sample()` method in pandas displays a random sample of  entries
    
    ```python
    df.sample()  # returns one entry
    df.sample(5) # returns 5 entries
    ```

In [10]:
patients.sample()

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,contact,birthdate,weight,height,bmi
432,433,female,Karen,Jakobsen,1690 Fannie Street,Houston,TX,77020.0,United States,KarenJakobsen@jourrapide.com1 979 203 0438,11/25/1962,185.2,67,29.0


### Assess
These are the programmatic assessment methods in pandas that you will probably use most often:

* .head (DataFrame and Series)
* .tail (DataFrame and Series)
* .sample (DataFrame and Series)
* .info (DataFrame only)
* .describe (DataFrame and Series)
* .value_counts (Series only)
* Various methods of indexing and selecting data (.loc and bracket notation with/without boolean indexing, also .iloc)

Try them out below and keep their results in mind. Some will come in handy later in the lesson.

Check out the [pandas API reference](https://pandas.pydata.org/pandas-docs/stable/reference/index.html) for detailed usage information.

In [11]:
adverse_reactions.describe()

Unnamed: 0,given_name,surname,adverse_reaction
count,34,34,34
unique,34,33,6
top,berta,johnson,hypoglycemia
freq,1,2,19


In [14]:
adverse_reactions['adverse_reaction'].value_counts()

hypoglycemia                 19
injection site discomfort     6
headache                      3
cough                         2
throat irritation             2
nausea                        2
Name: adverse_reaction, dtype: int64

In [46]:
# patients.style
print(len(patients.loc[patients.city == "New York"]))

18


In [42]:
# selecting the records in the patients table for patients that are from the city New York.
patients.loc[(patients.state == "NY") | (patients['state'] == "New York")].head()

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,contact,birthdate,weight,height,bmi
9,10,female,Sophie,Cabrera,3303 Anmoore Road,New York,New York,10011.0,United States,SophieCabreraIbarra@teleworm.us1 718 795 9124,12/3/1930,194.7,64,33.4
22,23,male,Manchu,Su,1092 Deans Lane,Pleasantville,NY,10570.0,United States,914-745-6108ManchuSu@einrot.com,1/19/1936,130.7,65,21.7
24,25,male,Jakob,Jakobsen,648 Old Dear Lane,Port Jervis,New York,12771.0,United States,JakobCJakobsen@einrot.com+1 (845) 858-7707,8/1/1985,155.8,67,24.4
29,30,male,Jake,Jakobsen,648 Old Dear Lane,Port Jervis,New York,12771.0,United States,JakobCJakobsen@einrot.com+1 (845) 858-7707,8/1/1985,155.8,67,24.4
35,36,female,Kamila,Pecinová,3558 Longview Avenue,New York,New York,10004.0,United States,718-501-0503KamilaPecinova@dayrep.com,12/23/1985,198.9,62,36.4


In [35]:
# selecting range of rows from 2 to 5
display(treatments.loc[2: 5])

Unnamed: 0,given_name,surname,auralin,novodra,hba1c_start,hba1c_end,hba1c_change
2,yukitaka,takenaka,-,39u - 36u,7.68,7.25,
3,skye,gormanston,33u - 36u,-,7.97,7.62,0.35
4,alissa,montez,-,33u - 29u,7.78,7.46,0.32
5,jasmine,sykes,-,42u - 44u,7.56,7.18,0.38


In [33]:
# selecting rows from 1 to 4 and columns from 2 to 4
display(patients.iloc[1: 5, 2: 5])

Unnamed: 0,given_name,surname,address
1,Pamela,Hill,2370 University Hill Road
2,Jae,Debord,1493 Poling Farm Road
3,Liêm,Phan,2335 Webster Street
4,Tim,Neudorf,1428 Turkey Pen Lane


In [34]:
# selecting 0th, 2th, 4th, and 7th index rows
display(adverse_reactions.iloc[[0, 2, 4, 7]])

Unnamed: 0,given_name,surname,adverse_reaction
0,berta,napolitani,injection site discomfort
2,joseph,day,hypoglycemia
4,manouck,wubbels,throat irritation
7,albinca,komavec,hypoglycemia
