Dirty data
- data that has issues with data content including missing data, invalid data, inaccurate date or
inconsistent data

Messy data
- data that has issues with its structure (columns, rows or table)
- for tiday data: each variable should form a column; each observation should form a row; each observational unit should form a table

Data Quality Dimensions

1) Completeness
- do we have all the records?

2) Validity
- we have the records, but they're not valid
- records do not conform to a defined schema (e.g. you can't have a negative height)
- you can't have multiple primary keys

3) Accuracy
- adheres to defined schema but still incorrect
- e.g. overestimating weight, 27 inches for height

4) Consistency
- valid and accurate but multiple ways of referring to the same thing
- e.g. inconsistent representation of state 'CA' or 'California'

In [3]:
# import pandas library
import pandas as pd

In [5]:
# loading patients.csv file
df = pd.read_csv('patients.csv', sep='\t', index_col=0)

In [7]:
# getting basic information on the dataframe
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 503 entries, 0 to 502
Data columns (total 14 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   patient_id    503 non-null    int64  
 1   assigned_sex  503 non-null    object 
 2   given_name    503 non-null    object 
 3   surname       503 non-null    object 
 4   address       491 non-null    object 
 5   city          491 non-null    object 
 6   state         491 non-null    object 
 7   zip_code      491 non-null    float64
 8   country       491 non-null    object 
 9   contact       491 non-null    object 
 10  birthdate     503 non-null    object 
 11  weight        503 non-null    float64
 12  height        503 non-null    int64  
 13  bmi           503 non-null    float64
dtypes: float64(3), int64(2), object(9)
memory usage: 58.9+ KB


In [8]:
# sampling data
df.sample(5)

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,contact,birthdate,weight,height,bmi
428,429,male,Marko,Kos,1128 Jacobs Street,Pittsburgh,PA,15212.0,United States,412-319-0903MarkoKos@einrot.com,10/21/1982,227.7,69,33.6
384,385,male,Even,Knutsen,4851 Andy Street,Custer,SD,57730.0,United States,EvenKnutsen@rhyta.com+1 (605) 440-5492,10/26/1972,180.2,74,23.1
109,110,male,Stephen,Mayberry,3063 School House Road,Hattiesburg,MS,39402.0,United States,601-699-4153StephenFMayberry@jourrapide.com,9/1/1934,166.1,72,22.5
163,164,female,Hawra',Tuma,2972 Hillview Street,Winnsboro,SC,29180.0,United States,803-712-1180HawraSultanahTuma@superrito.com,9/29/1992,134.2,69,19.8
277,278,male,John,Doe,123 Main Street,New York,NY,12345.0,United States,johndoe@email.com1234567890,1/1/1975,180.0,72,24.4


In [9]:
# checking gender distribution
df.assigned_sex.value_counts()

male      253
female    250
Name: assigned_sex, dtype: int64

In [10]:
# describing numerical columns 
df.describe()

Unnamed: 0,patient_id,zip_code,weight,height,bmi
count,503.0,491.0,503.0,503.0,503.0
mean,252.0,49084.118126,173.43499,66.634195,27.483897
std,145.347859,30265.807442,33.916741,4.411297,5.276438
min,1.0,1002.0,48.8,27.0,17.1
25%,126.5,21920.5,149.3,63.0,23.3
50%,252.0,48057.0,175.3,67.0,27.2
75%,377.5,75679.0,199.5,70.0,31.75
max,503.0,99701.0,255.9,79.0,37.7


In [11]:
# describing non-numerical columns
df.describe(include='object')

Unnamed: 0,assigned_sex,given_name,surname,address,city,state,country,contact,birthdate
count,503,503,503,491,491,491,491,491,503
unique,2,470,466,483,349,54,1,483,493
top,male,John,Doe,123 Main Street,New York,California,United States,johndoe@email.com1234567890,1/1/1975
freq,253,9,6,6,18,36,491,6,6


In [12]:
# getting all male records
df[df.assigned_sex == 'male']

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,contact,birthdate,weight,height,bmi
2,3,male,Jae,Debord,1493 Poling Farm Road,York,Nebraska,68467.0,United States,402-363-6804JaeMDebord@gustr.com,2/19/1980,177.8,71,24.8
3,4,male,Liêm,Phan,2335 Webster Street,Woodbridge,NJ,7095.0,United States,PhanBaLiem@jourrapide.com+1 (732) 636-8246,7/26/1951,220.9,70,31.7
4,5,male,Tim,Neudorf,1428 Turkey Pen Lane,Dothan,AL,36303.0,United States,334-515-7487TimNeudorf@cuvox.de,2/18/1928,192.3,27,26.1
5,6,male,Rafael,Costa,1140 Willis Avenue,Daytona Beach,Florida,32114.0,United States,386-334-5237RafaelCardosoCosta@gustr.com,8/31/1931,183.9,70,26.4
8,9,male,Dsvid,Gustafsson,1790 Nutter Street,Kansas City,MO,64105.0,United States,816-265-9578DavidGustafsson@armyspy.com,3/6/1937,163.9,66,26.5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
496,497,male,Alexander,Hueber,3868 Freed Drive,Stockton,California,95204.0,United States,AlexanderHueber@jourrapide.com1 209 762 2320,9/12/1942,194.0,72,26.3
497,498,male,Masataka,Murakami,1179 Patton Lane,Tulsa,OK,74116.0,United States,MasatakaMurakami@einrot.com+1 (918) 984-9171,8/19/1937,155.1,72,21.0
498,499,male,Mustafa,Lindström,2530 Victoria Court,Milton Mills,ME,3852.0,United States,207-477-0579MustafaLindstrom@jourrapide.com,4/10/1959,181.1,72,24.6
499,500,male,Ruman,Bisliev,494 Clarksburg Park Road,Sedona,AZ,86341.0,United States,928-284-4492RumanBisliev@gustr.com,3/26/1948,239.6,70,34.4


In [13]:
# getting all records with duplicated birthdates
df[df.birthdate.duplicated()]

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,contact,birthdate,weight,height,bmi
29,30,male,Jake,Jakobsen,648 Old Dear Lane,Port Jervis,New York,12771.0,United States,JakobCJakobsen@einrot.com+1 (845) 858-7707,8/1/1985,155.8,67,24.4
229,230,male,John,Doe,123 Main Street,New York,NY,12345.0,United States,johndoe@email.com1234567890,1/1/1975,180.0,72,24.4
237,238,male,John,Doe,123 Main Street,New York,NY,12345.0,United States,johndoe@email.com1234567890,1/1/1975,180.0,72,24.4
244,245,male,John,Doe,123 Main Street,New York,NY,12345.0,United States,johndoe@email.com1234567890,1/1/1975,180.0,72,24.4
251,252,male,John,Doe,123 Main Street,New York,NY,12345.0,United States,johndoe@email.com1234567890,1/1/1975,180.0,72,24.4
277,278,male,John,Doe,123 Main Street,New York,NY,12345.0,United States,johndoe@email.com1234567890,1/1/1975,180.0,72,24.4
282,283,female,Sandy,Taylor,2476 Fulton Street,Rainelle,WV,25962.0,United States,304-438-2648SandraCTaylor@dayrep.com,10/23/1960,206.1,64,35.4
480,481,male,Nasser,Mansour,547 Weekley Street,San Antonio,TX,78212.0,United States,NasserMazinMansour@fleckens.hu1 210 326 5509,3/25/1938,183.5,66,29.6
498,499,male,Mustafa,Lindström,2530 Victoria Court,Milton Mills,ME,3852.0,United States,207-477-0579MustafaLindstrom@jourrapide.com,4/10/1959,181.1,72,24.6
502,503,male,Pat,Gersten,2778 North Avenue,Burr,Nebraska,68324.0,United States,PatrickGersten@rhyta.com402-848-4923,5/3/1954,138.2,71,19.3


In [15]:
# checking how many records have empty addresses
sum(df.address.isnull())

12

In [16]:
# gettting all columns
df.columns

Index(['patient_id', 'assigned_sex', 'given_name', 'surname', 'address',
       'city', 'state', 'zip_code', 'country', 'contact', 'birthdate',
       'weight', 'height', 'bmi'],
      dtype='object')

In [17]:
# same as df.columns (i.e. it gives you the column names))
pd.Series(list(df))

0       patient_id
1     assigned_sex
2       given_name
3          surname
4          address
5             city
6            state
7         zip_code
8          country
9          contact
10       birthdate
11          weight
12          height
13             bmi
dtype: object