<a href="https://colab.research.google.com/github/laurabrin/EDALabs/blob/main/EDA_Lab_2_LBrin.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lab Two: Data Cleaning
Laura Brin

Data cleaning is the very first part of any data analysis and/or machine learning project. In this lab, you will be going over some of the common data issues and applying suitable fixes.

##### Loading libraries needed and the data

In [1]:
import pandas as pd
import numpy as np



### Explaining the data we will be using for this and the next few labs

In this example, we have four datasets, two datasets from two different hypothetical clinics, "clinic1" and "clinic2" which diagnose patients with a novel device that takes many measurements. The final goal is to see if they have a particular disease or not. 

Measurements taken from patients in the two clinics are presented in dataframes `df_1` and `df_c2`. We also have an inspection log, recorded in `df_3` and `df_4`, for the devices used in "clinic1" and "clinic2" where many variables from the device are measured. 

Two of these variables, `'R1'` and `'R3'`, are believed to affect the readings taken from the patients. The `df_1` and `df_2` datasets are labelled with an actual diagnosis of whether the patient had the disease or not, and the goal is to predict the existence of the disease based on the measurements taken from the patients. Since the variables of the devices, measured in inspection, affects the measurements taken from patients in clinics, they should also be considered. Here are the data frames:

#### Note: Cells which have '[A]' represents the activity you have to do.

In [2]:
# loading the 'df_1_lab_2.csv' csv data
df = pd.read_csv("/content/df_1_lab_2.csv")

# This makes it so we are able to see 100 rows when displaying the data
pd.set_option('display.max_rows', 100)

In [3]:
df.head()

Unnamed: 0.1,Unnamed: 0,Examination Date,Name,Gender,Age,Mode,Q,M1,DD,Diagnosis
0,0,2011-06-10,Cody Watson,male,72,d,_,5.058993,1.481877,negative
1,1,2011-02-20,Jonathan Duke,male,9,e,104.08763319576146,6.531724,2.266884,negative
2,2,2011-06-27,Charlene Houseworth,female,59,f,102.90431521853222,6.273313,0.396333,positive
3,3,2010-09-22,Gregory Curci,Male,23,g,152.65165533060298,7.333626,0.557534,positive
4,4,2011-06-10,Cody Watson,male,72,d,_,5.058993,1.481877,negative


## Looking at issues with the data

##### Unique values of the `'Gender'` feature.

In [4]:
df['Gender'].unique()

array(['male', 'female', 'Male', 'fmeale'], dtype=object)

##### Unique values of the `'Mode'` feature.

In [5]:
df['Mode'].unique()

array(['d', 'e', 'f', 'g', '_', 'F', 'h', 'a'], dtype=object)

##### Observing the data types of each column.

In [6]:
df.dtypes

Unnamed: 0            int64
Examination Date     object
Name                 object
Gender               object
Age                   int64
Mode                 object
Q                    object
M1                  float64
DD                  float64
Diagnosis            object
dtype: object

# Lab Activity One: Guided Data Cleaning

##### It is generally a good idea to make a copy of the master dataset for cleaning so you can always go back if ever needed.

In [7]:
df_clean = df.copy()

##### [A] Drop the `'Unnamed: 0'` column

In [8]:
df_clean=df_clean.drop('Unnamed: 0', axis=1)

##### [A] Fix the data in the `'Gender'` column. 
> Hint: make it so there are 2 unique entries, `'male'` and `'female'`.

In [9]:
df_clean.replace(to_replace="fmeale", value="female",inplace=True)
df_clean.replace(to_replace="Male", value="male", inplace=True)
df_clean.head(10)

Unnamed: 0,Examination Date,Name,Gender,Age,Mode,Q,M1,DD,Diagnosis
0,2011-06-10,Cody Watson,male,72,d,_,5.058993,1.481877,negative
1,2011-02-20,Jonathan Duke,male,9,e,104.08763319576146,6.531724,2.266884,negative
2,2011-06-27,Charlene Houseworth,female,59,f,102.90431521853222,6.273313,0.396333,positive
3,2010-09-22,Gregory Curci,male,23,g,152.65165533060298,7.333626,0.557534,positive
4,2011-06-10,Cody Watson,male,72,d,_,5.058993,1.481877,negative
5,2011-07-21,Linda Sawicki,female,17,g,67.23905430892725,3.642516,3.765706,positive
6,2011-05-06,Ruth Morgan,female,19,_,54.05030860320375,7.193318,-0.173915,_
7,2011-07-05,Shane Acosta,male,5,f,47.04644557435022,4.808863,6.446874,negative
8,2010-10-31,Tania Fuoco,female,41,f,_,7.637614,-0.655884,negative
9,2011-05-01,Arla Czachorowski,female,36,f,109.64572405866217,7.00612,2.405082,negative


##### [A] Fix the data in the `'Mode'` column. 
> Hint: Check for lower and upper case entries

In [10]:
df_clean["Mode"].str.lower()
df_clean.head(10)

Unnamed: 0,Examination Date,Name,Gender,Age,Mode,Q,M1,DD,Diagnosis
0,2011-06-10,Cody Watson,male,72,d,_,5.058993,1.481877,negative
1,2011-02-20,Jonathan Duke,male,9,e,104.08763319576146,6.531724,2.266884,negative
2,2011-06-27,Charlene Houseworth,female,59,f,102.90431521853222,6.273313,0.396333,positive
3,2010-09-22,Gregory Curci,male,23,g,152.65165533060298,7.333626,0.557534,positive
4,2011-06-10,Cody Watson,male,72,d,_,5.058993,1.481877,negative
5,2011-07-21,Linda Sawicki,female,17,g,67.23905430892725,3.642516,3.765706,positive
6,2011-05-06,Ruth Morgan,female,19,_,54.05030860320375,7.193318,-0.173915,_
7,2011-07-05,Shane Acosta,male,5,f,47.04644557435022,4.808863,6.446874,negative
8,2010-10-31,Tania Fuoco,female,41,f,_,7.637614,-0.655884,negative
9,2011-05-01,Arla Czachorowski,female,36,f,109.64572405866217,7.00612,2.405082,negative


### Setting The Correct Data Type

##### [A] Change the `'Q'` column to numeric data `type(float64)` instead of `'object'`.

In [11]:
df_clean["Q"]=pd.to_numeric(df_clean["Q"],errors="coerce")
df_clean.head(10)

Unnamed: 0,Examination Date,Name,Gender,Age,Mode,Q,M1,DD,Diagnosis
0,2011-06-10,Cody Watson,male,72,d,,5.058993,1.481877,negative
1,2011-02-20,Jonathan Duke,male,9,e,104.087633,6.531724,2.266884,negative
2,2011-06-27,Charlene Houseworth,female,59,f,102.904315,6.273313,0.396333,positive
3,2010-09-22,Gregory Curci,male,23,g,152.651655,7.333626,0.557534,positive
4,2011-06-10,Cody Watson,male,72,d,,5.058993,1.481877,negative
5,2011-07-21,Linda Sawicki,female,17,g,67.239054,3.642516,3.765706,positive
6,2011-05-06,Ruth Morgan,female,19,_,54.050309,7.193318,-0.173915,_
7,2011-07-05,Shane Acosta,male,5,f,47.046446,4.808863,6.446874,negative
8,2010-10-31,Tania Fuoco,female,41,f,,7.637614,-0.655884,negative
9,2011-05-01,Arla Czachorowski,female,36,f,109.645724,7.00612,2.405082,negative


##### [A] Convert `'Examination Date'` column to `datetime` type.

In [12]:
df_clean["Examination Date"]=pd.to_datetime(df_clean["Examination Date"],errors="coerce")

##### Using the `gender_type` key we set the `'Gender'` column as categorical data.

In [13]:
gender_type = pd.CategoricalDtype(categories=["female", "male"])

df_clean["Gender"] = df_clean["Gender"].astype(gender_type)

##### [A] Replicate how we changed the gender column to categorical but this time for the `'Mode'` column.

In [14]:
mode_type = pd.CategoricalDtype(categories=["d", "e","f","g","h","a"])
df_clean["Mode"] = df_clean["Mode"].astype(mode_type)

##### [A] Again, change the `'Diagnosis'` column to categorical replicating above the example above.

In [15]:
diagnosis_type=pd.CategoricalDtype(categories=["positive","negative"])
df_clean["Diagnosis"] = df_clean["Diagnosis"].astype(diagnosis_type)

### Duplicate Entries
- Duplicate entities in a dataset is not good to have as it  can use overfit and is redundant information

##### [A] Check for duplicate entries and delete them

In [16]:
duplicate_rows=df_clean.duplicated()
display(duplicate_rows)

0     False
1     False
2     False
3     False
4      True
5     False
6     False
7     False
8     False
9     False
10    False
11    False
12    False
13    False
14    False
15    False
16    False
17    False
18    False
19    False
20    False
21    False
22    False
23    False
24    False
25    False
26    False
27    False
28    False
29    False
30    False
31    False
32    False
33    False
34    False
35    False
36     True
37    False
38    False
39    False
40    False
41    False
42    False
43    False
44    False
45    False
46    False
47    False
48    False
49    False
50    False
51    False
52    False
53    False
54    False
55    False
56    False
57    False
58    False
59    False
60    False
dtype: bool

In [17]:
df_clean=df_clean[~duplicate_rows]

##### [A] Once you delete a row in your dataset the index of that row is also deleted. Reset the index of the dataset so it in proper order. 
> Hint: use the `dataframe.reset_index` function and set the drop parameter to `True`.

In [18]:
df_clean=df_clean.reset_index(drop=True)

##### [A] Check the data types of your dataframe and print the shape of the dataframe.

In [19]:
df_clean.dtypes

Examination Date    datetime64[ns]
Name                        object
Gender                    category
Age                          int64
Mode                      category
Q                          float64
M1                         float64
DD                         float64
Diagnosis                 category
dtype: object

In [20]:
print(df_clean.shape)

(59, 9)


# Lab Activity Two: Clean a Dataset Yourself

In this activity, you will need to clean the dataset yourself using the examples from the activity above. The `df_2_lab_2.csv` is loaded and displayed for you.

In [42]:
df_2 = pd.read_csv('df_2_lab_2.csv')
df_2.head(10)

Unnamed: 0.1,Unnamed: 0,Inspection Date,R1,R2,R3,R4,Device Site
0,0,2010-09-04,0.07960088091996784,0.007344,0.542921,7.4e-05,clinic1
1,1,2010-09-27,_,0.000204,9.613118,4.2e-05,clinic1
2,2,2010-10-06,_,0.000928,4.298943,4.1e-05,clinic1
3,3,2010-10-21,0.8776172211437634,0.002254,9.240019,0.000182,clinic1
4,4,2010-11-26,_,0.008547,6.659528,2.2e-05,clinic1
5,5,2010-11-29,0.7518246219719119,0.006172,7.023103,1.6e-05,clinic1
6,6,2011-01-04,0.9279763002997659,0.000833,5.473454,0.000124,clinic2
7,7,2011-01-28,0.9051137625634268,0.008793,7.223218,0.000209,clinic2
8,8,2011-02-03,0.09188005264255929,0.001529,2.745545,2.4e-05,clinic1
9,9,2011-02-04,_,0.007507,4.472094,3.5e-05,clinic2


In [43]:
print(df_2.shape)

(22, 7)


##### [A] Figure out the issues with this dataset and apply the cleaning methods

In [44]:
df_clean2=df_2.copy()

In [45]:
df_clean2.dtypes

Unnamed: 0           int64
Inspection Date     object
R1                  object
R2                 float64
R3                 float64
R4                 float64
Device Site         object
dtype: object

needed cleaning: Inspection date to datetime, R1 to float/NaN, check Device site for unique, update Device site to set categories, check for duplicates

In [46]:
df_clean2=df_clean2.drop("Unnamed: 0", axis=1)

In [None]:
df_clean2["Device Site"].unique()

In [48]:
df_clean2["Inspection Date"]=pd.to_datetime(df_clean2["Inspection Date"], errors="coerce")

In [49]:
site_categories=pd.CategoricalDtype(categories=["clinic1","clinic2"])
df_clean2["Device Site"]=df_clean2["Device Site"].astype(site_categories)

In [50]:
df_clean2["R1"]=pd.to_numeric(df_clean2["R1"], errors="coerce")

In [51]:
df_clean2.head(10)

Unnamed: 0,Inspection Date,R1,R2,R3,R4,Device Site
0,2010-09-04,0.079601,0.007344,0.542921,7.4e-05,clinic1
1,2010-09-27,,0.000204,9.613118,4.2e-05,clinic1
2,2010-10-06,,0.000928,4.298943,4.1e-05,clinic1
3,2010-10-21,0.877617,0.002254,9.240019,0.000182,clinic1
4,2010-11-26,,0.008547,6.659528,2.2e-05,clinic1
5,2010-11-29,0.751825,0.006172,7.023103,1.6e-05,clinic1
6,2011-01-04,0.927976,0.000833,5.473454,0.000124,clinic2
7,2011-01-28,0.905114,0.008793,7.223218,0.000209,clinic2
8,2011-02-03,0.09188,0.001529,2.745545,2.4e-05,clinic1
9,2011-02-04,,0.007507,4.472094,3.5e-05,clinic2


In [52]:
print(df_clean2.dtypes)

Inspection Date    datetime64[ns]
R1                        float64
R2                        float64
R3                        float64
R4                        float64
Device Site              category
dtype: object


In [53]:
duplicates=df_clean2.duplicated()
display(duplicates)

0     False
1     False
2     False
3     False
4     False
5     False
6     False
7     False
8     False
9     False
10    False
11    False
12    False
13    False
14    False
15    False
16    False
17    False
18    False
19    False
20    False
21    False
dtype: bool

In [54]:
print(df_clean2.shape)

(22, 6)
