<div style="border: 2px solid #ddd; background-color: #f9f9f9; padding: 15px; margin: 10px 0; border-radius: 8px; color: #333;">

# Data Cleaning Exercise

In this exercise, you will be given a synthetic dataset that we have designed to be deliberately "messy". It has duplicates, missing values and mixed up types. Your task is to use the skills you learned in the `data_cleaning.ipynb` demonstration to try to clean up this data.  
  
If you get stuck at any point, then look back on that exercise notebook to remember how to analyse the data.  
  
**There are no wrong answers**. This notebook is not assessed in any way, and there is no one unique way to do data cleaning. This notebook is intended to help you practice data cleaning on some real world data.

In [14]:
# First, we import the libraries that we are going to use 

import numpy as np 
import pandas as pd 

In [15]:
# this cell reads in our dataset

df = pd.read_csv("dataset/synthetic_bad_patient_data.csv")

df

Unnamed: 0,patient_id,age,gender,blood_pressure,cholesterol,height_cm,weight_kg,smoking_status,exercise_freq,alcohol_use,diagnosis_code,visit_count,last_visit,medication_count,insurance_type,glucose_level,bmi,doctor_notes,random_column_1,random_column_2
0,1377,30.0,Male,110.628157,180.0,,80,never,1.0,high,D5,10,,0.0,private,96.596998,32.0,Refer,,
1,1649,30.0,Other,107.742218,,170.0,80,former,4.0,none,D3,2,2023-06-15,,private,87.638960,28.0,Monitor,,ok
2,1096,30.0,Female,119.324598,220.0,175.0,80,current,1.0,high,D2,5,unknown,0.0,private,97.295194,22.0,Refer,bad_data,
3,1314,35.0,Female,145.216385,180.0,160.0,85,never,1.0,moderate,D3,15,unknown,0.0,,71.612233,,Refer,bad_data,
4,1198,45.0,Male,121.106552,240.0,,missing,never,,moderate,D5,7,2023-06-15,,none,88.632031,30.0,Monitor,bad_data,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
495,1542,25.0,Male,112.524564,180.0,180.0,80,former,,none,D5,15,,3.0,private,85.530523,,Monitor,,ok
496,1656,,Male,114.013322,240.0,175.0,,former,1.0,moderate,D5,15,2023-01-01,3.0,,92.853994,25.0,Stable,bad_data,ok
497,1439,45.0,Female,146.893839,200.0,170.0,missing,never,0.0,moderate,D5,7,,3.0,,104.512547,30.0,Follow-up,bad_data,
498,1721,,Male,110.664327,180.0,175.0,missing,former,4.0,none,D2,18,,2.0,private,92.608232,,,,


<div style="border: 2px solid #ddd; background-color: #f9f9f9; padding: 15px; margin: 10px 0; border-radius: 8px; color: #333;">

### Data Exploration 
  
The most important first step is to **explore** the data. We need to find out *what* is wrong with it.  
  
Try to answer the following questions:
- How many patients are there?
- What are the data types of each column?
- How many missing values are there in each column?
- Which columns should we just drop, because so many values are missing?

<div style="border: 2px solid #ddd; background-color: #f9f9f9; padding: 15px; margin: 10px 0; border-radius: 8px; color: #333;">

Hints: You may find these functions useful:
- `df.isna().sum()` will tell you how many NaN values are in each column
- `print(df.dtypes)` will tell you what data types each column is. Remember that an "object" is a column with mixed types and this is what we want to fix. 
  
You can use the next few Python cells to do this analysis

<div style="border: 2px solid #ddd; background-color: #f9f9f9; padding: 15px; margin: 10px 0; border-radius: 8px; color: #333;">

### 1) De-duplication. 
  
The first thing that is useful to do is drop any duplicates. Do you remember how to do this?  
  
Do it in the cell below:

<div style="border: 2px solid #ddd; background-color: #f9f9f9; padding: 15px; margin: 10px 0; border-radius: 8px; color: #333;">

### 2) Dropping any columns
  
Are there any columns that we should consider dropping? Why?   
  
Use the cells below to drop any columns you feel you need to:

### Cleaning Up Column Types
  
Make sure to convert the types of the columns so there are no "object" columns!

<div style="border: 2px solid #ddd; background-color: #f9f9f9; padding: 15px; margin: 10px 0; border-radius: 8px; color: #333;">

### 3) Missing Value Imputation
  
Do you still have any missing values? How could you check?  
  
If you still have some, you could consider using "mean value imputation", "median value imputation" or some other variant of your choice. It is totally up to you! Make sure you document what you did and why: try to explain your choices 

In [None]:
df

<div style="border: 2px solid #ddd; background-color: #f9f9f9; padding: 15px; margin: 10px 0; border-radius: 8px; color: #333;">

### 4) Check your work! 
  
Now you have cleaned the data, you should check your work! Use a similar approach to what you did for 1) Data Exploration to check that you have dealt with the data. There should now be no missing values, all the columns should have a single type (no "object") and there should be no duplicated values.  
  
This dataset would now be ready for you to do some analysis on it!