<div style="border: 2px solid #ddd; background-color: #f9f9f9; padding: 15px; margin: 10px 0; border-radius: 8px; color: #333;">

# Data Cleaning and Pre-processing

Raw data is often messy. Before analysis or modeling, it's important to clean and prepare the data so that it is accurate, consistent, and structured in a way that can be used effectively.

**Data cleaning** refers to identifying and correcting errors or inconsistencies in the data. This may include:
- Handling missing values
- Removing duplicate records
- Standardizing formats (e.g., date or category labels)

**Data pre-processing** involves transforming raw data into a format suitable for analysis. This can include:
- Converting data types (e.g., strings to numbers)
- Encoding categorical variables
- Normalizing or scaling numerical values

Proper cleaning and pre-processing are essential steps to ensure reliable and reproducible results in any data analysis project.


<div style="border: 2px solid #ddd; background-color: #f9f9f9; padding: 15px; margin: 10px 0; border-radius: 8px; color: #333;">

## The Data Science Workflow

The typical workflow in a data science project follows these steps:

1. **Data Collection**
   - Gather raw data from databases, surveys, files, sensors, or other sources.

2. **Data Cleaning** 
   - Remove or fix missing values, duplicates, and inconsistent entries.

3. **Data Pre-processing**
   - Convert data types, encode categorical variables, scale numerical values, and prepare the dataset for modeling.

4. **Model Building**
   - Train a machine learning or statistical model using the prepared data.

5. **Model Evaluation**
   - Assess how well the model performs using appropriate evaluation metrics.

6. **Downstream Analysis**
   - Use the model results for predictions, reporting, visualization, or informing decisions.


<div style="border: 2px solid #ddd; background-color: #f9f9f9; padding: 15px; margin: 10px 0; border-radius: 8px; color: #333;">

In this notebook, you are going to work through some data cleaning exercises. This is a critical stage of the data science pipeline

<div style="border: 2px solid #ddd; background-color: #f9f9f9; padding: 15px; margin: 10px 0; border-radius: 8px; color: #333;">

## 1) Duplicates  
  
Frequently, you will find duplicate records in your dataset. This might be due to a computer or administrative error, or possibly two separate people inputting the same case into EHRs. If you do not remove these duplicates, it can bias results.  

**Luckily** `pandas` has lots of functions ready made for you to deal with duplicates.  
  
If you have a dataframe called `df`, then the `df.drop_duplicates()` function will remove them.  
  
We demonstrate this with some fake data in the below cell. 

In [1]:
# Import pandas library
import pandas as pd

# Create a small DataFrame with duplicate rows
data = {
    'patient_id': [101, 102, 103, 104, 105, 101, 106, 107, 102, 108],
    'age': [34, 45, 23, 67, 54, 34, 29, 40, 45, 60],
    'diagnosis': ['A', 'B', 'A', 'C', 'B', 'A', 'D', 'C', 'B', 'A']
}

df = pd.DataFrame(data)

df

Unnamed: 0,patient_id,age,diagnosis
0,101,34,A
1,102,45,B
2,103,23,A
3,104,67,C
4,105,54,B
5,101,34,A
6,106,29,D
7,107,40,C
8,102,45,B
9,108,60,A


In [2]:
df_no_duplicates = df.drop_duplicates()

df_no_duplicates

Unnamed: 0,patient_id,age,diagnosis
0,101,34,A
1,102,45,B
2,103,23,A
3,104,67,C
4,105,54,B
6,106,29,D
7,107,40,C
9,108,60,A


<div style="border: 2px solid #ddd; background-color: #f9f9f9; padding: 15px; margin: 10px 0; border-radius: 8px; color: #333;">

**Important** 
  
`df.drop_duplicates()` only drops rows where every single value in each of the columns is the same. You need to be sure that this is what you want to do! If you have two patients with the same ID, but a different diagnosis, *these are not duplicates* and you need to think of some other way to handle them

<div style="border: 2px solid #ddd; background-color: #f9f9f9; padding: 15px; margin: 10px 0; border-radius: 8px; color: #333;">

## 2) Mixed Data Type. 
  
For ML models to work well, they need to be able to accurately tell the "type" of the data you give them: is it a string? An integer? A decimal? A picture?  
  
You may find situations where your data has mixed types. This needs to be fixed to make all of the data the same. `Pandas` also handles this for us!  
  
If you have a dataframe called `df`, with column `age` which should be integers, then you can use this code `df["age"] = df["age"].astype(int)`.  
The types that you can use are:
* `str` for string (words/characters)
* `int` for integer 
* `float` for decimals

In [3]:
data = {
    'patient_id': [101, 102, 103, 104],
    'age': ['34', 45, '50', 60],  # mixed types: some strings, some integers
    'lab_result': ['4.5', 5.0, '6.2', 7.1]  # mixed types: strings and floats
}

df = pd.DataFrame(data)

df

Unnamed: 0,patient_id,age,lab_result
0,101,34,4.5
1,102,45,5.0
2,103,50,6.2
3,104,60,7.1


In [4]:
print(df.dtypes)

patient_id     int64
age           object
lab_result    object
dtype: object


<div style="border: 2px solid #ddd; background-color: #f9f9f9; padding: 15px; margin: 10px 0; border-radius: 8px; color: #333;">

You can see that the age column contains a mix of "strings" and "integers" and the lab_result column contains a mix of "strings" and "floats". They are recorded as "object" because of this mix. This will confuse any model, so we need to correct this.

In [5]:
# correct the age column 
df["age"] = df["age"].astype(int)

# correct the lab_result column
df["lab_result"] = df["lab_result"].astype(float)

In [6]:
print(df.dtypes)

patient_id      int64
age             int64
lab_result    float64
dtype: object


<div style="border: 2px solid #ddd; background-color: #f9f9f9; padding: 15px; margin: 10px 0; border-radius: 8px; color: #333;">

Now you can see that we have fixed the problem! In some situations, it will be more complex, with more different types or numbers that are complicated to convert. We have not handled these here, but it is worth being aware of them.  
  
For example, what if numbers are recorded as words?
| patient_id | age     |
|------------|---------|
| 101        | 34      |
| 102        | forty   |
| 103        | 28      |
| 104        | fifty   |
| 105        | 41      |
| 106        | sixty   |
  
Another key example is dosages, where someone might use different units that need to be converted!
| patient_id | dose   |
|------------|--------|
| 101        | 10mg   |
| 102        | 0.1g   |
| 103        | 100mg  |
| 104        | 20ug   |


<div style="border: 2px solid #ddd; background-color: #f9f9f9; padding: 15px; margin: 10px 0; border-radius: 8px; color: #333;">

## 3) Missing Data. 
  
A prevalent problem in medical data is when data points are missing, with no value recorded. This can be for various reasons: 
* the measurement wasn't taken for that patient
* the value wasn't recorded by mistake
* a processing error 
  
Handling missing data in a way that is consistent and fair is important in order to minimise any possible bias. There are many ways to handle missing data and no single method is perfect.

<div style="border: 2px solid #ddd; background-color: #f9f9f9; padding: 15px; margin: 10px 0; border-radius: 8px; color: #333;">

### Missing data in Python
  
In Python and `pandas`, missing values may look like a few different things: 
* For numbers, the number might be set to `0` or `np.inf` (infinity) depending on content 
* For strings, a missing value might be left as an empty string like `" "`. 
  
A very common type for data is `NaN`, which is "Not a Number". If you see this value appear in a dataframe, it almost definitely corresponds to a missing value.

<div style="border: 2px solid #ddd; background-color: #f9f9f9; padding: 15px; margin: 10px 0; border-radius: 8px; color: #333;">

### Types of Missing Data

Understanding the type of missing data helps determine how to handle it appropriately. There are three main categories:

- **Missing Completely at Random (MCAR):**  
  The missingness is unrelated to any values in the dataset, observed or unobserved. In other words, the data is missing in a completely unpredictable way.  
  *Example:* A lab machine randomly fails to record a result due to a technical glitch.

- **Missing at Random (MAR):**  
  The missingness is related to observed data, but not to the value that is missing itself.  
  *Example:* Older patients are less likely to report income on a survey. The missingness depends on age (which is observed), but not on the income value itself.

- **Missing Not at Random (MNAR):**  
  The missingness is related to the value that is missing.  
  *Example:* Patients with higher levels of depression are more likely to skip a mental health questionnaire. The reason for missingness is tied to the unobserved value (depression severity).

Identifying the type of missingness is important because it influences which methods are valid for imputing or analyzing the data.


<div style="border: 2px solid #ddd; background-color: #f9f9f9; padding: 15px; margin: 10px 0; border-radius: 8px; color: #333;">

### Ways to handle missing data 

#### 1) Drop missing values
The simplest option is to just delete any missing values, but this has to be done in a consistent way. For the machine learning model to work, every single patient must have values for every single feature. Therefore we cannot just drop a feature for one patient but not all of them. Therefore, if we decide we are going to delete or "drop" a particular feature because it is missing for a patient, then we need to drop it for all patients.  
  
It is worth considering how this will affect our results if we do this. Removing a feature means removing information that could determine our outcome. Ideally, we'd like as much information input to the model as we can, so we'd prefer to not delete any features. However, if a particular feature is missing not at random, or is missing in a very large fraction of the patients, it is probably not reliable for a model and should be dropped.  
  
This can be done in pandas using the syntax shown in the following cell.

In [7]:
import numpy as np 

# Create a small dataset with some missing values
data = {
    'patient_id': [1, 2, 3, 4, 5],
    'age': [25, 30, np.nan, 45, np.nan],
    'blood_pressure': [120, 130, 125, 135, 128]
}

df = pd.DataFrame(data)

df


Unnamed: 0,patient_id,age,blood_pressure
0,1,25.0,120
1,2,30.0,130
2,3,,125
3,4,45.0,135
4,5,,128


In [8]:
# This cell shows how we can check how many values are missing from each cell
print("\nMissing values per column:")
print(df.isna().sum())


Missing values per column:
patient_id        0
age               2
blood_pressure    0
dtype: int64


In [9]:
# This cell shows how we can delete a particular column 
df_dropped = df.drop(columns=['age'])

print("\nDataFrame after dropping the 'age' column:")
df_dropped


DataFrame after dropping the 'age' column:


Unnamed: 0,patient_id,blood_pressure
0,1,120
1,2,130
2,3,125
3,4,135
4,5,128


<div style="border: 2px solid #ddd; background-color: #f9f9f9; padding: 15px; margin: 10px 0; border-radius: 8px; color: #333;">

We can also delete rows instead of columns. We normally do this by referring to the index of the row. This is a much less common thing to do however as it can seriously introduce bias if you delete a set of patients who are missing a value for a particular reason.

In [10]:
df_dropped_row = df.drop(index=[2, 4])
df_dropped_row

Unnamed: 0,patient_id,age,blood_pressure
0,1,25.0,120
1,2,30.0,130
3,4,45.0,135


<div style="border: 2px solid #ddd; background-color: #f9f9f9; padding: 15px; margin: 10px 0; border-radius: 8px; color: #333;">

#### 2) Imputation 
The general approach to handling missing values—when they occur at random and are not overly frequent—is to fill in the missing entries with a *sensible* estimate. This process is called **imputation**.

The choice of value for imputation depends on the context. Common strategies include:

- **Mean or Median Imputation**  
  Missing values can be replaced with the average (mean) or median of the observed values for that feature.  
  For example, if a patient's age is missing, we might impute it using the mean or median age of all other patients. Median is often preferred when the data is skewed.

- **Minimum or Maximum Value Imputation**  
  In some cases, we might fill missing values with the minimum or maximum value of the feature. This is less common and may be used when we believe the data is **not missing at random** and we want to make conservative assumptions (e.g., imputing the worst-case scenario).

Other imputation methods exist, including:
- Using a **constant** value (e.g., 0 or "unknown")
- **Forward/backward fill** in time series data
- More advanced techniques like **regression imputation** or **multiple imputation**

Imputation always introduces some uncertainty, so it's important to document the method used and consider its impact on downstream analyses.

<div style="border: 2px solid #ddd; background-color: #f9f9f9; padding: 15px; margin: 10px 0; border-radius: 8px; color: #333;">

This next cell shows how we can do mean value imputation. In order to use this safely, you need to be confident that there are enough values that are *not* missing in the feature that the mean is a useful quantity. 
  
In situations where a lot of data is missing, the mean becomes inaccurate and so it is necessary to drop features instead

In [11]:
# Create a dataset with missing values in 'age' and 'blood_pressure'
data = {
    'patient_id': [101, 102, 103, 104, 105, 106, 107, 108],
    'age': [25, 32, np.nan, 45, 29, np.nan, 38, 41],
    'blood_pressure': [120, 125, 130, np.nan, 118, 122, 127, np.nan]
}

df = pd.DataFrame(data)

df

Unnamed: 0,patient_id,age,blood_pressure
0,101,25.0,120.0
1,102,32.0,125.0
2,103,,130.0
3,104,45.0,
4,105,29.0,118.0
5,106,,122.0
6,107,38.0,127.0
7,108,41.0,


We can see that df is missing values in both blood pressure and age. We are going to use mean value imputation to fill in the age and median value imputation to fill in the blood pressure. 

In [12]:
df_2 = df.copy()

# Mean imputation for 'age'
age_mean = df_2['age'].mean()
df_2['age'] = df_2['age'].fillna(age_mean)

# Median imputation for 'blood_pressure'
bp_median = df_2['blood_pressure'].median()
df_2['blood_pressure'] = df_2['blood_pressure'].fillna(bp_median)

df_2

Unnamed: 0,patient_id,age,blood_pressure
0,101,25.0,120.0
1,102,32.0,125.0
2,103,35.0,130.0
3,104,45.0,123.5
4,105,29.0,118.0
5,106,35.0,122.0
6,107,38.0,127.0
7,108,41.0,123.5


The code to do minimum value imputation would be similar

In [13]:
age_min = df['age'].min()
df['age'] = df['age'].fillna(age_min)
df

Unnamed: 0,patient_id,age,blood_pressure
0,101,25.0,120.0
1,102,32.0,125.0
2,103,25.0,130.0
3,104,45.0,
4,105,29.0,118.0
5,106,25.0,122.0
6,107,38.0,127.0
7,108,41.0,


<div style="border: 2px solid #ddd; background-color: #f9f9f9; padding: 15px; margin: 10px 0; border-radius: 8px; color: #333;">

### Risks of Imputation

Imputation is a common strategy for handling missing data, but it comes with important risks and limitations that must be considered:

- **Introduction of Bias:**  
  Imputing missing values—especially with simple techniques like mean or median—can distort the true distribution of the data. For example, if the missing values are not missing at random, filling them with the average can systematically bias the results.

- **Underestimation of Variability:**  
  Replacing missing values with a constant (like the mean) reduces variability in the data. This can make statistical analyses appear more certain than they truly are, and can affect confidence intervals or p-values.

- **Distortion of Relationships:**  
  Imputation can weaken or distort relationships between variables, especially if the imputed values do not reflect the actual structure or correlations in the data.

- **False Sense of Completeness:**  
  After imputation, a dataset may look complete, but the uncertainty introduced by missing values remains. It's important not to interpret imputed values as if they were directly observed.

**Best Practice:**  
Always evaluate the impact of imputation on your results. Consider:
- Comparing results before and after imputation
- Documenting which values were imputed
- Using advanced imputation methods (e.g., multiple imputation) when appropriate


<div style="border: 2px solid #ddd; background-color: #f9f9f9; padding: 15px; margin: 10px 0; border-radius: 8px; color: #333;">

### Exercise  

Now, in order to practice what you have learned about data cleaning, please go to the data cleaning exercise notebook.