# Lesson 6: Data Cleaning

This is a lesson on data cleaning techniques, a crucial step in any Data Science and Machine Learning project. We will go through the most common issues in raw data and how to handle them effectively using Python, based on examples from the textbook.

**Main Content:**
1.  **Basic Data Cleaning:** Handling useless columns and duplicate data.
2.  **Handling Outliers:** Identifying and removing abnormal values.
3.  **Handling Missing Data:** Strategies from simple to complex for dealing with missing values.
4.  **Guide to Using Pipelines:** Encapsulating the workflow for automation and avoiding data leakage.

## 1. Setup and Import Libraries

In [125]:
# Import necessary libraries
import pandas as pd
import numpy as np
import io

## 2. Basic Data Cleaning (Chapter 5)

We will use a simplified version of the `oil-spill` dataset for illustration.

In [126]:
# Create sample oil-spill data
csv_data = '''f_1,f_2,f_3,f_4,f_5
1,25.4,3.8,0,10
2,22.3,4.1,0,12
3,26.1,3.7,0,10
4,24.8,3.9,0,11
2,22.3,4.1,0,12''' # Duplicate row

df_oil = pd.read_csv(io.StringIO(csv_data))
print("Initial oil-spill data:")
print(df_oil)

Initial oil-spill data:
   f_1   f_2  f_3  f_4  f_5
0    1  25.4  3.8    0   10
1    2  22.3  4.1    0   12
2    3  26.1  3.7    0   10
3    4  24.8  3.9    0   11
4    2  22.3  4.1    0   12


### 2.1. Remove Zero-Variance Columns

**Mathematical Explanation:**
The variance of a variable `X` measures the spread of its values around the mean (μ). The formula is:
$$ Var(X) = \frac{\sum_{i=1}^{N}(x_i - \mu)^2}{N} $$
If the variance is 0, it means that every value $x_i$ is equal to the mean μ. In other words, all values in that column are identical, and the column provides no information to distinguish between samples.

--- 
### Details on `sklearn.feature_selection.VarianceThreshold`
This is a transformer class in Scikit-learn used as a basic feature selection step.

**Usage:**
1. Initialize a `VarianceThreshold` object with the desired `threshold`.
2. Use the `.fit_transform()` method on the data to both learn (calculate variance from the data) and transform (remove columns that don't meet the threshold).

**Important Parameters:**
- `threshold` (float, default=0.0): The variance threshold. Any feature with a variance less than or equal to this threshold will be removed.

In [127]:
# Import the required class
from sklearn.feature_selection import VarianceThreshold

In [128]:
print("Initial data:")
print(df_oil)

# Use VarianceThreshold to remove columns with zero variance
transformer = VarianceThreshold(threshold=0)
# Note: VarianceThreshold only works on numerical data
data_transformed = transformer.fit_transform(df_oil)

# Get the names of the retained columns
retained_cols = transformer.get_feature_names_out(input_features=df_oil.columns)

# Create a new DataFrame
data_cleaned = pd.DataFrame(data_transformed, columns=retained_cols)

print("\nData after removing zero-variance column (f_4):")
print(data_cleaned)

Initial data:
   f_1   f_2  f_3  f_4  f_5
0    1  25.4  3.8    0   10
1    2  22.3  4.1    0   12
2    3  26.1  3.7    0   10
3    4  24.8  3.9    0   11
4    2  22.3  4.1    0   12

Data after removing zero-variance column (f_4):
   f_1   f_2  f_3   f_5
0  1.0  25.4  3.8  10.0
1  2.0  22.3  4.1  12.0
2  3.0  26.1  3.7  10.0
3  4.0  24.8  3.9  11.0
4  2.0  22.3  4.1  12.0


In [129]:
df_oil.nunique()==1

f_1    False
f_2    False
f_3    False
f_4     True
f_5    False
dtype: bool

In [130]:
df_oil.drop(columns=df_oil.columns[df_oil.nunique()==1])

Unnamed: 0,f_1,f_2,f_3,f_5
0,1,25.4,3.8,10
1,2,22.3,4.1,12
2,3,26.1,3.7,10
3,4,24.8,3.9,11
4,2,22.3,4.1,12


### 2.2. Remove Duplicate Rows

**Logical Explanation:**
A duplicate row is a row where all column values are identical to another existing row. Keeping these duplicates can skew analysis and lead to falsely optimistic model evaluations, as the model might be trained and tested on the same data.

In [131]:
print("Initial data:")
print(df_oil)

# Check for duplicate rows
print(f"\nNumber of duplicate rows: {df_oil.duplicated().sum()}")

# Remove duplicate rows
data_no_dup = df_oil.drop_duplicates()
print("\nData after removing duplicate rows:")
print(data_no_dup)

Initial data:
   f_1   f_2  f_3  f_4  f_5
0    1  25.4  3.8    0   10
1    2  22.3  4.1    0   12
2    3  26.1  3.7    0   10
3    4  24.8  3.9    0   11
4    2  22.3  4.1    0   12

Number of duplicate rows: 1

Data after removing duplicate rows:
   f_1   f_2  f_3  f_4  f_5
0    1  25.4  3.8    0   10
1    2  22.3  4.1    0   12
2    3  26.1  3.7    0   10
3    4  24.8  3.9    0   11


### 2.3 Lab #1: Practice with full data

In [132]:
url_lab1 = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/oil-spill.csv'
df_lab1 = pd.read_csv(url_lab1, header=None)
df_lab1.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,40,41,42,43,44,45,46,47,48,49
0,1,2558,1506.09,456.63,90,6395000.0,40.88,7.89,29780.0,0.19,...,2850.0,1000.0,763.16,135.46,3.73,0,33243.19,65.74,7.95,1
1,2,22325,79.11,841.03,180,55812500.0,51.11,1.21,61900.0,0.02,...,5750.0,11500.0,9593.48,1648.8,0.6,0,51572.04,65.73,6.26,0
2,3,115,1449.85,608.43,88,287500.0,40.42,7.34,3340.0,0.18,...,1400.0,250.0,150.0,45.13,9.33,1,31692.84,65.81,7.84,1
3,4,1201,1562.53,295.65,66,3002500.0,42.4,7.97,18030.0,0.19,...,6041.52,761.58,453.21,144.97,13.33,1,37696.21,65.67,8.07,1
4,5,312,950.27,440.86,37,780000.0,41.43,7.03,3350.0,0.17,...,1320.04,710.63,512.54,109.16,2.58,0,29038.17,65.66,7.35,0


## 3. Identify and Remove Outliers (Chapter 6)

We will use the `housing` dataset for illustration. This dataset contains information about house prices, and extremely large or small values could be outliers.

In [133]:
# Create sample housing data
csv_housing = '''CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
0.00632,18,2.31,0,0.538,6.575,65.2,4.09,1,296,15.3,396.9,4.98,24
0.02731,0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.9,9.14,21.6
0.02729,0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03,34.7
0.03237,0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4
0.06905,0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,396.9,5.33,36.2
0.9,80,20,0,0.6,12,90,2,5,666,20,350,30,500''' # Row that may contain outliers

df_housing = pd.read_csv(io.StringIO(csv_housing))
print("Initial housing data:")
df_housing

Initial housing data:


Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
0,0.00632,18,2.31,0,0.538,6.575,65.2,4.09,1,296,15.3,396.9,4.98,24.0
1,0.02731,0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.9,9.14,21.6
2,0.02729,0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03,34.7
3,0.03237,0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4
4,0.06905,0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,396.9,5.33,36.2
5,0.9,80,20.0,0,0.6,12.0,90.0,2.0,5,666,20.0,350.0,30.0,500.0


### 3.1. Standard Deviation Method

**Mathematical Explanation:**
This method assumes the data has a Gaussian (bell-shaped) distribution. We define a boundary based on the mean (μ) and standard deviation (σ).
- **Upper bound:** $ \mu + 3 \times \sigma $
- **Lower bound:** $ \mu - 3 \times \sigma $
Any data point falling outside this range is considered an outlier. The number 3 is common but can be adjusted.

In [134]:
# Consider the MEDV column (house price)
data_col = df_housing['MEDV']

# Calculate the limits
mean, std = data_col.mean(), data_col.std()
cut_off = std * 2 # Using 2 std to make outliers more visible in this small dataset
lower, upper = mean - cut_off, mean + cut_off

# Identify outliers
outliers = df_housing[(data_col < lower) | (data_col > upper)]
print(f"Found {len(outliers)} outliers.")
print(outliers[['RM', 'MEDV']])

# Remove outliers
data_cleaned_outlier = df_housing[(data_col >= lower) & (data_col <= upper)]
print(f"\nOriginal data size: {len(df_housing)}")
print(f"Data size after cleaning: {len(data_cleaned_outlier)}")

Found 1 outliers.
     RM   MEDV
5  12.0  500.0

Original data size: 6
Data size after cleaning: 5


### 3.2. Interquartile Range (IQR) Method

**Mathematical Explanation:**
This method does not require the data to have a Gaussian distribution. It is based on quartiles:
- **Q1:** The first quartile (25% of data is below it).
- **Q3:** The third quartile (75% of data is below it).
- **IQR (Interquartile Range):** $ Q3 - Q1 $

The boundaries are defined as:
- **Upper bound:** $ Q3 + 1.5 \times IQR $
- **Lower bound:** $ Q1 - 1.5 \times IQR $
Points falling outside this range are considered outliers.

In [135]:
# Reusing the MEDV column
data_col = df_housing['MEDV']

# Calculate Q1, Q3, and IQR
Q1 = data_col.quantile(0.25)
Q3 = data_col.quantile(0.75)
IQR = Q3 - Q1

# Calculate the limits
lower_iqr, upper_iqr = Q1 - 1.5 * IQR, Q3 + 1.5 * IQR

# Identify outliers
outliers_iqr = df_housing[(data_col < lower_iqr) | (data_col > upper_iqr)]
print(f"Found {len(outliers_iqr)} outliers using IQR.")
print(outliers_iqr[['RM', 'MEDV']])

Found 1 outliers using IQR.
     RM   MEDV
5  12.0  500.0


In [136]:
lower = df_housing.mean() - 2*df_housing.std()
(df_housing>=lower).sum(axis=1)<14

0    False
1    False
2    False
3    False
4    False
5     True
dtype: bool

In [137]:
column_names = [
    "CRIM",    # Tội phạm bình quân đầu người theo thị trấn
    "ZN",      # Tỷ lệ đất ở > 25,000 sq.ft
    "INDUS",   # Tỷ lệ diện tích cho doanh nghiệp phi bán lẻ
    "CHAS",    # Biến giả sông Charles (=1 nếu gần sông, 0 nếu không)
    "NOX",     # Nồng độ oxit nitơ (phần triệu)
    "RM",      # Số phòng trung bình mỗi căn hộ
    "AGE",     # % căn hộ xây dựng trước 1940
    "DIS",     # Khoảng cách bình quân đến 5 trung tâm việc làm
    "RAD",     # Chỉ số khả năng tiếp cận đường cao tốc
    "TAX",     # Thuế bất động sản
    "PTRATIO", # Tỷ lệ học sinh/giáo viên
    "B",       # 1000(Bk - 0.63)^2, với Bk % dân da đen
    "LSTAT",   # % dân có địa vị kinh tế xã hội thấp
    "MEDV"     # Giá trị trung vị của nhà (ngàn USD)
]
url_lab2 = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv"
df_lab2 = pd.read_csv(url_lab2, header=None, names=column_names)
df_lab2
df_lab2.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222.0,18.7,396.9,5.33,36.2


In [138]:
X = df_lab2.drop(columns=['MEDV'])
y = df_lab2['MEDV']

In [139]:
X.columns

Index(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX',
       'PTRATIO', 'B', 'LSTAT'],
      dtype='object')

In [140]:
from sklearn.model_selection import train_test_split
from data_cleaning_module import* 

In [141]:
from sklearn.linear_model import LinearRegression
model = LinearRegression()
remover_iqr = OutlierIQRRemove(columns=['TAX','PTRATIO', 'B', 'LSTAT']) # Chỉ cần định nghĩa transformer

# 2. Khởi tạo và chạy so sánh
comparer_iqr = ModelComparer(model=model, preprocessor=remover_iqr)
results_df = comparer_iqr.compare(X, y)

print(results_df)

           Model Score  Train Samples  Test Samples
Dataset                                            
Raw           0.668759            404           102
Processed     0.716205            335            87


In [142]:
from sklearn.linear_model import LinearRegression
model = LinearRegression()
remover_std = OutlierStdRemove(columns=['TAX','PTRATIO', 'B', 'LSTAT']) # Chỉ cần định nghĩa transformer

# 2. Khởi tạo và chạy so sánh
comparer_std = ModelComparer(model=model, preprocessor=remover_iqr)
results_df = comparer_std.compare(X, y)

print(results_df)

           Model Score  Train Samples  Test Samples
Dataset                                            
Raw           0.668759            404           102
Processed     0.716205            335            87


### 3.3. Automatic Outlier Detection (Local Outlier Factor)

**Logical Explanation:**
Local Outlier Factor (LOF) does not look at the entire dataset but compares the density of a data point to the density of its "neighbors".
- If a point has a significantly lower density than its neighbors, it is considered to be in a "sparse" region and is likely an outlier.
- The algorithm assigns a score to each sample. The higher the score, the more likely it is an outlier. In scikit-learn, outliers are typically labeled as -1.

--- 
### Details on `sklearn.neighbors.LocalOutlierFactor`
This is an unsupervised learning model for anomaly detection.

**Usage:**
1. Initialize a `LocalOutlierFactor` object.
2. Use the `.fit_predict()` method on the training data. This method will return an array: `1` for normal points (inliers) and `-1` for outliers. 
3. `.predict()` on the test data



**Important Parameters:**
- `n_neighbors` (int, default=20): The number of neighbors used to calculate the local density. This is the most important parameter to tune.
- `contamination` (float, default='auto'): The expected proportion of outliers in the dataset (e.g., 0.1 for 10%). This parameter affects the model's decision threshold. The default 'auto' will determine the threshold based on the original algorithm's publication.

In [143]:
# Import the required class
from sklearn.neighbors import LocalOutlierFactor

In [144]:
# Use LOF on the entire housing dataset
lof = LocalOutlierFactor()
yhat = lof.fit_predict(df_housing)

# Filter out the outliers (LOF labels outliers as -1)
mask = yhat != -1
print(f"Number of outliers found: {sum(yhat == -1)}")
print(f"Data size after removing outliers: {df_housing[mask].shape}")

Number of outliers found: 0
Data size after removing outliers: (6, 14)




In [145]:
X_train, X_test, y_train, y_test = train_test_split(X,y,random_state=42)


In [146]:
X_train

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
182,0.09103,0.0,2.46,0,0.4880,7.155,92.2,2.7006,3,193.0,17.8,394.12,4.82
155,3.53501,0.0,19.58,1,0.8710,6.152,82.6,1.7455,5,403.0,14.7,88.01,15.02
280,0.03578,20.0,3.33,0,0.4429,7.820,64.5,4.6947,5,216.0,14.9,387.31,3.76
126,0.38735,0.0,25.65,0,0.5810,5.613,95.6,1.7572,2,188.0,19.1,359.29,27.26
329,0.06724,0.0,3.24,0,0.4600,6.333,17.2,5.2146,4,430.0,16.9,375.21,7.34
...,...,...,...,...,...,...,...,...,...,...,...,...,...
106,0.17120,0.0,8.56,0,0.5200,5.836,91.9,2.2110,5,384.0,20.9,395.67,18.66
270,0.29916,20.0,6.96,0,0.4640,5.856,42.1,4.4290,3,223.0,18.6,388.65,13.00
348,0.01501,80.0,2.01,0,0.4350,6.635,29.7,8.3440,4,280.0,17.0,390.94,5.99
435,11.16040,0.0,18.10,0,0.7400,6.629,94.6,2.1247,24,666.0,20.2,109.85,23.27


In [147]:
X_train_clean = X_train.fillna(value=0)
X_train_clean.shape

(379, 13)

In [148]:
# Use LOF on the entire housing dataset
lof = LocalOutlierFactor()
yhat = lof.fit_predict(X_train_clean)

# Filter out the outliers (LOF labels outliers as -1)
mask = yhat != -1
print(f"Number of outliers found: {sum(yhat == -1)}")
print(f"Data size after removing outliers: {X_train_clean[mask].shape}")

Number of outliers found: 33
Data size after removing outliers: (346, 13)


## 4. Handling Missing Data (Chapters 7-10)

We will use a simplified version of the `horse-colic` dataset for illustration.

In [149]:
# Create sample horse-colic data
csv_horse = '''hospital_number,rectal_temp,pulse,respiratory_rate,pain,outcome
530101,38.5,66,28,3,2
534817,39.2,88,20,?,1
530334,38.3,40,?,3,1
529048,39.1,164,84,4,2
526254,?,72,?,2,1'''

# Read data, considering '?' as a missing value
df_horse = pd.read_csv(io.StringIO(csv_horse), na_values='?')
print("Initial horse-colic data:")
print(df_horse)

Initial horse-colic data:
   hospital_number  rectal_temp  pulse  respiratory_rate  pain  outcome
0           530101         38.5     66              28.0   3.0        2
1           534817         39.2     88              20.0   NaN        1
2           530334         38.3     40               NaN   3.0        1
3           529048         39.1    164              84.0   4.0        2
4           526254          NaN     72               NaN   2.0        1


### 4.1. Marking and Removing Missing Data (Chapter 7)

**Logical Explanation:**
This is the simplest approach. If a row contains at least one missing value, we remove the entire row. The advantage is speed and simplicity, but the disadvantage is the potential loss of a large amount of valuable data if missing values are scattered.

In [150]:
print("Initial data:")
print(df_horse)

# Check the number of missing values
print("\nNumber of missing values per column:")
print(df_horse.isnull().sum())

# Remove rows with missing values
data_dropped = df_horse.dropna()
print("\nData after dropping missing rows:")
print(data_dropped)

Initial data:
   hospital_number  rectal_temp  pulse  respiratory_rate  pain  outcome
0           530101         38.5     66              28.0   3.0        2
1           534817         39.2     88              20.0   NaN        1
2           530334         38.3     40               NaN   3.0        1
3           529048         39.1    164              84.0   4.0        2
4           526254          NaN     72               NaN   2.0        1

Number of missing values per column:
hospital_number     0
rectal_temp         1
pulse               0
respiratory_rate    2
pain                1
outcome             0
dtype: int64

Data after dropping missing rows:
   hospital_number  rectal_temp  pulse  respiratory_rate  pain  outcome
0           530101         38.5     66              28.0   3.0        2
3           529048         39.1    164              84.0   4.0        2


### 4.2. Statistical Imputation (Chapter 8)

**Mathematical Explanation:**
We replace a missing value (NaN) in a column with a statistical value calculated from the remaining valid values in that column.
- **Mean:** $ \bar{x} = \frac{\sum x_i}{n} $. Suitable for numerical data without outliers.
- **Median:** The middle value of the column after sorting. Suitable for numerical data with outliers.
- **Most Frequent (Mode):** The value that appears most often. Suitable for categorical data.

--- 
### Details on `sklearn.impute.SimpleImputer`
This is a transformer class for handling missing values. It provides basic imputation strategies.

**Usage:**
1. Initialize a `SimpleImputer` object, specifying the imputation strategy via the `strategy` parameter.
2. Use `.fit_transform()` on the data.

**Important Parameters:**
- `strategy` (string, default='mean'): The imputation strategy. Possible values are `'mean'`, `'median'`, `'most_frequent'`, or `'constant'`. 
- `fill_value` (string or number, default=None): When `strategy='constant'`, this parameter is used to specify the value to be imputed.

In [151]:
# Import the required class
from sklearn.impute import SimpleImputer

In [152]:
# Use SimpleImputer
imputer = SimpleImputer(strategy='median')
data_imputed_mean = imputer.fit_transform(df_horse)

print("Data after imputing with the mean:")
print(pd.DataFrame(data_imputed_mean, columns=df_horse.columns))

Data after imputing with the mean:
   hospital_number  rectal_temp  pulse  respiratory_rate  pain  outcome
0         530101.0         38.5   66.0              28.0   3.0      2.0
1         534817.0         39.2   88.0              20.0   3.0      1.0
2         530334.0         38.3   40.0              28.0   3.0      1.0
3         529048.0         39.1  164.0              84.0   4.0      2.0
4         526254.0         38.8   72.0              28.0   2.0      1.0


In [153]:
#assume that we are doing on strainning set
df_lab2.head()
imputer_lab2 = SimpleImputer(strategy='mean')
imputer_lab2.fit(df_lab2)

0,1,2
,missing_values,
,strategy,'mean'
,fill_value,
,copy,True
,add_indicator,False
,keep_empty_features,False


In [154]:
df_lab2_inputed = imputer_lab2.transform(df_lab2)
df_lab2_inputed

array([[6.3200e-03, 1.8000e+01, 2.3100e+00, ..., 3.9690e+02, 4.9800e+00,
        2.4000e+01],
       [2.7310e-02, 0.0000e+00, 7.0700e+00, ..., 3.9690e+02, 9.1400e+00,
        2.1600e+01],
       [2.7290e-02, 0.0000e+00, 7.0700e+00, ..., 3.9283e+02, 4.0300e+00,
        3.4700e+01],
       ...,
       [6.0760e-02, 0.0000e+00, 1.1930e+01, ..., 3.9690e+02, 5.6400e+00,
        2.3900e+01],
       [1.0959e-01, 0.0000e+00, 1.1930e+01, ..., 3.9345e+02, 6.4800e+00,
        2.2000e+01],
       [4.7410e-02, 0.0000e+00, 1.1930e+01, ..., 3.9690e+02, 7.8800e+00,
        1.1900e+01]])

### 4.3. KNN Imputation (Chapter 9)

**Logical Explanation:**
K-Nearest Neighbors (KNN) Imputation works on the idea that a data point can be predicted by the data points closest to it.
1. To impute a missing value, the algorithm finds the `k` nearest rows (neighbors) to the row with the missing value. "Nearest" is measured by distance (e.g., Euclidean distance) based on the columns that have values.
2. The missing value is then estimated by taking the average (or weighted average) of the values from those `k` neighbors in the same column.

--- 
### Details on `sklearn.impute.KNNImputer`
This transformer class imputes missing values using the k-Nearest Neighbors method.

**Usage:**
1. Initialize `KNNImputer` with the desired number of neighbors `n_neighbors`.
2. Use `.fit_transform()` to learn from the data and perform the imputation.



**Important Parameters:**
- `n_neighbors` (int, default=5): The number of neighbors to use for imputation.
- `weights` (string, default='uniform'): The weight function used in prediction. `'uniform'` means all neighbors are weighted equally. `'distance'` means that closer neighbors will have a greater influence.

In [155]:
# Import the required class
from sklearn.impute import KNNImputer

In [156]:
df_horse

Unnamed: 0,hospital_number,rectal_temp,pulse,respiratory_rate,pain,outcome
0,530101,38.5,66,28.0,3.0,2
1,534817,39.2,88,20.0,,1
2,530334,38.3,40,,3.0,1
3,529048,39.1,164,84.0,4.0,2
4,526254,,72,,2.0,1


In [157]:
# Use KNNImputer
knn_imputer = KNNImputer(n_neighbors=3, weights='uniform')
data_imputed_knn = knn_imputer.fit_transform(df_horse)

print("Data after KNN imputation:")
print(pd.DataFrame(data_imputed_knn, columns=df_horse.columns))

Data after KNN imputation:
   hospital_number  rectal_temp  pulse  respiratory_rate      pain  outcome
0         530101.0    38.500000   66.0              28.0  3.000000      2.0
1         534817.0    39.200000   88.0              20.0  3.333333      1.0
2         530334.0    38.300000   40.0              44.0  3.000000      1.0
3         529048.0    39.100000  164.0              84.0  4.000000      2.0
4         526254.0    38.633333   72.0              44.0  2.000000      1.0


### 4.4. Iterative Imputation (Chapter 10)

**Logical Explanation:**
This is a more sophisticated method that treats imputation as a machine learning problem.
1. For a column `C` with missing values, the algorithm treats column `C` as the target variable (`y`) and all other columns as input features (`X`).
2. It builds a regression model (e.g., Linear Regression) on the rows without missing data in column `C` to learn the relationship `y = f(X)`.
3. This model is then used to predict and fill in the missing values in column `C`.
4. This process is repeated for all columns with missing values, and the entire cycle is repeated multiple times (iteratively). In each iteration, the imputed values from the previous step are used to improve the next predictions until the results converge.

--- 
### Details on `sklearn.impute.IterativeImputer`
This is a multivariate transformer class that imputes missing values by modeling each feature as a function of other features.

**Usage:**
1. Initialize `IterativeImputer`. The internal regression model can be customized via the `estimator` parameter.
2. Use `.fit_transform()` to perform the iterative imputation process.

**Important Parameters:**
- `estimator` (object, default=BayesianRidge()): The regression model used to predict missing values. Other models like `RandomForestRegressor` can be used.
- `max_iter` (int, default=10): The maximum number of imputation rounds.
- `random_state` (int): To ensure reproducible results.

In [158]:
# Import the required classes
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

In [159]:
# Use IterativeImputer
iter_imputer = IterativeImputer(max_iter=10, random_state=0)
data_imputed_iter = iter_imputer.fit_transform(df_horse)

print("Data after iterative imputation:")
print(pd.DataFrame(data_imputed_iter, columns=df_horse.columns))

Data after iterative imputation:
   hospital_number  rectal_temp  pulse  respiratory_rate      pain  outcome
0         530101.0     38.50000   66.0         28.000000  3.000000      2.0
1         534817.0     39.20000   88.0         20.000000  4.576976      1.0
2         530334.0     38.30000   40.0         13.338398  3.000000      1.0
3         529048.0     39.10000  164.0         84.000000  4.000000      2.0
4         526254.0     38.10825   72.0         47.142432  2.000000      1.0


### 4.5 lab #3: Practice with full data

In [160]:
url_lab3 = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/horse-colic.csv"

col_names = [
    "surgery", "age", "hospital_number", "rectal_temp", "pulse",
    "respiratory_rate", "temp_extremities", "peripheral_pulse",
    "mucous_membrane", "capillary_refill", "pain", "peristalsis",
    "abdominal_distension", "nasogastric_tube", "nasogastric_reflux",
    "nasogastric_reflux_ph", "rectal_exam_feces", "abdomen",
    "packed_cell_volume", "total_protein", "abdomocentesis_appearance",
    "abdomocentesis_total_protein", "outcome", "surgical_lesion",
    "lesion_1", "lesion_2", "lesion_3", "lesion_4"
]

df_lab3 = pd.read_csv(url_lab3, header=None, names=col_names, na_values="?")

df_lab3.head()

Unnamed: 0,surgery,age,hospital_number,rectal_temp,pulse,respiratory_rate,temp_extremities,peripheral_pulse,mucous_membrane,capillary_refill,...,packed_cell_volume,total_protein,abdomocentesis_appearance,abdomocentesis_total_protein,outcome,surgical_lesion,lesion_1,lesion_2,lesion_3,lesion_4
0,2.0,1,530101,38.5,66.0,28.0,3.0,3.0,,2.0,...,45.0,8.4,,,2.0,2,11300,0,0,2
1,1.0,1,534817,39.2,88.0,20.0,,,4.0,1.0,...,50.0,85.0,2.0,2.0,3.0,2,2208,0,0,2
2,2.0,1,530334,38.3,40.0,24.0,1.0,1.0,3.0,1.0,...,33.0,6.7,,,1.0,2,0,0,0,1
3,1.0,9,5290409,39.1,164.0,84.0,4.0,1.0,6.0,2.0,...,48.0,7.2,3.0,5.3,2.0,1,2208,0,0,1
4,2.0,1,530255,37.3,104.0,35.0,,,6.0,2.0,...,74.0,7.4,,,2.0,2,4300,0,0,2


## 5. Automating the Workflow with Pipelines (Best Practice)

In practice, data cleaning steps (like imputation) and model training should not be separate. Scikit-learn's `Pipeline` is a powerful tool for combining multiple processing and modeling steps into a single workflow.

### 5.1. Why are Pipelines Important?

**Logical Explanation:**
Imagine you are imputing missing data with the mean. The wrong way to do it is to calculate the mean on the *entire* dataset, fill in the missing values, and *then* split into training and test sets. By doing this, information from the test set (specifically, its values contributing to the mean) has "leaked" into the training process. This leads to a falsely optimistic performance on your test set, but the model will perform poorly on real-world data.

**A `Pipeline` solves this by:**
1.  **Encapsulating the process:** It combines steps like `SimpleImputer` and `RandomForestClassifier` into a single object.
2.  **Preventing data leakage:** When you call `pipeline.fit(X_train, y_train)`, it will only "learn" (fit) the preprocessing steps (e.g., calculate the mean) on `X_train`. When you call `pipeline.predict(X_test)`, it will apply the learned transformation to `X_test` without re-learning. This accurately simulates the real-world workflow and ensures the integrity of the model evaluation.

### 5.2. Practical Example with a Pipeline

In this example, we will build a complete workflow: impute missing data and then train a classification model. We will explore the components used in more detail.

--- 
### Details on Components within the Pipeline

**1. `sklearn.model_selection.train_test_split`**
This is a utility function to split data into training and testing sets, ensuring that model evaluation is objective.
- **Usage:** `train_test_split(X, y, test_size=0.3, random_state=42)`
- **Important Parameters:**
  - `test_size` (float, default=0.25): The proportion of the dataset to allocate to the test set.
  - `random_state` (int): Ensures that the data split is the same every time the code is run, making results reproducible.
  - `stratify` (array-like, default=None): If provided (usually `y`), the data is split in a stratified fashion, preserving the same proportion of classes in both the train and test sets. Very useful for imbalanced classification problems.

**2. `sklearn.ensemble.RandomForestClassifier`**
A powerful classification model from the "ensemble" family of algorithms. It builds multiple decision trees and combines their results to make a final prediction, which increases stability and accuracy.
- **Usage:** Initialize a `RandomForestClassifier` object, then use `.fit()` to train and `.predict()` to make predictions.
- **Important Parameters:**
  - `n_estimators` (int, default=100): The number of decision trees in the forest.
  - `max_depth` (int, default=None): The maximum depth of each tree. Helps control model complexity and combat overfitting.
  - `criterion` (string, default='gini'): The function to measure the quality of a split. Can be `'gini'` or `'entropy'`.

**3. `sklearn.pipeline.Pipeline`**
The main class for encapsulating processing steps and a model into a single object.
- **Usage:** Initialize `Pipeline` by passing a list of tuples. Each tuple contains an identifier name (string) and an estimator object (e.g., `('imputer', SimpleImputer())`). The Pipeline has `.fit()`, `.predict()`, and `.transform()` methods.
- **Important Parameters:**
  - `steps` (list): The list of processing steps and the model. The final step must be an estimator (has `.fit()`). All preceding steps must be transformers (have `.fit_transform()`).

**4. `sklearn.metrics.accuracy_score`**
A function to measure model performance by calculating the percentage of correct predictions.
- **Usage:** `accuracy_score(y_true, y_pred)`
- **Important Parameters:**
  - `y_true`: The ground truth labels.
  - `y_pred`: The labels predicted by the model.

In [161]:
# Import all necessary classes for this example
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer # Already imported, but good practice to have it here
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

In [162]:
# 1. Prepare data from horse-colic
# Drop rows where the target variable 'outcome' is missing
df_lab3_clean = df_lab3.dropna(subset=['outcome'])
X = df_lab3_clean.drop('outcome', axis=1)
y = df_lab3_clean['outcome']

# 2. Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# 3. Create the Pipeline
# This Pipeline will consist of 2 steps:
# 'imputer': Impute missing values with the median.
# 'model': Train a Random Forest model.
pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('model', RandomForestClassifier(random_state=42))
])

# 4. Train the entire Pipeline on the training set
# Scikit-learn will automatically:
# - Call imputer.fit_transform(X_train)
# - Then use the result to train model.fit(X_train_transformed, y_train)
pipeline.fit(X_train, y_train)
print(f"Pipeline has been trained.")

# 5. Evaluate the Pipeline on the test set
# Scikit-learn will automatically:
# - Call imputer.transform(X_test) (NOT re-fitting)
# - Then use the result to predict model.predict(X_test_transformed)
y_pred = pipeline.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print(f"Accuracy on the test set: {accuracy:.4f}")

Pipeline has been trained.
Accuracy on the test set: 0.7333


###

# Summary

We have covered the core data cleaning techniques:

- **Basic Cleaning:** Removing zero-variance features and duplicate samples.
- **Handling Outliers:** Using statistical methods and machine learning models to identify and remove abnormal data points.
- **Handling Missing Data:** From simple removal to sophisticated imputation techniques like KNN and Iterative Imputation.
- **Using `Pipeline`:** This is the strongly recommended method to combine preprocessing and modeling steps, ensuring a robust, reproducible, and data-leakage-free workflow.

Data cleaning is a foundational step that ensures the data fed into the model is reliable and of high quality, thereby improving the performance of the machine learning model.

### KNN-Inputer

In [163]:
# 3. Create the Pipeline
# This Pipeline will consist of 2 steps:
# 'imputer': Impute missing values with the median.
# 'model': Train a Random Forest model.
pipeline = Pipeline([
    ('imputer', KNNImputer()),
    ('model', RandomForestClassifier(random_state=42))
])

# 4. Train the entire Pipeline on the training set
# Scikit-learn will automatically:
# - Call imputer.fit_transform(X_train)
# - Then use the result to train model.fit(X_train_transformed, y_train)
pipeline.fit(X_train, y_train)
print(f"Pipeline has been trained.")

# 5. Evaluate the Pipeline on the test set
# Scikit-learn will automatically:
# - Call imputer.transform(X_test) (NOT re-fitting)
# - Then use the result to predict model.predict(X_test_transformed)
y_pred = pipeline.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print(f"Accuracy on the test set: {accuracy:.4f}")

Pipeline has been trained.
Accuracy on the test set: 0.7444


In [164]:
# 1. Prepare data from horse-colic
# Drop rows where the target variable 'outcome' is missing
df_lab3_clean = df_lab3.dropna(subset=['outcome'])
X = df_lab3_clean.drop('outcome', axis=1)
y = df_lab3_clean['outcome']

# 2. Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# 3. Create the Pipeline
# This Pipeline will consist of 2 steps:
# 'imputer': Impute missing values with the median.
# 'model': Train a Random Forest model.
pipeline = Pipeline([
    ('imputer', IterativeImputer()),
    ('model', RandomForestClassifier(random_state=42))
])

# 4. Train the entire Pipeline on the training set
# Scikit-learn will automatically:
# - Call imputer.fit_transform(X_train)
# - Then use the result to train model.fit(X_train_transformed, y_train)
pipeline.fit(X_train, y_train)
print(f"Pipeline has been trained.")

# 5. Evaluate the Pipeline on the test set
# Scikit-learn will automatically:
# - Call imputer.transform(X_test) (NOT re-fitting)
# - Then use the result to predict model.predict(X_test_transformed)
y_pred = pipeline.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print(f"Accuracy on the test set: {accuracy:.4f}")

Pipeline has been trained.
Accuracy on the test set: 0.7333
