# **Module 1: Machine Learning**
## **Introduction**

## Handling Missing Data

Handling missing data is a critical step in the machine learning pipeline because the presence of missing values can significantly impact the performance of models. Here are some common strategies for handling missing data:

### 1. Removing Missing Data
If the missing data is minimal, you can remove the rows or columns with missing values.

**Pros:** Ensures that analyses are based on the same set of data.

**Cons:** Reduces the dataset size and may introduce bias.

In [None]:
import pandas as pd
import numpy as np

data = {
    'A': [1, 2, np.nan, 4, 5],
    'B': [5, np.nan, np.nan, 8, 10],
    'C': ['cat', 'dog', 'cat', np.nan, 'dog']
}
df = pd.DataFrame(data)
df

In [None]:
import pandas as pd

print('data frame before: ')
print (df)

print('\ndata frame after: ')
df.dropna()  # Removes rows with missing values

### 2. Imputation:
**Mean/Median/Mode Imputation:**  
Replace missing values with the mean (numerical), median (numerical), or mode (categorical) of the column.  
**Pros:** Simple and quick.  
**Cons:** Can distort the distribution and reduce variance.

In [None]:
df = pd.DataFrame({'A': [1, 2, None], 'B': [4, None, 6]})
df #DataFrame visualization before transforming SimpleImputer with 'mean'

In [None]:
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy='mean')
df['B'] = imputer.fit_transform(df[['B']])
df['A'] = imputer.fit_transform(df[['A']])
df #DataFrame visualization after transforming imputer with 'mean' strategy

**Forward/Backward Fill:**  
Replace missing values with the previous or next value in time series data.  
**Pros:** Suitable for time series data.  
**Cons:** Not suitable for non-sequential data.


In [None]:
df = pd.DataFrame({'A': [1, 2, None], 'B': [4, None, 6]})
print(df)
print('\ndata frame after: ')
df_ffill = df.ffill()
print(df_ffill)


**K-Nearest Neighbors (KNN) Imputation:**  
Use the K-nearest neighbors to impute missing values.  
**Pros:** Can provide more accurate imputations.  
**Cons:** Computationally expensive, especially with large datasets.

In [None]:
from sklearn.impute import KNNImputer

data = {
    'A': [1, 2, np.nan, 4, 5],
    'B': [5, np.nan, np.nan, 8, 10],
    #'C': ['cat', 'dog', 'cat', np.nan, 'dog']
}
df = pd.DataFrame(data)
df

print('\ndata frame after KNN: ')
knn_imputer = KNNImputer(n_neighbors=2)
df[['A', 'B']] = knn_imputer.fit_transform(df[['A', 'B']])
df

**Iterative Imputer:**  
Similar to MICE, but it iteratively models each feature as a function of other features.  
**Pros:** Provides more robust imputations.  
**Cons:** Complex and computationally intensive.

In [None]:
from sklearn.impute import IterativeImputer

iter_imputer = IterativeImputer(max_iter=10, random_state=0)
df[['A', 'B']] = iter_imputer.fit_transform(df[['A', 'B']])
df

The program below demonstrates different techniques for handling missing data using the pandas library and scikit-learn's imputation methods.

In [None]:
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

# Sample data
data = {
    'A': [1, 2, np.nan, 4, 5],
    'B': [5, np.nan, np.nan, 8, 10],
    'C': ['cat', 'dog', 'cat', np.nan, 'dog']
}
df = pd.DataFrame(data)

# Removing missing data
df_dropna = df.dropna()

# Mean/Median/Mode Imputation
mean_imputer = SimpleImputer(strategy='mean')
df['A'] = mean_imputer.fit_transform(df[['A']])

# KNN Imputation
knn_imputer = KNNImputer(n_neighbors=2)
df[['A', 'B']] = knn_imputer.fit_transform(df[['A', 'B']])

# Iterative Imputation
iter_imputer = IterativeImputer(max_iter=10, random_state=0)
df[['A', 'B']] = iter_imputer.fit_transform(df[['A', 'B']])

# Forward Fill
#df_ffill = df.fillna(method='ffill')
df_ffill = df.ffill()

# Create Missing Indicator
df['A_missing'] = df['A'].isnull().astype(int)
mean_imputer = SimpleImputer(strategy='mean')
df['A'] = mean_imputer.fit_transform(df[['A']])
df

