## Handling Missing Values in Large-scale ML Pipelines:

**Task 1**: Impute with Mean or Median
- Step 1: Load a dataset with missing values (e.g., Boston Housing dataset).
- Step 2: Identify columns with missing values.
- Step 3: Impute missing values using the mean or median of the respective columns.

In [None]:
# write your code from here
import numpy as np
import pandas as pd
from sklearn.datasets import fetch_openml
from sklearn.impute import SimpleImputer

# Step 1: Load dataset with missing values
# Boston Housing dataset doesn't have missing values by default, 
# so let's artificially introduce some for demonstration.
boston = fetch_openml(name="boston", version=1, as_frame=True)
df = boston.frame

# Artificially introduce missing values randomly for demo
np.random.seed(42)
missing_mask = np.random.rand(*df.shape) < 0.1  # 10% missing values
df = df.mask(missing_mask)

# Step 2: Identify columns with missing values
missing_cols = df.columns[df.isnull().any()]
print(f"Columns with missing values:\n{missing_cols.tolist()}")

# Step 3: Impute missing values using mean or median
# Create two imputers for demonstration
mean_imputer = SimpleImputer(strategy='mean')
median_imputer = SimpleImputer(strategy='median')

# Impute using mean
df_mean_imputed = df.copy()
df_mean_imputed[missing_cols] = mean_imputer.fit_transform(df_mean_imputed[missing_cols])

# Impute using median
df_median_imputed = df.copy()
df_median_imputed[missing_cols] = median_imputer.fit_transform(df_median_imputed[missing_cols])

print("\nSample data after mean imputation:")
print(df_mean_imputed.head())

print("\nSample data after median imputation:")
print(df_median_imputed.head())


**Task 2**: Impute with the Most Frequent Value
- Step 1: Use the Titanic dataset and identify columns with missing values.
- Step 2: Impute categorical columns using the most frequent value.

In [None]:
# write your code from here
import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.datasets import fetch_openml

# Step 1: Load Titanic dataset and identify columns with missing values
titanic = fetch_openml(name="titanic", version=1, as_frame=True)
df = titanic.frame

print("Columns with missing values:")
print(df.columns[df.isnull().any()].tolist())

# Step 2: Impute categorical columns using the most frequent value
# Select categorical columns (dtype 'category' or 'object')
categorical_cols = df.select_dtypes(include=['category', 'object']).columns

# Columns among these that have missing values
cat_missing_cols = [col for col in categorical_cols if df[col].isnull().any()]

print("\nCategorical columns with missing values:")
print(cat_missing_cols)

# Initialize imputer with 'most_frequent' strategy
most_frequent_imputer = SimpleImputer(strategy='most_frequent')

# Impute missing values in categorical columns
df_imputed = df.copy()
df_imputed[cat_missing_cols] = most_frequent_imputer.fit_transform(df_imputed[cat_missing_cols])

print("\nSample data after imputation:")
print(df_imputed[cat_missing_cols].head())


**Task 3**: Advanced Imputation - k-Nearest Neighbors
- Step 1: Implement KNN imputation using the KNNImputer from sklearn.
- Step 2: Explore how KNN imputation improves data completion over simpler methods.

In [None]:
# write your code from here

## Feature Scaling & Normalization Best Practices:

**Task 1**: Standardization
- Step 1: Standardize features using StandardScaler.
- Step 2: Observe how standardization affects data distribution.

In [None]:
# write your code from here

**Task 2**: Min-Max Scaling

- Step 1: Scale features to lie between 0 and 1 using MinMaxScaler.
- Step 2: Compare with standardization.

In [None]:
# write your code from here

**Task 3**: Robust Scaling
- Step 1: Scale features using RobustScaler, which is useful for data with outliers.
- Step 2: Assess changes in data scaling compared to other scaling methods.

In [None]:
# write your code from here

## Feature Selection Techniques:
### Removing Highly Correlated Features:

**Task 1**: Correlation Matrix
- Step 1: Compute correlation matrix.
- Step 2: Remove highly correlated features (correlation > 0.9).

In [None]:
# write your code from here

### Using Mutual Information & Variance Thresholds:

**Task 2**: Mutual Information
- Step 1: Compute mutual information between features and target.
- Step 2: Retain features with high mutual information scores.

In [None]:
# write your code from here

**Task 3**: Variance Threshold
- Step 1: Implement VarianceThreshold to remove features with low variance.
- Step 2: Analyze impact on feature space.

In [None]:
# write your code from here