## Handling Missing Values in Large-scale ML Pipelines:

**Task 1**: Impute with Mean or Median
- Step 1: Load a dataset with missing values (e.g., Boston Housing dataset).
- Step 2: Identify columns with missing values.
- Step 3: Impute missing values using the mean or median of the respective columns.

In [1]:
# write your code from here
import numpy as np
import pandas as pd
from sklearn.datasets import fetch_openml
from sklearn.impute import SimpleImputer

# Step 1: Load dataset with missing values
# Boston Housing dataset doesn't have missing values by default, 
# so let's artificially introduce some for demonstration.
boston = fetch_openml(name="boston", version=1, as_frame=True)
df = boston.frame

# Artificially introduce missing values randomly for demo
np.random.seed(42)
missing_mask = np.random.rand(*df.shape) < 0.1  # 10% missing values
df = df.mask(missing_mask)

# Step 2: Identify columns with missing values
missing_cols = df.columns[df.isnull().any()]
print(f"Columns with missing values:\n{missing_cols.tolist()}")

# Step 3: Impute missing values using mean or median
# Create two imputers for demonstration
mean_imputer = SimpleImputer(strategy='mean')
median_imputer = SimpleImputer(strategy='median')

# Impute using mean
df_mean_imputed = df.copy()
df_mean_imputed[missing_cols] = mean_imputer.fit_transform(df_mean_imputed[missing_cols])

# Impute using median
df_median_imputed = df.copy()
df_median_imputed[missing_cols] = median_imputer.fit_transform(df_median_imputed[missing_cols])

print("\nSample data after mean imputation:")
print(df_mean_imputed.head())

print("\nSample data after median imputation:")
print(df_median_imputed.head())


Columns with missing values:
['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']

Sample data after mean imputation:
       CRIM         ZN      INDUS  CHAS       NOX     RM        AGE     DIS  \
0  0.006320  18.000000   2.310000   0.0  0.538000  6.575  67.925658  4.0900   
1  0.027310   0.000000   7.070000   0.0  0.469000  6.421  78.900000  4.9671   
2  0.027290  11.308114   7.070000   0.0  0.554603  7.185  61.100000  4.9671   
3  3.860641   0.000000   2.180000   0.0  0.458000  6.998  45.800000  6.0622   
4  3.860641   0.000000  11.211645   0.0  0.458000  7.147  54.200000  6.0622   

   RAD         TAX    PTRATIO       B      LSTAT  MEDV  
0  1.0  296.000000  18.427438  396.90   4.980000  24.0  
1  2.0  242.000000  17.800000  396.90   9.140000  21.6  
2  2.0  407.217105  17.800000  392.83   4.030000  34.7  
3  3.0  222.000000  18.700000  394.63   2.940000  33.4  
4  3.0  222.000000  18.700000  396.90  12.774039  36.2  

Sample data

**Task 2**: Impute with the Most Frequent Value
- Step 1: Use the Titanic dataset and identify columns with missing values.
- Step 2: Impute categorical columns using the most frequent value.

**Task 3**: Advanced Imputation - k-Nearest Neighbors
- Step 1: Implement KNN imputation using the KNNImputer from sklearn.
- Step 2: Explore how KNN imputation improves data completion over simpler methods.

In [2]:
# write your code from here

## Feature Scaling & Normalization Best Practices:

**Task 1**: Standardization
- Step 1: Standardize features using StandardScaler.
- Step 2: Observe how standardization affects data distribution.

In [3]:
# write your code from here

**Task 2**: Min-Max Scaling

- Step 1: Scale features to lie between 0 and 1 using MinMaxScaler.
- Step 2: Compare with standardization.

In [4]:
# write your code from here

**Task 3**: Robust Scaling
- Step 1: Scale features using RobustScaler, which is useful for data with outliers.
- Step 2: Assess changes in data scaling compared to other scaling methods.

In [5]:
# write your code from here

## Feature Selection Techniques:
### Removing Highly Correlated Features:

**Task 1**: Correlation Matrix
- Step 1: Compute correlation matrix.
- Step 2: Remove highly correlated features (correlation > 0.9).

In [6]:
# write your code from here

### Using Mutual Information & Variance Thresholds:

**Task 2**: Mutual Information
- Step 1: Compute mutual information between features and target.
- Step 2: Retain features with high mutual information scores.

In [7]:
# write your code from here

**Task 3**: Variance Threshold
- Step 1: Implement VarianceThreshold to remove features with low variance.
- Step 2: Analyze impact on feature space.

In [8]:
# write your code from here