# **Implementing a Simple MLP for Disease Prediction Using NHANES Dataset**

## **Introduction**

In this assignment, you will preprocess the National Health and Nutrition Examination Survey (NHANES) dataset and implement a simple Multi-Layer Perceptron (MLP) from scratch using NumPy. The goal is to predict the occurrence of either Cancer or Liver disease based on various health and demographic features.

## **Dataset Overview**

The NHANES dataset provides a comprehensive set of health and nutritional data collected through interviews and medical examinations. It includes information on demographics, dietary habits, medical history, and laboratory results.

### **Data Components:**

- **Demographics:** Information about age, gender, ethnicity, etc.
- **Diet:** Dietary intake and related questions.
- **Examination:** Medical and physiological measurements.
- **Labs:** Laboratory test results.

You can refer to the following links for detailed explanations of each component:

- [Demographics](https://wwwn.cdc.gov/Nchs/Nhanes/Search/DataPage.aspx?Component=demographic&CycleBeginYear=2013)
- [Diet](https://wwwn.cdc.gov/Nchs/Nhanes/Search/DataPage.aspx?Component=diet&CycleBeginYear=2013)
- [Examination](https://wwwn.cdc.gov/Nchs/Nhanes/Search/DataPage.aspx?Component=examination&CycleBeginYear=2013)
- [Labs](https://wwwn.cdc.gov/Nchs/Nhanes/Search/DataPage.aspx?Component=labs&CycleBeginYear=2013)

## **Tasks Overview**

1. **Data Preprocessing**
2. **Model Training**
3. **Implementing MLP from Scratch**
4. **Training and Evaluating the Model**

---

## **1. Data Preprocessing**

Data preprocessing is a critical step in any machine learning project. It involves cleaning and transforming raw data to make it suitable for analysis and modeling.

### **1.1. Import Necessary Libraries**


In [None]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from scipy import stats
import re
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from imblearn.under_sampling import RandomUnderSampler

### **1.2. Load the Dataset**

You will be provided with four CSV files corresponding to different components of the NHANES dataset. Your first task is to load these datasets.

In [None]:
# ##TODO: Load the datasets into pandas DataFrames
df1 = ##TODO
df2 = ##TODO
df3 = ##TODO
df4 = ##TODO

**Explanation:**

- `df1` to `df4` correspond to Demographics, Diet, Examination, and Labs datasets respectively.
- Use `pd.read_csv()` to load each CSV file.

### **1.3. Merge the Datasets**

After loading the datasets, you need to merge them into a single DataFrame for easier analysis.

In [None]:
# ##TODO: Merge the DataFrames on a common column, typically 'SEQN'
merged_df = ##TODO

**Explanation:**

- The `SEQN` column is a unique identifier that can be used to merge the datasets.
- Ensure that the merge is performed correctly to include all relevant data.

### **1.4. Exploratory Data Analysis (EDA)**

Perform EDA to understand the structure and statistical properties of the dataset.

In [None]:
# Display the shape of the dataset
merged_df.##TODO

# Display basic statistics
merged_df.##TODO

# Display the first few rows
merged_df.##TODO

**Explanation:**

- Use `.shape` to get the dimensions.
- Use `.describe()` for statistical summary.
- Use `.head()` to view the first few rows.

### **1.5. Handle Columns with Fixed Values**

Remove columns that have only one unique value as they do not contribute to the predictive power.

In [None]:
# ##TODO: Identify and drop columns with only one unique value
dropcols = ##TODO
merged_df.drop(dropcols, axis=1, inplace=True)

**Explanation:**

- Columns with a single unique value are not useful for modeling.
- Identify such columns using `.nunique()` and drop them.

### **1.6. Handle Missing Values and Inconsistencies**

Replace specific codes (`7` and `9`) with `NaN` and handle missing values appropriately.


In [None]:
# Define a function to replace specific patterns with NaN
def replace_with_nan(element):
    try:
        if re.match(pattern, str(int(element))):
            return np.nan
        return element
    except:
        return element

# Replace '7's with NaN
pattern = r'^7+$'
merged_df = merged_df.##TODO

# Replace '9's with NaN
pattern = r'^9+$'
merged_df = merged_df.##TODO

# Drop columns with more than 30% missing values
merged_df.##TODO

**Explanation:**

- Use regex patterns to identify and replace values.
- Drop columns that have a high percentage of missing values to maintain data quality.

### **1.7. Segregate Numerical and Categorical Variables**

Differentiate between numerical and categorical variables for appropriate preprocessing.

In [None]:
# Identify numerical variables
numerical = ##TODO

print('There are {} numerical variables\n'.format(len(numerical)))
print('The numerical variables are :', numerical)
merged_df[numerical].head()

# Identify categorical variables
categorical = ##TODO
merged_df[categorical].head()

**Explanation:**

- Numerical variables are typically of type `float64` or `int64`.
- Categorical variables might be of type `object` or `int64` with limited unique values.

### **1.8. Handle Outliers in Numerical Variables**

Detect and visualize outliers in numerical features.

In [None]:
# Check for missing values in numerical variables
merged_df[numerical].isnull().sum()

# View summary statistics
print(round(merged_df[numerical].describe(), 2))

# Function to find columns with outliers
def find_outlier_cols(df):
    outlier_cols = []
    for col in df.columns:
        Q1 = df[col].quantile(0.25)
        Q3 = df[col].quantile(0.75)
        IQR = Q3 - Q1
        lower_bound = Q1 - 1.5 * IQR
        upper_bound = Q3 + 1.5 * IQR
        if df[(df[col] < lower_bound) | (df[col] > upper_bound)].any(axis=None):
            outlier_cols.append(col)
    return outlier_cols

outlier_cols = find_outlier_cols(merged_df[numerical])
print(outlier_cols)
len(outlier_cols), len(numerical)

**Visualization:**

In [None]:
plt.figure(figsize=(15,10))

plt.subplot(2, 2, 1)
fig = merged_df.boxplot(column='DMDBORN4')
fig.set_title('')
fig.set_ylabel('DMDBORN4')

plt.subplot(2, 2, 2)
fig = merged_df.boxplot(column='DMDCITZN')
fig.set_title('')
fig.set_ylabel('DMDCITZN')

plt.subplot(2, 2, 3)
fig = merged_df.boxplot(column='FIALANG')
fig.set_title('')
fig.set_ylabel('FIALANG')

plt.subplot(2, 2, 4)
fig = merged_df.boxplot(column='FIAPROXY')
fig.set_title('')
fig.set_ylabel('FIAPROXY')
plt.show()

plt.figure(figsize=(15,10))

plt.subplot(2, 2, 1)
fig = merged_df.DMDBORN4.hist(bins=10)
fig.set_xlabel('DMDBORN4')
fig.set_ylabel('HIST')

plt.subplot(2, 2, 2)
fig = merged_df.DMDCITZN.hist(bins=10)
fig.set_xlabel('DMDCITZN')
fig.set_ylabel('HIST')

plt.subplot(2, 2, 3)
fig = merged_df.FIALANG.hist(bins=10)
fig.set_xlabel('FIALANG')
fig.set_ylabel('HIST')

plt.subplot(2, 2, 4)
fig = merged_df.FIAPROXY.hist(bins=10)
fig.set_xlabel('FIAPROXY')
fig.set_ylabel('HIST')
plt.show()

**Explanation:**

- Boxplots and histograms help visualize the distribution and detect outliers.
- Outliers can be handled by removing or transforming the data if necessary.

### **1.9. Prepare Target Variables**

Load and preprocess the target variables for Cancer and Liver disease.

In [None]:
# Load the Questionnaire file
target = pd.read_csv('/content/questionnaire.csv')

# Replace '7's and '9's with NaN
pattern = r'^9+$'
target = target.applymap(replace_with_nan)
pattern = r'^7+$'
target = target.applymap(replace_with_nan)

# Merge targets with the main dataset
liver_values = target.merge(merged_df, on='SEQN', how='right')[['SEQN', 'MCQ160L']]
cancer_values = target.merge(merged_df, on='SEQN', how='right')[['SEQN', 'MCQ220']]

# Drop NaN values from target DataFrames
cancer_values.dropna(subset=['MCQ220'], inplace=True)
liver_values.dropna(subset=['MCQ160L'], inplace=True)

# Merge back to get final datasets
final_liver_data = liver_values.merge(merged_df, on='SEQN', how='left').drop(['SEQN', 'MCQ160L'], axis=1)
final_cancer_data = cancer_values.merge(merged_df, on='SEQN', how='left').drop(['SEQN', 'MCQ220'], axis=1)

# Drop 'SEQN' from target DataFrames
cancer_values = cancer_values.drop(['SEQN'], axis=1)
liver_values = liver_values.drop(['SEQN'], axis=1)

final_cancer_data.head()
final_liver_data.head()

**Explanation:**

- The target variables `MCQ220` (Cancer) and `MCQ160L` (Liver) indicate the presence (`1`) or absence (`2`) of the respective diseases.
- Handle missing values by dropping rows with `NaN` in target variables.

---

## **2. Feature Engineering**

Transform raw data into meaningful features to enhance model performance.

### **2.1. Segregate Categorical and Numerical Features**


In [None]:
# Identify categorical and numerical features for Cancer data
cancer_categorical = ##TODO
print(final_cancer_data[cancer_categorical].head())

cancer_numerical = ##TODO
print(final_cancer_data[cancer_numerical].head())

# Identify categorical and numerical features for Liver data
liver_categorical = ##TODO
print(final_liver_data[liver_categorical].head())

liver_numerical = ##TODO
print(final_liver_data[liver_numerical].head())

**Explanation:**

- Categorical features are typically of type `object` or have a limited number of unique integer values.
- Numerical features are of type `float64` or have many unique integer values.

### **2.2. Impute Missing Values**

#### **2.2.1. Numerical Features**

Use median imputation for numerical features to handle missing values.


In [None]:
# Impute missing values in Cancer numerical features
for col in cancer_numerical:
    col_median = ##TODO
    final_cancer_data[col].##TODO

# Impute missing values in Liver numerical features
for col in liver_numerical:
    col_median = ##TODO
    final_liver_data[col].##TODO

print(final_cancer_data[cancer_numerical].isnull().sum())
print(final_liver_data[liver_numerical].isnull().sum())

**Explanation:**

- Median imputation is robust to outliers and suitable for numerical data.
- Replace `NaN` values with the median of each column.

#### **2.2.2. Categorical Features**

Use mode imputation for categorical features to handle missing values.


In [None]:
# Impute missing values in Cancer categorical features
for col in cancer_categorical:
    col_mode = ##TODO
    final_cancer_data[col].##TODO

# Impute missing values in Liver categorical features
for col in liver_categorical:
    col_mode = ##TODO
    final_liver_data[col].##TODO

print(final_cancer_data[cancer_categorical].isnull().sum())
print(final_liver_data[categorical].isnull().sum())

**Explanation:**

- Mode imputation fills missing categorical values with the most frequent category.
- Ensure no missing values remain after imputation.

### **2.3. Encode Categorical Variables**

Convert categorical variables into numerical format using one-hot encoding.


In [None]:
# One-hot encode categorical variables for Cancer data
final_cancer_data = pd.##TODO

# One-hot encode categorical variables for Liver data
final_liver_data = pd.##TODO

**Explanation:**

- Use `pd.get_dummies()` to convert categorical variables into binary indicators.
- This step is essential for feeding categorical data into the MLP.

### **2.4. Feature Selection**

Select the most relevant features based on correlation to reduce dimensionality and improve model performance.


In [None]:
# Compute correlation matrices
cancer_corr_matrix = ##TODO
liver_corr_matrix = ##TODO

# Plot heatmaps for Cancer data
abs_corr_matrix = np.abs(cancer_corr_matrix)
cancer_corr_pairs = abs_corr_matrix.unstack().sort_values(ascending=False)
top_pairs = cancer_corr_pairs[:1000]
unique_pairs = top_pairs.drop_duplicates()
top_columns = np.unique([col for col, _ in unique_pairs.keys()])
top_corr_matrix = final_cancer_data[top_columns].corr()

plt.figure(figsize=(50, 50))
sns.heatmap(top_corr_matrix, annot=False, cmap='coolwarm')
plt.show()

# Plot heatmaps for Liver data
abs_corr_matrix = np.abs(liver_corr_matrix)
liver_corr_pairs = abs_corr_matrix.unstack().sort_values(ascending=False)
top_pairs = liver_corr_pairs[:1000]
unique_pairs = top_pairs.drop_duplicates()
top_columns = np.unique([col for col, _ in unique_pairs.keys()])
top_corr_matrix = final_liver_data[top_columns].corr()

plt.figure(figsize=(50, 50))
sns.heatmap(top_corr_matrix, annot=False, cmap='coolwarm')
plt.show()

**Explanation:**

- High correlation between features can lead to multicollinearity.
- Selecting a subset of features with high correlation to the target reduces complexity.

### **2.5. Select Top Features**

Choose 150 features based on their correlation with the target variable.

#### **2.5.1. Cancer Data**


In [None]:
import random

# Set correlation threshold
threshold = 0.8

# Get pairs with correlation higher than threshold
high_corr_pairs = ##TODO

# Initialize set for selected features
selected_features = set()

# Select features based on high correlation
for pair, corr in high_corr_pairs:
    selected_feature = ##TODO
    selected_features.add(selected_feature)
    if len(selected_features) == 150:
        break

# Warning if less than 150 features are selected
if len(selected_features) < 150:
    print(f"Warning: Only {len(selected_features)} features were selected. Consider lowering the threshold.")

# Create DataFrame with selected features
selected_cancer_df = final_cancer_data[list(selected_features)]

#### **2.5.2. Liver Data**

In [None]:
# Set correlation threshold
threshold = 0.8

# Get pairs with correlation higher than threshold
high_corr_pairs = ##TODO

# Initialize set for selected features
selected_features = set()

# Select features based on high correlation
for pair, corr in high_corr_pairs:
    selected_feature = ##TODO
    selected_features.add(selected_feature)
    if len(selected_features) == 150:
        break

# Warning if less than 150 features are selected
if len(selected_features) < 150:
    print(f"Warning: Only {len(selected_features)} features were selected. Consider lowering the threshold.")

# Create DataFrame with selected features
selected_liver_df = final_liver_data[list(selected_features)]

**Explanation:**

- Features with a correlation higher than the threshold are considered.
- Randomly select one feature from highly correlated pairs to avoid redundancy.
- Ensure that 150 features are selected; adjust the threshold if necessary.

### **2.6. Scale the Features and Apply PCA**

Normalize the data and reduce dimensionality using Principal Component Analysis (PCA).


In [None]:
# Scale and apply PCA for Liver data
cols = selected_liver_df.columns
scaler = StandardScaler()
selected_liver_df = scaler.fit_transform(selected_liver_df)
selected_liver_df = pd.DataFrame(selected_liver_df, columns=[cols])

pca = PCA(0.90)
liverprincipalComponents = pca.fit_transform(selected_liver_df)
liverprincipalDf = pd.DataFrame(data=liverprincipalComponents,
                           columns=['Principal Component ' + str(i) for i in range(1, liverprincipalComponents.shape[1] + 1)])
liverprincipalDf.head()

# Scale and apply PCA for Cancer data
cols = selected_cancer_df.columns
scaler = StandardScaler()
selected_cancer_df = scaler.fit_transform(selected_cancer_df)
selected_cancer_df = pd.DataFrame(selected_cancer_df, columns=[cols])

pca = PCA(0.90)
cancerprincipalComponents = pca.fit_transform(selected_cancer_df)
cancerprincipalDf = pd.DataFrame(data=cancerprincipalComponents,
                           columns=['Principal Component ' + str(i) for i in range(1, cancerprincipalComponents.shape[1] + 1)])
cancerprincipalDf.head()

**Explanation:**

- `StandardScaler` standardizes features by removing the mean and scaling to unit variance.
- PCA reduces the number of features while retaining 90% of the variance, simplifying the model.

### **2.7. Encode Target Labels**

Convert target labels from `1` and `2` to `0` and `1`.


In [None]:
# Convert Cancer labels: 1 -> 1 (present), 2 -> 0 (absent)
cancer_values[##TODO] = 0
cancer_values.head()

# Convert Liver labels: 1 -> 1 (present), 2 -> 0 (absent)
liver_values[##TODO] = 0
liver_values.head()

**Explanation:**

- Binary classification requires labels to be `0` and `1`.
- Replace `2` with `0` to indicate absence of disease.

### **2.8. Split the Data into Training and Testing Sets**

Divide the dataset into training and testing subsets.


In [None]:
# Split Cancer data
cancer_x_train, cancer_x_test, cancer_y_train, cancer_y_test = train_test_split(
    cancerprincipalDf, cancer_values, test_size=0.2, random_state=0)

# Split Liver data
liver_x_train, liver_x_test, liver_y_train, liver_y_test = train_test_split(
    liverprincipalDf, liver_values, test_size=0.2, random_state=0)

print(cancer_x_train.shape, cancer_x_test.shape, liver_x_train.shape, liver_x_test.shape)

**Explanation:**

- Use an 80/20 split for training and testing.
- Ensures that the model is trained on a majority of the data and tested on unseen data.

### **2.9. Handle Class Imbalance**

Use under-sampling to balance the classes in the training data.


In [None]:
# Under-sampling for Liver dataset
rus = RandomUnderSampler(random_state=0)
liver_x_resampled, liver_y_resampled = rus.fit_resample(liver_x_train, liver_y_train)

# Under-sampling for Cancer dataset
rus = RandomUnderSampler(random_state=0)
cancer_x_resampled, cancer_y_resampled = rus.fit_resample(cancer_x_train, cancer_y_train)

**Explanation:**

- Imbalanced datasets can bias the model towards the majority class.
- Under-sampling reduces the size of the majority class to balance the dataset.

---

## **3. Implementing MLP from Scratch**

In this section, you will implement a simple MLP with one hidden layer using NumPy. The network will consist of an input layer, a hidden layer with 10 neurons, and an output layer. The sigmoid activation function will be used, and Binary Cross-Entropy will be the loss function.

### **3.1. Define Data Preprocessor Class**

This class handles splitting the resampled data into training and validation sets.

In [None]:
class DataPreprocessor:
    def __init__(self):
        '''
        Initialize with resampled training data and test data
        '''
        self.X = ##TODO
        self.y = ##TODO
        self.X_test = ##TODO
        self.y_test = ##TODO

    def preprocess(self):
        self.X_train, self.X_val, self.y_train, self.y_val = train_test_split(
            self.X, self.y, test_size=0.2, random_state=42)

**Explanation:**

- Initializes with resampled training data (`X` and `y`) and test data.
- Splits the training data further into training and validation sets for monitoring overfitting.

### **3.2. Define MLP Class**

This class defines the structure and operations of the MLP.


In [None]:
class MLP:
    def __init__(self, input_size, hidden_size, output_size):
        '''
        Initialize weights and biases
        '''
        self.weights1 = ##TODO
        self.bias1 = ##TODO
        self.weights2 = ##TODO
        self.bias2 = ##TODO

    def sigmoid(self, x):
        '''
        Implement the sigmoid activation function
        '''
        ##TODO

    def sigmoid_derivative(self, x):
        '''
        Implement the derivative of the sigmoid function
        '''
        ##TODO

    def forward(self, X):
        '''
        Perform forward propagation
        '''
        self.z1 = ##TODO
        self.a1 = ##TODO
        self.z2 = ##TODO
        self.a2 = ##TODO
        return self.a2

    def backward(self, X, y, output, learning_rate):
        '''
        Perform backward propagation and update weights and biases
        '''
        m = X.shape[0]

        dz2 = ##TODO
        dw2 = ##TODO
        db2 = ##TODO

        dz1 = ##TODO
        dw1 = ##TODO
        db1 = ##TODO

        self.weights2 -= ##TODO
        self.bias2 -= ##TODO
        self.weights1 -= ##TODO
        self.bias1 -= ##TODO

**Explanation:**

- **Initialization (`__init__`):**
  - Initialize weights with small random values.
  - Initialize biases with zeros.
  
- **Activation Functions:**
  - `sigmoid`: Activation function for neurons.
  - `sigmoid_derivative`: Derivative used during backpropagation.
  
- **Forward Propagation (`forward`):**
  - Compute activations for hidden and output layers.
  
- **Backward Propagation (`backward`):**
  - Compute gradients and update weights and biases using Stochastic Gradient Descent (SGD).

### **3.3. Define Trainer Class**

This class handles the training process, including batching, loss calculation, and accuracy monitoring.

In [None]:
class Trainer:
    def __init__(self, model, X_train, y_train, X_val, y_val, epochs=100, learning_rate=0.1, batch_size=32):
        self.model = model
        self.X_train = X_train
        self.y_train = y_train
        self.X_val = X_val
        self.y_val = y_val
        self.epochs = epochs
        self.learning_rate = learning_rate
        self.batch_size = batch_size
        self.train_losses = []
        self.train_accuracies = []
        self.val_losses = []
        self.val_accuracies = []

    def train(self):
        m = self.X_train.shape[0]

        for epoch in range(self.epochs):
            # Shuffle the training data
            indices = ##TODO
            X_shuffled = self.X_train[indices]
            y_shuffled = self.y_train[indices]

            for i in range(0, m, self.batch_size):
                # Get the current batch
                X_batch = ##TODO
                y_batch = ##TODO

                # Forward pass
                output = ##TODO

                # Backward pass
                self.model.backward(##TODO)

            # Calculate training loss and accuracy
            train_output = ##TODO
            train_loss = ##TODO
            train_accuracy = ##TODO

            # Calculate validation loss and accuracy
            val_output = ##TODO
            val_loss = ##TODO
            val_accuracy = ##TODO

            # Store the metrics
            self.train_losses.append(train_loss)
            self.train_accuracies.append(train_accuracy)
            self.val_losses.append(val_loss)
            self.val_accuracies.append(val_accuracy)

            if (epoch + 1) % 10 == 0:
                print(f"Epoch {epoch+1}/{self.epochs}, Train Loss: {train_loss:.4f}, Train Acc: {train_accuracy:.4f}, Val Loss: {val_loss:.4f}, Val Acc: {val_accuracy:.4f}")

    def plot_results(self):
        plt.figure(figsize=(12, 5))
        plt.subplot(1, 2, 1)
        plt.plot(self.train_losses, label='Train')
        plt.plot(self.val_losses, label='Validation')
        plt.title('Loss')
        plt.xlabel('Epoch')
        plt.ylabel('Loss')
        plt.legend()

        plt.subplot(1, 2, 2)
        plt.plot(self.train_accuracies, label='Train')
        plt.plot(self.val_accuracies, label='Validation')
        plt.title('Accuracy')
        plt.xlabel('Epoch')
        plt.ylabel('Accuracy')
        plt.legend()

        plt.tight_layout()
        plt.show()

**Explanation:**

- **Training Process (`train`):**
  - Shuffles the data at each epoch to ensure randomness.
  - Processes data in batches for efficient computation.
  - Performs forward and backward passes to update the model.
  - Calculates and stores loss and accuracy for both training and validation sets.
  
- **Plotting Results (`plot_results`):**
  - Visualizes the loss and accuracy over epochs to monitor training progress and detect overfitting.

### **3.4. Define Tester Class**

This class evaluates the trained model on the test dataset.


In [None]:
class Tester:
    def __init__(self, model, X_test, y_test):
        self.model = model
        self.X_test = X_test
        self.y_test = y_test

    def test(self):
        '''
        Perform testing by forwarding test data through the model and computing loss and accuracy
        '''
        test_output = ##TODO
        test_loss = ##TODO
        test_accuracy = ##TODO

        print(f"Test Loss: {test_loss:.4f}, Test Accuracy: {test_accuracy:.4f}")

        return test_output

**Explanation:**

- **Testing Process (`test`):**
  - Performs a forward pass with the test data.
  - Computes the Binary Cross-Entropy loss and accuracy.
  - Prints and returns the test results.

---

## **4. Training and Evaluating the Model**

Finally, you will tie everything together to train and evaluate your MLP model.

### **4.1. Main Execution**

In [None]:
# Main execution
if __name__ == "__main__":
    # Preprocess data
    preprocessor = DataPreprocessor()
    preprocessor.preprocess()

    # Initialize and train model
    input_size = preprocessor.X_train.shape[1]
    hidden_size = 10
    output_size = 1

    model = MLP(input_size, hidden_size, output_size)
    trainer = Trainer(model, preprocessor.X_train, preprocessor.y_train,
                      preprocessor.X_val, preprocessor.y_val,
                      epochs=100, learning_rate=0.1, batch_size=32)
    trainer.train()

    # Plot training results
    trainer.plot_results()

    # Test the model
    tester = Tester(model, preprocessor.X_test, preprocessor.y_test)
    test_output = tester.test()

    # Print some example predictions
    print("\nExample predictions:")
    for i in range(5):
        true_label = np.argmax(preprocessor.y_test[i])
        predicted_label = np.argmax(test_output[i])
        print(f"True: {true_label}, Predicted: {predicted_label}")

**Explanation:**

- **Data Preprocessing:**
  - Initializes the `DataPreprocessor` and splits the data.
  
- **Model Initialization and Training:**
  - Initializes the `MLP` with appropriate sizes.
  - Creates a `Trainer` instance and trains the model.
  
- **Plotting Results:**
  - Visualizes the training and validation loss and accuracy.
  
- **Testing:**
  - Evaluates the model on the test dataset.
  - Prints example predictions to compare true and predicted labels.

---

## **5. Instructions for Completing the Assignment**

Throughout the notebook, you will encounter `##TODO` sections. Your tasks are to:

1. **Load and Merge Datasets:**
   - Use `pd.read_csv()` to load each CSV file into a DataFrame.
   - Merge the DataFrames on the `SEQN` column using `pd.merge()`.

2. **Identify and Drop Columns:**
   - Use `.nunique()` to find columns with a single unique value.
   - Drop these columns using `.drop()`.

3. **Handle Missing Values:**
   - Implement the `replace_with_nan` function to replace specific patterns with `NaN`.
   - Use `.applymap()` to apply this function across the DataFrame.
   - Drop columns with more than 30% missing values.

4. **Segregate Numerical and Categorical Variables:**
   - Identify numerical columns (typically `float64` or `int64` with many unique values).
   - Identify categorical columns (typically `object` or `int64` with few unique values).

5. **Impute Missing Values:**
   - For numerical columns, compute the median and replace `NaN` values.
   - For categorical columns, compute the mode and replace `NaN` values.

6. **Encode Categorical Variables:**
   - Use `pd.get_dummies()` to perform one-hot encoding on categorical columns.

7. **Feature Selection:**
   - Compute the correlation matrix using `.corr()`.
   - Select top features based on correlation threshold.
   - Ensure that 150 features are selected, adjusting the threshold if necessary.

8. **Scale and Apply PCA:**
   - Use `StandardScaler` to normalize the data.
   - Apply PCA to reduce dimensionality while retaining 90% variance.

9. **Encode Target Labels:**
   - Convert disease labels from `1` and `2` to `0` and `1`.

10. **Handle Class Imbalance:**
    - Use `RandomUnderSampler` to balance the classes in the training data.

11. **Implement MLP:**
    - Initialize weights and biases with appropriate dimensions and values.
    - Implement the sigmoid activation function and its derivative.
    - Perform forward and backward propagation.
    - Update weights and biases using gradients.

12. **Train and Evaluate the Model:**
    - Train the MLP using the `Trainer` class.
    - Plot the loss and accuracy curves.
    - Test the model and evaluate its performance.

---

## **Submission Guidelines**

- **Notebook Completion:** Ensure all `##TODO` sections are properly implemented.
- **Code Quality:** Write clean, readable, and well-documented code.
- **Visualizations:** Include relevant plots for EDA and training progress.
- **Report:** Summarize your findings, challenges faced, and insights gained from the model's performance.

---

Good luck with your assignment! If you have any questions or need further clarifications, feel free to reach out.