<a href="https://colab.research.google.com/github/micah-shull/pipelines/blob/main/pipelines_04_pytorch_sklearn_pipeline_wrapper_2_adding_complexity.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Slowly adding complexity to a model to improve performance is a well-established approach in machine learning and deep learning. This iterative and incremental approach is beneficial for several reasons:

## Reasons for Incrementally Adding Complexity

1. **Understanding the Model**:
   - **Starting Simple**: Beginning with a simple model allows you to establish a baseline performance and understand the basic behavior of the model.
   - **Diagnosing Issues**: Simple models are easier to debug. If there are any issues, they are more straightforward to identify and fix.

2. **Preventing Overfitting**:
   - **Controlled Complexity**: Gradually increasing complexity helps in controlling overfitting. You can monitor how the model's performance on the training and validation data changes as you add more layers, neurons, or other complexities.
   - **Regularization**: Adding complexity slowly allows you to implement and fine-tune regularization techniques such as dropout, L2 regularization, or early stopping.

3. **Efficient Resource Use**:
   - **Resource Management**: Simple models require less computational power and memory, making them faster to train and easier to iterate upon. This is especially important in environments with limited resources.
   - **Scalability**: You can incrementally scale up your model as needed, optimizing resource usage and training time.

4. **Improved Model Performance**:
   - **Layer-Wise Optimization**: By incrementally adding layers or neurons, you can optimize each part of the model. This helps in identifying the most effective architecture.
   - **Hyperparameter Tuning**: Incremental complexity allows for systematic hyperparameter tuning, ensuring each added complexity contributes positively to the model's performance.

5. **Building Intuition**:
   - **Learning Process**: This approach helps build intuition about how different architectural changes impact model performance. It’s a valuable learning process for understanding deep learning principles.
   - **Domain Knowledge**: Incorporating domain knowledge gradually into the model architecture can lead to better and more interpretable models.

### Practical Steps for Incremental Complexity

1. **Start with a Simple Model**:
   - Begin with a straightforward model, such as a single-layer neural network.
   - Establish a baseline performance metric.

2. **Monitor Performance Metrics**:
   - Evaluate the model using relevant performance metrics (accuracy, F1-score, etc.).
   - Use cross-validation to ensure the model generalizes well.

3. **Gradually Increase Complexity**:
   - Add more layers or neurons.
   - Introduce dropout layers for regularization.
   - Experiment with different activation functions.

4. **Tune Hyperparameters**:
   - Adjust learning rates, batch sizes, and the number of epochs.
   - Use techniques like grid search or random search for systematic hyperparameter tuning.

5. **Evaluate and Iterate**:
   - After each change, re-evaluate the model’s performance.
   - Compare against the baseline and previous iterations to ensure improvements.





### Step 1: Load and Preprocess the Dataset



In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

# Load the dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00350/default%20of%20credit%20card%20clients.xls"
df = pd.read_excel(url, header=1)

# Rename columns to lower case and replace spaces with underscores
df.columns = [col.lower().replace(' ', '_') for col in df.columns]

# Select features and target
target = 'default_payment_next_month'
X = df.drop(columns=[target]+['id'])
y = df[target]

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Identify column types
numeric_features = X_train.select_dtypes(include=['int64', 'float64']).columns.tolist()
categorical_features = X_train.select_dtypes(include=['object', 'category']).columns.tolist()

# Define preprocessing for numeric columns
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

# Define preprocessing for categorical columns
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

# Combine preprocessing steps
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

# Create the preprocessing pipeline
preprocessing_pipeline = Pipeline(steps=[('preprocessor', preprocessor)])

# Fit and transform the data
X_train_processed = preprocessing_pipeline.fit_transform(X_train)
X_test_processed = preprocessing_pipeline.transform(X_test)

# Convert to PyTorch tensors
import torch
X_train_tensor = torch.tensor(X_train_processed, dtype=torch.float32)
y_train_tensor = torch.tensor(y_train.values, dtype=torch.float32).unsqueeze(1)
X_test_tensor = torch.tensor(X_test_processed, dtype=torch.float32)
y_test_tensor = torch.tensor(y_test.values, dtype=torch.float32).unsqueeze(1)

### Step 2: Define a Simple PyTorch Neural Network Model

Define the simplest possible neural network model.

In [3]:
import torch.nn as nn

class SimpleNN(nn.Module):
    def __init__(self, input_dim):
        super(SimpleNN, self).__init__()
        self.fc1 = nn.Linear(input_dim, 1)

    def forward(self, x):
        x = torch.sigmoid(self.fc1(x))
        return x

### Step 3: Define the sklearn Wrapper

Create the sklearn wrapper for the simple PyTorch model.

In [4]:
from sklearn.base import BaseEstimator, ClassifierMixin
import torch.optim as optim

class SklearnNN(BaseEstimator, ClassifierMixin):
    def __init__(self, input_dim, learning_rate=0.001, epochs=50, batch_size=64):
        self.input_dim = input_dim
        self.learning_rate = learning_rate
        self.epochs = epochs
        self.batch_size = batch_size
        self.model = SimpleNN(self.input_dim)

    def fit(self, X, y):
        criterion = nn.BCELoss()
        optimizer = optim.Adam(self.model.parameters(), lr=self.learning_rate)
        train_dataset = torch.utils.data.TensorDataset(torch.tensor(X, dtype=torch.float32), torch.tensor(y, dtype=torch.float32).unsqueeze(1))
        train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=self.batch_size, shuffle=True)

        for epoch in range(self.epochs):
            self.model.train()
            for inputs, targets in train_loader:
                optimizer.zero_grad()
                outputs = self.model(inputs)
                loss = criterion(outputs, targets.view(-1, 1))
                loss.backward()
                optimizer.step()
        return self

    def predict(self, X):
        self.model.eval()
        with torch.no_grad():
            outputs = self.model(torch.tensor(X, dtype=torch.float32))
            predictions = (outputs > 0.5).float()
        return predictions.numpy().squeeze()

### Step 4: Train and Evaluate the Simple Model

Train and evaluate the simple PyTorch neural network model.


In [5]:
# Create an instance of SklearnNN
input_dim = X_train_tensor.shape[1]
nn_estimator = SklearnNN(input_dim=input_dim)

# Fit the model
nn_estimator.fit(X_train_tensor.numpy(), y_train_tensor.numpy())

# Predict on the test set
test_predictions = nn_estimator.predict(X_test_tensor.numpy())

# Evaluate the model
from sklearn.metrics import classification_report
print(classification_report(y_test_tensor.numpy(), test_predictions))

              precision    recall  f1-score   support

         0.0       0.82      0.97      0.89      4687
         1.0       0.69      0.24      0.35      1313

    accuracy                           0.81      6000
   macro avg       0.76      0.60      0.62      6000
weighted avg       0.79      0.81      0.77      6000



## Adding Layers of Complexity

Let's gradually add more layers and complexity to the model and evaluate the performance.

#### Enhanced PyTorch Neural Network Model

Add more layers and dropout.


In [6]:
class EnhancedNN(nn.Module):
    def __init__(self, input_dim, hidden1_dim=64, hidden2_dim=32, dropout_rate=0.5):
        super(EnhancedNN, self).__init__()
        self.fc1 = nn.Linear(input_dim, hidden1_dim)
        self.dropout1 = nn.Dropout(dropout_rate)
        self.fc2 = nn.Linear(hidden1_dim, hidden2_dim)
        self.dropout2 = nn.Dropout(dropout_rate)
        self.fc3 = nn.Linear(hidden2_dim, 1)

    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = self.dropout1(x)
        x = torch.relu(self.fc2(x))
        x = self.dropout2(x)
        x = torch.sigmoid(self.fc3(x))
        return x

#### Enhanced sklearn Wrapper

Update the wrapper to use the enhanced model.

In [7]:
class SklearnEnhancedNN(BaseEstimator, ClassifierMixin):
    def __init__(self, input_dim, hidden1_dim=64, hidden2_dim=32, dropout_rate=0.5, learning_rate=0.001, epochs=50, batch_size=64):
        self.input_dim = input_dim
        self.hidden1_dim = hidden1_dim
        self.hidden2_dim = hidden2_dim
        self.dropout_rate = dropout_rate
        self.learning_rate = learning_rate
        self.epochs = epochs
        self.batch_size = batch_size
        self.model = EnhancedNN(self.input_dim, self.hidden1_dim, self.hidden2_dim, self.dropout_rate)

    def fit(self, X, y):
        criterion = nn.BCELoss()
        optimizer = optim.Adam(self.model.parameters(), lr=self.learning_rate)
        train_dataset = torch.utils.data.TensorDataset(torch.tensor(X, dtype=torch.float32), torch.tensor(y, dtype=torch.float32).unsqueeze(1))
        train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=self.batch_size, shuffle=True)

        for epoch in range(self.epochs):
            self.model.train()
            for inputs, targets in train_loader:
                optimizer.zero_grad()
                outputs = self.model(inputs)
                loss = criterion(outputs, targets.view(-1, 1))
                loss.backward()
                optimizer.step()
        return self

    def predict(self, X):
        self.model.eval()
        with torch.no_grad():
            outputs = self.model(torch.tensor(X, dtype=torch.float32))
            predictions = (outputs > 0.5).float()
        return predictions.numpy().squeeze()


### Train and Evaluate the Enhanced Model

By following these steps, you can start with a simple PyTorch neural network model, train and evaluate it, and then gradually add more layers and complexity to see if it improves performance. This approach allows you to systematically explore the impact of different model architectures on the performance of your neural network.

In [8]:

# Create an instance of SklearnEnhancedNN
nn_estimator = SklearnEnhancedNN(input_dim=input_dim)

# Fit the model
nn_estimator.fit(X_train_tensor.numpy(), y_train_tensor.numpy())

# Predict on the test set
test_predictions = nn_estimator.predict(X_test_tensor.numpy())

# Evaluate the model
print(classification_report(y_test_tensor.numpy(), test_predictions))

              precision    recall  f1-score   support

         0.0       0.83      0.96      0.89      4687
         1.0       0.69      0.32      0.43      1313

    accuracy                           0.82      6000
   macro avg       0.76      0.64      0.66      6000
weighted avg       0.80      0.82      0.79      6000



## Adding More Complexity

Let's add more complexity to the neural network model by adding another hidden layer and experimenting with different activation functions and dropout rates. We can also tweak other hyperparameters to see if the performance improves further.

### Step 1: Define a More Complex PyTorch Neural Network Model

We'll add another hidden layer and experiment with different activation functions.








In [None]:
class MoreComplexNN(nn.Module):
    def __init__(self, input_dim, hidden1_dim=128, hidden2_dim=64, hidden3_dim=32, dropout_rate=0.5):
        super(MoreComplexNN, self).__init__()
        self.fc1 = nn.Linear(input_dim, hidden1_dim)
        self.dropout1 = nn.Dropout(dropout_rate)
        self.fc2 = nn.Linear(hidden1_dim, hidden2_dim)
        self.dropout2 = nn.Dropout(dropout_rate)
        self.fc3 = nn.Linear(hidden2_dim, hidden3_dim)
        self.dropout3 = nn.Dropout(dropout_rate)
        self.fc4 = nn.Linear(hidden3_dim, 1)

    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = self.dropout1(x)
        x = torch.relu(self.fc2(x))
        x = self.dropout2(x)
        x = torch.relu(self.fc3(x))
        x = self.dropout3(x)
        x = torch.sigmoid(self.fc4(x))
        return x

### Step 2: Update the sklearn Wrapper

Update the wrapper to use the more complex model.

In [None]:
class SklearnMoreComplexNN(BaseEstimator, ClassifierMixin):
    def __init__(self, input_dim, hidden1_dim=128, hidden2_dim=64, hidden3_dim=32, dropout_rate=0.5, learning_rate=0.001, epochs=50, batch_size=64):
        self.input_dim = input_dim
        self.hidden1_dim = hidden1_dim
        self.hidden2_dim = hidden2_dim
        self.hidden3_dim = hidden3_dim
        self.dropout_rate = dropout_rate
        self.learning_rate = learning_rate
        self.epochs = epochs
        self.batch_size = batch_size
        self.model = MoreComplexNN(self.input_dim, self.hidden1_dim, self.hidden2_dim, self.hidden3_dim, self.dropout_rate)

    def fit(self, X, y):
        criterion = nn.BCELoss()
        optimizer = optim.Adam(self.model.parameters(), lr=self.learning_rate)
        train_dataset = torch.utils.data.TensorDataset(torch.tensor(X, dtype=torch.float32), torch.tensor(y, dtype=torch.float32).unsqueeze(1))
        train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=self.batch_size, shuffle=True)

        for epoch in range(self.epochs):
            self.model.train()
            for inputs, targets in train_loader:
                optimizer.zero_grad()
                outputs = self.model(inputs)
                loss = criterion(outputs, targets.view(-1, 1))
                loss.backward()
                optimizer.step()
        return self

    def predict(self, X):
        self.model.eval()
        with torch.no_grad():
            outputs = self.model(torch.tensor(X, dtype=torch.float32))
            predictions = (outputs > 0.5).float()
        return predictions.numpy().squeeze()

### Step 3: Train and Evaluate the More Complex Model

In [None]:
# Create an instance of SklearnMoreComplexNN
nn_estimator = SklearnMoreComplexNN(input_dim=input_dim)

# Fit the model
nn_estimator.fit(X_train_tensor.numpy(), y_train_tensor.numpy())

# Predict on the test set
test_predictions = nn_estimator.predict(X_test_tensor.numpy())

# Evaluate the model
from sklearn.metrics import classification_report
print(classification_report(y_test_tensor.numpy(), test_predictions))

              precision    recall  f1-score   support

         0.0       0.83      0.96      0.89      4687
         1.0       0.69      0.32      0.43      1313

    accuracy                           0.82      6000
   macro avg       0.76      0.64      0.66      6000
weighted avg       0.80      0.82      0.79      6000



### Classification Report Analysis


### Key Metrics to Focus On

1. **Precision**:
   - **Class 0 (No Default)**: 0.83
   - **Class 1 (Default)**: 0.69
   - Precision is relatively high for both classes, but there is a noticeable drop for class 1.

2. **Recall**:
   - **Class 0 (No Default)**: 0.96
   - **Class 1 (Default)**: 0.32
   - Recall for class 0 is very high, but recall for class 1 is significantly low, indicating that many instances of class 1 are not being identified correctly.

3. **F1-score**:
   - **Class 0 (No Default)**: 0.89
   - **Class 1 (Default)**: 0.43
   - The F1-score for class 1 is much lower than for class 0, reflecting poor performance in correctly predicting defaults.

4. **Support**:
   - **Class 0 (No Default)**: 4687 instances
   - **Class 1 (Default)**: 1313 instances
   - There is a significant imbalance between the two classes, with class 0 having more than three times the number of instances as class 1.

### Observations

1. **Imbalance in Class Distribution**:
   - The support values indicate a clear imbalance in the dataset, with many more instances of class 0 (no default) compared to class 1 (default).

2. **High Precision, Low Recall for Class 1**:
   - The precision for class 1 is not too bad, but the recall is very low. This means that while the model is reasonably good at predicting defaults when it does predict them, it misses a large number of actual defaults.
   - This low recall for class 1 suggests that the model is biased towards predicting class 0, which is a common issue when dealing with imbalanced datasets.

3. **Overall Performance**:
   - The overall accuracy is relatively high at 0.82, but this is mainly driven by the high number of correct predictions for class 0.
   - The macro and weighted averages of precision, recall, and F1-score indicate a discrepancy in performance between the classes.

### Why Addressing Class Imbalance is Important

- **Improving Recall for Class 1**:
  - Addressing class imbalance can help improve the recall for class 1, ensuring that more instances of actual defaults are correctly identified by the model.

- **Balanced Model Performance**:
  - Balancing the classes can help the model to learn to differentiate better between the two classes, leading to more balanced precision, recall, and F1-scores across both classes.

- **Preventing Model Bias**:
  - An imbalanced dataset can lead to a model that is biased towards the majority class, as the model might learn to predict the majority class more often simply because it is overrepresented in the training data.

### Recommended Approach

- **Using SMOTE (Synthetic Minority Over-sampling Technique)**:
  - SMOTE is a popular technique for addressing class imbalance. It works by generating synthetic examples for the minority class by interpolating between existing examples.
  - This helps in creating a more balanced dataset, which can lead to better model performance, especially in terms of recall for the minority class.


## Imabalanced Data Handling

Let's implement SMOTE (Synthetic Minority Over-sampling Technique) to address the class imbalance and then retrain the model on the balanced dataset. Here's the step-by-step implementation:

### Step 1: Install imbalanced-learn Library

First, ensure you have the `imbalanced-learn` library installed. You can install it using pip if you haven't already:

```bash
pip install imbalanced-learn
```









### Step 2: Apply SMOTE to Balance the Dataset

We'll use SMOTE to oversample the minority class in the training data.

In [None]:
from imblearn.over_sampling import SMOTE
import torch

# Apply SMOTE to balance the dataset
smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train_processed, y_train)

# Convert to PyTorch tensors
X_train_tensor = torch.tensor(X_train_resampled, dtype=torch.float32)
y_train_tensor = torch.tensor(y_train_resampled.values, dtype=torch.float32).unsqueeze(1)
X_test_tensor = torch.tensor(X_test_processed, dtype=torch.float32)
y_test_tensor = torch.tensor(y_test.values, dtype=torch.float32).unsqueeze(1)

### Train and Evaluate the Model

Now, let's train and evaluate the model using the balanced dataset:


### Summary

By applying SMOTE to balance the dataset and retraining the model, we can address the class imbalance and improve the model's performance, especially in terms of recall for the minority class (defaults). This should lead to more balanced precision, recall, and F1-scores across both classes. Let's see how the performance metrics change after addressing the class imbalance.

In [None]:
# Create an instance of SklearnMoreComplexNN
input_dim = X_train_tensor.shape[1]
nn_estimator = SklearnMoreComplexNN(input_dim=input_dim)

# Fit the model
nn_estimator.fit(X_train_tensor.numpy(), y_train_tensor.numpy())

# Predict on the test set
test_predictions = nn_estimator.predict(X_test_tensor.numpy())

# Evaluate the model
from sklearn.metrics import classification_report
print(classification_report(y_test_tensor.numpy(), test_predictions))

              precision    recall  f1-score   support

         0.0       0.88      0.76      0.82      4687
         1.0       0.43      0.64      0.51      1313

    accuracy                           0.73      6000
   macro avg       0.66      0.70      0.67      6000
weighted avg       0.78      0.73      0.75      6000



Let's analyze the results after applying SMOTE to balance the dataset and retraining the model with the `MoreComplexNN` architecture:

### Classification Report Analysis

### Key Metrics to Focus On

1. **Precision**:
   - **Class 0 (No Default)**: 0.88
   - **Class 1 (Default)**: 0.43
   - Precision for class 0 is high, indicating that when the model predicts no default, it is usually correct. However, precision for class 1 is lower, meaning there are more false positives (incorrectly predicting defaults).

2. **Recall**:
   - **Class 0 (No Default)**: 0.76
   - **Class 1 (Default)**: 0.64
   - Recall for class 0 has decreased compared to before applying SMOTE, but recall for class 1 has significantly improved. This means the model is now better at identifying actual defaults.

3. **F1-score**:
   - **Class 0 (No Default)**: 0.82
   - **Class 1 (Default)**: 0.51
   - The F1-score for class 1 has improved, reflecting better overall performance in predicting defaults.

4. **Support**:
   - **Class 0 (No Default)**: 4687 instances
   - **Class 1 (Default)**: 1313 instances
   - The number of instances for each class remains the same as expected since support values are based on the original test set.

5. **Overall Accuracy**:
   - The overall accuracy is 0.73, which is lower than the initial accuracy before applying SMOTE. However, this is expected because the model is now also focusing on identifying the minority class.

### Interpretation

1. **Improved Recall for Class 1**:
   - The recall for class 1 (defaults) has improved from 0.32 to 0.64. This is a significant improvement and indicates that the model is now correctly identifying more default cases.

2. **Balanced Performance**:
   - There is a trade-off between precision and recall for class 1. Precision has decreased, but recall has increased, leading to a more balanced F1-score for class 1.
   - The overall macro and weighted averages for precision, recall, and F1-score indicate a more balanced performance between the two classes compared to the previous model.

3. **Accuracy vs. Recall Trade-off**:
   - The overall accuracy has decreased from 0.82 to 0.73, but this is not necessarily a bad thing. Accuracy is not always the best metric for imbalanced datasets. The improvement in recall for class 1 is more important in this context, as it means the model is better at identifying defaults, which could be critical in real-world applications.

### Next Steps

To further improve the model's performance, consider the following steps:

1. **Tune Hyperparameters**:
   - Perform hyperparameter tuning to find the optimal settings for the model. This can include experimenting with different learning rates, batch sizes, number of epochs, and architectures.

2. **Experiment with Different Architectures**:
   - Try different neural network architectures, including deeper networks or different activation functions, to see if they provide better performance.

3. **Regularization Techniques**:
   - Implement regularization techniques like L2 regularization (weight decay) to prevent overfitting and improve generalization.

4. **Adjust Decision Threshold**:
   - Experiment with different decision thresholds for the classification to see if adjusting the threshold can further improve the recall for class 1.

