<a href="https://colab.research.google.com/github/micah-shull/pipelines/blob/main/pipelines_04_pytorch_sklearn_pipeline_wrapper_1_explained.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#### Step 1: Load and Preprocess the Dataset

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

# Load the dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00350/default%20of%20credit%20card%20clients.xls"
df = pd.read_excel(url, header=1)

# Rename columns to lower case and replace spaces with underscores
df.columns = [col.lower().replace(' ', '_') for col in df.columns]

# Select features and target
target = 'default_payment_next_month'
X = df.drop(columns=[target])
y = df[target]

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Identify column types
numeric_features = X_train.select_dtypes(include=['int64', 'float64']).columns.tolist()
categorical_features = X_train.select_dtypes(include=['object', 'category']).columns.tolist()

# Define preprocessing for numeric columns
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

# Define preprocessing for categorical columns
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

# Combine preprocessing steps
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

# Create the preprocessing pipeline
preprocessing_pipeline = Pipeline(steps=[('preprocessor', preprocessor)])

# Fit and transform the data
X_train_processed = preprocessing_pipeline.fit_transform(X_train)
X_test_processed = preprocessing_pipeline.transform(X_test)

# Convert to PyTorch tensors
import torch
X_train_tensor = torch.tensor(X_train_processed, dtype=torch.float32)
y_train_tensor = torch.tensor(y_train.values, dtype=torch.float32).unsqueeze(1)
X_test_tensor = torch.tensor(X_test_processed, dtype=torch.float32)
y_test_tensor = torch.tensor(y_test.values, dtype=torch.float32).unsqueeze(1)

#### Step 2: Define a Simple PyTorch Neural Network Model

In [None]:
import torch.nn as nn

class SimpleNN(nn.Module):
    def __init__(self, input_dim):
        super(SimpleNN, self).__init__()
        self.fc1 = nn.Linear(input_dim, 1)

    def forward(self, x):
        x = torch.sigmoid(self.fc1(x))
        return x

#### Step 3: Define the sklearn Wrapper



In [None]:
from sklearn.base import BaseEstimator, ClassifierMixin
import torch.optim as optim

class SklearnNN(BaseEstimator, ClassifierMixin):
    def __init__(self, input_dim, learning_rate=0.001):
        self.input_dim = input_dim
        self.learning_rate = learning_rate
        self.model = SimpleNN(self.input_dim)

    def fit(self, X, y):
        criterion = nn.BCELoss()
        optimizer = optim.Adam(self.model.parameters(), lr=self.learning_rate)
        train_dataset = torch.utils.data.TensorDataset(torch.tensor(X, dtype=torch.float32), torch.tensor(y, dtype=torch.float32).unsqueeze(1))
        train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=64, shuffle=True)

        for epoch in range(50):
            self.model.train()
            for inputs, targets in train_loader:
                optimizer.zero_grad()
                outputs = self.model(inputs)
                loss = criterion(outputs, targets.view(-1, 1))
                loss.backward()
                optimizer.step()
        return self

    def predict(self, X):
        self.model.eval()
        with torch.no_grad():
            outputs = self.model(torch.tensor(X, dtype=torch.float32))
            predictions = (outputs > 0.5).float()
        return predictions.numpy().squeeze()

#### Step 4: Train and Evaluate the Model



In [None]:
# Create an instance of SklearnNN
input_dim = X_train_tensor.shape[1]
nn_estimator = SklearnNN(input_dim=input_dim)

# Fit the model
nn_estimator.fit(X_train_tensor.numpy(), y_train_tensor.numpy())

# Predict on the test set
test_predictions = nn_estimator.predict(X_test_tensor.numpy())

# Evaluate the model
from sklearn.metrics import classification_report
print(classification_report(y_test_tensor.numpy(), test_predictions))

### Summary

- **Step 1**: Load and preprocess the dataset.
- **Step 2**: Define a simple PyTorch neural network model.
- **Step 3**: Create a sklearn wrapper for the PyTorch model.
- **Step 4**: Train and evaluate the model using the sklearn wrapper.

This provides a simple yet complete example of integrating a PyTorch neural network with scikit-learn for training and evaluation.



### Step-by-Step Process

#### 1. Simple PyTorch Neural Network Model

We'll start with the simplest possible neural network model and then enhance it incrementally.











In [None]:
import torch
import torch.nn as nn

class SimpleNN(nn.Module):
    def __init__(self, input_dim):
        super(SimpleNN, self).__init__()
        self.fc1 = nn.Linear(input_dim, 1)

    def forward(self, x):
        x = torch.sigmoid(self.fc1(x))
        return x

#### 2. Define the sklearn Wrapper

We'll create the sklearn wrapper and start by adding a basic `fit` method.



In [None]:
from sklearn.base import BaseEstimator, ClassifierMixin
import torch.optim as optim

class SklearnNN(BaseEstimator, ClassifierMixin):
    def __init__(self, input_dim, learning_rate=0.001):
        self.input_dim = input_dim
        self.learning_rate = learning_rate
        self.model = SimpleNN(self.input_dim)

    def fit(self, X, y):
        criterion = nn.BCELoss()
        optimizer = optim.Adam(self.model.parameters(), lr=self.learning_rate)
        train_dataset = torch.utils.data.TensorDataset(torch.tensor(X, dtype=torch.float32), torch.tensor(y, dtype=torch.float32).unsqueeze(1))
        train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=64, shuffle=True)

        for epoch in range(50):
            self.model.train()
            for inputs, targets in train_loader:
                optimizer.zero_grad()
                outputs = self.model(inputs)
                loss = criterion(outputs, targets.view(-1, 1))
                loss.backward()
                optimizer.step()
        return self

    def predict(self, X):
        self.model.eval()
        with torch.no_grad():
            outputs = self.model(torch.tensor(X, dtype=torch.float32))
            predictions = (outputs > 0.5).float()
        return predictions.numpy().squeeze()

### Example Workflow

Let's use the same dataset and preprocess it, then train and evaluate our simple model.

#### Load and Preprocess the Dataset

In [None]:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

# Load the dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00350/default%20of%20credit%20card%20clients.xls"
df = pd.read_excel(url, header=1)

# Rename columns to lower case and replace spaces with underscores
df.columns = [col.lower().replace(' ', '_') for col in df.columns]

# Select features and target
target = 'default_payment_next_month'
X = df.drop(columns=[target])
y = df[target]

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Identify column types
numeric_features = X_train.select_dtypes(include=['int64', 'float64']).columns.tolist()
categorical_features = X_train.select_dtypes(include=['object', 'category']).columns.tolist()

# Define preprocessing for numeric columns
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

# Define preprocessing for categorical columns
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

# Combine preprocessing steps
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

# Create the preprocessing pipeline
preprocessing_pipeline = Pipeline(steps=[('preprocessor', preprocessor)])

# Fit and transform the data
X_train_processed = preprocessing_pipeline.fit_transform(X_train)
X_test_processed = preprocessing_pipeline.transform(X_test)

# Convert to PyTorch tensors
X_train_tensor = torch.tensor(X_train_processed, dtype=torch.float32)
y_train_tensor = torch.tensor(y_train.values, dtype=torch.float32).unsqueeze(1)
X_test_tensor = torch.tensor(X_test_processed, dtype=torch.float32)
y_test_tensor = torch.tensor(y_test.values, dtype=torch.float32).unsqueeze(1)

### Train and Evaluate the Model

In [None]:
# Create an instance of SklearnNN
input_dim = X_train_tensor.shape[1]
nn_estimator = SklearnNN(input_dim=input_dim)

# Fit the model
nn_estimator.fit(X_train_tensor.numpy(), y_train_tensor.numpy())

# Predict on the test set
test_predictions = nn_estimator.predict(X_test_tensor.numpy())

# Evaluate the model
from sklearn.metrics import classification_report
print(classification_report(y_test_tensor.numpy(), test_predictions))

### Incrementally Adding Features

Let's enhance our model by adding more layers, dropout, and other hyperparameters.

#### Enhanced PyTorch Neural Network Model

In [None]:
class EnhancedNN(nn.Module):
    def __init__(self, input_dim, hidden1_dim=64, hidden2_dim=32, dropout_rate=0.5):
        super(EnhancedNN, self).__init__()
        self.fc1 = nn.Linear(input_dim, hidden1_dim)
        self.dropout1 = nn.Dropout(dropout_rate)
        self.fc2 = nn.Linear(hidden1_dim, hidden2_dim)
        self.dropout2 = nn.Dropout(dropout_rate)
        self.fc3 = nn.Linear(hidden2_dim, 1)

    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = self.dropout1(x)
        x = torch.relu(self.fc2(x))
        x = self.dropout2(x)
        x = torch.sigmoid(self.fc3(x))
        return x

#### Enhanced sklearn Wrapper

In [None]:
class SklearnEnhancedNN(BaseEstimator, ClassifierMixin):
    def __init__(self, input_dim, hidden1_dim=64, hidden2_dim=32, dropout_rate=0.5, learning_rate=0.001):
        self.input_dim = input_dim
        self.hidden1_dim = hidden1_dim
        self.hidden2_dim = hidden2_dim
        self.dropout_rate = dropout_rate
        self.learning_rate = learning_rate
        self.model = EnhancedNN(self.input_dim, self.hidden1_dim, self.hidden2_dim, self.dropout_rate)

    def fit(self, X, y):
        criterion = nn.BCELoss()
        optimizer = optim.Adam(self.model.parameters(), lr=self.learning_rate)
        train_dataset = torch.utils.data.TensorDataset(torch.tensor(X, dtype=torch.float32), torch.tensor(y, dtype=torch.float32).unsqueeze(1))
        train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=64, shuffle=True)

        for epoch in range(50):
            self.model.train()
            for inputs, targets in train_loader:
                optimizer.zero_grad()
                outputs = self.model(inputs)
                loss = criterion(outputs, targets.view(-1, 1))
                loss.backward()
                optimizer.step()
        return self

    def predict(self, X):
        self.model.eval()
        with torch.no_grad():
            outputs = self.model(torch.tensor(X, dtype=torch.float32))
            predictions = (outputs > 0.5).float()
        return predictions.numpy().squeeze()

#### Train and Evaluate the Enhanced Model

In [None]:
# Create an instance of SklearnEnhancedNN
input_dim = X_train_tensor.shape[1]
nn_estimator = SklearnEnhancedNN(input_dim=input_dim)

# Fit the model
nn_estimator.fit(X_train_tensor.numpy(), y_train_tensor.numpy())

# Predict on the test set
test_predictions = nn_estimator.predict(X_test_tensor.numpy())

# Evaluate the model
print(classification_report(y_test_tensor.numpy(), test_predictions))

#### Summary

1. **Define a Simple PyTorch Neural Network**: Start with a simple neural network model.
2. **Create a sklearn Wrapper**: Define a wrapper that initializes the neural network model and adds additional parameters like learning rate.
3. **Incrementally Enhance the Model**: Add more layers, dropout, and other hyperparameters to the PyTorch model and update the wrapper accordingly.
4. **Train and Evaluate the Model**: Use the sklearn wrapper to fit the model, make predictions, and evaluate its performance.

This approach allows you to leverage the strengths of both scikit-learn and PyTorch, creating a powerful and flexible workflow for building and tuning neural network models.

The code inside the `SklearnNN` wrapper looks like PyTorch code because it essentially is PyTorch code. The purpose of the wrapper is to integrate the PyTorch neural network model into the scikit-learn framework, allowing you to use scikit-learn's tools for model training, evaluation, and hyperparameter tuning.

#### Why Use PyTorch Code Inside the Wrapper?

1. **Model Training**:
   - The `fit` method in the wrapper uses PyTorch's training loop, including forward passes, loss calculation, backpropagation, and optimizer steps.
   - This is necessary because the model itself is a PyTorch model, and the training process involves operations specific to PyTorch.

2. **Prediction**:
   - The `predict` method also uses PyTorch for making predictions. It puts the model in evaluation mode and performs forward passes to generate predictions.

3. **Compatibility**:
   - By writing the training and prediction code in PyTorch within the scikit-learn wrapper, you ensure that the neural network model is compatible with the PyTorch framework while also being usable within the scikit-learn framework.

### Detailed Breakdown

Let's break down the key parts of the wrapper:

#### Initialization

```python
class SklearnNN(BaseEstimator, ClassifierMixin):
    def __init__(self, input_dim, learning_rate=0.001):
        self.input_dim = input_dim
        self.learning_rate = learning_rate
        self.model = SimpleNN(self.input_dim)
```

- **Initialization (`__init__`)**:
  - Parameters like `input_dim` and `learning_rate` are defined and stored.
  - An instance of the `SimpleNN` model is created using the provided `input_dim`.

#### Training the Model

```python
    def fit(self, X, y):
        criterion = nn.BCELoss()
        optimizer = optim.Adam(self.model.parameters(), lr=self.learning_rate)
        train_dataset = torch.utils.data.TensorDataset(torch.tensor(X, dtype=torch.float32), torch.tensor(y, dtype=torch.float32).unsqueeze(1))
        train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=64, shuffle=True)

        for epoch in range(50):
            self.model.train()
            for inputs, targets in train_loader:
                optimizer.zero_grad()
                outputs = self.model(inputs)
                loss = criterion(outputs, targets.view(-1, 1))
                loss.backward()
                optimizer.step()
        return self
```

- **Loss Function (`criterion`)**:
  - Uses binary cross-entropy loss, which is common for binary classification tasks.

- **Optimizer**:
  - Uses the Adam optimizer to update model parameters.

- **DataLoader**:
  - Converts the input data to PyTorch tensors and creates a DataLoader to handle mini-batches.

- **Training Loop**:
  - For each epoch, the model is set to training mode.
  - Iterates over mini-batches, performing forward passes, calculating loss, backpropagating gradients, and updating parameters.

#### Making Predictions

```python
    def predict(self, X):
        self.model.eval()
        with torch.no_grad():
            outputs = self.model(torch.tensor(X, dtype=torch.float32))
            predictions = (outputs > 0.5).float()
        return predictions.numpy().squeeze()
```

- **Evaluation Mode**:
  - The model is set to evaluation mode to disable dropout and other training-specific layers.

- **Forward Pass**:
  - Performs a forward pass through the model to generate predictions.
  - Converts the output to binary predictions (0 or 1).

### Conclusion

The `SklearnNN` wrapper allows you to integrate a PyTorch neural network model into the scikit-learn framework. Inside the wrapper, you use PyTorch code to handle model training and prediction because the model itself is a PyTorch model. This approach leverages the strengths of both frameworks: the flexibility and power of PyTorch for neural network modeling and the ease of use and integration capabilities of scikit-learn for preprocessing, model selection, and evaluation.

If you were to use a TensorFlow neural network model, you would need to use TensorFlow code within the wrapper to handle the training and prediction processes. The idea is to wrap the TensorFlow model in a scikit-learn compatible interface so that you can leverage scikit-learn's utilities while using TensorFlow for the neural network operations.

#### TensorFlow Neural Network with scikit-learn Wrapper

Let's go through an example of how to integrate a TensorFlow neural network with scikit-learn.

#### Step 1: Define a Simple TensorFlow Neural Network

We'll start with a simple neural network model using TensorFlow and Keras.

```python
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

def create_simple_nn(input_dim):
    model = Sequential()
    model.add(Dense(1, input_dim=input_dim, activation='sigmoid'))
    return model
```

#### Step 2: Define the sklearn Wrapper

Next, we create the sklearn wrapper for the TensorFlow model. This involves defining a class that inherits from `BaseEstimator` and `ClassifierMixin`, similar to the PyTorch example.

```python
from sklearn.base import BaseEstimator, ClassifierMixin

class SklearnTFNN(BaseEstimator, ClassifierMixin):
    def __init__(self, input_dim, learning_rate=0.001, epochs=50, batch_size=64):
        self.input_dim = input_dim
        self.learning_rate = learning_rate
        self.epochs = epochs
        self.batch_size = batch_size
        self.model = create_simple_nn(self.input_dim)
        self.model.compile(optimizer=tf.keras.optimizers.Adam(lr=self.learning_rate),
                           loss='binary_crossentropy',
                           metrics=['accuracy'])

    def fit(self, X, y):
        self.model.fit(X, y, epochs=self.epochs, batch_size=self.batch_size, verbose=0)
        return self

    def predict(self, X):
        predictions = self.model.predict(X)
        return (predictions > 0.5).astype("int32")
```

### Example Workflow

Let's use the same dataset and preprocess it, then train and evaluate our TensorFlow model using the sklearn wrapper.

#### Load and Preprocess the Dataset

```python
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

# Load the dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00350/default%20of%20credit%20card%20clients.xls"
df = pd.read_excel(url, header=1)

# Rename columns to lower case and replace spaces with underscores
df.columns = [col.lower().replace(' ', '_') for col in df.columns]

# Select features and target
target = 'default_payment_next_month'
X = df.drop(columns=[target])
y = df[target]

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Identify column types
numeric_features = X_train.select_dtypes(include=['int64', 'float64']).columns.tolist()
categorical_features = X_train.select_dtypes(include=['object', 'category']).columns.tolist()

# Define preprocessing for numeric columns
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

# Define preprocessing for categorical columns
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

# Combine preprocessing steps
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

# Create the preprocessing pipeline
preprocessing_pipeline = Pipeline(steps=[('preprocessor', preprocessor)])

# Fit and transform the data
X_train_processed = preprocessing_pipeline.fit_transform(X_train)
X_test_processed = preprocessing_pipeline.transform(X_test)

# Convert to NumPy arrays
X_train_array = X_train_processed.toarray() if hasattr(X_train_processed, 'toarray') else X_train_processed
X_test_array = X_test_processed.toarray() if hasattr(X_test_processed, 'toarray') else X_test_processed
```

### Train and Evaluate the TensorFlow Model

```python
# Create an instance of SklearnTFNN
input_dim = X_train_array.shape[1]
nn_estimator = SklearnTFNN(input_dim=input_dim)

# Fit the model
nn_estimator.fit(X_train_array, y_train.values)

# Predict on the test set
test_predictions = nn_estimator.predict(X_test_array)

# Evaluate the model
from sklearn.metrics import classification_report
print(classification_report(y_test, test_predictions))
```

### Explanation

1. **TensorFlow Neural Network**:
   - The `create_simple_nn` function defines a simple neural network with a single layer.
   - The model is compiled with an Adam optimizer, binary cross-entropy loss, and accuracy metric.

2. **sklearn Wrapper**:
   - The `SklearnTFNN` class initializes the TensorFlow model with the specified parameters.
   - The `fit` method trains the TensorFlow model on the provided data.
   - The `predict` method generates predictions using the trained TensorFlow model.

3. **Preprocessing**:
   - Data preprocessing is handled using scikit-learn pipelines, ensuring consistency between training and testing datasets.

4. **Training and Evaluation**:
   - The model is trained using the `fit` method and evaluated using the `predict` method.
   - The classification report provides an evaluation of the model's performance.

By using TensorFlow code within the wrapper, you ensure that the TensorFlow neural network model is properly trained and evaluated within the scikit-learn framework. This approach allows you to combine the strengths of both TensorFlow and scikit-learn in a seamless and efficient manner.