<a href="https://colab.research.google.com/github/micah-shull/pipelines/blob/main/pipelines_07_pytorch_pipeline_05_feature_engineering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Feature Engineering Order of Operations

In feature engineering, certain transformations are better applied before splitting the data into training and testing sets to avoid data leakage, while others can be applied after splitting. Here's how you can determine which transformations to apply before or after the split:

### Apply Before Train-Test Split:
1. **Interaction Features**: Creating new features by combining existing ones.
2. **Date Features**: Extracting features from date columns.
3. **Target Encoding**: Encoding categorical variables using the target variable (with caution).
4. **Binning**: Converting continuous variables into categorical variables.
5. **Ratio Features**: Creating ratio features (e.g., payment-to-bill ratios).
6. **Aggregations**: Creating features based on aggregations over certain periods (e.g., average bill amounts).

### Apply After Train-Test Split:
1. **Normalization/Standardization**: Scaling features to have a mean of zero and a standard deviation of one, or scaling to a specific range.
2. **Imputation**: Filling missing values with a specific strategy (mean, median, etc.).
3. **Encoding Categorical Variables**: One-hot encoding or label encoding.
4. **Polynomial Features**: Generating polynomial and interaction features.
5. **SMOTE, ADASYN, or Other Resampling Techniques**: Addressing class imbalance.

### Why Apply Certain Transformations After the Train-Test Split?
- **Prevent Data Leakage**: Ensuring that no information from the test set is used to influence the training process.
- **Consistent Scaling**: Calculating the parameters (mean, standard deviation) of scaling on the training set and then applying them to both the training and test sets.
- **Proper Imputation**: Filling missing values based on the distribution of the training set.

### Summary

- **Before Train-Test Split**: We applied feature engineering steps that create new features from existing ones, such as interaction features, target encoding, binning, ratio features, and aggregated features.
- **After Train-Test Split**: We performed transformations such as imputation, scaling, and one-hot encoding to ensure that these transformations are based only on the training set and applied consistently to both training and test sets.

This approach ensures that there is no data leakage and that the transformations are applied correctly. Let me know if you need any further adjustments or explanations!




In [1]:
import pandas as pd
import numpy as np
import torch
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn.metrics import classification_report
from model_pipeline import load_data_from_url, clean_column_names, remove_id_column, convert_categorical, split_data, train_model, calculate_class_weights, convert_to_tensors, preprocess_data, define_preprocessor, SimpleNN, SklearnSimpleNN

# Define dataset-specific parameters
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00350/default%20of%20credit%20card%20clients.xls"
categorical_columns = ['sex', 'education', 'marriage']
target = 'default_payment_next_month'

# Load and preprocess data
data = load_data_from_url(url)
data = clean_column_names(data)
data = remove_id_column(data)
data = convert_categorical(data, categorical_columns=categorical_columns)

# Rename columns
def rename_columns(df):
    rename_dict = {
        'pay_0': 'pay_1'
    }
    df = df.rename(columns=rename_dict)
    return df

data = rename_columns(data)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = split_data(data, target=target)

# Define preprocessor and preprocess the data
preprocessor = define_preprocessor(X_train)
X_train_processed, X_test_processed = preprocess_data(preprocessor, X_train, X_test)

# Convert data to tensors
X_train_tensor, y_train_tensor, X_test_tensor, y_test_tensor = convert_to_tensors(X_train_processed, y_train, X_test_processed, y_test)


### Before

In [2]:
# Train the model without feature engineering
nn_estimator_before = SklearnSimpleNN(input_dim=X_train_tensor.shape[1], pos_weight=calculate_class_weights(y_train)[1])
nn_estimator_before = train_model(nn_estimator_before, X_train_tensor, y_train_tensor)

# Evaluate the model
# def evaluate_model(nn_estimator, X_test_tensor, y_test_tensor):
#     y_pred = nn_estimator.predict(X_test_tensor.numpy())
#     print(classification_report(y_test_tensor.numpy(), y_pred))

# evaluate_model(nn_estimator_before, X_test_tensor, y_test_tensor)

# Train the model without feature engineering
nn_estimator_before = SklearnSimpleNN(input_dim=X_train_tensor.shape[1], pos_weight=calculate_class_weights(y_train)[1])
nn_estimator_before = train_model(nn_estimator_before, X_train_tensor, y_train_tensor)

# Evaluate the model and save the report before feature engineering
def evaluate_model(nn_estimator, X_test_tensor, y_test_tensor, label):
    y_pred = nn_estimator.predict(X_test_tensor.numpy())
    report = classification_report(y_test_tensor.numpy(), y_pred, output_dict=True)
    print(f"Classification Report ({label}):")
    print(classification_report(y_test_tensor.numpy(), y_pred))
    return report

report_before = evaluate_model(nn_estimator_before, X_test_tensor, y_test_tensor, "Before Feature Engineering")


Classification Report (Before Feature Engineering):
              precision    recall  f1-score   support

         0.0       0.87      0.87      0.87      4673
         1.0       0.53      0.52      0.53      1327

    accuracy                           0.79      6000
   macro avg       0.70      0.70      0.70      6000
weighted avg       0.79      0.79      0.79      6000



In [3]:
# Apply feature engineering to the entire dataset
def create_interaction_features(df):
    df['limit_bal_age'] = df['limit_bal'] * df['age']
    return df

def target_encode(df, target, categorical_columns):
    for col in categorical_columns:
        mean_target = df.groupby(col)[target].mean()
        df[col + '_target_enc'] = df[col].map(mean_target)
    return df

def bin_features(df, column, bins):
    df[column + '_binned'] = pd.cut(df[column], bins=bins)
    return df

def create_payment_to_bill_ratios(df):
    for i in range(1, 7):
        df[f'pay_to_bill_ratio_{i}'] = df[f'pay_amt{i}'] / df[f'bill_amt{i}'].replace(0, np.nan)
    return df

def create_payment_to_limit_ratios(df):
    for i in range(1, 7):
        df[f'pay_to_limit_ratio_{i}'] = df[f'pay_amt{i}'] / df['limit_bal']
    return df

def create_bill_to_limit_ratios(df):
    for i in range(1, 7):
        df[f'bill_to_limit_ratio_{i}'] = df[f'bill_amt{i}'] / df['limit_bal']
    return df

def create_lagged_payment_differences(df):
    for i in range(1, 6):
        df[f'pay_amt_diff_{i}'] = df[f'pay_amt{i+1}'] - df[f'pay_amt{i}']
    return df

def create_debt_ratio_features(df):
    for i in range(1, 7):
        df[f'debt_ratio_{i}'] = df[f'bill_amt{i}'] / df['limit_bal']
    return df

def create_average_payment_and_bill(df):
    df['avg_payment'] = df[[f'pay_amt{i}' for i in range(1, 7)]].mean(axis=1)
    df['avg_bill'] = df[[f'bill_amt{i}' for i in range(1, 7)]].mean(axis=1)
    return df

def create_payment_timeliness_features(df):
    for i in range(1, 7):
        df[f'pay_on_time_{i}'] = (df[f'pay_{i}'] <= 0).astype(int)
    return df

def create_total_payment_and_bill(df):
    df['total_payment'] = df[[f'pay_amt{i}' for i in range(1, 7)]].sum(axis=1)
    df['total_bill'] = df[[f'bill_amt{i}' for i in range(1, 7)]].sum(axis=1)
    return df

def create_bill_difference_features(df):
    for i in range(1, 6):
        df[f'bill_diff_{i}'] = df[f'bill_amt{i+1}'] - df[f'bill_amt{i}']
    return df

data = create_interaction_features(data)
data = target_encode(data, target, categorical_columns)
data = bin_features(data, column='age', bins=5)
data = create_payment_to_bill_ratios(data)
data = create_payment_to_limit_ratios(data)
data = create_bill_to_limit_ratios(data)
data = create_lagged_payment_differences(data)
data = create_debt_ratio_features(data)
data = create_average_payment_and_bill(data)
data = create_payment_timeliness_features(data)
data = create_total_payment_and_bill(data)
data = create_bill_difference_features(data)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = split_data(data, target=target)

# Define preprocessor and preprocess the data
preprocessor = define_preprocessor(X_train)
X_train_processed, X_test_processed = preprocess_data(preprocessor, X_train, X_test)

# Convert data to tensors
X_train_tensor, y_train_tensor, X_test_tensor, y_test_tensor = convert_to_tensors(X_train_processed, y_train, X_test_processed, y_test)

# Train the model after feature engineering
nn_estimator_after = SklearnSimpleNN(input_dim=X_train_tensor.shape[1], pos_weight=calculate_class_weights(y_train)[1])
nn_estimator_after = train_model(nn_estimator_after, X_train_tensor, y_train_tensor)

# Evaluate the model
# evaluate_model(nn_estimator_after, X_test_tensor, y_test_tensor)

# Evaluate the model and save the report after feature engineering
report_after = evaluate_model(nn_estimator_after, X_test_tensor, y_test_tensor, "After Feature Engineering")

Classification Report (After Feature Engineering):
              precision    recall  f1-score   support

         0.0       0.86      0.87      0.87      4673
         1.0       0.54      0.52      0.53      1327

    accuracy                           0.79      6000
   macro avg       0.70      0.69      0.70      6000
weighted avg       0.79      0.79      0.79      6000



### Compare Report

In [4]:
import pandas as pd

def compare_classification_reports(report_before, report_after):
    # Convert reports to DataFrame
    report_before_df = pd.DataFrame(report_before).transpose()
    report_after_df = pd.DataFrame(report_after).transpose()

    # Merge reports
    comparison_df = report_before_df.join(report_after_df, lsuffix='_before', rsuffix='_after')

    # Calculate percentage change
    comparison_df['precision_change'] = (comparison_df['precision_after'] - comparison_df['precision_before']) / comparison_df['precision_before'] * 100
    comparison_df['recall_change'] = (comparison_df['recall_after'] - comparison_df['recall_before']) / comparison_df['recall_before'] * 100
    comparison_df['f1-score_change'] = (comparison_df['f1-score_after'] - comparison_df['f1-score_before']) / comparison_df['f1-score_before'] * 100

    print("Comparison of Classification Report Metrics:")
    print(comparison_df[['precision_before', 'precision_after', 'precision_change',
                         'recall_before', 'recall_after', 'recall_change',
                         'f1-score_before', 'f1-score_after', 'f1-score_change']])

    return comparison_df

# Compare the classification reports before and after feature engineering
comparison_df = compare_classification_reports(report_before, report_after)


Comparison of Classification Report Metrics:
              precision_before  precision_after  precision_change  \
0.0                   0.865446         0.864012         -0.165714   
1.0                   0.533384         0.535575          0.410618   
accuracy              0.793333         0.794000          0.084034   
macro avg             0.699415         0.699793          0.054045   
weighted avg          0.792005         0.791373         -0.079871   

              recall_before  recall_after  recall_change  f1-score_before  \
0.0                0.869891      0.872887       0.344403         0.867663   
1.0                0.523738      0.516202      -1.438849         0.528517   
accuracy           0.793333      0.794000       0.084034         0.793333   
macro avg          0.696814      0.694544      -0.325758         0.698090   
weighted avg       0.793333      0.794000       0.084034         0.792655   

              f1-score_after  f1-score_change  
0.0                 0.868427 

In [5]:
report_before

{'0.0': {'precision': 0.8654460293804556,
  'recall': 0.8698908624010272,
  'f1-score': 0.8676627534685166,
  'support': 4673},
 '1.0': {'precision': 0.533384497313891,
  'recall': 0.5237377543330821,
  'f1-score': 0.5285171102661597,
  'support': 1327},
 'accuracy': 0.7933333333333333,
 'macro avg': {'precision': 0.6994152633471733,
  'recall': 0.6968143083670546,
  'f1-score': 0.6980899318673381,
  'support': 6000},
 'weighted avg': {'precision': 0.7920050872050669,
  'recall': 0.7933333333333333,
  'f1-score': 0.7926550420469287,
  'support': 6000}}

In [6]:
report_after

{'0.0': {'precision': 0.8640118618936666,
  'recall': 0.8728867964904772,
  'f1-score': 0.8684266553119012,
  'support': 4673},
 '1.0': {'precision': 0.5355746677091477,
  'recall': 0.5162019593067069,
  'f1-score': 0.5257099002302379,
  'support': 1327},
 'accuracy': 0.794,
 'macro avg': {'precision': 0.6997932648014071,
  'recall': 0.694544377898592,
  'f1-score': 0.6970682777710695,
  'support': 6000},
 'weighted avg': {'precision': 0.7913725024465239,
  'recall': 0.794,
  'f1-score': 0.7926291329796733,
  'support': 6000}}

In [7]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30000 entries, 0 to 29999
Data columns (total 73 columns):
 #   Column                      Non-Null Count  Dtype   
---  ------                      --------------  -----   
 0   limit_bal                   30000 non-null  int64   
 1   sex                         30000 non-null  category
 2   education                   30000 non-null  category
 3   marriage                    30000 non-null  category
 4   age                         30000 non-null  int64   
 5   pay_1                       30000 non-null  int64   
 6   pay_2                       30000 non-null  int64   
 7   pay_3                       30000 non-null  int64   
 8   pay_4                       30000 non-null  int64   
 9   pay_5                       30000 non-null  int64   
 10  pay_6                       30000 non-null  int64   
 11  bill_amt1                   30000 non-null  int64   
 12  bill_amt2                   30000 non-null  int64   
 13  bill_amt3       

The `target_encode` function is a feature engineering technique that replaces the categories of a categorical variable with the mean (or other statistic) of the target variable for each category. This method can be particularly useful when dealing with high-cardinality categorical features, where one-hot encoding would create too many dummy variables.

### Explanation of Target Encoding

Let's break down what the `target_encode` function does step by step:

1. **Group by Category and Compute Mean Target**:
   - The function groups the dataframe by each unique value of the categorical feature.
   - It then computes the mean of the target variable for each category.

2. **Map the Mean Target to the Original Data**:
   - The function maps the computed mean target values back to the original dataframe, replacing the categorical feature values with these mean target values.
   
### Benefits of Target Encoding

- **Dimensionality Reduction**: Unlike one-hot encoding, which can significantly increase the number of features, target encoding results in only one new feature for each categorical variable.
- **Capturing Impact on Target**: Target encoding captures the relationship between the categorical feature and the target variable, which can be beneficial for certain machine learning models.

### Potential Issues

- **Data Leakage**: If not applied correctly, target encoding can lead to data leakage, where information from the test set influences the training set. This is why it's important to perform target encoding based only on the training data and then apply the same encoding to the test data.
- **Overfitting**: Target encoding can sometimes lead to overfitting, especially if there are categories with few examples. Regularization techniques can be applied to mitigate this.





### Step 3: Split the Data into Training and Testing Sets




### Step 4: Define Preprocessing Pipelines and Apply to Training Data


### Step 5: Convert Data to Tensors and Train the Model

Determining which features improve the performance of a machine learning model is an iterative process that involves several steps. Here are the general steps you can follow to evaluate the impact of different features on your model's performance:

### 1. **Baseline Model**:
   - Start by building a baseline model with minimal preprocessing and feature engineering.
   - Evaluate the baseline model's performance using appropriate metrics (e.g., accuracy, precision, recall, F1-score, AUC-ROC).

### 2. **Feature Addition/Removal**:
   - Incrementally add or remove features and observe changes in model performance.
   - This can be done manually or using automated methods like recursive feature elimination.

### 3. **Cross-Validation**:
   - Use cross-validation to ensure that the performance improvements are consistent and not due to random chance.
   - Cross-validation helps in providing a more robust estimate of the model's performance.

### 4. **Feature Importance**:
   - For models that provide feature importance (e.g., tree-based models like Random Forest, Gradient Boosting), examine the feature importance scores.
   - Identify the most influential features according to the model.

### 5. **Statistical Tests**:
   - Perform statistical tests to determine the significance of individual features.
   - Techniques like ANOVA, chi-square tests, and mutual information can help in understanding the relationship between features and the target variable.

### 6. **Model Performance Metrics**:
   - Monitor and compare key performance metrics before and after adding/removing features.
   - Common metrics include accuracy, precision, recall, F1-score, and AUC-ROC.

### 7. **Model Interpretation Tools**:
   - Use tools like SHAP (SHapley Additive exPlanations) and LIME (Local Interpretable Model-agnostic Explanations) to understand the impact of features on model predictions.

### Example: Feature Importance with Random Forest

Here’s an example of how you might use a Random Forest model to assess feature importance:

```python
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.metrics import classification_report

# Fit a Random Forest model
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train_processed, y_train)

# Evaluate the model
y_pred = rf.predict(X_test_processed)
print(classification_report(y_test, y_pred))

# Get feature importances
importances = rf.feature_importances_
feature_names = X_train.columns
feature_importance_df = pd.DataFrame({'Feature': feature_names, 'Importance': importances})
feature_importance_df = feature_importance_df.sort_values(by='Importance', ascending=False)
print(feature_importance_df)

# Cross-validation
cv_scores = cross_val_score(rf, X_train_processed, y_train, cv=5)
print(f'Cross-validation scores: {cv_scores}')
print(f'Mean CV score: {np.mean(cv_scores)}')
```

### Example: Using SHAP for Model Interpretation

SHAP values provide a way to understand the contribution of each feature to individual predictions:

```python
import shap

# Initialize the SHAP explainer
explainer = shap.TreeExplainer(rf)
shap_values = explainer.shap_values(X_train_processed)

# Plot SHAP summary
shap.summary_plot(shap_values, X_train_processed, feature_names=feature_names)
```

### Summary Steps:
1. **Build a baseline model and evaluate its performance.**
2. **Incrementally add or remove features and observe changes in performance.**
3. **Use cross-validation to ensure consistent performance improvements.**
4. **Examine feature importance scores from models that support it.**
5. **Perform statistical tests to assess the significance of features.**
6. **Monitor key performance metrics to compare different feature sets.**
7. **Use model interpretation tools like SHAP or LIME to understand feature contributions.**

This iterative process helps in identifying the most impactful features and refining the model for better performance. If you have specific features or transformations you'd like to evaluate, I can help with implementing those as well.

### Updated model_pipeline.py script

In [8]:
script_content = """
import pandas as pd
import numpy as np
import torch
from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.metrics import classification_report
import logging

# Set up logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

# Load the dataset from a URL
def load_data_from_url(url):
    df = pd.read_excel(url, header=1)
    return df

# Clean column names
def clean_column_names(df):
    df.columns = [col.lower().replace(' ', '_') for col in df.columns]
    return df

# Remove the 'id' column
def remove_id_column(df):
    if 'id' in df.columns:
        df = df.drop(columns=['id'])
    return df

# Convert specified columns to categorical type
def convert_categorical(df, categorical_columns):
    df[categorical_columns] = df[categorical_columns].astype('category')
    return df

# Split the data into training and testing sets
def split_data(df, target):
    X = df.drop(columns=[target])
    y = df[target]
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
    return X_train, X_test, y_train, y_test

# Define the preprocessor
def define_preprocessor(X_train):
    numeric_features = X_train.select_dtypes(include=['int64', 'float64']).columns.tolist()
    categorical_features = X_train.select_dtypes(include=['object', 'category']).columns.tolist()

    numeric_transformer = Pipeline(steps=[
        ('imputer', SimpleImputer(strategy='median')),
        ('scaler', StandardScaler())])

    categorical_transformer = Pipeline(steps=[
        ('imputer', SimpleImputer(strategy='most_frequent')),
        ('onehot', OneHotEncoder(handle_unknown='ignore'))])

    preprocessor = ColumnTransformer(
        transformers=[
            ('num', numeric_transformer, numeric_features),
            ('cat', categorical_transformer, categorical_features)])

    return preprocessor

# Preprocess the data
def preprocess_data(preprocessor, X_train, X_test):
    X_train_processed = preprocessor.fit_transform(X_train)
    X_test_processed = preprocessor.transform(X_test)
    return X_train_processed, X_test_processed

# Calculate class weights for imbalanced datasets
def calculate_class_weights(y_train):
    return len(y_train) / (2 * np.bincount(y_train))

# Convert data to PyTorch tensors
def convert_to_tensors(X_train_processed, y_train, X_test_processed, y_test):
    X_train_tensor = torch.tensor(X_train_processed, dtype=torch.float32)
    y_train_tensor = torch.tensor(y_train.values, dtype=torch.float32).unsqueeze(1)
    X_test_tensor = torch.tensor(X_test_processed, dtype=torch.float32)
    y_test_tensor = torch.tensor(y_test.values, dtype=torch.float32).unsqueeze(1)
    return X_train_tensor, y_train_tensor, X_test_tensor, y_test_tensor

# Define the neural network model
class SimpleNN(torch.nn.Module):
    def __init__(self, input_dim):
        super(SimpleNN, self).__init__()
        self.fc1 = torch.nn.Linear(input_dim, 32)
        self.fc2 = torch.nn.Linear(32, 1)

    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = self.fc2(x)
        return x

# Commented out the original SklearnSimpleNN class definition
# class SklearnSimpleNN(BaseEstimator, ClassifierMixin):
#     def __init__(self, input_dim, learning_rate=0.001, epochs=50, batch_size=64, pos_weight=1.0):
#         self.input_dim = input_dim
#         self.learning_rate = learning_rate
#         self.epochs = epochs
#         self.batch_size = batch_size
#         self.pos_weight = pos_weight
#         self.model = SimpleNN(self.input_dim)

#     def fit(self, X, y):
#         criterion = torch.nn.BCEWithLogitsLoss(pos_weight=torch.tensor(self.pos_weight, dtype=torch.float32))
#         optimizer = torch.optim.Adam(self.model.parameters(), lr=self.learning_rate)
#         train_dataset = torch.utils.data.TensorDataset(torch.tensor(X, dtype=torch.float32), torch.tensor(y, dtype=torch.float32).unsqueeze(1))
#         train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=self.batch_size, shuffle=True)

#         for epoch in range(self.epochs):
#             self.model.train()
#             for inputs, targets in train_loader:
#                 optimizer.zero_grad()
#                 outputs = self.model(inputs)
#                 loss = criterion(outputs, targets.view(-1, 1))
#                 loss.backward()
#                 optimizer.step()
#         return self

#     def predict(self, X):
#         self.model.eval()
#         with torch.no_grad():
#             if isinstance(X, np.ndarray):
#                 X = torch.tensor(X, dtype=torch.float32)
#             elif isinstance(X, pd.DataFrame):
#                 X = torch.tensor(X.values, dtype=torch.float32)
#             outputs = self.model(X)
#             probabilities = torch.sigmoid(outputs)
#             predictions = (probabilities > 0.5).float()
#         return predictions.numpy().squeeze()

# Updated SklearnSimpleNN class definition
class SklearnSimpleNN(BaseEstimator, ClassifierMixin):
    def __init__(self, input_dim, learning_rate=0.001, epochs=50, batch_size=64, pos_weight=1.0):
        self.input_dim = input_dim
        self.learning_rate = learning_rate
        self.epochs = epochs
        self.batch_size = batch_size
        self.pos_weight = pos_weight
        self.model = SimpleNN(self.input_dim)

    def fit(self, X, y):
        criterion = torch.nn.BCEWithLogitsLoss(pos_weight=torch.tensor(self.pos_weight, dtype=torch.float32))
        optimizer = torch.optim.Adam(self.model.parameters(), lr=self.learning_rate)
        train_dataset = torch.utils.data.TensorDataset(torch.tensor(X, dtype=torch.float32), torch.tensor(y, dtype=torch.float32).unsqueeze(1))
        train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=self.batch_size, shuffle=True)

        for epoch in range(self.epochs):
            self.model.train()
            for inputs, targets in train_loader:
                optimizer.zero_grad()
                outputs = self.model(inputs)
                loss = criterion(outputs, targets.view(-1, 1))
                loss.backward()
                optimizer.step()
        return self

    def predict(self, X):
        self.model.eval()
        with torch.no_grad():
            if isinstance(X, np.ndarray):
                X = torch.tensor(X, dtype=torch.float32)
            elif isinstance(X, pd.DataFrame):
                X = torch.tensor(X.values, dtype=torch.float32)
            outputs = self.model(X)
            probabilities = torch.sigmoid(outputs)
            predictions = (probabilities > 0.5).float()
        return predictions.numpy().squeeze()

def train_model(nn_estimator, X_train_tensor, y_train_tensor):
    nn_estimator.fit(X_train_tensor.numpy(), y_train_tensor.numpy())
    return nn_estimator

def evaluate_model(nn_estimator, X_test_tensor, y_test_tensor):
    y_pred = nn_estimator.predict(X_test_tensor.numpy())
    print(classification_report(y_test_tensor.numpy(), y_pred))

"""

# Append the functions to model_pipeline.py
with open("model_pipeline.py", "w") as file:
    file.write(script_content)

print("Functions written successfully to model_pipeline.py")

# reload script to make function available for use
import importlib
import model_pipeline
importlib.reload(model_pipeline)


Functions written successfully to model_pipeline.py


<module 'model_pipeline' from '/content/model_pipeline.py'>