# **Tabular Classification with Modlee: An End-to-End Tutorial**

In this tutorial, we'll guide you through a complete project using the Modlee package for tabular data classification. We'll use a diabetes dataset to show you how to:

1. Prepare the Data: Load and preprocess the dataset, including scaling and splitting into training and validation sets.
2. Use Modlee for Model Training: Train a model using Modlee's framework.
3. Implement and Train a Custom Model: Build a custom model, integrate it with Modlee, and train it.
4. Evaluate Model: Assess the performance of the custom model on the validation data.

## Tips
For best performance, ensure that the runtime is set to use a GPU (`Runtime > Change runtime type > T4 GPU`).

## Help & Questions

If you have any questions, please reachout on our [Discord](https://discord.gg/dncQwFdN9m).

You can also use our [documenation](https://docs.modlee.ai/README.html) as a reference for using our package.

# **Environment Setup**
## Step 1:

First, we need to make sure that we have the necessary packages installed. We will need `modlee` and its related packages.

## Step 2: Importing Libraries and Setting Up Environment

In this section, we import the necessary libraries and set up the environment. We include a workaround to handle `SSL` verification errors that might occur when working in certain environments.

In [None]:
!pip3 install modlee torch torchvision pytorch-lightning torchtext==0.18.0

In [None]:
import pandas as pd
from sklearn.preprocessing import StandardScaler
import torch
import os
import modlee
import lightning.pytorch as pl
from torch.utils.data import DataLoader, TensorDataset, random_split
from sklearn.model_selection import train_test_split
import ssl
ssl._create_default_https_context = ssl._create_unverified_context


## Step 3: Setting Up Modlee API Key

Here, we will set our Modlee API key and initialize the Modlee package.
Make sure that you have a Modlee account and an API key [from the dashboard](https://www.dashboard.modlee.ai/).
Replace `replace-with-your-api-key` with your API key.

In [None]:
# Set the API key to an environment variable,
# to simulate setting this in your shell profile
os.environ['MODLEE_API_KEY'] = "OktSzjtS27JkuFiqpuzzyZCORw88Cz0P"
modlee.init(api_key=os.environ['MODLEE_API_KEY'])


# **Dataset Preparation**
## Step 1:

For this example, we will manually download the diabetes dataset from Kaggle and upload it to your Google Colab environment.

1. Visit the [Diabetes CSV dataset page](https://www.kaggle.com/datasets/saurabh00007/diabetescsv) on Kaggle.
2. Click the "Download" button to save the dataset (diabetes.csv) to your local machine.
3. In your Colab notebook, click on the file icon on the left side and upload the CSV file.

This section ensures that the dataset is ready for use in the subsequent steps. The path to the dataset in the Colab environment will be `/content/diabetes.csv`, which you can then use in your data processing code.


## Step 2:

We define a custom dataset class `TabularDataset` for handling our tabular data. This class inherits from `torch.utils.data.Dataset` and is used to manage the `features (X)` and `labels (y)` of our dataset.

In [None]:
class TabularDataset(TensorDataset):
    def __init__(self, data, target):
        self.data = torch.tensor(data, dtype=torch.float32)  # Convert features to tensors
        self.target = torch.tensor(target, dtype=torch.long) # Convert labels to long integers for classification

    def __len__(self):
        return len(self.data) # Return the size of the dataset

    def __getitem__(self, idx):
        return self.data[idx], self.target[idx] # Return a single sample from the dataset

## Step 3: Loading and Preprocessing the Data

This section explains how to load and prepare the diabetes dataset for training.

First, the data is loaded into a `Pandas DataFrame`, with `features (X)` and `labels (y)` separated. The labels are in the `'Outcome'` column. The features are then scaled using `StandardScaler` to ensure consistent input for the model.

Next, a custom `TabularDataset` is created to handle this data in `PyTorch`. The dataset is split into training and validation sets, and `DataLoader` instances are created to efficiently manage data during training.

Finally, the function returns these `DataLoaders` for use in the model training process.

In [None]:
def get_diabetes_dataloaders(batch_size=32, val_split=0.2, shuffle=True):
    dataset_path = "/Users/mansiagrawal/Downloads/diabetes.csv"
    df = pd.read_csv(dataset_path) # Load the CSV file into a DataFrame
    X = df.drop('Outcome', axis=1).values # Features (X) - drop the target column
    y = df['Outcome'].values # Labels (y) - the target column
    scaler = StandardScaler() # Initialize the scaler for feature scaling
    X_scaled = scaler.fit_transform(X) # Scale the features
    dataset = TabularDataset(X_scaled, y) # Create a TabularDataset instance

    # Split the dataset into training and validation sets
    dataset_size = len(dataset)
    val_size = int(val_split * dataset_size)
    train_size = dataset_size - val_size
    train_dataset, val_dataset = random_split(dataset, [train_size, val_size])

    # Create DataLoader instances for training and validation
    train_dataloader = DataLoader(train_dataset, batch_size=batch_size, shuffle=shuffle)
    val_dataloader = DataLoader(val_dataset, batch_size=batch_size, shuffle=shuffle)

    return train_dataloader, val_dataloader

# Generate the DataLoaders
train_dataloader, val_dataloader = get_diabetes_dataloaders(batch_size=32, val_split=0.2, shuffle=True)

# **Defining the Custom Model**
Here, we define a simple feedforward neural network `TabularClassifier`. The network has multiple fully connected layers with dropout for regularization.




The `TabularClassifier` is a straightforward neural network for classification. It has four fully connected layers, reducing the neurons from 128 to 1. Each layer uses the `selu activation` function for non-linearity, and `AlphaDropout` is applied to prevent overfitting. This simple architecture effectively handles tabular data for classification.

In [None]:

class TabularClassifier(modlee.model.TabularClassificationModleeModel):
    def __init__(self, input_dim, num_classes=2):
        super().__init__()
        self.fc1 = torch.nn.Linear(input_dim, 128)  # First hidden layer
        self.dropout1 = torch.nn.AlphaDropout(0.1)  # Dropout to prevent overfitting

        self.fc2 = torch.nn.Linear(128, 64)  # Second hidden layer
        self.dropout2 = torch.nn.AlphaDropout(0.1)  # Dropout to prevent overfitting

        self.fc3 = torch.nn.Linear(64, 32)  # Third hidden layer
        self.dropout3 = torch.nn.AlphaDropout(0.1)  # Dropout to prevent overfitting

        self.fc4 = torch.nn.Linear(32, num_classes)  # Output layer

        self.loss_fn = torch.nn.CrossEntropyLoss()

    def forward(self, x):
        x = torch.selu(self.fc1(x))  # Apply SELU activation to the first layer
        x = self.dropout1(x)  # Apply dropout

        x = torch.selu(self.fc2(x))  # Apply SELU activation to the second layer
        x = self.dropout2(x)  # Apply dropout

        x = torch.selu(self.fc3(x))  # Apply SELU activation to the third layer
        x = self.dropout3(x)  # Apply dropout

        x = self.fc4(x)  # Output layer without activation (for binary classification)
        return x
    
    def training_step(self, batch, batch_idx):
        x, y_target = batch
        y_pred = self(x)
        loss = self.loss_fn(y_pred, y_target.squeeze()) # Calculate the loss
        return {"loss": loss}

    def validation_step(self, val_batch, batch_idx):
        x, y_target = val_batch
        y_pred = self(x)
        val_loss = self.loss_fn(y_pred, y_target.squeeze()) # Calculate validation loss
        return {'val_loss': val_loss}

    def configure_optimizers(self):
        optimizer = torch.optim.SGD(self.parameters(), lr=0.001, momentum=0.9)  # Define the optimizer
        return optimizer

# **Integrating the Custom Model with Modlee**


The `ModleeTabularClassifier` class integrates the `TabularClassifier` with Modlee's framework. It handles the forward pass, calculates loss during training and validation, and sets up the `optimizer (SGD)`. This class enables efficient training and evaluation of the model within the `Modlee` environment.

# **Training the Model**

Here, we train the `ModleeTabularClassifier` using `PyTorch Lightning`. We first determine the input and output dimensions from the dataset. After initializing the model, we use `PyTorch Lightning's Trainer` to train the model for one epoch and evaluate it with the validation data. This setup simplifies managing the training and validation process.

In [None]:
# Get the input dimension
original_train_dataset = train_dataloader.dataset.dataset # Access the original dataset
input_dim = len(original_train_dataset[0][0])
num_classes = 2  # Binary classification

# Initialize the Modlee model
modlee_model = TabularClassifier(input_dim=input_dim, num_classes=num_classes)

# Train the model using PyTorch Lightning
with modlee.start_run() as run:
    trainer = pl.Trainer(max_epochs=1)
    trainer.fit(
        model=modlee_model,
        train_dataloaders=train_dataloader,
        val_dataloaders=val_dataloader
    )


# **Evaluating the Model**
After training, we evaluate the model by predicting on the validation set and calculating the accuracy. We disable gradient computation for efficiency, collect predictions and true labels, and use `accuracy_score` to measure performance. Finally, we print the accuracy to assess the model's effectiveness.

In [9]:
from sklearn.metrics import accuracy_score

# Evaluate the model's performance
modlee_model.eval() # Set the model to evaluation mode
y_pred = []
y_true = []
with torch.no_grad(): # Disable gradient computation
    for batch in val_dataloader:
        X_batch, y_batch = batch
        outputs = modlee_model(X_batch) # Get model predictions
        predictions = torch.argmax(outputs, dim=1)  # Round predictions to get binary output
        y_pred.extend(predictions.numpy()) # Store predictions
        y_true.extend(y_batch.numpy()) # Store true labels

# Calculate accuracy
accuracy = accuracy_score(y_true, y_pred)
print(f"Model accuracy: {accuracy:.2f}")

Model accuracy: 0.73


# **Document and Inspect Artifacts**

After training, we inspect the artifacts saved by Modlee, including the model graph and various statistics.

In [10]:
import sys

# Get the path to the last run's saved data
last_run_path = modlee.last_run_path()
print(f"Run path: {last_run_path}")

# Get the path to the saved artifacts
artifacts_path = os.path.join(last_run_path, 'artifacts')
artifacts = os.listdir(artifacts_path)
print(f"Saved artifacts: {artifacts}")

# Set the artifacts path as an environment variable
os.environ['ARTIFACTS_PATH'] = artifacts_path

# Add the artifacts directory to the system path
sys.path.insert(0, artifacts_path)

Run path: /Users/mansiagrawal/Documents/modlee_pypi/src/modlee/notebooks_tests/mlruns/0/39c72338970842388573289203ee6568
Saved artifacts: ['model_metafeatures', 'model_size', 'model_summary.txt', 'checkpoints', 'model.py', 'cached_vars', 'stats_rep', 'model', 'model_graph.py', 'model_graph.txt', 'data_metafeatures']


In [11]:
# Print out the first few lines of the model
print("Model graph:")
!sed -n -e 1,15p $ARTIFACTS_PATH/model_graph.py
!echo "        ..."
!sed -n -e 58,68p $ARTIFACTS_PATH/model_graph.py
!echo "        ..."

Model graph:

import torch, onnx2torch
from torch import tensor
class Model(torch.nn.Module):
    
    def __init__(self):
        super().__init__()
        setattr(self,'Gemm', torch.nn.modules.linear.Linear(**{'in_features':8,'out_features':128}))
        setattr(self,'Selu', torch.nn.modules.activation.SELU(**{'inplace':False}))
        setattr(self,'Gemm_1', torch.nn.modules.linear.Linear(**{'in_features':128,'out_features':64}))
        setattr(self,'Selu_1', torch.nn.modules.activation.SELU(**{'inplace':False}))
        setattr(self,'Gemm_2', torch.nn.modules.linear.Linear(**{'in_features':64,'out_features':32}))
        setattr(self,'Selu_2', torch.nn.modules.activation.SELU(**{'inplace':False}))
        setattr(self,'Gemm_3', torch.nn.modules.linear.Linear(**{'in_features':32,'out_features':2}))
    def forward(self, input_1):
        ...
        ...


In [12]:
# Print the first lines of the data metafeatures
print("Data metafeatures:")
!head -20 $ARTIFACTS_PATH/stats_rep

Data metafeatures:
{
  "batch_element_0_mean": [
    -0.027250396087765694,
    0.0191038828343153,
    -0.019142156466841698,
    -0.004866223782300949,
    0.03995737433433533,
    0.025532608851790428,
    -0.004861955530941486,
    -0.014338756911456585
  ],
  "batch_element_0_median": [
    -0.2509521245956421,
    -0.12188771367073059,
    0.149640753865242,
    0.15453319251537323,
    -0.38030633330345154,
    0.0009419787675142288,
    -0.32579922676086426,
    -0.3608474135398865


# **Great Job!**

We've successfully completed a machine learning project using the Modlee package for tabular data classification. Here's what we covered:

- Loaded and prepared the diabetes dataset: We imported, scaled, and split the data into training and validation sets.
- Built and trained a model: We used the `TabularClassifier` and integrated it with Modlee's framework for efficient training.
- Evaluated the model: We assessed the model's performance using accuracy on the validation set.

This project has given you a solid foundation in using Modlee for tabular data. You now know how to prepare data, train models, and evaluate their performance. Keep exploring and building on this knowledge!