# Third Party Models. Abstraction of Tokenization and Intent Classifier 

#### This notebook has two parts. 
1. using Hugging Face Transformer
2. Demonstarion of a Transformer like Interface for Models (Own Tokenizer and Model Abstraction), with train, eval and predict functions that can be used to trained any model.

## Transformer, Bert's Intent Classification
I trained and fine tuned a language model with Attention Mechanism previously. However, I have not made it generalized. Let's use Hugging face transformers. 
An example Intent Classification model using BERT and HuggingFace Transformers.
Steps:
1. Load data
2. Tokenize data
3. Create PyTorch Dataset
4. Train model
5. Evaluate model
6. Save model


In [13]:
from transformers import BertTokenizer, BertForSequenceClassification
from transformers import Trainer, TrainingArguments
from torch.utils.data import Dataset, DataLoader
import torch
import config_bert as cfg
import warnings
from transformers import TrainerCallback
from machine_learning.model_utils import get_or_create_experiment
from machine_learning.IntentTokenizer import IntentTokenizer
import mlflow
import mlflow.pytorch
import pandas as pd

warnings.filterwarnings("ignore")
device = torch.device("cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu")
print(device)

mps


 ## Data Loading
 I will use the pretrained Bert tokenizer. The tokenizer will convert the text into tokens that the model can understand. The model will be trained to classify the intent of the text. I will use the BertForSequenceClassification model, which is a pretrained Bert model with a single linear classification layer on top. This model can be used for sequence classification tasks like ours.

In [14]:
# Load the pre-trained tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Load the pre-trained model for sequence classification with the number of labels
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=len(cfg.le.classes_))
model=model.to(device)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## PyTorch Dataset
The data set uses encodings from tokenizer and labels from label encoder. The data set is then used to train the model.

In [15]:
class IntentDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_length):
        self.encodings = tokenizer(texts, truncation=True, padding=True, max_length=max_length)
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

# Load data
train_df = pd.read_csv('data/atis/train.tsv', sep='\t', header=None, names=["text", "label"])
test_df = pd.read_csv('data/atis/test.tsv', sep='\t', header=None, names=["text", "label"])

# Assume the second column is the label and the first column is the text
train_texts = train_df["text"].tolist()
test_texts = test_df["text"].tolist()

# Convert labels to integer (if they are not already)
# This might involve using a LabelEncoder as you have categorical labels
from sklearn.preprocessing import LabelEncoder
label_encoder = cfg.le
num_labels = len(cfg.le.classes_)

# Tokenize the text and create datasets
max_length = 256  # Max length of the text sequence, you might need to adjust this based on your dataset
train_dataset = IntentDataset(train_texts, cfg.train_labels, tokenizer, max_length)
test_dataset = IntentDataset(test_texts, cfg.test_labels, tokenizer, max_length)

## Training
Hyperparameters are defined here. The model is trained and evaluated.

In [10]:
training_args = TrainingArguments(
    output_dir='./results',          # output directory
    num_train_epochs=3,              # total number of training epochs
    per_device_train_batch_size=16,  # batch size per device during training
    per_device_eval_batch_size=64,   # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=1e-5,               # strength of weight decay
    logging_dir='./logs',  
    logging_strategy="steps",  # or "epoch"
    logging_steps=50,  # Log every 10 steps# directory for storing logs,
    save_strategy="no"
)

class MLflowLoggingCallback(TrainerCallback):
    def on_evaluate(self, args, state, control, metrics=None, **kwargs):
        # Log metrics with MLflow here
        if metrics:
            for key, value in metrics.items():
                mlflow.log_metric(key, value, step=state.global_step)

try:
    # Create an experiment and log parameters
    mlflow(pytorch=True)
    mlflow.start_run()
    mlflow.log_param("epochs", training_args.num_train_epochs)
    mlflow.log_param("batch_size", training_args.per_device_train_batch_size)
    mlflow.log_param("learning_rate", training_args.learning_rate)
    mlflow.log_param("weight_decay", training_args.weight_decay)
    mlflow.log_param("warmup_steps", training_args.warmup_steps)
    mlflow.log_param("max_length", max_length)
    mlflow.log_param("num_labels", num_labels)
    mlflow.log_param("model", "bert-base-uncased")

except:
    pass
#mlflow.log_params(your_params_dict)  # Log any initial parameters
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    callbacks=[MLflowLoggingCallback()]
)
trainer.train()
mlflow.end_run()

Step,Training Loss
50,3.0494
100,2.1804
150,1.1373
200,0.8117
250,0.556
300,0.4456
350,0.3133
400,0.2608
450,0.2047
500,0.185


## Save Model

In [16]:
trainer.save_model('results/final_bert_evaluated')

 ## Evaluate Model

In [17]:
trainer.evaluate()

{'eval_loss': 0.22830472886562347,
 'eval_runtime': 1.0164,
 'eval_samples_per_second': 836.246,
 'eval_steps_per_second': 13.773,
 'epoch': 3.0}

# Building a generic Hugging Face like Interface
Hugging face has its own tokenizer and training interface that abstracts pytorch implementation. I show a similar approach. Classes are implemented in machine_learning directory

In [18]:
import pandas as pd
import torch
import torch.nn as nn
from machine_learning.IntentTokenizer import IntentTokenizer
from machine_learning.IntentClassifierLSTMWithAttention import IntentClassifierLSTMWithAttention
from machine_learning.model_utils import train, evaluate, predict
from torch.utils.data import DataLoader
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Set device
device = torch.device("cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu")
print(f"Using device: {device}")

# Load and preprocess the data
train_df = pd.read_csv('data/atis/train.tsv', sep='\t', header=None, names=["text", "label"])
test_df = pd.read_csv('data/atis/test.tsv', sep='\t', header=None, names=["text", "label"])

Using device: mps


### Own Tokenizer Implementation

In [19]:
tokenizer = IntentTokenizer(train_df)
tokenizer.save_state("models/IntentClassifierLSTMWithAttention_tokenizer.pickle", "models/IntentClassifierLSTMWithAttention_le.pickle")

inside IntentTokenizer
Vocabulary Size: 890
Encoding labels for the first time and adding unknown class.
Label Encoding: {'abbreviation': 0, 'aircraft': 1, 'aircraft+flight+flight_no': 2, 'airfare': 3, 'airfare+flight_time': 4, 'airline': 5, 'airline+flight_no': 6, 'airport': 7, 'capacity': 8, 'cheapest': 9, 'city': 10, 'distance': 11, 'flight': 12, 'flight+airfare': 13, 'flight_no': 14, 'flight_time': 15, 'ground_fare': 16, 'ground_service': 17, 'ground_service+ground_fare': 18, 'meal': 19, 'quantity': 20, 'restriction': 21, '<unknown>': 22}


### Get data Tensors and Loaders in One go

I use a Tupled Tensor Data Set, (two Tensors) first one giving the sequences, and and the 2nd one the labels

In [20]:
# Example usage
train_data = tokenizer.process_data(train_df,device=device)
test_data = tokenizer.process_data(test_df,device=device)
print("Number of training samples:", train_data.tensors[0].size())
print("Number of test samples:", test_data.tensors[0].size())

# Create DataLoaders
batch_size = 32
train_loader = DataLoader(train_data, shuffle=True, batch_size=batch_size)
test_loader = DataLoader(test_data, shuffle=True, batch_size=batch_size)
print("Number of training batches:", len(train_loader))
print("Number of test batches:", len(test_loader))

Number of training samples: torch.Size([4634, 46])
Number of test samples: torch.Size([850, 30])
Number of training batches: 145
Number of test batches: 27


### Encode Hyper parameters

In [21]:
# Define loss function and optimizer
loss_function = nn.CrossEntropyLoss()
learning_rate = 0.01              # If you set this too high, it might explode. If too low, it might not learn
weight_decay = 1e-7               # Regularization strength
dropout_rate = 0.3                 # Dropout rate
embedding_dim = 64                # Size of each embedding vector
hidden_dim = 128                 # Number of features in the hidden state of the LSTM
batch_size = 32                  # Number of samples in each batch
output_dim = len(IntentTokenizer.le.classes_)  # Number of classes
num_epochs = 5            # Number of times to go through the entire dataset
vocab_size = tokenizer.max_vocab_size + 1  # The size of the vocabulary
# Create a string that summarizes these parameters
params_str = f"Vocab Size: {vocab_size}\n" \
             f"Embedding Dim: {embedding_dim}\n" \
             f"Hidden Dim: {hidden_dim}\n" \
             f"Output Dim: {output_dim}\n" \
             f"Dropout Rate: {dropout_rate}\n" \
             f"learning Rate: {learning_rate}\n" \
             f"epochs: {num_epochs}"
print(params_str)

Vocab Size: 891
Embedding Dim: 64
Hidden Dim: 128
Output Dim: 23
Dropout Rate: 0.3
learning Rate: 0.01
epochs: 5


### Train, Evaluate, and Predict Abstraction
with 3,4 lines of code, you can almost train, evaluate any intent classification model

In [22]:
# Pick the model and train it. Evaluate the model on the test set.
# choose model to train, uncomment the model you want to train and comment the other one
# IntentClassifierLSTM is a simple LSTM model. IntentClassifierLSTMWithAttention is a LSTM model with attention.
# The latter performs better.
# Difference in Accuracy between the two models is about 3%

# model = IntentClassifierLSTM(vocab_size, embedding_dim, hidden_dim, output_dim, dropout_rate).to(device)
model = IntentClassifierLSTMWithAttention(vocab_size, embedding_dim, hidden_dim, output_dim, dropout_rate).to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate, weight_decay=weight_decay)
train(model, optimizer, loss_function, train_loader, num_epochs)
evaluate(model, loss_function, test_loader)

Epoch [1/5], Loss: 0.7383, Accuracy: 0.8312
Epoch [2/5], Loss: 0.2805, Accuracy: 0.9312
Epoch [3/5], Loss: 0.1614, Accuracy: 0.9592
Epoch [4/5], Loss: 0.1452, Accuracy: 0.9640
Epoch [5/5], Loss: 0.0954, Accuracy: 0.9756
Test Loss: 0.3663
Test Accuracy: 0.9400


0.94

### Model and Tokenization Saving

In [23]:
# Save the model and tokenizer for serving.
model_name = "IntentClassifierLSTMWithAttention"
torch.save(model.to(torch.device("cpu")),f"models/{model_name}.pth")
tokenizer.save_state(f"models/{model_name}_tokenizer.pickle", f"models/{model_name}_le.pickle")

### Model Serving

In [24]:
# Serve the model
device=torch.device("cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu")
model_serve = torch.load(f"models/{model_name}.pth").to(device)

### Model Predictions

In [25]:
# Predict on a query
max_query_length = 50
query_text = "what airlines off from love field between 6 and 10 am on june sixth"
query = pd.DataFrame({"text": [query_text]})
prediction = predict(model_serve, query,tokenizer,device)
print(f"Predicted label: {prediction}")

Predicted label: ['airline']


# Conclusion
In the Notebooks, I have accomplished the following:
1. Show how to train a simple model and add Attention Mechanism to improve accuracy
2. Building a better model through hyper parameterization and Parameter Logging
3. Model Management, Registry, and Experiment Management. Very important parts of Machine Learning Engineering
4. Model Evaluation, on test data and performance during production time. Confidence Scores and Performance Improvment by using a distillation approach (using gpt4 to create OOS data and fine tune our best model). Improve production accuracy and performance
5. Using Hugging Face Pretrained Transformers model. Fine tuning on Atis Data Set
6. Building a Transfomer like Abstract Interface to ELSTM with Attention Model, i.e. Hide Pytorch and Only allow parameters to pass through (code in machine_learning folder)

I hope many of the questions in the challenge are resolved. There are tons of things one can do there, one can visualize the impact of Attentions and Embeddings, one can implement A/B testing, logging in production, discuss more distillation approaches, etc.. looking forward to more fun :)
