## Natural language understanding from traditional methods to large language models

Intent classification (IC) and slot-labeling (SL) are two main key components of the natural language understanding (NLU) block in a dialogue system.

There is dependency between the intent type (eg: play music) and possible slots (eg: artist name, song, genre). Traditional systems often use a cascaded approach or a naive joint approach, where the error in intent classification affects the slot labeling task.

A proper way of addressing both the tasks can help improve the performance and also efficiency of the dialogue systems. Instruction-tuned large language models are able to do the task jointly. 

The goal of this project is to compare traditional methods to the more recent large language model based methods for NLU. 

- Implement IC and SL using methods based on word-embeddings (eg: word2vec or glove).
- Implement IC and SL by fine-tuning a pre-trained language model (eg: BERT or T5)
- Implement IC and SL using incontext-learning without any finetuning (eg: OLMo-7B-Instruct, or Gemma-2B-Instruct).
- Run all the above experiments on two standard datasets (eg: NLU-evaluation benchmark, SNIPS, Banking77).
- Compare all the systems, analyze the results and summarize your findings.

**References**
* Liu et al Benchmarking Natural Language Understanding Services for building Conversational Agents
* Dataset: NLU-Evaluation Benchmark
* Weld et al A survey of joint intent detection and slot-filling models in natural language understanding
* Han et al Bi-directional Joint Neural Networks for Intent Classification and Slot Filling
* hugging-face transformers for pre-trained models.

*First of all you need to specify the directory where you want to download nltk data if you haven't already.*

In [None]:
import re
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from gensim.models import Word2Vec
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from nltk.tokenize import word_tokenize
import nltk

download_dir = '' # your `nltk_data` download directory

nltk.download('punkt_tab', download_dir=download_dir)
nltk.data.path.append(download_dir)

file_path = 'NLU-Evaluation-Data-master/AnnotatedData/NLU-Data-Home-Domain-Annotated-All.csv'

### Data Processing

In [2]:
def get_data(file):
    data = pd.read_csv(file, delimiter=';')
    return data

In [None]:
data = get_data(file_path)
data.head()

The columns *"intent"*, *"scenario"* and *"answer_annotation"* are relevant for our **IC** and **SL** tasks.

In [None]:
data.info()

We can check the different values of the *"status"* and *"notes"* columns because we observe that they are also containing **NaN** values:

In [None]:
status_values = data['status'].unique()
notes_values = data['notes'].unique()

print('Status values:', status_values)
print('Notes values:', notes_values)

In [None]:
status_values = data['status'].value_counts()
notes_values = data['notes'].value_counts()

print('Status values:', status_values)
print('Notes values:', notes_values)

How many **NaN** values are in those columns:

In [None]:
NaN_status_values = data['status'].isna().sum()
NaN_notes_values = data['notes'].isna().sum()

print('NaN status values:', NaN_status_values)
print('NaN notes values:', NaN_notes_values)

#### Data Cleaning
We can remove the rows with values that start with **'IRR_'** in the *"status"* column, as the utterance will be ignored by the post processing scripts. We also replace all the NaN values with empty strings.

In [None]:
data.isnull().sum()

In [None]:
data = data[~data['status'].str.startswith('IRR_', na=False)]

data['userid'] = data['userid'].fillna('1.0')
data['scenario'] = data['scenario'].fillna('audio')
data['intent'] = data['intent'].fillna('volume_mute')
data['status'] = data['status'].fillna('')
data['notes'] = data['notes'].fillna('')
data['answer_normalised'] = data['answer_normalised'].fillna('stop')
data['answer'] = data['answer'].fillna('stop')
data['question'] = data['question'].fillna('Write what you would tell your PDA in the foll...')

data['answerid'] = data.index

data.reset_index(drop=True, inplace=True)
data.head()

In [None]:
data.isnull().sum()

In [None]:
sentences = data['answer_normalised'] 
intents = data['intent']

print("Sample Sentences:")
print(sentences.head())
print("\nSample Intents:")
print(intents.head())

We also encode the labels, using the intents.

In [None]:
label_encoder = LabelEncoder()
encoded_intents = label_encoder.fit_transform(intents)

print("Intent Label Mapping:")
for i, label in enumerate(label_encoder.classes_):
    print(f"{i}: {label}")

### **Intent Classification (IC)** and **Slot-Labeling (SL)** using Word2Vec embeddings

We are splitting a dataset into training, validation, and testing subsets using the ``train_test_split`` function from the *sklearn.model_selection* module.

In [None]:
X_train, X_val_test, y_train, y_val_test = train_test_split(sentences, encoded_intents, test_size=0.3, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_val_test, y_val_test, test_size=0.5, random_state=42)

print(f"Training Samples:", len(X_train), "(70%)")
print(f"Validation Samples:",len(X_val), "(15%)")
print(f"Testing Samples:", len(X_test), "(15%)")

We tokenize text data from training, validation, and test datasets using the ``word_tokenize`` function.

In [62]:
X_train_tokens = X_train.apply(lambda x: word_tokenize(x.lower()))
X_val_tokens = X_val.apply(lambda x: word_tokenize(x.lower()))
X_test_tokens = X_test.apply(lambda x: word_tokenize(x.lower()))

In [None]:
print(X_train.head())
print()
print(X_train_tokens.head())

Then we train the Word2Vec model on the training data

In [64]:
w2v_model = Word2Vec(X_train_tokens.tolist(), vector_size=100, window=5, min_count=1, workers=4)

By averaging word embeddings, we get the sentence-level embeddings

In [65]:
def get_sentence_embedding(tokens, model):
    vectors = [model.wv[word] for word in tokens if word in model.wv]
    if vectors:
        return np.mean(vectors, axis=0)
    else:
        return np.zeros(model.vector_size)
    
X_train_emb = np.array([get_sentence_embedding(tokens, w2v_model) for tokens in X_train_tokens])
X_val_emb = np.array([get_sentence_embedding(tokens, w2v_model) for tokens in X_val_tokens])
X_test_emb = np.array([get_sentence_embedding(tokens, w2v_model) for tokens in X_test_tokens])

Now we can train one or several models on the training data. Here we will use the **Logistic Regression** and the **Random Forest** classifiers.

In [None]:
# Logistic Regression
lr_model = LogisticRegression(max_iter=500, random_state=42)
lr_model.fit(X_train_emb, y_train)

# Random Forest
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train_emb, y_train)

In [67]:
def evaluate_model(model, X, y_true, dataset_name, labels, print_matrix=False, print_report=False):
    y_pred = model.predict(X)
    accuracy = accuracy_score(y_true, y_pred)

    print(f"Accuracy on {dataset_name} set: {accuracy:.4f}")

    if print_report:
        print(f"Classification Report for {dataset_name} set:")
        print(classification_report(y_true, y_pred, target_names=labels))
    
    if print_matrix:
        conf_matrix = confusion_matrix(y_true, y_pred)

        plt.figure(figsize=(12, 10))
        ax = sns.heatmap(
            conf_matrix,
            annot=False,
            cmap="viridis",
            fmt="d", 
            xticklabels=labels,
            yticklabels=labels,
            cbar_kws={'label': 'Count'}
        )

        ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha="right", fontsize=8)
        ax.set_yticklabels(ax.get_yticklabels(), rotation=0, fontsize=8)

        plt.xlabel("Predicted Label", fontsize=12)
        plt.ylabel("True Label", fontsize=12)
        plt.title(f"Confusion Matrix for {dataset_name} data", fontsize=14)

        plt.tight_layout()
        plt.show()

Logistic Regression validation and testing evaluation

In [None]:
print("Logistic Regression:")
evaluate_model(lr_model, X_val_emb, y_val, "Validation", label_encoder.classes_)
evaluate_model(lr_model, X_test_emb, y_test, "Test", label_encoder.classes_, print_matrix=True, print_report=True)

Random Forest validation and testing evaluation

In [None]:
print("Random Forest:")
evaluate_model(rf_model, X_val_emb, y_val, "Validation", label_encoder.classes_)
evaluate_model(rf_model, X_test_emb, y_test, "Test", label_encoder.classes_, print_matrix=True)

The general accuracy from the validation and test data goes from 0.46 to 0.51.

On the different confusion matrices, we observe the diagonal that represents correct predictions, where the true label matches the predicted label. The brighter is the color, the higher is the accuracy for those classes.

The other cells indicate misclassification. The brighter spots represent more errors than the darker spot.

In [None]:
def predict_intent(sentence, model, w2v_model, label_encoder):
    tokens = word_tokenize(sentence.lower())
    emb = get_sentence_embedding(tokens, w2v_model)
    pred = model.predict([emb])
    return label_encoder.inverse_transform(pred)[0]

sentence = "What is the weather in Paris?"
print(f"Sentence: {sentence}")
print(f"Predicted Intent: {predict_intent(sentence, lr_model, w2v_model, label_encoder)}")

In [None]:
sentence = "Wake me up at 7 am"
print(f"Sentence: {sentence}")
print(f"Predicted Intent: {predict_intent(sentence, lr_model, w2v_model, label_encoder)}")

We can predict the missing entities to complete the dataset

In [None]:
def predict_missing_entities(row, model, w2v_model, label_encoder):
    if pd.isna(row['suggested_entities']):
        tokens = word_tokenize(row['answer_normalised'].lower())
        emb = get_sentence_embedding(tokens, w2v_model)
        pred = model.predict([emb])
        predicted_intent = label_encoder.inverse_transform(pred)[0]
        return predicted_intent
    return row['suggested_entities']

data['suggested_entities'] = data.apply(lambda row: predict_missing_entities(row, lr_model, w2v_model, label_encoder), axis=1)
data.isnull().sum()

Let's also try to predict the scenarios

In [None]:
scenario_encoder = LabelEncoder()
encoded_scenarios = scenario_encoder.fit_transform(data['scenario'])

print("Scenarios Label Mapping:")
for i, label in enumerate(scenario_encoder.classes_):
    print(f"{i}: {label}")

In [None]:
X_train_scenario, X_val_test_scenario, y_train_scenario, y_val_test_scenario = train_test_split(sentences, encoded_scenarios, test_size=0.3, random_state=42)
X_val_scenario, X_test_scenario, y_val_scenario, y_test_scenario = train_test_split(X_val_test_scenario, y_val_test_scenario, test_size=0.5, random_state=42)

X_train_tokens_scenario = X_train_scenario.apply(lambda x: word_tokenize(x.lower()))
X_val_tokens_scenario = X_val_scenario.apply(lambda x: word_tokenize(x.lower()))
X_test_tokens_scenario = X_test_scenario.apply(lambda x: word_tokenize(x.lower()))

X_train_emb_scenario = np.array([get_sentence_embedding(tokens, w2v_model) for tokens in X_train_tokens_scenario])
X_val_emb_scenario = np.array([get_sentence_embedding(tokens, w2v_model) for tokens in X_val_tokens_scenario])
X_test_emb_scenario = np.array([get_sentence_embedding(tokens, w2v_model) for tokens in X_test_tokens_scenario])

lr_model_scenario = LogisticRegression(max_iter=500, random_state=42)
lr_model_scenario.fit(X_train_emb_scenario, y_train_scenario)

print("Logistic Regression (Scenario Prediction):")
evaluate_model(lr_model_scenario, X_val_emb_scenario, y_val_scenario, "Validation", scenario_encoder.classes_)
evaluate_model(lr_model_scenario, X_test_emb_scenario, y_test_scenario, "Test", scenario_encoder.classes_, print_matrix=True)

### Now let's work on slot-labeling

First, we extract the slots from the sentences.

In [None]:
def extract_slots(annotation):
    slots = []
    if isinstance(annotation, str):
        entities = re.findall(r'\[([^\]]+)\]', annotation)
        for entity in entities:
            slot_type, slot_value = entity.split(':')
            slots.append((slot_type, slot_value))
            # print(f"Slot Type: {slot_type}, Slot Value: {slot_value}")
    return slots

data['slots'] = data['answer_annotation'].apply(extract_slots)
data[['answer_annotation', 'slots']].head()

In [None]:
slots = data['slots']
slots = slots.explode()
slots = slots.apply(pd.Series)
slots.columns = ['slot_type', 'slot_value']
slots = slots.reset_index(drop=True)
slots['slot_type'] = slots['slot_type'].str.strip()
slot_types = slots['slot_type'].unique().tolist()
print("Slot Types:", slot_types)

We then convert each word into BIO tagging.

In [None]:
def get_bio_labels(sentence, slots):
    tokens = word_tokenize(sentence)
    labels = ['O'] * len(tokens)
    for slot_type, slot_value in slots:
        slot_tokens = word_tokenize(slot_value)
        slot_tokens_len = len(slot_tokens)
        for i in range(len(tokens)):
            if tokens[i:i+slot_tokens_len] == slot_tokens:
                labels[i] = f"B-{slot_type.strip()}"
                labels[i+1:i+slot_tokens_len] = [f"I-{slot_type}"] * (slot_tokens_len - 1)
    return labels

data['bio_labels'] = data.apply(lambda row: get_bio_labels(row['answer_normalised'], row['slots']), axis=1)
data[['answer_normalised', 'slots', 'bio_labels']].head()

In [None]:
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.multiclass import OneVsRestClassifier

X = data['answer_normalised'].tolist()
X = [word_tokenize(x.lower()) for x in X]
y = data['bio_labels'].tolist()

def align_bio_labels(tokens, bio_labels):
	aligned_labels = ['O'] * len(tokens)
	label_index = 0
	for i, token in enumerate(tokens):
		if label_index < len(bio_labels) and bio_labels[label_index] != 'O':
			aligned_labels[i] = bio_labels[label_index]
			label_index += 1
		elif label_index < len(bio_labels):
			label_index += 1
	return aligned_labels

aligned_bio_labels = align_bio_labels(X, y)

mlb = MultiLabelBinarizer()
y_encoded = mlb.fit_transform(aligned_bio_labels)

# bio_labels = mlb.classes_.tolist()

print("BIO Labels Mapping:")
for i, label in enumerate(mlb.classes_):
    print(f"{i}: {label}")

In [73]:
X_train, X_val_test, y_train, y_val_test = train_test_split(X, y_encoded, test_size=0.3, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_val_test, y_val_test, test_size=0.5, random_state=42)

sl_w2v_model = Word2Vec(X_train, vector_size=100, window=5, min_count=1, workers=4)

X_train_emb = np.array([get_sentence_embedding(tokens, sl_w2v_model) for tokens in X_train])
X_val_emb = np.array([get_sentence_embedding(tokens, sl_w2v_model) for tokens in X_val])
X_test_emb = np.array([get_sentence_embedding(tokens, sl_w2v_model) for tokens in X_test])

In [None]:
lr_model = OneVsRestClassifier(LogisticRegression(max_iter=500, random_state=42))
lr_model.fit(X_train_emb, y_train)

print("Logistic Regression (Slot Labeling):")
evaluate_model(lr_model, X_val_emb, y_val, "Validation", mlb.classes_)
evaluate_model(lr_model, X_test_emb, y_test, "Test", mlb.classes_)

In [None]:
rf_model = OneVsRestClassifier(RandomForestClassifier(n_estimators=100, random_state=42))
rf_model.fit(X_train_emb, y_train)

print("Random Forest (Slot Labeling):")
evaluate_model(rf_model, X_val_emb, y_val, "Validation", mlb.classes_)
evaluate_model(rf_model, X_test_emb, y_test, "Test", mlb.classes_)

In [None]:
def predict_slots(sentence, model, w2v_model, mlb):
    tokens = word_tokenize(sentence.lower())
    emb = get_sentence_embedding(tokens, w2v_model)
    pred = model.predict([emb])
    pred_labels = mlb.inverse_transform(pred)[0]
    return align_bio_labels(tokens, pred_labels)

sentence = "Wake me up at 5 am tomorrow"
print(f"Sentence: {sentence}")
print(f"Predicted Slots: {predict_slots(sentence, lr_model, sl_w2v_model, mlb)}")

In [None]:
sentence = "Can you remind me to buy milk at 5 pm?"
print(f"Sentence: {sentence}")
print(f"Predicted Slots: {predict_slots(sentence, lr_model, sl_w2v_model, mlb)}")

In [None]:
sentence = "wake me up at nine am on friday"
print(f"Sentence: {sentence}")
print(f"Predicted Slots: {predict_slots(sentence, lr_model, sl_w2v_model, mlb)}")

## Let's continue by fine-tuning a pre-trained language model like BERT

### Intent Classification

In [None]:
import torch
from torch.utils.data import Dataset, DataLoader
from transformers import BertTokenizer, BertForSequenceClassification
from tqdm import tqdm

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

X = data['answer_normalised']
y = data['intent']

ic_label_encoder = LabelEncoder()
ic_encoded_intents = ic_label_encoder.fit_transform(y)

X_train, X_val_test, y_train, y_val_test = train_test_split(X, ic_encoded_intents, test_size=0.3, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_val_test, y_val_test, test_size=0.5, random_state=42)

X_train = X_train.reset_index(drop=True)
y_train = pd.Series(y_train).reset_index(drop=True)
X_val = X_val.reset_index(drop=True)
y_val = pd.Series(y_val).reset_index(drop=True)
X_test = X_test.reset_index(drop=True)
y_test = pd.Series(y_test).reset_index(drop=True)

# Dataset class for tokenized inputs
class TextDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_length):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text = self.texts[idx]
        label = self.labels[idx]
        encoded = self.tokenizer(text, max_length=self.max_length, padding="max_length", truncation=True, return_tensors="pt")
        return {"input_ids": encoded["input_ids"].squeeze(0), "attention_mask": encoded["attention_mask"].squeeze(0), "label": torch.tensor(label, dtype=torch.long),}

In [None]:
print(X_train.head())
print(y_train)

In [39]:
ic_bert_tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
ic_train_dataset = TextDataset(X_train, y_train, ic_bert_tokenizer, max_length=32)
ic_val_dataset = TextDataset(X_val, y_val, ic_bert_tokenizer, max_length=32)

ic_train_loader = DataLoader(ic_train_dataset, batch_size=16, shuffle=True)
ic_val_loader = DataLoader(ic_val_dataset, batch_size=16)

In [None]:
# Pre-trained BERT model
ic_model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=len(ic_label_encoder.classes_))
ic_model = ic_model.to(device)

ic_optimizer = torch.optim.Adam(ic_model.parameters(), lr=5e-5)

In [41]:
# Training function
def ic_train(model, dataloader, optimizer, device):
    model.train()
    total_loss = 0
    for batch in tqdm(dataloader, desc="Training"):
        optimizer.zero_grad()

        # Move inputs to GPU
        input_ids = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)
        labels = batch["label"].to(device)

        # Forward pass
        outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs.loss
        total_loss += loss.item()

        # Backward pass
        loss.backward()
        optimizer.step()
    return total_loss / len(dataloader)

# Validation function
def ic_evaluate(model, dataloader, device):
    model.eval()
    preds, true_labels = [], []
    total_loss = 0
    with torch.no_grad():
        for batch in tqdm(dataloader, desc="Evaluating"):
            # Move inputs to GPU
            input_ids = batch["input_ids"].to(device)
            attention_mask = batch["attention_mask"].to(device)
            labels = batch["label"].to(device)

            # Forward pass
            outputs = model(input_ids, attention_mask=attention_mask)
            loss = outputs.loss
            total_loss += loss.item()
            logits = outputs.logits
            preds.extend(torch.argmax(logits, dim=1).cpu().numpy())
            true_labels.extend(labels.cpu().numpy())

    accuracy = accuracy_score(true_labels, preds)
    return total_loss / len(dataloader), accuracy

In [None]:
epochs = 3
for epoch in range(epochs):
    print(f"\nEpoch {epoch + 1}/{epochs}")
    train_loss = ic_train(ic_model, ic_train_loader, ic_optimizer, device)
    val_loss, val_accuracy = ic_evaluate(ic_model, ic_val_loader, device)
    print(f"Training Loss: {train_loss:.4f}")
    print(f"Validation Loss: {val_loss:.4f}")
    print(f"Validation Accuracy: {val_accuracy:.4f}")

ic_model.save_pretrained("fine_tuned_bert")
ic_bert_tokenizer.save_pretrained("fine_tuned_bert")

In [None]:
ic_model = BertForSequenceClassification.from_pretrained("fine_tuned_bert").to(device)
ic_bert_tokenizer = BertTokenizer.from_pretrained("fine_tuned_bert")

test_dataset = TextDataset(X_test, y_test, ic_bert_tokenizer, max_length=128)
test_loader = DataLoader(test_dataset, batch_size=16)

test_accuracy = ic_evaluate(ic_model, test_loader, device)
print(f"Test Accuracy: {test_accuracy:.4f}")

def predict_intent_bert(sentence, model, tokenizer, label_encoder, device):
    inputs = tokenizer(sentence, max_length=128, padding="max_length", truncation=True, return_tensors="pt")
    inputs = {key: val.to(device) for key, val in inputs.items()}
    outputs = model(**inputs)
    logits = outputs.logits
    pred = torch.argmax(logits, dim=1).cpu().numpy()[0]
    return label_encoder.inverse_transform([pred])[0]

sentence = "What is the weather in Paris?"
print(f"Sentence: {sentence}")
print(f"Predicted Intent: {predict_intent_bert(sentence, ic_model, ic_bert_tokenizer, ic_label_encoder, device)}")

In [None]:
sentence = "Could you please turn the lights on?"
print(f"Sentence: {sentence}")
print(f"Predicted Intent: {predict_intent_bert(sentence, ic_model, ic_bert_tokenizer, ic_label_encoder, device)}")

### Slot-Labeling

In [None]:
from transformers import BertTokenizerFast, BertForTokenClassification
from sklearn.preprocessing import LabelEncoder
import torch
from torch.utils.data import Dataset, DataLoader
from tqdm import tqdm
import numpy as np

# Load tokenizer
tokenizer = BertTokenizerFast.from_pretrained("bert-base-uncased")

# Encode BIO labels
bio_label_encoder = LabelEncoder()
bio_label_encoder.fit([label for labels in data['bio_labels'] for label in labels])
num_labels = len(bio_label_encoder.classes_)

# Dataset class for token-level tasks
class TokenDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_length):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text = self.texts[idx]
        labels = self.labels[idx]

        # Tokenize the text and align the labels
        encoded = self.tokenizer(
            text.split(),
            is_split_into_words=True,
            max_length=self.max_length,
            padding="max_length",
            truncation=True,
            return_tensors="pt"
        )

        word_ids = encoded.word_ids(batch_index=0)
        label_ids = [-100] * len(word_ids)  # Initialize with -100 for ignored tokens

        for i, word_id in enumerate(word_ids):
            if word_id is not None:  # Skip special tokens like [CLS] and [SEP]
                label_ids[i] = bio_label_encoder.transform([labels[word_id]])[0]

        return {
            "input_ids": encoded["input_ids"].squeeze(0),
            "attention_mask": encoded["attention_mask"].squeeze(0),
            "labels": torch.tensor(label_ids, dtype=torch.long)
        }

# Prepare data
X = data['answer_normalised']
y = data['bio_labels']  # Use the list of labels directly

# Split data
X_train, X_val_test, y_train, y_val_test = train_test_split(X, y, test_size=0.3, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_val_test, y_val_test, test_size=0.5, random_state=42)

# Create datasets and dataloaders
train_dataset = TokenDataset(X_train.tolist(), y_train.tolist(), tokenizer, max_length=32)
val_dataset = TokenDataset(X_val.tolist(), y_val.tolist(), tokenizer, max_length=32)

train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=16)

# Load pre-trained BERT model for token classification
model = BertForTokenClassification.from_pretrained("bert-base-uncased", num_labels=num_labels)
model = model.to(device)

optimizer = torch.optim.Adam(model.parameters(), lr=5e-5)

# Training function
def train(model, dataloader, optimizer, device):
    model.train()
    total_loss = 0
    for batch in tqdm(dataloader, desc="Training"):
        optimizer.zero_grad()

        input_ids = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)
        labels = batch["labels"].to(device)

        outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs.loss
        total_loss += loss.item()

        loss.backward()
        optimizer.step()

    return total_loss / len(dataloader)

# Evaluation function
def evaluate(model, dataloader, device):
    model.eval()
    total_loss = 0
    preds, true_labels = [], []
    with torch.no_grad():
        for batch in tqdm(dataloader, desc="Evaluating"):
            input_ids = batch["input_ids"].to(device)
            attention_mask = batch["attention_mask"].to(device)
            labels = batch["labels"].to(device)

            outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
            loss = outputs.loss
            total_loss += loss.item()

            logits = outputs.logits.cpu().numpy()
            label_ids = labels.cpu().numpy()

            # Collect predictions and true labels (ignoring -100)
            for i, label in enumerate(label_ids):
                preds.extend(np.argmax(logits[i], axis=1)[label != -100])
                true_labels.extend(label[label != -100])

    accuracy = accuracy_score(true_labels, preds)
    return total_loss / len(dataloader), accuracy

# Training loop
epochs = 3
for epoch in range(epochs):
    print(f"\nEpoch {epoch + 1}/{epochs}")
    train_loss = train(model, train_loader, optimizer, device)
    val_loss, val_accuracy = evaluate(model, val_loader, device)
    print(f"Training Loss: {train_loss:.4f}")
    print(f"Validation Loss: {val_loss:.4f}")
    print(f"Validation Accuracy: {val_accuracy:.4f}")

# Save the model and tokenizer
model.save_pretrained("slot_labeling_bert")
tokenizer.save_pretrained("slot_labeling_bert")


In [None]:
sl_model = BertForTokenClassification.from_pretrained("slot_labeling_bert").to(device)
sl_tokenizer = BertTokenizerFast.from_pretrained("slot_labeling_bert")

test_dataset = TokenDataset(X_test.tolist(), y_test.tolist(), sl_tokenizer, max_length=128)
test_loader = DataLoader(test_dataset, batch_size=16)

test_loss, test_accuracy = evaluate(sl_model, test_loader, device)
print(f"Test Loss: {test_loss:.4f}")
print(f"Test Accuracy: {test_accuracy:.4f}")

def predict_slots_bert(sentence, model, tokenizer, label_encoder, device):
    tokens = word_tokenize(sentence.lower())
    inputs = tokenizer(tokens, is_split_into_words=True, return_tensors="pt")
    inputs = {key: val.to(device) for key, val in inputs.items()}
    outputs = model(**inputs)
    logits = outputs.logits
    preds = torch.argmax(logits, dim=2).cpu().numpy()[0]
    labels = label_encoder.inverse_transform(preds)
    return align_bio_labels(tokens, labels)

sentence = "Wake me up at 5 am tomorrow"
print(f"Sentence: {sentence}")
print(f"Predicted Slots: {predict_slots_bert(sentence, sl_model, sl_tokenizer, bio_label_encoder, device)}")

## Implement IC and SL using incontext-learning without any finetuning (e.g. Flan-T5-base from Google).

In [None]:
from transformers import T5Tokenizer, T5ForConditionalGeneration

tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-base", legacy_format=False)
model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-base")

In [None]:
intent_prompt = """
Classify the intent of the following sentences:

Text: Wake me up at 7 am
Intent: set_alarm

Text: What is the weather in Paris?
Intent: get_weather

Text: Play some jazz music
Intent: play_music

Text: Set a reminder for my meeting at 3 pm
Intent: set_reminder

Text: Turn off the lights
Intent: control_lights

Text: {}
Intent:
"""

def predict_output_flanT5(sentence, prompt, model, tokenizer):
    input_text = prompt.format(sentence)
    input_ids = tokenizer.encode(input_text, return_tensors="pt")
    outputs = model.generate(input_ids, max_length=50)
    intent = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return intent

sentence = "Book a table for two at the restaurant"
predicted_intent = predict_output_flanT5(sentence, intent_prompt, model, tokenizer)
print(f"Sentence: {sentence}")
print(f"Predicted Intent: {predicted_intent}")

In [None]:
correct_predictions = 0
total_predictions = len(data)

for i in range(total_predictions):
    sentence = data['answer_normalised'][i]
    actual_intent = data['intent'][i]
    predicted_intent = predict_output_flanT5(sentence, intent_prompt, model, tokenizer)
    
    if predicted_intent == actual_intent:
        correct_predictions += 1

accuracy = correct_predictions / total_predictions
print(f"Accuracy: {accuracy:.4f}")

Let's see if by giving all the available intents, the accuracy will increase or not

In [None]:
intents_list = ', '.join(set(data['intent'])).join(['[', ']'])
print(intents_list)

In [None]:
intent_prompt = f"""
Here are all the possible intents:
Intents: {intents_list}

Classify the intent of the following sentence:

Text: {{}}
Intent:
"""

correct_predictions = 0
total_predictions = len(data)

for i in range(total_predictions):
    sentence = data['answer_normalised'][i]
    actual_intent = data['intent'][i]
    predicted_intent = predict_output_flanT5(sentence, intent_prompt, model, tokenizer)
    
    if predicted_intent == actual_intent:
        correct_predictions += 1

accuracy = correct_predictions / total_predictions
print(f"Accuracy: {accuracy:.4f}")

## Slot Labeling with Google Flan-T5

In [None]:
slot_prompt = """
Extract the slots from the following sentences:

Text: Wake me up at 7 am
Slots: [time: 7 am]

Text: Book a table for two at the restaurant
Slots: [number: two, location: restaurant]

Text: Set a reminder for my meeting at 3 pm
Slots: [event: meeting, time: 3 pm]

Text: Turn off the lights in the living room
Slots: [action: turn off, object: lights, location: living room]

Text: {}
Slots:
"""

sentence = "Remind me to buy milk at 5 pm"
predicted_slots = predict_output_flanT5(sentence, slot_prompt, model, tokenizer)
print(f"Sentence: {sentence}")
print(f"Predicted Slots: {predicted_slots}")

In [None]:
correct_predictions = 0
total_predictions = len(data)

for i in range(total_predictions):
    sentence = data['answer_normalised'][i]
    actual_slots = data['bio_labels'][i]
    predicted_slots = predict_output_flanT5(sentence, slot_prompt, model, tokenizer)
    
    if predicted_slots == actual_slots:
        correct_predictions += 1

accuracy = correct_predictions / total_predictions
print(f"Accuracy: {accuracy:.4f}")

In [None]:
samples = data.sample(5)
for i, row in samples.iterrows():
    sentence = row['answer_normalised']
    actual_intent = row['intent']
    actual_slots = row['bio_labels']
    
    predicted_intent = predict_output_flanT5(sentence, intent_prompt, model, tokenizer)
    predicted_slots = predict_output_flanT5(sentence, slot_prompt, model, tokenizer)
    
    print(f"Sentence: {sentence}")
    print(f"Actual Intent: {actual_intent}")
    print(f"Predicted Intent: {predicted_intent}")
    print(f"Actual Slots: {actual_slots}")
    print(f"Predicted Slots: {predicted_slots}")
    print()

In [None]:
print(slot_types)

In [None]:
prompt = f"""
Here are all the possible labels:
Labels: {slot_types}

Give a list of slots (labels in BIO format) corresponding to the following sentence:

Example:
Sentence: "email dad how is the weather this week"
Predicted Slots: ['O', 'B-relation', 'O', 'O', 'O', 'O', 'B-date', 'I-date ']

Text: {{}}
Slots:
"""

correct_predictions = 0
# total_predictions = len(data)
total_predictions = 50

for i in range(total_predictions):
    sentence = data['answer_normalised'][i]
    actual_slots = data['bio_labels'][i]
    predicted_slots = predict_output_flanT5(sentence, prompt, model, tokenizer)
    
    if predicted_slots == actual_slots:
        correct_predictions += 1

accuracy = correct_predictions / total_predictions
print(f"Accuracy: {accuracy:.4f}")

samples = data.sample(5)
for i, row in samples.iterrows():
    sentence = row['answer_normalised']
    actual_slots = row['bio_labels']
    
    predicted_slots = predict_output_flanT5(sentence, prompt, model, tokenizer)
    
    print(f"Sentence: {sentence}")
    print(f"Actual Slots: {actual_slots}")
    print(f"Predicted Slots: {predicted_slots}")
    print()