<a href="https://colab.research.google.com/github/manojmandal27/Text-Classification-BERT-LLM/blob/main/LLM_Text_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Problem Statement

Intent identification using BERT model

## Learning Objectives

At the end of the mini-project, you will be able to :

* Read the intent, questions and responses data
* Load a pre trained BERT model
* Fine-tune the BERT model
* Get the predictions for each question

## Overview

The intent identification problem is framed as a text classification task, where a BERT model is trained to classify intent. Once the model is fine-tuned, a conversation tool is set up. For each user question, the model first predicts the intent, and a response is selected from a predefined set of responses corresponding to the predicted intent as the answer to the input question.

## Dataset

Different classes of intent with a set of questions that fall into each intent and a pool of suitable responses for each intent.

In [None]:
# prompt: Create a hidden code cell with @#title Download the Dataset. Data should be downloaded from the following link: https://cdn.exec.talentsprint.com/static/aimlops/c3/spam.csv

#@title Download the Dataset
!wget https://cdn.exec.talentsprint.com/static/aimlops/c3/Intent.json

--2024-09-22 03:28:21--  https://cdn.exec.talentsprint.com/static/aimlops/c3/Intent.json
Resolving cdn.exec.talentsprint.com (cdn.exec.talentsprint.com)... 172.105.52.210
Connecting to cdn.exec.talentsprint.com (cdn.exec.talentsprint.com)|172.105.52.210|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 69866 (68K) [application/json]
Saving to: ‘Intent.json’


2024-09-22 03:28:22 (292 KB/s) - ‘Intent.json’ saved [69866/69866]



### Import Neccesary Packages

In [None]:
# Please feel free to add/remove installations here

# Initial Packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

### Read the Intent, Questions, and Response Data

In [None]:

#Prompt :
#File Uploaded - Intent.json.1 . I want you to help me write the codes using the uploaded
#file as per the steps I instruct you. Step-1 What code to write in python for - Read the Intent, Questions, and Response Data
# Function to read intent, questions, and responses
def extract_intents(data):
    intents = []
    # Access the list of dictionaries for the key 'intents'
    for entry in data['intents']:
        intent = entry.get('intent', 'Unknown')
        questions = entry.get('text', [])
        responses = entry.get('responses', [])
        intents.append({
            'intent': intent,
            'questions': questions,
            'responses': responses
        })
    return intents

# Load the downloaded json file
import json
with open('Intent.json', 'r') as f:
  data = json.load(f)

# Extract intents data from the JSON
intents_data = extract_intents(data)

# Display the extracted intents
for intent in intents_data:
    print(f"Intent: {intent['intent']}")
    print(f"Questions: {intent['questions']}")
    print(f"Responses: {intent['responses']}")
    print("-" * 50)

Intent: Greeting
Questions: ['Hi', 'Hi there', 'Hola', 'Hello', 'Hello there', 'Hya', 'Hya there']
Responses: ['Hi human, please tell me your GeniSys user', 'Hello human, please tell me your GeniSys user', 'Hola human, please tell me your GeniSys user']
--------------------------------------------------
Intent: GreetingResponse
Questions: ['My user is Adam', 'This is Adam', 'I am Adam', 'It is Adam', 'My user is Bella', 'This is Bella', 'I am Bella', 'It is Bella']
Responses: ['Great! Hi <HUMAN>! How can I help?', 'Good! Hi <HUMAN>, how can I help you?', 'Cool! Hello <HUMAN>, what can I do for you?', 'OK! Hola <HUMAN>, how can I help you?', 'OK! hi <HUMAN>, what can I do for you?']
--------------------------------------------------
Intent: CourtesyGreeting
Questions: ['How are you?', 'Hi how are you?', 'Hello how are you?', 'Hola how are you?', 'How are you doing?', 'Hope you are doing well?', 'Hello hope you are doing well?']
Responses: ['Hello, I am great, how are you? Please tell me

Pre processing :

### Tokenize the Questions

In [None]:
!pip install transformers




In [None]:
from transformers import AutoTokenizer

# Load the tokenizer for distilbert-base-cased
checkpoint = 'distilbert-base-cased'
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

# Function to tokenize questions using the distilbert tokenizer
def tokenize_questions_with_distilbert(intents_data):
    tokenized_data = []

    for intent in intents_data:
        tokenized_questions = [tokenizer(question, padding='max_length', truncation=True, return_tensors='pt') for question in intent['questions']]
        tokenized_data.append({
            'intent': intent['intent'],
            'tokenized_questions': tokenized_questions
        })

    return tokenized_data

# Tokenize the questions from the extracted intents data (from Step 1)
tokenized_intents_distilbert = tokenize_questions_with_distilbert(intents_data)

# Display the tokenized questions (for demonstration, showing the input IDs)
for tokenized_intent in tokenized_intents_distilbert:
    print(f"Intent: {tokenized_intent['intent']}")
    for question in tokenized_intent['tokenized_questions']:
        print(f"Tokenized Question (input_ids): {question['input_ids']}")
    print("-" * 50)


tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/465 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]



[1;30;43mStreaming output truncated to the last 5000 lines.[0m
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0]])
--------------------------------------------------
Intent: NameQuery
Tokenized Question (input_ids): tensor([[ 101, 1327, 1110, 1240, 1271,  136,  102,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,  

### Create the Train Data with Tokenized Questions and Intent Labels

In [None]:
## Add your code here
from transformers import AutoTokenizer
from sklearn.preprocessing import LabelEncoder
import torch

# Load the tokenizer for distilbert-base-cased
checkpoint = 'distilbert-base-cased'
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

# Function to tokenize questions using the distilbert tokenizer
def tokenize_questions_with_distilbert(intents_data):
    tokenized_data = []
    labels = []

    for intent in intents_data:
        tokenized_questions = [tokenizer(question, padding='max_length', truncation=True, return_tensors='pt') for question in intent['questions']]
        tokenized_data.extend(tokenized_questions)  # Collect all tokenized questions
        labels.extend([intent['intent']] * len(intent['questions']))  # Repeat intent label for each question

    return tokenized_data, labels

# Step 1: Extract intents and questions
intents_data = extract_intents(data)  # From Step 1

# Step 2: Tokenize the questions and gather corresponding intent labels
tokenized_questions, labels = tokenize_questions_with_distilbert(intents_data)

# Step 3: Encode the intent labels into numerical values
label_encoder = LabelEncoder()
encoded_labels = label_encoder.fit_transform(labels)

# Prepare the final train dataset
train_data = []
for tokenized_question, label in zip(tokenized_questions, encoded_labels):
    input_ids = tokenized_question['input_ids'].squeeze()  # Remove the extra dimension
    attention_mask = tokenized_question['attention_mask'].squeeze()

    # Create a tuple of input IDs, attention mask, and the corresponding label
    train_data.append((input_ids, attention_mask, torch.tensor(label)))

# Display the first few examples of train data
for i in range(3):
    print(f"Input IDs: {train_data[i][0]}")
    print(f"Attention Mask: {train_data[i][1]}")
    print(f"Label: {train_data[i][2]}")
    print("-" * 50)


Input IDs: tensor([ 101, 8790,  102,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    



### Load a Pre-Trained BERT Model

In [None]:
## Add your code here
from transformers import AutoModelForSequenceClassification

# Define the pre-trained BERT model and number of output labels
checkpoint = 'distilbert-base-cased'  # We will use the distilbert-base-cased model
num_labels = len(set(encoded_labels))  # Number of unique intents

# Load the pre-trained BERT model for sequence classification
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=num_labels)

# Display model architecture
print(model)


model.safetensors:   0%|          | 0.00/263M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(28996, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
 

### Prepare the Model to Fine-Tune

In [None]:
## Add your code here
import torch
from torch.utils.data import DataLoader, TensorDataset
from transformers import AdamW, get_scheduler
from torch.optim import AdamW
from transformers import get_scheduler
from tqdm.auto import tqdm

# Prepare training data (Step-3 output)
input_ids = torch.stack([item[0] for item in train_data])
attention_masks = torch.stack([item[1] for item in train_data])
labels = torch.stack([item[2] for item in train_data])

# Create a TensorDataset and DataLoader for batching
train_dataset = TensorDataset(input_ids, attention_masks, labels)
train_dataloader = DataLoader(train_dataset, batch_size=16, shuffle=True)  # Set appropriate batch size

# Define the optimizer (AdamW)
optimizer = AdamW(model.parameters(), lr=5e-5)

# Scheduler for learning rate decay
num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
    name="linear", optimizer=optimizer, num_warmup_steps=0, num_training_steps=num_training_steps
)

# Loss function: CrossEntropyLoss is used for classification
loss_fn = torch.nn.CrossEntropyLoss()

# Check if a GPU is available and move model to GPU if possible
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to(device)

# Model preparation summary
print(f"Model loaded on: {device}")
print(f"Batch size: 16, Epochs: {num_epochs}, Learning rate: {optimizer.param_groups[0]['lr']}")


Model loaded on: cuda
Batch size: 16, Epochs: 3, Learning rate: 5e-05


### Train the Model using the Tokenized Questions and Intent Labels

In [None]:
## Add your code here
# Fine-tuning the model (Training loop)

# Set the model to train mode
model.train()

# Initialize progress bar for tracking
progress_bar = tqdm(range(num_training_steps))

for epoch in range(num_epochs):
    print(f"Epoch {epoch + 1}/{num_epochs}")

    for batch in train_dataloader:
        # Unpack the batch and send inputs to the appropriate device (CPU or GPU)
        batch_input_ids = batch[0].to(device)
        batch_attention_masks = batch[1].to(device)
        batch_labels = batch[2].to(device)

        # Forward pass: Get model predictions (logits)
        outputs = model(
            input_ids=batch_input_ids,
            attention_mask=batch_attention_masks,
            labels=batch_labels
        )

        # Loss computation (CrossEntropy)
        loss = outputs.loss

        # Backward pass: Compute gradients
        loss.backward()

        # Optimizer step: Update model weights
        optimizer.step()
        lr_scheduler.step()  # Update learning rate
        optimizer.zero_grad()  # Reset gradients

        # Update progress bar
        progress_bar.update(1)

    print(f"Epoch {epoch + 1} completed, Loss: {loss.item()}")

print("Training complete!")


  0%|          | 0/27 [00:00<?, ?it/s]

Epoch 1/3
Epoch 1 completed, Loss: 3.0709595680236816
Epoch 2/3
Epoch 2 completed, Loss: 2.8387722969055176
Epoch 3/3
Epoch 3 completed, Loss: 2.7341010570526123
Training complete!


### Create a Function to get the Predictions for each Question

In [None]:
## Add your code here
import torch
from transformers import AutoTokenizer
import torch.nn.functional as F

# Function to get predictions for a list of questions
def predict_intent(questions, model, tokenizer, label_encoder, device):
    model.eval()  # Set the model to evaluation mode

    predictions = []

    with torch.no_grad():  # Disable gradient calculation for inference
        for question in questions:
            # Tokenize the input question
            inputs = tokenizer(
                question,
                padding="max_length",
                truncation=True,
                return_tensors="pt"
            ).to(device)  # Move to device (CPU/GPU)

            # Forward pass to get the logits
            outputs = model(**inputs)
            logits = outputs.logits

            # Apply softmax to get probabilities
            probs = F.softmax(logits, dim=1)

            # Get the predicted label (index of the highest probability)
            predicted_label_id = torch.argmax(probs, dim=1).item()

            # Convert the predicted label ID back to the intent name
            predicted_label = label_encoder.inverse_transform([predicted_label_id])[0]

            # Store the prediction
            predictions.append(predicted_label)

    return predictions

# Example usage of the function
questions_to_predict = ["What is the weather today?", "Tell me a joke." , "Tell me how are you?"]
predicted_intents = predict_intent(questions_to_predict, model, tokenizer, label_encoder, device)

# Display predictions
for question, intent in zip(questions_to_predict, predicted_intents):
    print(f"Question: {question}")
    print(f"Predicted Intent: {intent}")
    print("-" * 50)


Question: What is the weather today?
Predicted Intent: CourtesyGreeting
--------------------------------------------------
Question: Tell me a joke.
Predicted Intent: Jokes
--------------------------------------------------
Question: Tell me how are you?
Predicted Intent: CourtesyGreeting
--------------------------------------------------


### Create a Function to Choose the Response based on the Intent Prediction

In [None]:
# Create a dictionary to map intents to responses from the JSON data
intent_response_map = {item['intent']: item['responses'] for item in intents_data}

# Function to choose a response based on the predicted intent
def get_response_from_intent(predicted_intents, intent_response_map):
    responses = []

    for intent in predicted_intents:
        # Get the response from the intent-response mapping
        response = intent_response_map.get(intent, "Sorry, I don't understand that.")
        responses.append(response)

    return responses

# Example usage
questions_to_predict = ["What's the weather today?", "Tell me a joke."]
predicted_intents = predict_intent(questions_to_predict, model, tokenizer, label_encoder, device)
responses = get_response_from_intent(predicted_intents, intent_response_map)

# Display the responses for each question
for question, response in zip(questions_to_predict, responses):
    print(f"Question: {question}")
    print(f"Response: {response}")
    print("-" * 50)


Question: What's the weather today?
Response: ['Hello, I am great, how are you? Please tell me your GeniSys user', 'Hello, how are you? I am great thanks! Please tell me your GeniSys user', 'Hello, I am good thank you, how are you? Please tell me your GeniSys user', 'Hi, I am great, how are you? Please tell me your GeniSys user', 'Hi, how are you? I am great thanks! Please tell me your GeniSys user', 'Hi, I am good thank you, how are you? Please tell me your GeniSys user', 'Hi, good thank you, how are you? Please tell me your GeniSys user']
--------------------------------------------------
Question: Tell me a joke.
Response: ["I met a Dutch girl with inflatable shoes last week, phoned her up to arrange a date but unfortunately she'd popped her clogs.  ", "So I said 'Do you want a game of Darts?' He said, 'OK then', I said nearest to bull starts'. He said, 'Baa', I said, 'Moo', he said, You're closest'.  ", "The other day I sent my girlfriend a huge pile of snow. I rang her up; I said 

for loop

###Connect the above 2 Functions to take a Question from the User and Respond with Intent and the Answer

In [None]:
def ask_and_respond(question, model, tokenizer, label_encoder, intent_response_map, device):
    # Step 1: Predict the intent of the input question
    predicted_intents = predict_intent([question], model, tokenizer, label_encoder, device)

    # Step 2: Get the corresponding response based on the predicted intent
    responses = get_response_from_intent(predicted_intents, intent_response_map)

    # Return the predicted intent and response
    return predicted_intents[0], responses[0]

# Example predefined responses for intents from the JSON data
intent_response_map = {item['intent']: item['responses'] for item in intents_data}

# Example usage: Take a question from the user
user_question = input("Ask me a question: ")

# Call the function that predicts intent and generates the response
predicted_intent, response = ask_and_respond(user_question, model, tokenizer, label_encoder, intent_response_map, device)

# Display the predicted intent and response
print(f"Predicted Intent: {predicted_intent}")
print(f"Response: {response}")


Ask me a question: Hi How are you today?
Predicted Intent: CourtesyGreeting
Response: ['Hello, I am great, how are you? Please tell me your GeniSys user', 'Hello, how are you? I am great thanks! Please tell me your GeniSys user', 'Hello, I am good thank you, how are you? Please tell me your GeniSys user', 'Hi, I am great, how are you? Please tell me your GeniSys user', 'Hi, how are you? I am great thanks! Please tell me your GeniSys user', 'Hi, I am good thank you, how are you? Please tell me your GeniSys user', 'Hi, good thank you, how are you? Please tell me your GeniSys user']
