## Load the dataset

### Subtask:
Load the "clinc_oos" dataset using the `load_dataset` function.


**Reasoning**:
Import the necessary function and load the specified dataset.



In [1]:
from datasets import load_dataset

dataset = load_dataset("clinc_oos", "small")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/172k [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/77.8k [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/136k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/7600 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3100 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/5500 [00:00<?, ? examples/s]

## Explore the dataset

### Subtask:
Understand the structure and content of the dataset, including the intent labels.


**Reasoning**:
Display the structure of the dataset, the first few examples from the training split, and the unique intent labels with their numerical representations to understand the dataset's content and structure.



In [2]:
print(dataset)
# print(dataset['train'].head())
print(dataset['train'].features['intent'])

DatasetDict({
    train: Dataset({
        features: ['text', 'intent'],
        num_rows: 7600
    })
    validation: Dataset({
        features: ['text', 'intent'],
        num_rows: 3100
    })
    test: Dataset({
        features: ['text', 'intent'],
        num_rows: 5500
    })
})
ClassLabel(names=['restaurant_reviews', 'nutrition_info', 'account_blocked', 'oil_change_how', 'time', 'weather', 'redeem_rewards', 'interest_rate', 'gas_type', 'accept_reservations', 'smart_home', 'user_name', 'report_lost_card', 'repeat', 'whisper_mode', 'what_are_your_hobbies', 'order', 'jump_start', 'schedule_meeting', 'meeting_schedule', 'freeze_account', 'what_song', 'meaning_of_life', 'restaurant_reservation', 'traffic', 'make_call', 'text', 'bill_balance', 'improve_credit_score', 'change_language', 'no', 'measurement_conversion', 'timer', 'flip_coin', 'do_you_have_pets', 'balance', 'tell_joke', 'last_maintenance', 'exchange_rate', 'uber', 'car_rental', 'credit_limit', 'oos', 'shopping_list', 'ex

**Reasoning**:
The previous code failed because `Dataset` objects do not have a `head()` method. Display the structure of the dataset, the first few examples from the training split using slicing, and the unique intent labels with their numerical representations to understand the dataset's content and structure.



In [3]:
print(dataset)
print(dataset['train'][:5])
print(dataset['train'].features['intent'])

DatasetDict({
    train: Dataset({
        features: ['text', 'intent'],
        num_rows: 7600
    })
    validation: Dataset({
        features: ['text', 'intent'],
        num_rows: 3100
    })
    test: Dataset({
        features: ['text', 'intent'],
        num_rows: 5500
    })
})
{'text': ['can you walk me through setting up direct deposits to my bank of internet savings account', 'i want to switch to direct deposit', 'set up direct deposit for me', 'how do i go about setting up direct deposit', 'i need to get my paycheck direct deposited to my chase account'], 'intent': [108, 108, 108, 108, 108]}
ClassLabel(names=['restaurant_reviews', 'nutrition_info', 'account_blocked', 'oil_change_how', 'time', 'weather', 'redeem_rewards', 'interest_rate', 'gas_type', 'accept_reservations', 'smart_home', 'user_name', 'report_lost_card', 'repeat', 'whisper_mode', 'what_are_your_hobbies', 'order', 'jump_start', 'schedule_meeting', 'meeting_schedule', 'freeze_account', 'what_song', 'meaning_of_

## Preprocess the data

### Subtask:
Prepare the text data for BERT by tokenizing and encoding it. This will likely involve using a BERT tokenizer.


**Reasoning**:
Load a BERT tokenizer, define a function to tokenize and encode the text data, apply this function to all dataset splits, remove the original text column, and rename the intent column to labels.



In [4]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized_datasets = dataset.map(tokenize_function, batched=True)

tokenized_datasets = tokenized_datasets.remove_columns(["text"])
tokenized_datasets = tokenized_datasets.rename_column("intent", "labels")

print(tokenized_datasets)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Map:   0%|          | 0/7600 [00:00<?, ? examples/s]

Map:   0%|          | 0/3100 [00:00<?, ? examples/s]

Map:   0%|          | 0/5500 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 7600
    })
    validation: Dataset({
        features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 3100
    })
    test: Dataset({
        features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 5500
    })
})


## Load a pre-trained bert model

### Subtask:
Load a pre-trained BERT model suitable for sequence classification.


**Reasoning**:
Import the necessary class and load the pre-trained BERT model for sequence classification.



In [5]:
from transformers import AutoModelForSequenceClassification

num_labels = tokenized_datasets['train'].features['labels'].num_classes
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=num_labels)

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## Fine-tune the bert model

### Subtask:
Train the BERT model on the loaded dataset for intent classification.


**Reasoning**:
Import necessary classes, define training arguments, instantiate the Trainer, and start the training process.



**Reasoning**:
Import necessary classes, define training arguments, instantiate the Trainer, and start the training process.

In [6]:
from transformers import TrainingArguments, Trainer
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
import logging

# Set up logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = logits.argmax(-1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='weighted')
    acc = accuracy_score(labels, preds)
    logger.info(f"Accuracy: {acc}, F1: {f1}, Precision: {precision}, Recall: {recall}")
    return {"accuracy": acc, "f1": f1, "precision": precision, "recall": recall}

training_args = TrainingArguments(
    output_dir="./results",
    eval_strategy="epoch", # Changed evaluation_strategy to eval_strategy
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    compute_metrics=compute_metrics,
)

trainer.train()

# APIKEY =

<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize?ref=models
wandb: Paste an API key from your profile and hit enter:

 ··········


wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mnandanadileep29[0m ([33mnandanadileep29-indian-institute-of-technology-madras[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,No log,3.658651,0.584839,0.55092,0.675511,0.584839
2,4.414900,2.720546,0.82,0.807654,0.849404,0.82
3,3.134200,2.428687,0.851613,0.842159,0.871704,0.851613


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


TrainOutput(global_step=1425, training_loss=3.408562654194079, metrics={'train_runtime': 2483.4649, 'train_samples_per_second': 9.181, 'train_steps_per_second': 0.574, 'total_flos': 6006957498777600.0, 'train_loss': 3.408562654194079, 'epoch': 3.0})

## Inference

### Subtask:
Use the fine-tuned BERT model to classify the intent of a new utterance.

**Reasoning**:
Define a function to predict the intent of a given text using the trained model and tokenizer, then test this function with an example utterance.

In [7]:
import torch

def predict_intent(text):
    # Tokenize the input text
    inputs = tokenizer(text, return_tensors="pt", padding="max_length", truncation=True)

    # Move inputs to the same device as the model
    inputs = {name: tensor.to(model.device) for name, tensor in inputs.items()}

    # Perform inference
    with torch.no_grad():
        outputs = model(**inputs)

    # Get the predicted class (intent)
    logits = outputs.logits
    predicted_class_id = logits.argmax().item()

    # Get the intent label from the dataset features
    predicted_intent = dataset['train'].features['intent'].int2str(predicted_class_id)

    return predicted_intent

# Test with an example utterance
example_utterance = "What is the weather like today?"
predicted_intent = predict_intent(example_utterance)
print(f"The predicted intent for '{example_utterance}' is: {predicted_intent}")

example_utterance_2 = "Can you tell me a joke?"
predicted_intent_2 = predict_intent(example_utterance_2)
print(f"The predicted intent for '{example_utterance_2}' is: {predicted_intent_2}")

The predicted intent for 'What is the weather like today?' is: weather
The predicted intent for 'Can you tell me a joke?' is: tell_joke
