<a href="https://colab.research.google.com/github/priyanka011011/NextWordSuggestion/blob/main/Bert_empathetic_dialogues.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Next Word Suggestion**

### **Overview**
The task of Next Word Suggestion comes under the NLP Task: Masked Language Modelling. The model gets an input with a mask:
     **Sample sentence: **

     we are [MASK].

         we are eating.

         we are dancing.

         we are playing.


### **Methodology**
In this project, we will use pre-trained models from hugging face and fine tune it on different dataset. We will use the model "BERT-UNCASED".

### **AI Application**


*   NLP

### **Business Segments**


*   Lifestyle & Social Media, Media & Publishing
*   Business & Private Sector

### **Data**
 [Dataset link](https://huggingface.co/datasets/empathetic_dialogues)












In [5]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [6]:
!pip install datasets transformers accelerate


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


## **IMPORT LIBRARIES**

In [7]:
from transformers import AutoModelForMaskedLM
from transformers import AutoTokenizer
from datasets import load_dataset
from transformers import DataCollatorForLanguageModeling
from transformers import TrainingArguments
from torch.utils.data import DataLoader
from transformers import default_data_collator
from torch.optim import AdamW
from accelerate import Accelerator
from transformers import get_scheduler
from tqdm.auto import tqdm
import torch
import math
from transformers import pipeline


## **LOAD THE DATASET**

In [8]:
dataset = load_dataset("empathetic_dialogues", split='train')



## **PRE-PROCESSING THE DATA**

In [9]:
# Renaming and dropping unrequired columns
dataset = dataset.remove_columns([col for col in dataset.column_names if col != 'utterance'])
dataset = dataset.rename_column('utterance', 'text')

## **MODEL**



# **BERT-BASE-UNCASED** (A pre-trained model from hugging face)

In [10]:
model_checkpoint = "bert-base-uncased"
model = AutoModelForMaskedLM.from_pretrained(model_checkpoint)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [11]:
bert_num_parameters = model.num_parameters() / 1_000_000
print(f"'>>> BERT number of parameters: {round(bert_num_parameters)}M'")

'>>> BERT number of parameters: 110M'


In [12]:
text = "I love [MASK]."

## **Importing Auto-tokenizer**

In [13]:
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

In [14]:
def tokenize_function(examples):
    result = tokenizer(examples["text"], max_length=512,
                       padding='max_length', truncation=True, return_tensors="pt")
    if tokenizer.is_fast:
        result["word_ids"] = [result.word_ids(i) for i in range(len(result["input_ids"]))]
    return result

In [15]:
tokenized_datasets = dataset.map(
    tokenize_function, batched=True, remove_columns=["text"]
)
tokenized_datasets



Dataset({
    features: ['input_ids', 'token_type_ids', 'attention_mask', 'word_ids'],
    num_rows: 76673
})

In [16]:
tokenizer.model_max_length

512

## **Creating Chunk Of Data**

In [17]:
chunk_size = 128


In [18]:
def group_texts(examples):
    # Concatenate all texts
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    # Compute length of concatenated texts
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the last chunk if it's smaller than chunk_size
    total_length = (total_length // chunk_size) * chunk_size
    # Split by chunks of max_len
    result = {
        k: [t[i : i + chunk_size] for i in range(0, total_length, chunk_size)]
        for k, t in concatenated_examples.items()
    }
    # Create a new labels column
    result["labels"] = result["input_ids"].copy()
    return result

lm_datasets = tokenized_datasets.map(group_texts, batched=True)
lm_datasets



Dataset({
    features: ['input_ids', 'token_type_ids', 'attention_mask', 'word_ids', 'labels'],
    num_rows: 306692
})

## **Downsampling the dataset using train_test_split**

In [19]:
train_size = 10_000
test_size = int(0.1 * train_size)

downsampled_dataset = lm_datasets.train_test_split(
    train_size=train_size, test_size=test_size, seed=42
)
downsampled_dataset



DatasetDict({
    train: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 10000
    })
    test: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 1000
    })
})

In [20]:
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15)

## **Training the Argument from transformer**

In [22]:
batch_size = 32
# Show the training loss with every epoch
logging_steps = len(downsampled_dataset["train"]) // batch_size
model_name = model_checkpoint.split("/")[-1]

training_args = TrainingArguments(
    output_dir=f"{model_name}-finetuned-empathetic dialogues",
    overwrite_output_dir=True,
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    weight_decay=0.01,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    push_to_hub=True,
    fp16=True,
    logging_steps=logging_steps,
)

In [23]:
def insert_random_mask(batch):
    features = [dict(zip(batch, t)) for t in zip(*batch.values())]
    masked_inputs = data_collator(features)
    # Create a new "masked" column for each column in the dataset
    return {"masked_" + k: v.numpy() for k, v in masked_inputs.items()}

In [24]:
downsampled_dataset = downsampled_dataset.remove_columns(["word_ids"])
eval_dataset = downsampled_dataset["test"].map(
    insert_random_mask,
    batched=True,
    remove_columns=downsampled_dataset["test"].column_names,
)
eval_dataset = eval_dataset.rename_columns(
    {
        "masked_input_ids": "input_ids",
        "masked_attention_mask": "attention_mask",
        "masked_labels": "labels",
    }
)

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


## **Training the model**

In [25]:
batch_size = 32
train_dataloader = DataLoader(
    downsampled_dataset["train"],
    shuffle=True,
    batch_size=batch_size,
    collate_fn=data_collator,
)
eval_dataloader = DataLoader(
    eval_dataset, batch_size=batch_size, collate_fn=default_data_collator
)

In [26]:
optimizer = AdamW(model.parameters(), lr=5e-5)

## **Training With Accelerator**

In [27]:
accelerator = Accelerator()
model, optimizer, train_dataloader, eval_dataloader = accelerator.prepare(
    model, optimizer, train_dataloader, eval_dataloader
)

## **Scheduling Learning Rate**

In [28]:
num_train_epochs = 3
num_update_steps_per_epoch = len(train_dataloader)
num_training_steps = num_train_epochs * num_update_steps_per_epoch

lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)

In [29]:
output_dir = f"bert-base-uncased-finetuned-empathetic dialogues"

In [30]:
progress_bar = tqdm(range(num_training_steps))

for epoch in range(num_train_epochs):
    # Training
    model.train()
    for batch in train_dataloader:
        # # Remove the masked_token_type_ids argument
        inputs = {k: v for k, v in batch.items() if k != "masked_token_type_ids"}
        outputs = model(**batch)
        loss = outputs.loss
        accelerator.backward(loss)

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

    # Evaluation
    model.eval()
    losses = []
    for step, batch in enumerate(eval_dataloader):
        # Remove the masked_token_type_ids argument
        inputs = {k: v for k, v in batch.items() if k != "masked_token_type_ids"}
        with torch.no_grad():
            # Use the updated inputs dictionary
            outputs = model(**inputs)

        loss = outputs.loss
        losses.append(accelerator.gather(loss.repeat(batch_size)))

    losses = torch.cat(losses)
    losses = losses[: len(eval_dataset)]
    try:
        perplexity = math.exp(torch.mean(losses))
    except OverflowError:
        perplexity = float("inf")

    print(f">>> Epoch {epoch}: Perplexity: {perplexity}")

    # Save and upload
    accelerator.wait_for_everyone()
    unwrapped_model = accelerator.unwrap_model(model)
    unwrapped_model.save_pretrained(output_dir, save_function=accelerator.save)
    if accelerator.is_main_process:
        tokenizer.save_pretrained(output_dir)


  0%|          | 0/939 [00:00<?, ?it/s]

>>> Epoch 0: Perplexity: 7.321173127004363
>>> Epoch 1: Perplexity: 6.316510675379457
>>> Epoch 2: Perplexity: 6.269209842800283


## **Making Predictions on the MASK.**

In [32]:
mask_filler = pipeline(
    "fill-mask", model="/content/bert-base-uncased-finetuned-empathetic dialogues"
)

In [35]:
def predict(sample=''):
  preds = mask_filler(sample)
  df = pd.DataFrame(preds)
  df['score'] = df['score'].apply(lambda x: round(x, 2))
  print(f'Sample Text: {sample}')
  print('-'*50)
  print('Model Predictions')
  print('-'*50)
  print(df)
  print('-'*50)
  return df

In [37]:
import pandas as pd

In [38]:
df = predict(sample='I am [MASK] to the shop.')

Sample Text: I am [MASK] to the shop.
--------------------------------------------------
Model Predictions
--------------------------------------------------
   score  token token_str                   sequence
0   0.67   2183     going    i am going to the shop.
1   0.23   3753    headed   i am headed to the shop.
2   0.02   2746    coming   i am coming to the shop.
3   0.02   4439   driving  i am driving to the shop.
4   0.01   5825   heading  i am heading to the shop.
--------------------------------------------------


In [41]:
df2 = predict(sample='I am [MASK] in the park.')

Sample Text: I am [MASK] in the park.
--------------------------------------------------
Model Predictions
--------------------------------------------------
   score  token token_str                   sequence
0   0.33   3788   walking  i am walking in the park.
1   0.08   2025       not      i am not in the park.
2   0.06   2041       out      i am out in the park.
3   0.03   9083    parked   i am parked in the park.
4   0.03   2894     alone    i am alone in the park.
--------------------------------------------------
