## Training a Chatbot using GPT-2

This code example demonstrates how to train a chatbot using the GPT-2 language model. GPT-2, which stands for "Generative Pre-trained Transformer 2," is a state-of-the-art language model developed by OpenAI. It is based on the Transformer architecture and has been trained on a large corpus of diverse text data.

Language models, such as GPT-2, are trained to predict the next word in a sequence of text given the previous context. By pre-training on a large amount of text data, language models learn to capture and generate coherent and contextually relevant text. They can be fine-tuned on specific tasks, such as chatbot training, to generate responses based on user input.

Hugging Face's `transformers` library provides a convenient way to access and use pre-trained language models, including GPT-2. These pre-trained models are available for various natural language processing tasks and architectures, allowing developers to leverage their power and versatility.

In this code example, the `GPT2LMHeadModel` class from the `transformers` library is used to initialize the GPT-2 model. It is a language model with a linear output layer on top, making it suitable for tasks like text generation.

The GPT-2 model is initialized with pre-trained weights obtained from Hugging Face's model repository. The `from_pretrained()` method downloads and loads the model, allowing you to use it out-of-the-box. The `"gpt2"` variant is chosen here, which refers to the base GPT-2 model with 117 million parameters. However, Hugging Face provides multiple GPT-2 variants with varying sizes, such as `"gpt2-medium"` and `"gpt2-large"`, depending on the task requirements and available computational resources.

Hugging Face's `transformers` library also provides pre-trained tokenizers, like the `GPT2Tokenizer` class, that are specifically designed for each model architecture. These tokenizers handle the tokenization and encoding of text, enabling you to easily prepare input data for the GPT-2 model.

In summary, this code example showcases the usage of the GPT-2 language model from Hugging Face's `transformers` library for training a chatbot. It highlights the power of pre-trained language models in generating contextually relevant responses. By fine-tuning these models on specific tasks, developers can create sophisticated chatbots that simulate human-like conversations.


### ChatData Class

The `ChatData` class is a custom class that extends the `Dataset` class from the `torch.utils.data` module. It is designed to handle the chat data for training a chatbot model.

```python
from torch.utils.data import Dataset
import json

class ChatData(Dataset):
    def __init__(self, path:str, tokenizer):
        self.data = json.load(open(path, "r"))

        self.X = []
        for i in self.data:
            for j in i['dialog']:
                self.X.append(j['text'])

        for idx, i in enumerate(self.X):
            try:
                self.X[idx] = "<startofstring> " + i + " <bot>: " + self.X[idx+1] + " <endofstring>"
            except:
                break

        self.X = self.X[:5000]

        self.X_encoded = tokenizer(self.X, max_length=40, truncation=True, padding="max_length", return_tensors="pt")
        self.input_ids = self.X_encoded['input_ids']
        self.attention_mask = self.X_encoded['attention_mask']

    def __len__(self):
        return len(self.X)

    def __getitem__(self, idx):
        return (self.input_ids[idx], self.attention_mask[idx])
 

The `ChatData` class has the following components:

- **Constructor (`__init__`):** The constructor initializes the `ChatData` object. It takes two parameters: `path`, which represents the path to the chat data JSON file, and `tokenizer`, which is an instance of the tokenizer used for encoding the text.

- **Loading and Preprocessing Data:** Inside the constructor, the chat data is loaded from the JSON file using `json.load()`, and stored in the `self.data` attribute.

  The text from the chat data is extracted and stored in the `self.X` list. This is done by iterating over the `self.data` object, accessing the 'dialog' field, and appending the 'text' field of each dialog entry to the `self.X` list.

  The chat data is then preprocessed by adding special tags and combining consecutive dialog entries into a single string. The "<startofstring>", "<bot>:", and "<endofstring>" tags are added to indicate the start of the conversation, the bot's response, and the end of the conversation, respectively. The consecutive dialog entries are combined using the `self.X[idx+1]` statement inside the `for` loop.

  The number of entries in `self.X` is limited to 5000 by assigning `self.X = self.X[:5000]`. This is done to limit the amount of data used for training, as indicated by the example code. You can adjust this number based on your dataset size and resource availability.

- **Encoding and Tokenization:** The text data in `self.X` is encoded and tokenized using the `tokenizer` object. The `tokenizer` function is called with the specified parameters: `max_length=40` to limit the maximum length of the encoded sequences, `truncation=True` to truncate longer sequences, `padding="max_length"` to pad shorter sequences, and `return_tensors="pt"` to return PyTorch tensors as outputs.

  The encoded input sequences and attention masks are stored in the `self.X_encoded` dictionary. The input IDs and attention masks are extracted from `self.X_encoded` and stored in the `self.input_ids` and `self.attention_mask` attributes, respectively.

- **Data Access Methods (`__len__` and `__getitem__`):** The `__len__` method returns the length of the dataset, which corresponds to the number of chat data entries in `self.X`. The `__getitem__` method is used to access individual data items given an index (`idx`). It returns a tuple containing the input IDs and attention masks corresponding to the given index.

The `ChatData` class facilitates the preparation and access of the chat data for training a chatbot model. It encapsulates the loading, preprocessing, encoding, and tokenization steps required to prepare the data for model training.



In [1]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer
from ChatData import ChatData
from torch.optim import Adam
from torch.utils.data import DataLoader
import tqdm
import torch

def train(chatData, model, optim):

    epochs = 12

    for i in tqdm.tqdm(range(epochs)):
        for X, a in chatData:
            X = X.to(device)
            a = a.to(device)
            optim.zero_grad()
            loss = model(X, attention_mask=a, labels=X).loss
            loss.backward()
            optim.step()
        torch.save(model.state_dict(), "model_state.pt")
        print(infer("hello how are you"))

def infer(inp):
    inp = "<startofstring> "+inp+" <bot>: "
    inp = tokenizer(inp, return_tensors="pt")
    X = inp["input_ids"].to(device)
    a = inp["attention_mask"].to(device)
    output = model.generate(X, attention_mask=a )
    output = tokenizer.decode(output[0])
    return output


device = "cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu"

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
tokenizer.add_special_tokens({"pad_token": "<pad>",
                                "bos_token": "<startofstring>",
                                "eos_token": "<endofstring>"})
tokenizer.add_tokens(["<bot>:"])

model = GPT2LMHeadModel.from_pretrained("gpt2")
model.resize_token_embeddings(len(tokenizer))

model = model.to(device)

# print(tokenizer.decode(model.generate(**tokenizer("hey i was good at basketball but ",
#                          return_tensors="pt"))[0]))

chatData = ChatData("chat_data.json", tokenizer)
chatData =  DataLoader(chatData, batch_size=64)

model.train()

optim = Adam(model.parameters(), lr=1e-3)

print("training .... ")
train(chatData, model, optim)

<startofstring> I love iphone! i just bought new iphone! <bot>: Thats good for you, i'm not very into new tech <endofstring>
training .... 


  0%|                                                                                                                                                | 0/12 [00:00<?, ?it/s]Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
  8%|███████████▎                                                                                                                            | 1/12 [01:01<11:17, 61.60s/it]

<startofstring> hello how are you <bot>: <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad>


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
 17%|██████████████████████▋                                                                                                                 | 2/12 [02:03<10:16, 61.66s/it]

<startofstring> hello how are you <bot>: Hi <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad>


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
 25%|██████████████████████████████████                                                                                                      | 3/12 [03:05<09:16, 61.83s/it]

<startofstring> hello how are you <bot>: Hi <bot>: Hi <bot>: i am a huge gamer <endofstring> <pad> <pad> <pad> <pad>


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
 33%|█████████████████████████████████████████████▎                                                                                          | 4/12 [04:07<08:15, 61.94s/it]

<startofstring> hello how are you <bot>: Hi <endofstring> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad>


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
 42%|████████████████████████████████████████████████████████▋                                                                               | 5/12 [05:09<07:14, 62.03s/it]

<startofstring> hello how are you <bot>: Hi, how are doing? <endofstring> <pad> <pad> <pad> <pad> <pad> <pad> <pad>


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
 50%|████████████████████████████████████████████████████████████████████                                                                    | 6/12 [06:11<06:12, 62.03s/it]

<startofstring> hello how are you <bot>: i am a huge gamer <endofstring> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad>


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
 58%|███████████████████████████████████████████████████████████████████████████████▎                                                        | 7/12 [07:13<05:10, 62.10s/it]

<startofstring> hello how are you <bot>: Hi, how are doing? <endofstring> <pad> <pad> <pad> <pad> <pad> <pad> <pad>


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
 67%|██████████████████████████████████████████████████████████████████████████████████████████▋                                             | 8/12 [08:16<04:08, 62.19s/it]

<startofstring> hello how are you <bot>: Hi <endofstring> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad>


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
 75%|██████████████████████████████████████████████████████████████████████████████████████████████████████                                  | 9/12 [09:18<03:06, 62.20s/it]

<startofstring> hello how are you <bot>: Hi, how are doing? <endofstring> <pad> <pad> <pad> <pad> <pad> <pad> <pad>


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
 83%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌                      | 10/12 [10:20<02:04, 62.18s/it]

<startofstring> hello how are you <bot>: Hi, how are doing? <endofstring> <pad> <pad> <pad> <pad> <pad> <pad> <pad>


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
 92%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊           | 11/12 [11:22<01:02, 62.21s/it]

<startofstring> hello how are you <bot>: i am a huge gamer <endofstring> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad>


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12/12 [12:25<00:00, 62.10s/it]

<startofstring> hello how are you <bot>: Hi <endofstring> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad>





### Code Explanation

This code example demonstrates how to train a chatbot using the GPT-2 language model. GPT-2, which stands for "Generative Pre-trained Transformer 2," is a state-of-the-art language model developed by OpenAI. It is based on the Transformer architecture and has been trained on a large corpus of diverse text data.

The code performs the following steps:

1. **Import Libraries:** The necessary libraries and modules are imported, including `GPT2LMHeadModel`, `GPT2Tokenizer`, `ChatData`, `Adam`, `DataLoader`, `tqdm`, and `torch`.

2. **Define Training Function:** The `train` function is defined to train the chatbot model. It takes `chatData` (chat data for training), `model` (GPT-2 model), and `optim` (optimizer) as inputs. Inside the function, the number of epochs is set to 12. The training process includes iterating over the chat data, performing forward and backward propagation, updating the optimizer, and saving the model's state after each epoch. Additionally, the `infer` function is called to generate a response for the input "hello how are you" after each epoch.

3. **Define Inference Function:** The `infer` function is defined to generate responses given an input string (`inp`). It prepares the input by adding special tokens and encoding it using the tokenizer. The input is then passed to the model for generation. The generated output is decoded using the tokenizer and returned as the response.

4. **Set Device:** The `device` variable is set to "cuda" if a CUDA-capable GPU is available, "mps" if the Torch Multi-Processing Service is available, or "cpu" as the default.

5. **Initialize Tokenizer:** The GPT-2 tokenizer is initialized using the `GPT2Tokenizer.from_pretrained()` method with the "gpt2" variant. Special tokens such as `<pad>`, `<startofstring>`, `<endofstring>`, and `<bot>:` are added to the tokenizer.

6. **Initialize Model:** The GPT-2 model is initialized using the `GPT2LMHeadModel.from_pretrained()` method with the "gpt2" variant. The token embeddings are resized to match the tokenizer's vocabulary size using the `resize_token_embeddings()` method. The model is then moved to the chosen device.

7. **Prepare Chat Data:** The `ChatData` class is initialized, taking the path to a JSON file containing the chat data and the tokenizer instance. This class manages the chat data for training.

8. **Data Loading:** A `DataLoader` is created to load the chat data in batches for efficient training. The batch size is set to 64.

9. **Set Model in Training Mode:** The model is put into training mode by calling `model.train()`.

10. **Initialize Optimizer:** An Adam optimizer is initialized with a learning rate of 1e-3, taking the model's parameters as input.

11. **Start Training:** The training process is initiated by calling the `train` function, passing the chat data, model, and optimizer as arguments. The progress is displayed using the `tqdm` library.

The output of the code example is the training process, where the chatbot model is trained on the provided chat data using the GPT-2 architecture.

### Post-Training: Using the Fine-Tuned GPT-2 Model

Now that we have trained our model on a GPU, we can proceed to interact with the fine-tuned GPT-2 model using the `infer` function and our custom data.

To chat with the model, you can call the `infer` function and provide it with a user input. The function will process the input, generate a response using the fine-tuned GPT-2 model, and return the generated response.

Make sure that the fine-tuned model is loaded and the `infer` function is accessible in the same environment. 

Additionally, if you want to check the GPU status, you can open the terminal in Jupyter Notebook and run the `nvidia-smi` command. This command displays information about the NVIDIA GPUs installed in your system, including their status, memory usage, and any running processes.

By utilizing the trained GPT-2 model and interacting with it using the `infer` function, you can have engaging conversations and leverage the power of the fine-tuned chatbot for various applications.

Please note that the specific commands and setup may vary depending on your environment and the resources available.

Enjoy chatting with your fine-tuned GPT-2 model!


In [2]:
import re

print("infer from model:")
while True:
    inp = input()
    response = infer(inp)  # Replace `infer` with the appropriate function for generating chatbot responses
    
    # Extract bot's response between <bot> and <endofstring> tags
    pattern = r"<bot>: (.*?) <endofstring>"
    matches = re.search(pattern, response, re.DOTALL)
    if matches:
        bot_response = matches.group(1).strip()
    else:
        bot_response = ""  # Empty response if no match is found
    
    print("Bot's response:", bot_response)


infer from model:
hi


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Bot's response: hi!
how are you?


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Bot's response: I am not sure what i am a very experienced person.
tell me about yourself


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Bot's response: i am not sure. i am a very experienced person.
how can you help me


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Bot's response: i am not sure it is. i am a huge gamer
what are your favorite games


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Bot's response: i am a huge gamer


KeyboardInterrupt: Interrupted by user