***Fundamentals of Artificial Intelligence***

> **Lab 6:** *Natural Language Processing and Chat Bots* <br>

> **Performed by:** *Dan Hariton*, group *FAF-211* <br>

> **Verified by:** Elena Graur, asist. univ.

#### Imports

In [2]:
import os
import pandas as pd
import torch
from torch import optim
from torch.utils.data import Dataset, DataLoader
import torch.nn as nn
from transformers import AutoTokenizer, T5ForConditionalGeneration
from nltk.translate.bleu_score import sentence_bleu
from sklearn.model_selection import train_test_split
from tqdm import tqdm
from warnings import filterwarnings

MODEL = 't5-base'
SOURCE_LEN = 512
TARGET_LEN = 512

filterwarnings("ignore")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using {device}")

Using cpu


#### Task 1

Set up the Telegram Bot. Interact with BotFather on Telegram to obtain an API token. Create your Telegram Bot (its name should follow the pattern FIA_Surname_Name_FAF_21x). Make sure you are able to receive and send requests to it.

1. Bot link: [FIA-Hariton-Dan-FAF-211](https://t.me/CityGuideHD_bot)

2. Run `app.py` to start the bot.

#### Task 2

Create a dataset that will serve as a training set for your model. It should follow the rules:
- an entry consists of two parts: the question and the answer;
- there are at least 75 entries written by you in your dataset;
- questions should be something tourists or locals can ask about a new city.

You can increase your dataset by adding open-source data. However, you MUST clearly show the questions written by you. Split your dataset into train and validation.

*Hint: it is recommended to split it into 80% and 20%, but you can adjust it according to your needs.*

#### Dataset

In [3]:
dataset = pd.read_csv("data.csv")

questions = dataset['question'].tolist()
answers = dataset['answer'].tolist()

#### Task 3

Use Tensorflow or Pytorch to implement the architecture of the Neural Network you are planning to use. It is highly recommended to use a Seq2Seq model (implement an LSTM or GRU architecture). You are NOT allowed to use pre-built or existing solutions (yep, connecting to GPT will not work).

In [7]:
tokenizer = AutoTokenizer.from_pretrained(MODEL)

def tokenize(data, max_len):
    return tokenizer(data, padding=True, truncation=True, return_tensors="pt", max_length=max_len)

In [8]:
class Seq2SeqDataset(Dataset):
    def __init__(self, inputs, targets):
        self.inputs = inputs
        self.targets = targets

    def __len__(self):
        return len(self.inputs['input_ids'])

    def __getitem__(self, idx):
        return {
            'input_ids': self.inputs['input_ids'][idx],
            'attention_mask': self.inputs['attention_mask'][idx],
            'labels': self.targets['input_ids'][idx]
        }

In [9]:
_, questions_val, _, answers_val = train_test_split(questions, answers, test_size=0.2)

tokenized_questions_train = tokenize(questions, SOURCE_LEN)
tokenized_answers_train = tokenize(answers, TARGET_LEN)
train_dataset = Seq2SeqDataset(tokenized_questions_train, tokenized_answers_train)
train_dataloader = DataLoader(train_dataset, batch_size=16, shuffle=True)

# Save the used tokens to later check if the question is according to the tokens
torch.save(tokenized_questions_train, 'models/used_tokens.pt')

tokenized_questions_val = tokenize(questions_val, SOURCE_LEN)
tokenized_answers_val = tokenize(answers_val, TARGET_LEN)
val_dataset = Seq2SeqDataset(tokenized_questions_val, tokenized_answers_val)
val_dataloader = DataLoader(val_dataset, batch_size=16, shuffle=True)

#### Task 4

Train your model and fine-tune it based on the chosen performance metrics.

In [10]:
class Seq2SeqModel(nn.Module):
    def __init__(self, model_name=MODEL):
        super(Seq2SeqModel, self).__init__()
        self.model = T5ForConditionalGeneration.from_pretrained(model_name)

    def forward(self, input_ids, attention_mask, labels):
        return self.model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)

In [11]:
def train(epochs: int = 10, file_name: str = 'model.pth'):
    model = Seq2SeqModel().to(device)
    optimizer = optim.AdamW(model.parameters(), lr=5e-4)

    for epoch in range(epochs):
        model.train()
        loop = tqdm(train_dataloader, leave=True)

        for batch in loop:
            loop.set_description(f"Epoch {epoch}")

            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['labels'].to(device)

            # Forward pass
            outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
            loss = outputs.loss

            # Backward pass
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

            # Update progress bar
            loop.set_postfix(loss=loss.item())

    os.makedirs('models', exist_ok=True)
    torch.save(model.state_dict(), f'models/{file_name}')

    return model

In [12]:
def evaluate(model, val_loader):
    model.eval()
    total_bleu = 0
    num_samples = 0

    with torch.no_grad():
        loop = tqdm(val_loader, leave=True)
        for batch in loop:
            loop.set_description("Evaluating: ")

            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['labels'].to(device)

            # Forward pass
            outputs = model.model.generate(input_ids=input_ids, attention_mask=attention_mask, max_length=TARGET_LEN)
            predictions = tokenizer.batch_decode(outputs, skip_special_tokens=True)
            references = tokenizer.batch_decode(labels, skip_special_tokens=True)

            for p, r in zip(predictions, references):
                total_bleu += sentence_bleu([r.split()], p.split())
                num_samples += 1

    return total_bleu / num_samples

In [13]:
TRAINED_MODEL = 'model10base.pth'

model = train(10, TRAINED_MODEL)
bleu_score = evaluate(model, val_dataloader)

print(f"BLEU Score: {bleu_score}")

Epoch 0:   0%|          | 0/5 [00:00<?, ?it/s]Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.48.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.
Epoch 0: 100%|██████████| 5/5 [00:16<00:00,  3.22s/it, loss=2.54]
Epoch 1: 100%|██████████| 5/5 [00:12<00:00,  2.52s/it, loss=2.1] 
Epoch 2: 100%|██████████| 5/5 [00:12<00:00,  2.59s/it, loss=1.25]
Epoch 3: 100%|██████████| 5/5 [00:12<00:00,  2.49s/it, loss=1.03] 
Epoch 4: 100%|██████████| 5/5 [00:12<00:00,  2.40s/it, loss=0.832]
Epoch 5: 100%|██████████| 5/5 [00:11<00:00,  2.39s/it, loss=0.613]
Epoch 6: 100%|██████████| 5/5 [00:11<00:00,  2.38s/it, loss=0.507]
Epoch 7: 100%|██████████| 5/5 [00:12<00:00,  2.51s/it, loss=0.26] 
Epoch 8: 100%|██████████| 5/5 [00:12<00:00,  2.52s/it, loss=0.268]
Epoch 9: 100%|██████████| 5/5 [00:11<00:00,  2.37s/it, loss=0.161]
Evaluating: : 100%|██████████| 1/1 [00:01<00:00,  

BLEU Score: 0.75





#### Task 5

Integrate your model into your Telegram ChatBot, so that the sent messages are taken as input by the model and its output is sent back as a reply.

In [14]:
def check_tokens(tokens):
    saved_tokens = torch.load('models/used_tokens.pt')['input_ids']
    return all([t in saved_tokens for t in tokens])

In [19]:
model = Seq2SeqModel().to(device)
model.load_state_dict(torch.load(f'models/{TRAINED_MODEL}'))

def generate_answer(question, model, tokenizer):
    model.eval()
    input_ids = tokenizer(question, return_tensors="pt").input_ids.to(device)

    if not check_tokens(input_ids[0]):
        print(input_ids[0])

        for token in input_ids[0]:
            print(f"{token}: {tokenizer.decode(token)}: {check_tokens([token])}")

        return "Invalid tokens"

    output = model.model.generate(input_ids, max_length=TARGET_LEN)

    decoded = tokenizer.decode(output[0], skip_special_tokens=True)

    return decoded

# test_question = "How do I get to the Hokage’s office?"
test_question = "What is the capital of Japan?"
display(generate_answer(test_question, model, tokenizer))

tensor([ 363,   19,    8, 1784,   13, 3411,   58,    1])
363: What: True
19: is: True
8: the: True
1784: capital: False
13: of: True
3411: Japan: False
58: ?: True
1: </s>: True


'Invalid tokens'

I was helped by Corneliu Catabluga from 213, he helped me with the T5 transformer, explaining to me how it works and how to use it along with Seq2Seq Model