***Fundamentals of Artificial Intelligence***

> **Lab 6:** *Natural Language Processing and Chat Bots* <br>

> **Performed by:** *Corneliu Catlabuga*, group *FAF-213* <br>

> **Verified by:** Elena Graur, asist. univ.

#### Imports

In [2]:
import os
import pandas as pd
import torch
from torch import optim
from torch.utils.data import Dataset, DataLoader
import torch.nn as nn
from transformers import AutoTokenizer, T5ForConditionalGeneration
from nltk.translate.bleu_score import sentence_bleu
from sklearn.model_selection import train_test_split
from tqdm import tqdm
from warnings import filterwarnings

MODEL = 't5-base'
SOURCE_LEN = 512
TARGET_LEN = 512

filterwarnings("ignore")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using {device}")

Using cpu


#### Task 1

Set up the Telegram Bot. Interact with BotFather on Telegram to obtain an API token. Create your Telegram Bot (its name should follow the pattern FIA_Surname_Name_FAF_21x). Make sure you are able to receive and send requests to it.

1. Bot link: [FIA_Catlabuga_Corneliu_FAF_213](https://t.me/fiacorneliucatlabugafaf213bot)

2. Run `app.py` to start the bot.

#### Task 2

Create a dataset that will serve as a training set for your model. It should follow the rules:
- an entry consists of two parts: the question and the answer;
- there are at least 75 entries written by you in your dataset;
- questions should be something tourists or locals can ask about a new city.

You can increase your dataset by adding open-source data. However, you MUST clearly show the questions written by you. Split your dataset into train and validation.

*Hint: it is recommended to split it into 80% and 20%, but you can adjust it according to your needs.*

#### Dataset

In [3]:
dataset = pd.read_csv('dataset.csv')

questions = dataset['question'].tolist()
answers = dataset['answer'].tolist()

#### Task 3

Use Tensorflow or Pytorch to implement the architecture of the Neural Network you are planning to use. It is highly recommended to use a Seq2Seq model (implement an LSTM or GRU architecture). You are NOT allowed to use pre-built or existing solutions (yep, connecting to GPT will not work).

In [4]:
tokenizer = AutoTokenizer.from_pretrained(MODEL)

def tokenize(data, max_len):
    return tokenizer(data, padding=True, truncation=True, return_tensors="pt", max_length=max_len)

In [5]:
class Seq2SeqDataset(Dataset):
    def __init__(self, inputs, targets):
        self.inputs = inputs
        self.targets = targets

    def __len__(self):
        return len(self.inputs['input_ids'])

    def __getitem__(self, idx):
        return {
            'input_ids': self.inputs['input_ids'][idx],
            'attention_mask': self.inputs['attention_mask'][idx],
            'labels': self.targets['input_ids'][idx]
        }

In [None]:
_, questions_val, _, answers_val = train_test_split(questions, answers, test_size=0.2)

tokenized_questions_train = tokenize(questions, SOURCE_LEN)
tokenized_answers_train = tokenize(answers, TARGET_LEN)
train_dataset = Seq2SeqDataset(tokenized_questions_train, tokenized_answers_train)
train_dataloader = DataLoader(train_dataset, batch_size=16, shuffle=True)

# Save the used tokens to later check if the question is according to the tokens
torch.save(tokenized_questions_train, 'models/used_tokens.pt')

tokenized_questions_val = tokenize(questions_val, SOURCE_LEN)
tokenized_answers_val = tokenize(answers_val, TARGET_LEN)
val_dataset = Seq2SeqDataset(tokenized_questions_val, tokenized_answers_val)
val_dataloader = DataLoader(val_dataset, batch_size=16, shuffle=True)

#### Task 4

Train your model and fine-tune it based on the chosen performance metrics.

In [10]:
class Seq2SeqModel(nn.Module):
    def __init__(self, model_name=MODEL):
        super(Seq2SeqModel, self).__init__()
        self.model = T5ForConditionalGeneration.from_pretrained(model_name)

    def forward(self, input_ids, attention_mask, labels):
        return self.model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)

In [75]:
def train(epochs: int = 10, file_name: str = 'model.pth'):
    model = Seq2SeqModel().to(device)
    optimizer = optim.AdamW(model.parameters(), lr=5e-4)

    for epoch in range(epochs):
        model.train()
        loop = tqdm(train_dataloader, leave=True)

        for batch in loop:
            loop.set_description(f"Epoch {epoch}")

            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['labels'].to(device)

            # Forward pass
            outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
            loss = outputs.loss

            # Backward pass
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

            # Update progress bar
            loop.set_postfix(loss=loss.item())

    os.makedirs('models', exist_ok=True)
    torch.save(model.state_dict(), f'models/{file_name}')

    return model

In [76]:
def evaluate(model, val_loader):
    model.eval()
    total_bleu = 0
    num_samples = 0

    with torch.no_grad():
        loop = tqdm(val_loader, leave=True)
        for batch in loop:
            loop.set_description("Evaluating: ")

            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['labels'].to(device)

            # Forward pass
            outputs = model.model.generate(input_ids=input_ids, attention_mask=attention_mask, max_length=TARGET_LEN)
            predictions = tokenizer.batch_decode(outputs, skip_special_tokens=True)
            references = tokenizer.batch_decode(labels, skip_special_tokens=True)

            for p, r in zip(predictions, references):
                total_bleu += sentence_bleu([r.split()], p.split())
                num_samples += 1

    return total_bleu / num_samples

In [None]:
TRAINED_MODEL = 'model10base.pth'

model = train(10, TRAINED_MODEL)
bleu_score = evaluate(model, val_dataloader)

print(f"BLEU Score: {bleu_score}")

Epoch 0: 100%|██████████| 8/8 [00:15<00:00,  1.94s/it, loss=2.3] 
Epoch 1: 100%|██████████| 8/8 [00:13<00:00,  1.71s/it, loss=1.35]
Epoch 2: 100%|██████████| 8/8 [00:08<00:00,  1.10s/it, loss=1.11]
Epoch 3: 100%|██████████| 8/8 [00:09<00:00,  1.22s/it, loss=0.859]
Epoch 4: 100%|██████████| 8/8 [00:09<00:00,  1.22s/it, loss=0.729]
Epoch 5: 100%|██████████| 8/8 [00:09<00:00,  1.22s/it, loss=0.622]
Epoch 6: 100%|██████████| 8/8 [00:09<00:00,  1.19s/it, loss=0.399]
Epoch 7: 100%|██████████| 8/8 [00:09<00:00,  1.22s/it, loss=0.264]
Epoch 8: 100%|██████████| 8/8 [00:09<00:00,  1.22s/it, loss=0.259]
Epoch 9: 100%|██████████| 8/8 [00:10<00:00,  1.26s/it, loss=0.228]
Evaluating: : 100%|██████████| 2/2 [00:00<00:00,  2.22it/s]

BLEU Score: 0.6664665989118275





#### Task 5

Integrate your model into your Telegram ChatBot, so that the sent messages are taken as input by the model and its output is sent back as a reply.

In [None]:
def check_tokens(tokens):
    saved_tokens = torch.load('models/used_tokens.pt')['input_ids']
    return all([t in saved_tokens for t in tokens])

In [None]:
model = Seq2SeqModel().to(device)
model.load_state_dict(torch.load(f'models/{TRAINED_MODEL}'))

def generate_answer(question, model, tokenizer):
    model.eval()
    input_ids = tokenizer(question, return_tensors="pt").input_ids.to(device)

    if not check_tokens(input_ids[0]):
        print(input_ids[0])

        for token in input_ids[0]:
            print(f"{token}: {tokenizer.decode(token)}: {check_tokens([token])}")

        return "Invalid tokens"

    output = model.model.generate(input_ids, max_length=TARGET_LEN)

    decoded = tokenizer.decode(output[0], skip_special_tokens=True)

    return decoded

# Test example
test_question = "Andy's"
display(generate_answer(test_question, model, tokenizer))

tensor([12838,    31,     7,     1])
12838: Andy: False
31: ': True
7: s: True
1: </s>: True


'Invalid tokens'

#### Telegram Bot

(Code can be found in `app.py` and `utils.py`)

*utils.py*

In [None]:
import torch
import torch.nn as nn
from transformers import AutoTokenizer, T5ForConditionalGeneration

MODEL = 't5-base'
TRAINED_MODEL = 'model10base.pth'

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = AutoTokenizer.from_pretrained(MODEL)

class Seq2SeqModel(nn.Module):
    # Model class
    ...

model = Seq2SeqModel().to(device)
model.load_state_dict(torch.load(f'models/{TRAINED_MODEL}'))

def check_tokens(tokens):
    # Check if the tokens are a part of the vocabulary
    ...

def generate_answer(question: str) -> str:
    # Generate an answer for the given question
    ...

*app.py*

In [None]:
from telegram import Update
from telegram.ext import ApplicationBuilder, ContextTypes, CommandHandler, MessageHandler
from dotenv import dotenv_values
from warnings import filterwarnings

from utils import generate_answer

filterwarnings("ignore")
config = dotenv_values(".env")


async def start(update: Update, context: ContextTypes.DEFAULT_TYPE) -> None:
    # Response for the /start command
    ...

async def respond(update: Update, context: ContextTypes.DEFAULT_TYPE) -> None:
    # Handle user messages
    ...


app = ApplicationBuilder().token(config['TELEGRAM_TOKEN']).build()
app.add_handler(CommandHandler(callback=start, command='start'))
app.add_handler(MessageHandler(callback=respond, filters=None))
app.run_polling()


#### Task 6

Handle potential errors that may occur, such as model errors or invalid inputs.

1. The app first splits the message into sentences. The bot will respond to each sentence separately.
2. After tokenizing the sentences, the tokens are checked if they are in the vocabulary.
3. For any uncaught exception, the bot will respond with a default error message.

#### Collaborations

1. *Beatricia Golban* FAF-213 - helped with the model implementation.
2. *Dan Hariton* FAF-211 - helped with the model and bot implementation.