***Fundamentals of Artificial Intelligence***

> **Lab 6:** *Natural Language Processing and Chat Bots* <br>

> **Performed by:** *Corneliu Catlabuga*, group *FAF-213* <br>

> **Verified by:** Elena Graur, asist. univ.

#### Imports

In [34]:
import os

import re

import numpy as np

import pandas as pd

from collections import Counter

from tqdm import tqdm

import torch
import torch.nn as nn
from transformers import AutoTokenizer, T5ForConditionalGeneration
from tqdm import tqdm
from warnings import filterwarnings

filterwarnings("ignore")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using {device}")

Using cuda


#### Task 1

Set up the Telegram Bot. Interact with BotFather on Telegram to obtain an API token. Create your Telegram Bot (its name should follow the pattern FIA_Surname_Name_FAF_21x). Make sure you are able to receive and send requests to it.

1. Bot link: [FIA_Catlabuga_Corneliu_FAF_213](https://t.me/FAFCatlabugaCorneliuFAF213bot)

2. Run `app.py` to start the bot.

#### Task 2

Create a dataset that will serve as a training set for your model. It should follow the rules:
- an entry consists of two parts: the question and the answer;
- there are at least 75 entries written by you in your dataset;
- questions should be something tourists or locals can ask about a new city.

You can increase your dataset by adding open-source data. However, you MUST clearly show the questions written by you. Split your dataset into train and validation.

*Hint: it is recommended to split it into 80% and 20%, but you can adjust it according to your needs.*

#### Dataset

In [35]:
dataset = pd.read_csv('dataset.csv')

questions = dataset['question'].astype(str).tolist()
answers = dataset['answer'].astype(str).tolist()


def clean_text(text: str) -> str:
    text = text.lower()
    text = re.sub(r"[^a-zA-Z0-9?.!,']", " ", text)
    return text.strip()

questions = [clean_text(question) for question in questions]
answers = [clean_text(answer) for answer in answers]

#### Task 3

Use Tensorflow or Pytorch to implement the architecture of the Neural Network you are planning to use. It is highly recommended to use a Seq2Seq model (implement an LSTM or GRU architecture). You are NOT allowed to use pre-built or existing solutions (yep, connecting to GPT will not work).

In [36]:
tokenizer = AutoTokenizer.from_pretrained("t5-base")

def build_vocab(texts: list) -> dict:
    tokens = [token for sentence in texts for token in tokenize(sentence)]
    freq = Counter(tokens)
    vocab = {word: idx+2 for idx, (word, _) in enumerate(freq.most_common())}
    vocab['<UNK>'] = 0
    vocab['<PAD>'] = 1
    return vocab

In [37]:
def vectorize(sentences, vocab, max_len):
    vectors = []
    for sentence in sentences:
        tokens = tokenize(sentence)
        vector = [vocab.get(token, vocab['<UNK>']) for token in tokens]
        if len(vector) < max_len:
            vector += [vocab['<PAD>']] * (max_len - len(vector))
        else:
            vector = vector[:max_len]
        vectors.append(vector)
    return np.array(vectors)

In [38]:
question_vocab = build_vocab(questions)
answer_vocab = build_vocab(answers)

idx_to_word = {idx: word for word, idx in question_vocab.items()}

MAX_LEN_Q = 20
MAX_LEN_A = 20

questions_vec = vectorize(questions, question_vocab, MAX_LEN_Q)
answers_vec = vectorize(answers, answer_vocab, MAX_LEN_A)

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


#### Task 4

Train your model and fine-tune it based on the chosen performance metrics.

In [39]:
class Seq2SeqModel(nn.Module):
    def __init__(self, model_name="t5-base"):
        super(Seq2SeqModel, self).__init__()
        self.model = T5ForConditionalGeneration.from_pretrained(model_name)

        return outputs

In [111]:
train_dataset = QADataset(questions_vec, answers_vec)
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)

In [None]:
def train(epochs: int = 10, file_name: str = 'model.pth'):
    model = Seq2SeqModel().to(device)
    optimizer = optim.AdamW(model.parameters(), lr=5e-4)

    for epoch in range(epochs):
        model.train()
        epoch_loss = 0
        loop = tqdm(train_loader)
        for questions, answers in loop:
            loop.set_description(f'Epoch {epoch+1}/{epochs}')
            questions, answers = questions.to(device), answers.to(device)

        for batch in loop:
            loop.set_description(f"Epoch {epoch}")

            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['labels'].to(device)

            output = output[:, 1:].reshape(-1, output_dim)
            answers = answers[:, 1:].reshape(-1)

            loss = criterion(output, answers)
            loss.backward()
            optimizer.step()

            # Update progress bar
            loop.set_postfix(loss=loss.item())

    os.makedirs('models', exist_ok=True)
    torch.save(model.state_dict(), f"./models/{output_path}")

In [41]:
train(5, 'model5base.pth')

Epoch 0: 100%|██████████| 8/8 [00:07<00:00,  1.10it/s, loss=2.14]
Epoch 1: 100%|██████████| 8/8 [00:07<00:00,  1.11it/s, loss=1.6] 
Epoch 2: 100%|██████████| 8/8 [00:09<00:00,  1.16s/it, loss=1.22]
Epoch 3: 100%|██████████| 8/8 [00:06<00:00,  1.23it/s, loss=0.749]
Epoch 4: 100%|██████████| 8/8 [00:06<00:00,  1.22it/s, loss=0.668]


#### Task 5

Integrate your model into your Telegram ChatBot, so that the sent messages are taken as input by the model and its output is sent back as a reply.

In [None]:
model = Seq2SeqModel().to(device)
model.load_state_dict(torch.load('models/model5base.pth'))

def generate_answer(question, model, tokenizer):
    model.eval()
    input_ids = tokenizer(question, return_tensors="pt").input_ids.to(device)
    output = model.model.generate(input_ids)

    if output[-1] != '.':
        return tokenizer.decode(output[0], skip_special_tokens=True) + '.'

    return tokenizer.decode(output[0], skip_special_tokens=True)

# Test example
test_question = "Can I book a table at DeMars in advance?"
display(generate_answer(test_question, model, tokenizer))

'Yes, you can book a table at the Luna-City app or visit the restaurant.'

#### Task 6

Handle potential errors that may occur, such as model errors or invalid inputs.