***Fundamentals of Artificial Intelligence***

> **Lab 6:** *Natural Language Processing and Chat Bots* <br>

> **Performed by:** *Corneliu Catlabuga*, group *FAF-213* <br>

> **Verified by:** Elena Graur, asist. univ.

#### Imports

In [263]:
import os
import pandas as pd
from sklearn.model_selection import train_test_split
import torch
from torch import optim
from torch.jit import script, trace
from torch.utils.data import Dataset, DataLoader
from torch.nn.utils.rnn import pad_sequence
import torch.nn as nn
import torch.nn.functional as F

#### Task 1

Set up the Telegram Bot. Interact with BotFather on Telegram to obtain an API token. Create your Telegram Bot (its name should follow the pattern FIA_Surname_Name_FAF_21x). Make sure you are able to receive and send requests to it.

1. Bot link: [FIA_Catlabuga_Corneliu_FAF_213](https://t.me/FAFCatlabugaCorneliuFAF213bot)

2. Run `app.py` to start the bot.

#### Task 2

Create a dataset that will serve as a training set for your model. It should follow the rules:
- an entry consists of two parts: the question and the answer;
- there are at least 75 entries written by you in your dataset;
- questions should be something tourists or locals can ask about a new city.

You can increase your dataset by adding open-source data. However, you MUST clearly show the questions written by you. Split your dataset into train and validation.

*Hint: it is recommended to split it into 80% and 20%, but you can adjust it according to your needs.*

#### Dataset

In [264]:
dataset = pd.read_csv('dataset.csv')

#### Task 3

Use Tensorflow or Pytorch to implement the architecture of the Neural Network you are planning to use. It is highly recommended to use a Seq2Seq model (implement an LSTM or GRU architecture). You are NOT allowed to use pre-built or existing solutions (yep, connecting to GPT will not work).

In [265]:
def tokenize(sentance, vocabulary):
    return [vocabulary.get(word, vocabulary['<UNK>']) for word in sentance.lower().split()]

In [266]:
class ChatDataset(Dataset):
    def __init__(self, questions, answers, vocabulary):
        self.questions = [tokenize(question, vocabulary) for question in questions]
        self.answers = [tokenize(answer, vocabulary) for answer in answers]
        self.vocablulary = vocabulary

    def __len__(self):
        return len(self.questions)

    def __getitem__(self, index):
        return torch.tensor(self.questions[index]), torch.tensor(self.answers[index])

In [294]:
train_data, validation_data = train_test_split(dataset, test_size=0.2)

special_tokens = ['<PAD>', '<UNK>', '<SOS>', '<EOS>']
# vocabulary = {'<PAD>': 0, '<UNK>': 1}
vocabulary = {token: index for index, token in enumerate(special_tokens)}

for sentance in pd.concat([dataset['question'], dataset['answer']]):
    for word in sentance.lower().split():
        if word not in vocabulary:
            vocabulary[word] = len(vocabulary)

vocab_reverse = {idx: word for word, idx in vocabulary.items()}

In [268]:
def collate_fn(batch):
    questions, answers = zip(*batch)

    questions = pad_sequence(questions, batch_first=True, padding_value=vocabulary['<PAD>'])
    answers = pad_sequence(answers, batch_first=True, padding_value=vocabulary['<PAD>'])

    return questions, answers

In [269]:
train_dataset = ChatDataset(train_data['question'], train_data['answer'], vocabulary)
validation_dataset = ChatDataset(validation_data['question'], validation_data['answer'], vocabulary)

train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True, collate_fn=collate_fn)
validation_loader = DataLoader(validation_dataset, batch_size=32, collate_fn=collate_fn)

#### Task 4

Train your model and fine-tune it based on the chosen performance metrics.

In [270]:
class Encoder(nn.Module):
    def __init__(self, input_dimension, embedding_dimension, hidden_dimension):
        super().__init__()
        self.embedding = nn.Embedding(input_dimension, embedding_dimension)
        self.rnn = nn.GRU(embedding_dimension, hidden_dimension, batch_first=True)
        self.hidden_dimension = hidden_dimension

    def forward(self, source):
        embedded = self.embedding(source)
        _, hidden = self.rnn(embedded)
        return hidden

In [271]:
class Decoder(nn.Module):
    def __init__(self, output_dimension, embedding_dimension, hidden_dimension):
        super().__init__()
        self.embedding = nn.Embedding(output_dimension, embedding_dimension)
        self.rnn = nn.GRU(embedding_dimension, hidden_dimension, batch_first=True)
        self.fc_out = nn.Linear(hidden_dimension, output_dimension)

    def forward(self, trigger, hidden):
        embedded = self.embedding(trigger)
        output, hidden = self.rnn(embedded, hidden)
        predictions = self.fc_out(output)
        return predictions, hidden

In [281]:
input_dimension = len(vocabulary)
output_dimension = len(vocabulary)
embedding_dimension = 256
hidden_diemnsion = 512
learning_rate = 0.001
batch_size = 32

encoder = Encoder(input_dimension, embedding_dimension, hidden_diemnsion)
decoder = Decoder(output_dimension, embedding_dimension, hidden_diemnsion)
model = nn.ModuleList([encoder, decoder])

criterion = nn.CrossEntropyLoss(ignore_index=vocabulary['<PAD>'])
optimizer = optim.Adam(model.parameters(), lr=learning_rate)

def train(epochs: int = 10, file_name: str = 'model.pth'):
    encoder.train()
    decoder.train()

    for epoch in range(epochs):
        epoch_loss = 0

        for questions, answers in train_loader:
            optimizer.zero_grad()
            hidden = encoder(questions)  # Encoder output (hidden state)

            input_seq = answers[:, :-1]  # Remove <EOS>
            target_seq = answers[:, 1:]  # Shift for teacher forcing

            batch_size, seq_len = target_seq.size()
            outputs = torch.zeros(batch_size, seq_len, output_dimension)

            dec_input = input_seq[:, 0].unsqueeze(1)  # Start with first token
            for t in range(seq_len):
                output, hidden = decoder(dec_input, hidden)
                outputs[:, t, :] = output.squeeze(1)
                dec_input = target_seq[:, t].unsqueeze(1)  # Teacher forcing

            # outputs = outputs.view(-1, output_dimension)
            # target_seq = target_seq.view(-1)
            outputs = outputs.reshape(-1, output_dimension)
            target_seq = target_seq.reshape(-1)
            loss = criterion(outputs, target_seq)
            loss.backward()
            optimizer.step()

            epoch_loss += loss.item()

        print(f'Epoch {epoch+1}/{epochs}, Loss: {epoch_loss/len(train_loader):.4f}')

    os.makedirs('models', exist_ok=True)
    torch.save(encoder.state_dict(), f'models/enc_{file_name}')
    torch.save(decoder.state_dict(), f'models/dec_{file_name}')

In [None]:
train(10, 'm10.pth')
train(20, 'm20.pth')
train(50, 'm50.pth')
# train(100, 'm100.pth')
# train(300, 'm300.pth')
# train(1000, 'm1000.pth')

Epoch 1/50, Loss: 0.0031
Epoch 2/50, Loss: 0.0032
Epoch 3/50, Loss: 0.0031
Epoch 4/50, Loss: 0.0031
Epoch 5/50, Loss: 0.0032
Epoch 6/50, Loss: 0.0032
Epoch 7/50, Loss: 0.0030
Epoch 8/50, Loss: 0.0031
Epoch 9/50, Loss: 0.0034
Epoch 10/50, Loss: 0.0033
Epoch 11/50, Loss: 0.0030
Epoch 12/50, Loss: 0.0032
Epoch 13/50, Loss: 0.0030
Epoch 14/50, Loss: 0.0030
Epoch 15/50, Loss: 0.0032
Epoch 16/50, Loss: 0.0030
Epoch 17/50, Loss: 0.0033
Epoch 18/50, Loss: 0.0031
Epoch 19/50, Loss: 0.0031
Epoch 20/50, Loss: 0.0029
Epoch 21/50, Loss: 0.0030
Epoch 22/50, Loss: 0.0032
Epoch 23/50, Loss: 0.0033
Epoch 24/50, Loss: 0.0033
Epoch 25/50, Loss: 0.0031
Epoch 26/50, Loss: 0.0031
Epoch 27/50, Loss: 0.0032
Epoch 28/50, Loss: 0.0031
Epoch 29/50, Loss: 0.0031
Epoch 30/50, Loss: 0.0032
Epoch 31/50, Loss: 0.0031
Epoch 32/50, Loss: 0.0033
Epoch 33/50, Loss: 0.0031
Epoch 34/50, Loss: 0.0030
Epoch 35/50, Loss: 0.0030
Epoch 36/50, Loss: 0.0031
Epoch 37/50, Loss: 0.0033
Epoch 38/50, Loss: 0.0035
Epoch 39/50, Loss: 0.

#### Task 5

Integrate your model into your Telegram ChatBot, so that the sent messages are taken as input by the model and its output is sent back as a reply.

In [288]:
def preprocess_question(question, vocab):
    """
    Tokenizes and numericalizes a question using the vocabulary.
    """
    tokens = question.lower().split()
    indices = [vocab.get(token, vocab['<UNK>']) for token in tokens]
    return torch.tensor([indices], dtype=torch.long)

In [327]:
def respond(question, encoder, decoder, vocab, vocab_reverse, max_len=20):
    # Preprocess the question
    question_tensor = preprocess_question(question, vocab)

    # Encode the question
    hidden = encoder(question_tensor)

    # Prepare to decode
    input_token = torch.tensor([[vocab['<SOS>']]])
    output_indices = []

    # Decode one token at a time
    for _ in range(max_len):
        output, hidden = decoder(input_token, hidden)
        next_token = output.argmax(2).item()  # Pick the most probable token
        output_indices.append(next_token)

        if next_token == vocab['<EOS>']:  # Stop at <EOS> token
            break
        input_token = torch.tensor([[next_token]])

    # Convert indices back to tokens, excluding padding and EOS
    response = ' '.join([vocab_reverse[idx] for idx in output_indices if idx not in {vocab['<PAD>'], vocab['<EOS>']}])
    return response

In [328]:
m = '50'

encoder = Encoder(input_dimension, embedding_dimension, hidden_diemnsion)
decoder = Decoder(output_dimension, embedding_dimension, hidden_diemnsion)
encoder.load_state_dict(torch.load(f'models/enc_m{m}.pth'))
decoder.load_state_dict(torch.load(f'models/dec_m{m}.pth'))

response = respond(
    "Where can I find Martia restaurant?",
    encoder,
    decoder,
    vocabulary,
    vocab_reverse
)

display(response)

  encoder.load_state_dict(torch.load(f'models/enc_m{m}.pth'))
  decoder.load_state_dict(torch.load(f'models/dec_m{m}.pth'))


'to restaurant marf yes venood much marf hub, or do get pizza area herbs all cost? has has has has'

#### Task 6

Handle potential errors that may occur, such as model errors or invalid inputs.