# Introduction to the Modeling Problem

In this project, we delve into the exciting world of Natural Language Processing (NLP) to tackle a compelling challenge: developing a chatbot that not only engages in contextually relevant conversations but also emulates the unique conversational styles of characters from the popular TV show "Rick and Morty". We chose to work with the DialoGPT model from Microsoft - a variant of the GPT-2 model, that is fine-tuned specifically for dialogues. 

## A Unique Problem to Crack

Our project involves distinct challenges, each representing a key stage in the development of our chatbot:

1. **Data Acquisition and Cleaning**: Our first task was to obtain and clean a dataset with enough depth and richness to effectively train our model. This required sourcing and preprocessing dialogues from the TV show.

2. **Understanding Character Styles**: Each character in "Rick and Morty" boasts a unique conversational style, peppered with their own idiosyncrasies. The complexity of capturing the nuances of their dialogue, including humor, colloquialisms, and catchphrases, represented a significant challenge.

3. **Text Generation**: Beyond generating relevant responses, our model needed to convincingly replicate the distinct conversational style of the character in question. The GPT-2, or more precisely, its DialogPT variant was our tool of choice, thanks to its excellent performance in generating conversational responses.

4. **Evaluation**: Our final hurdle lay in evaluating our model's performance. Beyond the realm of traditional accuracy metrics, we had to assess whether the generated text genuinely encapsulated the character's style. Establishing pertinent evaluation metrics represented a unique and interesting problem in itself.

## Goal of Our Project

The primary aim of our project is to leverage Natural Language Processing (NLP) techniques to develop an interactive and unique artificial personality. Our focus is not only to ensure contextually relevant conversations but also to emulate the distinctive conversational styles of characters from the TV show "Rick and Morty". By utilizing the DialoGPT model, a fine-tuned variant of the GPT-2 model specialized for dialogues, we seek to craft a chatbot that engages users in fun interactions, bringing the personas of "Rick and Morty" into our NLP model. Through this project we aim to expand our knowledge of conversational AI models and testing the boundaries.

Certainly! Here's the provided text formatted in Markdown style:

## Extending NLP-Powered Chatbot to Real Business Solutions and Beyond

The application of our NLP-powered chatbot could potentially extend beyond just creating a fun character from one popular show "Rick and Morty". With a successful Artifical Persona we can create unique and interactive user experiences on various platforms, presenting exciting opportunities for businesses and content creators.

1. **Website with Cool Characters:** We could create a business that can leverage our NLP Artifical Persona Chatbot to design engaging websites featuring interactive and entertaining characters. These characters could be original creations or inspired by popular media franchises. 

2. **Interactive YouTubers:** Content creators, especially YouTubers, can use our NLP-powered Artifical Persona to bring their personas to life in a whole new way. Imagine a YouTube channel where the creator's virtual character interacts with viewers in real-time, responding to comments and engaging in witty banter. If the Artifical Persona is indistinguishable from the real person this personalized and interactive experience could foster a stronger sense of connection between the content creator and their audience, enhancing viewer experience.

3. **Customizable Chatbots for Customer Support:** Beyond entertainment, our chatbot technology can serve as a valuable tool for businesses to streamline customer support services. Companies can create customized chatbots with distinct personalities that align with their brand identity. These chatbots can efficiently handle customer inquiries, provide helpful information, and resolve issues, all while maintaining a friendly and engaging conversational style.

4. **Educational Applications:** In the realm of education, our NLP-powered chatbot can be utilized to create interactive and adaptive learning experiences. Students can engage with virtual characters that act as intelligent tutors, providing personalized explanations, quizzes, and feedback based on individual learning preferences.

5. **Virtual Influencers:** As the world of social media continues to evolve, virtual influencers have gained popularity. Our chatbot technology can enable the creation of virtual influencers with unique personalities, who can interact with followers on social media platforms. These virtual influencers can promote products, share content, and engage in conversations with users, offering a fresh approach to influencer marketing.

The potential applications of our NLP-powered Artifical Persona chatbot are vast and diverse. From enhancing user experiences on websites to transforming the way content creators interact with their audience, and from enhancing customer support to improving education and influencer marketing, the possibilities are limitless. Through this project, we are exploring the capabilities of NLPs and how a successful model could integrate with todays world.


## Choice of Model

In this project, we decided to utilize Microsoft's DialoGPT model. As a variation of the GPT-2 model fine-tuned explicitly for generating conversational responses, DialoGPT seemed well-suited for our task of creating a chatbot that mirrors the dialogue style in the "Rick and Morty" show. This model selection was also backed by the promising results demonstrated by DialoGPT in generating contextually relevant and human-like conversational responses.

In the realm of Natural Language Processing (NLP), multiple models could have been considered. For instance, sequence-to-sequence (Seq2Seq) models like LSTM (Long Short-Term Memory) or GRU (Gated Recurrent Units) networks could have been potential choices. However, these recurrent neural network-based models often struggle with long sequences due to the vanishing gradient problem, making them less ideal for dialog systems where the context could be quite lengthy.

On the other hand, Transformer-based models like GPT-2 overcome this limitation with their attention mechanism. Although the GPT-2 model is already quite powerful, DialoGPT takes it a step further by fine-tuning this base model on a large-scale dialogue dataset. This fine-tuning makes DialoGPT more apt at understanding and generating dialogue, a feature that aligns perfectly with our project goal.

Our aim was to mimic the humoristic style of "Rick and Morty" accurately with a focus on the character of Rick, and also generate contextually coherent and relevant responses in a conversational setting. The DialoGPT model provided a solid foundation for this, allowing us to further train and fine-tune it with our dataset for even better performance.


# Data Deep Dive

In this section, we delve into our dataset, which is the cornerstone of any Machine Learning project. We'll explore the data acquisition process, our method for preparing the data for our model, and some challenges associated with the nature of the data and our approach. 

## Data Acquisition

Our dataset comes from Kaggle, specifically the [Rick and Morty Scripts dataset](https://www.kaggle.com/andradaolteanu/rickmorty-scripts) posted by user Andrada Olteanu. This dataset provides lines of dialogue from different characters across various episodes of the show.

The dataset structure is as follows:

- **index**: A simple row identifier.
- **season no.**: The season in which the dialogue line appears.
- **episode no.**: The episode in which the dialogue line appears.
- **episode name**: The name of the episode.
- **name**: The character who speaks the line.
- **line**: The line of dialogue itself.

Here are the first few lines for context:

```
index	season no.	episode no.	episode name	name	line
0	1	1	Pilot	Rick	Morty! You gotta come on. Jus'... you gotta come with me.
1	1	1	Pilot	Morty	What, Rick? What’s going on?
2	1	1	Pilot	Rick	I got a surprise for you, Morty.
```

## Data Preparation

One of our major tasks in data preparation is formatting this dataset for training our model, which requires a context-response structure. Each response will have the `n` previous responses as its context. We decided that `n = 7`, meaning each response has the seven preceding lines of dialogue as its context. The reason for this decision is that, in most conversations, the last 7 exchanges provide a sufficient context for understanding the current response.

Here's where you might choose to include a code snippet demonstrating how you generate the context for each line. (This is merely a suggestion, depending on your workflow and your professor's expectations).

Once our dataset is structured appropriately, we divide it into training and test subsets. This split allows us to train our model on one set of data (training) and evaluate its performance on a separate set (test) that it has not seen before.

Finally, we convert our dataset into a format suitable for our DialogPT model. This requires concatenating the responses into a single string for each row, separated by a special 'end of string' token. This token enables the model to identify the end of each response in a string.

## Reflecting on the Data

There are a few key considerations worth discussing about our dataset and the way we have chosen to use it:

1. **Limited Data**: Our dataset consists of only about 2,500 lines of dialogue. This is quite a small dataset for a machine learning project. Because the GPT-2 model was initially trained on a much larger dataset, fine-tuning it on our smaller dataset may lead to overfitting, where the model performs well on the training data but fails to generalize to unseen data.

2. **Character Diversity**: Each line in the dataset can come from a different character, adding another level of complexity to the task. The model needs to learn not only the response to a given line but also which character is likely to say that response. Given the limited size of our dataset, this could affect the model's ability to accurately capture each character's style.

3. **Context Length**: The choice of using seven previous responses as context is somewhat arbitrary and could affect the model's performance. If the context is too short, the model may not have enough information to generate an appropriate response. On the other hand, if the context is too long, the model may become too complex and harder to train.


In [2]:
# Briefly introduce using dialogpt to provide context into analysing the input data shape
# Add Code we used to generate the data and tokenise it

# Input Data Shape and Its Use in DialoGPT (A GPT-2 Variant)

The input data to GPT-2 (and in our case, the DialogPT variant we're using) is structured as a 2D tensor, where the first dimension represents individual instances (in our case, dialogues) and the second dimension represents the tokenized words within each instance.

This is because, at its core, GPT-2 is a transformer model, which was designed to handle sequence data, and text data is fundamentally a sequence of words or tokens. By representing our text data as a 2D tensor, we can maintain the sequence nature of our data while enabling the model to process multiple instances simultaneously for efficient batch processing.

To better understand the shape of our input data, consider the following example:

```python
inputs = [["I", "am", "GPT-2"], ["Hello", "world"]]
```

This list of lists represents two sentences, each of which is a sequence of words. After tokenization and numerical encoding, our data might look something like this:

```python
inputs = [[9, 84, 30522], [15496, 2327]]
```

However, because tensor operations require our data to be in a regular shape (i.e., each instance must have the same length), we need to pad our data to account for sentences of different lengths:

```python
inputs = [[9, 84, 30522, 0], [15496, 2327, 0, 0]]
```

Now our input data is a 2D tensor of shape (2, 4). The first dimension, of length 2, represents our two sentences, and the second dimension, of length 4, represents the words within each sentence. The '0' values are padding tokens that we've added to make each sentence the same length.

During training, GPT-2 uses this 2D tensor as input into its self-attention mechanism. In a nutshell, self-attention allows the model to weigh the importance of each word within a sentence when predicting the next word. The weights are learned during training and depend on the context provided by the other words in the sentence. This is how GPT-2 is able to generate contextually relevant responses.


## Transcript Data

Let's take a look at the dialogue example:

| response	| context	| context/0	| context/1	| context/2	| context/3	| context/4	| context/5 |
| ----------- | ----------- | ----------- | ----------- | ----------- | ----------- | ----------- | ----------- |
| What do you think of this... flying vehicle, M...	| We gotta go, gotta get outta here, come on. Go... | Ow! Ow! You're tugging me too hard!	| Come on, I got a surprise for you. Come on, h...	| It's the middle of the night. What are you tal...	| I got a surprise for you, Morty.	| What, Rick? What’s going on?	| Morty! You gotta come on. Jus'... you gotta co... |

Our first step in preparing this dialogue for input into our model is to concatenate all the context and response strings into one string for each row, adding an 'end of string' token between each response:

```
"Morty! You gotta come on. Jus'... you gotta co... <eos> What, Rick? What’s going on? <eos> I got a surprise for you, Morty. <eos> It's the middle of the night. What are you tal... <eos> Come on, I got a surprise for you. Come on, h... <eos> Ow! Ow! You're tugging me too hard! <eos> We gotta go, gotta get outta here, come on. Go... <eos> What do you think of this... flying vehicle, M..."
```

Next, we use the DialoGPT tokenizer to tokenize this string and convert it into numerical form:

```python
tokenizer = AutoTokenizer.from_pretrained('microsoft/DialoGPT-small')
input_ids = tokenizer.encode(
    "Morty! You gotta come on. Jus'... you gotta co... <eos> What, Rick? What’s going on? <eos> I got a surprise for you, Morty. <eos> It's the middle of the night. What are you tal... <eos> Come on, I got a surprise for you. Come on, h... <eos> Ow! Ow! You're tugging me too hard! <eos> We gotta go, gotta get outta here, come on. Go... <eos> What do you think of this... flying vehicle, M...", 
    return_tensors='pt'
)
print(input_ids.shape)
```

The output of this snippet should be something like `(1, N)`, where `N` is the number of tokens in our dialogue string. The 2D shape is because the tokenizer automatically adds an extra dimension for the batch size, even when we're only processing one dialogue.

We confirm this with the shape being `(1, 132)` with the tensor:

```python
tensor([[   44,   419,    88,     0,   921, 17753,  1282,   319,    13,   449,
           385,     6,   986,   345, 17753,   763,   986,  1279,    68,   418,
            29,  1867,    11,  8759,    30,  1867,   447,   247,    82,  1016,
           319,    30,  1279,    68,   418,    29,   314,  1392,   257,  5975,
           329,   345,    11, 30395,    13,  1279,    68,   418,    29,   632,
           338,   262,  3504,   286,   262,  1755,    13,  1867,   389,   345,
          3305,   986,  1279,    68,   418,    29,  7911,   319,    11,   314,
          1392,   257,  5975,   329,   345,    13,  7911,   319,    11,   289,
           986,  1279,    68,   418,    29, 11960,     0, 11960,     0,   921,
           821, 27762,  2667,   502,  1165,  1327,     0,  1279,    68,   418,
            29,   775, 17753,   467,    11, 17753,   651,   503,  8326,   994,
            11,  1282,   319,    13,  1514,   986,  1279,    68,   418,    29,
          1867,   466,   345,   892,   286,   428,   986,  7348,  4038,    11,
           337,   986]])
```

When we process a batch of dialogues, the shape would be `(B, N)`, where `B` is the batch size and `N` is the length of the longest dialogue in the batch. The other dialogues in the batch would be padded with zeros until they match this length.



# DialoGPT - A GPT-2 Variant Model Architecture

The GPT-2 (Generative Pretrained Transformer 2) model, developed by OpenAI, is a large-scale transformer-based language model. It builds upon the architecture of the original GPT model, with improvements in the model size, training data, and parameters. The GPT-2 model is designed to generate human-like text by predicting the next word in a given sequence of words. This makes it particularly suitable for tasks like text generation, translation, summarization, and more. In the context of our assignment, the GPT-2 architecture forms the basis for the DialoGPT model, which we use to create a chatbot that mimics the character Rick from Rick and Morty.
<img src="assets/dialogpt.png" alt="dialogpt" height="200">

## Transformer Architecture

The Transformer model, introduced in the paper "Attention is All You Need" by Vaswani et al., is the backbone of the GPT-2 model. The Transformer model is based on a self-attention mechanism and does away with recurrence and convolutions entirely. This architecture allows the model to process input sequences in parallel, rather than sequentially, leading to significant improvements in training speed.

The GPT-2 model uses a modified version of the Transformer, which only includes the decoder part of the original Transformer model. The decoder consists of a stack of identical layers, each with two sub-layers: a multi-head self-attention mechanism and a position-wise fully connected feed-forward network. A residual connection is employed around each of the two sub-layers, followed by layer normalization.

In our chatbot application, the Transformer architecture allows the model to generate responses in a conversational context, considering the entire context of the conversation rather than just the immediate previous response.
<img src="assets/gpt2_architecture.png" alt="gpt_architecture" height="400">
## Self-Attention Mechanism

The self-attention mechanism, also known as scaled dot-product attention, is a key component of the Transformer architecture. It allows the model to weigh the importance of words in an input sequence when generating an output sequence. In other words, it helps the model to decide where to "pay attention" when generating text.

In the context of our chatbot, the self-attention mechanism allows the model to generate responses that are contextually relevant and coherent. For example, if a user asks the chatbot a question about a specific episode of Rick and Morty, the self-attention mechanism helps the model to focus on the relevant parts of the conversation history when generating a response.

<img src="assets/self_attention.webp" alt="self_attention" height="300">

## Positional Encoding

Positional encoding is used in the Transformer model to give the model some information about the relative positions of the words in the input sequence. Since the Transformer model doesn't have any recurrence or convolutions, it doesn't have any inherent sense of position or order of the words. Positional encoding solves this problem by adding a vector to each input embedding. These vectors follow a specific pattern that allows the model to determine the position of each word in the sequence.

In the context of our chatbot, positional encoding helps the model to generate responses that make sense in the context of the conversation. For example, if a user asks a follow-up question, the model can use the positional encoding to understand that the follow-up question is related to the previous question, and generate a response accordingly.

<img src="assets/positional_encoding.png" alt="positional_encoding" height="200">

## Layer Normalization

Layer normalization is a technique used in the GPT-2 model to stabilize the neural network's learning process. It normalizes the inputs across the features instead of normalizing the features across the batch as in batch normalization. In other words, for each feature computed, layer normalization computes the mean and variance used for normalization from all of the summed inputs to the neurons in a layer on a single training case.

In the context of our chatbot, layer normalization helps to stabilize the learning process and improve the model's performance. It ensures that the scale of the features does not change drastically, making the model more robust and less prone to overfitting. This is particularly important for a chatbot application, where the model needs to handle a wide variety of inputs and generate coherent and contextually relevant responses.

<img src="assets/layer_norm.png" alt="layer_norm" height="200">

## Feed-Forward Networks

Each layer of the GPT-2 model contains a fully connected feed-forward network, which is applied to each position separately and identically. This consists of two linear transformations with a ReLU activation function in between. While the linear transformations are the same across different positions, they use different parameters from layer to layer.

The feed-forward networks in the GPT-2 model serve to increase the representational power of the model. They allow the model to learn more complex patterns in the data, which is crucial for a chatbot application. For example, the model needs to understand complex conversational patterns, detect sarcasm or humor, and generate responses that are not only contextually relevant but also in line with the character's personality (in this case, Rick from Rick and Morty).

<img src="assets/feed_forward.png" alt="feed_forward" height="200">

## Model Size and Parameters

The GPT-2 model comes in several sizes, ranging from "small" (117 million parameters) to "extra large" (1.5 billion parameters). The size of the model (i.e., the number of parameters) is a key factor that determines the model's capacity to learn from data. A larger model can learn more complex patterns in the data, but it also requires more computational resources to train and run.

In the context of our chatbot, we need to balance the model size with the available computational resources and the complexity of the task. For example, if we want the chatbot to generate highly creative and nuanced responses that closely mimic the character Rick, we might opt for a larger model. However, if we have limited computational resources or if the chatbot's responses don't need to be highly complex, a smaller model might suffice.



# DialoGPT and its Adaptation from GPT-2

DialoGPT is a variant of the GPT-2 model, specifically fine-tuned for conversational responses. It was trained on a large dataset of internet-based conversations, allowing it to generate human-like conversational responses. This makes DialoGPT particularly suitable for chatbot applications, such as our Rick and Morty character chatbot.

## Differences and Similarities with GPT-2

While DialoGPT and GPT-2 share the same underlying Transformer architecture, there are some key differences between the two models, mainly due to the fine-tuning process used to train DialoGPT.

### Similarities

- **Architecture**: Both GPT-2 and DialoGPT are based on the Transformer architecture, which includes components like self-attention mechanism, positional encoding, layer normalization, and feed-forward networks.

- **Training Method**: Both models are trained using a variant of the Transformer's decoder, with masked self-attention.

- **Generative Models**: Both GPT-2 and DialoGPT are generative models, meaning they generate text by predicting the next word in a sequence.

### Differences

- **Training Data**: While GPT-2 is trained on a diverse range of internet text, DialoGPT is specifically fine-tuned on a dataset of internet dialogues. This allows DialoGPT to generate more conversational and contextually appropriate responses.

- **Tokenization**: DialoGPT uses a byte-level BPE tokenizer, which allows it to handle a wider range of input text compared to GPT-2.

- **Fine-Tuning**: DialoGPT is fine-tuned for dialogue generation tasks, which makes it more suitable for chatbot applications compared to GPT-2.

In the context of our chatbot, the fine-tuning process allows DialoGPT to generate responses that are more conversational and contextually appropriate. This is crucial for creating a chatbot that can engage users in a natural and engaging conversation, mimicking the character Rick from Rick and Morty.
