# Introduction to the Modeling Problem

In this project, we delve into the exciting world of Natural Language Processing (NLP) to tackle a compelling challenge: developing a chatbot that not only engages in contextually relevant conversations but also emulates the unique conversational styles of characters from the popular TV show "Rick and Morty". We chose to work with the DialoGPT model from Microsoft - a variant of the GPT-2 model, that is fine-tuned specifically for dialogues. 

Our project involves distinct challenges, each representing a key stage in the development of our chatbot:

1. **Data Acquisition and Cleaning**: Our first task was to obtain and clean a dataset with enough depth and richness to effectively train our model. This required sourcing and preprocessing dialogues from the TV show.

2. **Understanding Character Styles**: Each character in "Rick and Morty" boasts a unique conversational style, peppered with their own idiosyncrasies. The complexity of capturing the nuances of their dialogue, including humor, colloquialisms, and catchphrases, represented a significant challenge.

3. **Text Generation**: Beyond generating relevant responses, our model needed to convincingly replicate the distinct conversational style of the character in question. The GPT-2, or more precisely, its DialogPT variant was our tool of choice, a finetuned model in generating conversational responses.

4. **Evaluation**: Our final hurdle lay in evaluating our model's performance. Apart from traditional accuracy metrics, we had to assess whether the generated text genuinely encapsulated the character's style. Establishing evaluation metrics represented a unique and interesting problem in itself.

___

## Goal of Our Project

The primary aim of our project is to leverage Natural Language Processing (NLP) techniques to develop an interactive and unique artificial personality. Our focus is not only to ensure contextually relevant conversations but also to emulate the distinctive conversational styles of characters from the TV show "Rick and Morty". By utilizing the DialoGPT model, a fine-tuned variant of the GPT-2 model specialized for dialogues, we seek to craft a chatbot that engages users in fun interactions, bringing the personas of "Rick and Morty" into our NLP model. Through this project we aim to expand our knowledge of conversational AI models and testing the boundaries.

___

## Extending NLP-Powered Chatbot to Real Business Solutions and Beyond

The application of our NLP-powered chatbot could potentially extend beyond just creating a fun character from one popular show "Rick and Morty". With a successful Artifical Persona we can create unique and interactive user experiences on various platforms, presenting exciting opportunities for businesses and content creators.

1. **Website with Cool Characters:** We could create a business that can leverage our NLP Artifical Persona Chatbot to design engaging websites featuring interactive and entertaining characters. These characters could be original creations or inspired by popular media franchises. 

2. **Interactive YouTubers:** Content creators, especially YouTubers, can use our NLP-powered Artifical Persona to bring their personas to life in a whole new way. Imagine a YouTube channel where the creator's virtual character interacts with viewers in real-time, responding to comments and engaging in witty banter. If the Artifical Persona is indistinguishable from the real person this personalized and interactive experience could foster a stronger sense of connection between the content creator and their audience, enhancing viewer experience.

3. **Customizable Chatbots for Customer Support:** Beyond entertainment, our chatbot technology can serve as a valuable tool for businesses to streamline customer support services. Companies can create customized chatbots with distinct personalities that align with their brand identity. These chatbots can efficiently handle customer inquiries, provide helpful information, and resolve issues, all while maintaining a friendly and engaging conversational style.

4. **Educational Applications:** In the realm of education, our NLP-powered chatbot can be utilized to create interactive and adaptive learning experiences. Students can engage with virtual characters that act as intelligent tutors, providing personalized explanations, quizzes, and feedback based on individual learning preferences.

5. **Virtual Influencers:** As the world of social media continues to evolve, virtual influencers have gained popularity. Our chatbot technology can enable the creation of virtual influencers with unique personalities, who can interact with followers on social media platforms. These virtual influencers can promote products, share content, and engage in conversations with users, offering a fresh approach to influencer marketing.

The potential applications of our NLP-powered Artifical Persona chatbot are vast and diverse. From enhancing user experiences on websites to transforming the way content creators interact with their audience, and from enhancing customer support to improving education and influencer marketing, the possibilities are limitless. Through this project, we are exploring the capabilities of NLPs and how a successful model could integrate with todays world.

___

## Choice of Model

In this project, we decided to utilize Microsoft's DialoGPT model. As a variation of the GPT-2 model fine-tuned explicitly for generating conversational responses, DialoGPT seemed well-suited for our task of creating a chatbot that mirrors the dialogue style in the "Rick and Morty" show. This model selection was also backed by the promising results demonstrated by DialoGPT in generating contextually relevant and human-like conversational responses.

In the realm of Natural Language Processing (NLP), multiple models could have been considered. For instance, sequence-to-sequence (Seq2Seq) models like LSTM (Long Short-Term Memory) or GRU (Gated Recurrent Units) networks could have been potential choices. However, these recurrent neural network-based models often struggle with long sequences due to the vanishing gradient problem, making them less ideal for dialog systems where the context could be quite lengthy.

On the other hand, Transformer-based models like GPT-2 overcome this limitation with their attention mechanism. Although the GPT-2 model is already quite powerful, DialoGPT takes it a step further by fine-tuning this base model on a large-scale dialogue dataset. This fine-tuning makes DialoGPT more apt at understanding and generating dialogue, a feature that aligns perfectly with our project goal.

Our aim was to mimic the humoristic style of "Rick and Morty" accurately with a focus on the character of Rick, and also generate contextually coherent and relevant responses in a conversational setting. The DialoGPT model provided a solid foundation for this, allowing us to further train and fine-tune it with our dataset for even better performance.

___

# Data Deep Dive

In this section, we delve into our dataset, which is the cornerstone of any Machine Learning project. We'll explore the data acquisition process, our method for preparing the data for our model, and some challenges associated with the nature of the data and our approach. 

## Data Acquisition

Our dataset comes from Kaggle, specifically the [Rick and Morty Scripts dataset](https://www.kaggle.com/andradaolteanu/rickmorty-scripts) posted by user Andrada Olteanu. This dataset provides lines of dialogue from different characters across various episodes of the show.

The dataset structure is as follows:

- **index**: A simple row identifier.
- **season no.**: The season in which the dialogue line appears.
- **episode no.**: The episode in which the dialogue line appears.
- **episode name**: The name of the episode.
- **name**: The character who speaks the line.
- **line**: The line of dialogue itself.

Here are the first few lines for context:

```
index	season no.	episode no.	episode name	name	line
0	1	1	Pilot	Rick	Morty! You gotta come on. Jus'... you gotta come with me.
1	1	1	Pilot	Morty	What, Rick? What’s going on?
2	1	1	Pilot	Rick	I got a surprise for you, Morty.
```
___

## Data Preparation

One of our major tasks in data preparation is formatting this dataset for training our model, which requires a context-response structure. Each response will have the `n` previous responses as its context. We decided that `n = 7`, meaning each response has the seven preceding lines of dialogue as its context. The reason for this decision is that, in most conversations, the last 7 exchanges provide a sufficient context for understanding the current response.

Here's where you might choose to include a code snippet demonstrating how you generate the context for each line. (This is merely a suggestion, depending on your workflow and your professor's expectations).

Once our dataset is structured appropriately, we divide it into training and test subsets. This split allows us to train our model on one set of data (training) and evaluate its performance on a separate set (test) that it has not seen before.

Finally, we convert our dataset into a format suitable for our DialogPT model. This requires concatenating the responses into a single string for each row, separated by a special 'end of string' token. This token enables the model to identify the end of each response in a string.



### Data Preperation Code

In [39]:
args = Args()
# Read dataset in the same directory
all_rick = pd.read_csv('RickAndMortyScripts.csv')
contexted = []
# Number of previous responses
n = 7

# Create context-response pairs, previous n lines are used as context for every lines of dialogue
for i in range(n, len(all_rick['line'])):
  row = []
  prev = i - 1 - n
  for j in range(i, prev, -1):
    row.append(all_rick['line'][j])
  contexted.append(row)

columns = ['response', 'context']
columns = columns + ['context/'+ str(i) for i in range(n - 1)]

# Convert contexted list into a pandas dataframe, each row corresponds to a context-response pair and column represent the response and the different lines of context
df = pd.DataFrame.from_records(contexted, columns=columns)
# Splitting data into 9:1 so that 90% is used for training and remaining for validation
trn_df, val_df = train_test_split(df, test_size = 0.1)
df.head(5)

Unnamed: 0,response,context,context/0,context/1,context/2,context/3,context/4,context/5
0,"What do you think of this... flying vehicle, M...","We gotta go, gotta get outta here, come on. Go...",Ow! Ow! You're tugging me too hard!,"Come on, I got a surprise for you. Come on, h...",It's the middle of the night. What are you tal...,"I got a surprise for you, Morty.","What, Rick? What’s going on?",Morty! You gotta come on. Jus'... you gotta co...
1,"Yeah, Rick... I-it's great. Is this the surprise?","What do you think of this... flying vehicle, M...","We gotta go, gotta get outta here, come on. Go...",Ow! Ow! You're tugging me too hard!,"Come on, I got a surprise for you. Come on, h...",It's the middle of the night. What are you tal...,"I got a surprise for you, Morty.","What, Rick? What’s going on?"
2,Morty. I had to... I had to do it. I had— I ha...,"Yeah, Rick... I-it's great. Is this the surprise?","What do you think of this... flying vehicle, M...","We gotta go, gotta get outta here, come on. Go...",Ow! Ow! You're tugging me too hard!,"Come on, I got a surprise for you. Come on, h...",It's the middle of the night. What are you tal...,"I got a surprise for you, Morty."
3,What?! A bomb?!,Morty. I had to... I had to do it. I had— I ha...,"Yeah, Rick... I-it's great. Is this the surprise?","What do you think of this... flying vehicle, M...","We gotta go, gotta get outta here, come on. Go...",Ow! Ow! You're tugging me too hard!,"Come on, I got a surprise for you. Come on, h...",It's the middle of the night. What are you tal...
4,We're gonna drop it down there just get a whol...,What?! A bomb?!,Morty. I had to... I had to do it. I had— I ha...,"Yeah, Rick... I-it's great. Is this the surprise?","What do you think of this... flying vehicle, M...","We gotta go, gotta get outta here, come on. Go...",Ow! Ow! You're tugging me too hard!,"Come on, I got a surprise for you. Come on, h..."


Then we encode our text data into a sequence of integers, each representing a token, which is then flattened into a single list for the model to process.

In [5]:
def construct_conv(row, tokenizer, eos = True):
    """
    Function to encode text data into a format that the model can understand.
    """
    flatten = lambda l: [item for sublist in l for item in sublist]
    # Encodes each piece of text into a sequence of integers, each integer represents a token
    conv = list(reversed([tokenizer.encode(x) + [tokenizer.eos_token_id] for x in row]))
    # Flatten the list of encoded sequences into a single list
    conv = flatten(conv)
    return conv

This is our custom ConversationDataset class, which inherits from PyTorch's Dataset class. It handles the preparation of conversational data for model training or evaluation, caching the processed data for efficiency, and returning each item as a PyTorch tensor

In [6]:
class ConversationDataset(Dataset):
    """
    Defines a custom class, which inherits from the Dataset class of Pytorch.
    This will handle the prepraration of conversational data for model training or evaluation.
    """
    def __init__(self, tokenizer: PreTrainedTokenizer, args, df, block_size=512):
        # Constructor of the class
        # Adjusts the block size based on the maximum length that the model and tokenizer can handle
        block_size = block_size - (tokenizer.model_max_length - tokenizer.max_len_single_sentence)

        # Cached the processed data
        directory = args.cache_dir
        cached_features_file = os.path.join(
            directory, args.model_type + "_cached_lm_" + str(block_size)
        )

        # Check if cached file already exists
        if os.path.exists(cached_features_file) and not args.overwrite_cache:
            logger.info("Loading features from cached file %s", cached_features_file)
            with open(cached_features_file, "rb") as handle:
                self.examples = pickle.load(handle)
        else:
            logger.info("Creating features from dataset file at %s", directory)

            self.examples = []
            for _, row in df.iterrows():
                conv = construct_conv(row, tokenizer)
                self.examples.append(conv)

            logger.info("Saving features into cached file %s", cached_features_file)
            with open(cached_features_file, "wb") as handle:
                pickle.dump(self.examples, handle, protocol=pickle.HIGHEST_PROTOCOL)

    def __len__(self):
        return len(self.examples)

    def __getitem__(self, item):
        # Returns the item as a Pytorch tensor
        return torch.tensor(self.examples[item], dtype=torch.long)

## Reflecting on the Data

There are a few key considerations worth discussing about our dataset and the way we have chosen to use it:

1. **Limited Data**: Our dataset consists of only about 2,500 lines of dialogue. This is quite a small dataset for a machine learning project. Because the GPT-2 model was initially trained on a much larger dataset, fine-tuning it on our smaller dataset may lead to overfitting, where the model performs well on the training data but fails to generalize to unseen data.

2. **Character Diversity**: Each line in the dataset can come from a different character, adding another level of complexity to the task. The model needs to learn not only the response to a given line but also which character is likely to say that response. Given the limited size of our dataset, this could affect the model's ability to accurately capture each character's style.

3. **Context Length**: The choice of using seven previous responses as context is somewhat arbitrary and could affect the model's performance. If the context is too short, the model may not have enough information to generate an appropriate response. On the other hand, if the context is too long, the model may become too complex and harder to train.
___

# Exploratory Data Analysis
Let's turn our attention to the initial step of any data science project: Exploratory Data Analysis (EDA).

EDA is a critical step that allows us to understand the data we're working with. By visualizing, summarizing, and interpreting the data, we can gain valuable insights that guide our model selection and training process. It also helps us identify any potential issues with the data, such as missing values or outliers, that could impact the performance of our models.


### Unigrams, Bigrams, and Trigrams

In the field of Natural Language Processing (NLP), unigrams, bigrams, and trigrams are essential components of text analysis. They represent sequences of one, two, and three words respectively. For example, given the text "come on Morty", the unigrams are "come", "on", "Morty"; the bigrams are "come on", "on Morty"; and the trigram is "come on Morty".

One common application of unigrams, bigrams, and trigrams is in the creation of a Term Document Matrix (TDM). A TDM is a mathematical matrix that describes the frequency of terms (unigrams, bigrams, trigrams, etc.) that occur in a collection of documents. In a TDM, rows represent the terms found in the documents, columns represent the documents, and the values represent the frequency of each term in each document.

Below is the Python code that defines functions to generate unigrams, bigrams, and trigrams from a given text, and to create a TDM from these n-grams.

```python
from sklearn.feature_extraction.text import CountVectorizer

# Function to create a term-document matrix of n-grams
def create_tdm(text, ngram_range):
    vectorizer = CountVectorizer(ngram_range=ngram_range)
    tdm = vectorizer.fit_transform(text)
    return pd.DataFrame(tdm.toarray(), columns=vectorizer.get_feature_names())

# Function to create unigrams
def create_unigrams(text):
    return create_tdm(text, (1, 1))

# Function to create bigrams
def create_bigrams(text):
    return create_tdm(text, (2, 2))

# Function to create trigrams
def create_trigrams(text):
    return create_tdm(text, (3, 3))
```

These functions helped to analyze the frequency of unigrams, bigrams, and trigrams in our text data, which can provide insights into the most common words and phrases, the relationships between words, and the overall structure of the text.

<img src="assets/Morty_unigram_frequency_chart.png" alt="1" height="250">
<img src="assets/Morty_bigram_frequency_chart.png" alt="2" height="250">
<img src="assets/Morty_trigram_frequency_chart.png" alt="3" height="250">
<img src="assets/Rick_unigram_frequency_chart.png" alt="4" height="250">
<img src="assets/Rick_bigram_frequency_chart.png" alt="5" height="250">
<img src="assets/Rick_trigram_frequency_chart.png" alt="6" height="250">

### Dialogue Analysis: Who Talks the Most?

In any dialogue-based dataset, one interesting aspect to explore is the distribution of dialogues among the characters. This can provide insights into the dynamics of the character interactions and the prominence of each character in the dataset.

In the case of our Rick and Morty dataset, it's not surprising to find that Rick and Morty, being the titular characters, have the most dialogues. Following them are Beth, Jerry, and Summer, completing the Smith family.

Interestingly, we've chosen to treat 'Pickle Rick' as a character on his own, separate from 'Rick'. This decision was made due to the distinct personality that Pickle Rick exhibits in the show, which could be considered somewhat more 'sour' than the 'true' Rick.

Below we have provided graphs to help us understand the character dialogue frequency

<img src="assets/character_line_frequencies_chart.png" alt="7" height="300">

___

# Input Data Shape and Its Use in DialoGPT (A GPT-2 Variant)

The input data to GPT-2 (and in our case, the DialogPT variant we're using) is structured as a 2D tensor, where the first dimension represents individual instances (in our case, dialogues) and the second dimension represents the tokenized words within each instance.

This is because, at its core, GPT-2 is a transformer model, which was designed to handle sequence data, and text data is fundamentally a sequence of words or tokens. By representing our text data as a 2D tensor, we can maintain the sequence nature of our data while enabling the model to process multiple instances simultaneously for efficient batch processing.

To better understand the shape of our input data, consider the following example:

```python
inputs = [["I", "am", "GPT-2"], ["Hello", "world"]]
```

This list of lists represents two sentences, each of which is a sequence of words. After tokenization and numerical encoding, our data might look something like this:

```python
inputs = [[9, 84, 30522], [15496, 2327]]
```

However, because tensor operations require our data to be in a regular shape (i.e., each instance must have the same length), we need to pad our data to account for sentences of different lengths:

```python
inputs = [[9, 84, 30522, 0], [15496, 2327, 0, 0]]
```

Now our input data is a 2D tensor of shape (2, 4). The first dimension, of length 2, represents our two sentences, and the second dimension, of length 4, represents the words within each sentence. The '0' values are padding tokens that we've added to make each sentence the same length.

During training, GPT-2 uses this 2D tensor as input into its self-attention mechanism. In a nutshell, self-attention allows the model to weigh the importance of each word within a sentence when predicting the next word. The weights are learned during training and depend on the context provided by the other words in the sentence. This is how GPT-2 is able to generate contextually relevant responses.

___

## Transcript Data

Let's take a look at the dialogue example:

| response	| context	| context/0	| context/1	| context/2	| context/3	| context/4	| context/5 |
| ----------- | ----------- | ----------- | ----------- | ----------- | ----------- | ----------- | ----------- |
| What do you think of this... flying vehicle, M...	| We gotta go, gotta get outta here, come on. Go... | Ow! Ow! You're tugging me too hard!	| Come on, I got a surprise for you. Come on, h...	| It's the middle of the night. What are you tal...	| I got a surprise for you, Morty.	| What, Rick? What’s going on?	| Morty! You gotta come on. Jus'... you gotta co... |

Our first step in preparing this dialogue for input into our model is to concatenate all the context and response strings into one string for each row, adding an 'end of string' token between each response:

```
"Morty! You gotta come on. Jus'... you gotta co... <eos> What, Rick? What’s going on? <eos> I got a surprise for you, Morty. <eos> It's the middle of the night. What are you tal... <eos> Come on, I got a surprise for you. Come on, h... <eos> Ow! Ow! You're tugging me too hard! <eos> We gotta go, gotta get outta here, come on. Go... <eos> What do you think of this... flying vehicle, M..."
```

Next, we use the DialoGPT tokenizer to tokenize this string and convert it into numerical form:

```python
tokenizer = AutoTokenizer.from_pretrained('microsoft/DialoGPT-small')
input_ids = tokenizer.encode(
    "Morty! You gotta come on. Jus'... you gotta co... <eos> What, Rick? What’s going on? <eos> I got a surprise for you, Morty. <eos> It's the middle of the night. What are you tal... <eos> Come on, I got a surprise for you. Come on, h... <eos> Ow! Ow! You're tugging me too hard! <eos> We gotta go, gotta get outta here, come on. Go... <eos> What do you think of this... flying vehicle, M...", 
    return_tensors='pt'
)
print(input_ids.shape)
```

The output of this snippet should be something like `(1, N)`, where `N` is the number of tokens in our dialogue string. The 2D shape is because the tokenizer automatically adds an extra dimension for the batch size, even when we're only processing one dialogue.

We confirm this with the shape being `(1, 132)` with the tensor:

```python
tensor([[   44,   419,    88,     0,   921, 17753,  1282,   319,    13,   449,
           385,     6,   986,   345, 17753,   763,   986,  1279,    68,   418,
            29,  1867,    11,  8759,    30,  1867,   447,   247,    82,  1016,
           319,    30,  1279,    68,   418,    29,   314,  1392,   257,  5975,
           329,   345,    11, 30395,    13,  1279,    68,   418,    29,   632,
           338,   262,  3504,   286,   262,  1755,    13,  1867,   389,   345,
          3305,   986,  1279,    68,   418,    29,  7911,   319,    11,   314,
          1392,   257,  5975,   329,   345,    13,  7911,   319,    11,   289,
           986,  1279,    68,   418,    29, 11960,     0, 11960,     0,   921,
           821, 27762,  2667,   502,  1165,  1327,     0,  1279,    68,   418,
            29,   775, 17753,   467,    11, 17753,   651,   503,  8326,   994,
            11,  1282,   319,    13,  1514,   986,  1279,    68,   418,    29,
          1867,   466,   345,   892,   286,   428,   986,  7348,  4038,    11,
           337,   986]])
```

When we process a batch of dialogues, the shape would be `(B, N)`, where `B` is the batch size and `N` is the length of the longest dialogue in the batch. The other dialogues in the batch would be padded with zeros until they match this length.



___

# DialoGPT - A GPT-2 Variant Model Architecture

The GPT-2 (Generative Pretrained Transformer 2) model, developed by OpenAI, is a large-scale transformer-based language model. It builds upon the architecture of the original GPT model, with improvements in the model size, training data, and parameters. The GPT-2 model is designed to generate human-like text by predicting the next word in a given sequence of words. This makes it particularly suitable for tasks like text generation, translation, summarization, and more. In the context of our assignment, the GPT-2 architecture forms the basis for the DialoGPT model, which we use to create a chatbot that mimics the character Rick from Rick and Morty.
<img src="assets/dialogpt.png" alt="dialogpt" height="200">

## Transformer Architecture

The Transformer model, introduced in the paper "Attention is All You Need" by Vaswani et al., is the backbone of the GPT-2 model. The Transformer model is based on a self-attention mechanism and does away with recurrence and convolutions entirely. This architecture allows the model to process input sequences in parallel, rather than sequentially, leading to significant improvements in training speed.

The GPT-2 model uses a modified version of the Transformer, which only includes the decoder part of the original Transformer model. The decoder consists of a stack of identical layers, each with two sub-layers: a multi-head self-attention mechanism and a position-wise fully connected feed-forward network. A residual connection is employed around each of the two sub-layers, followed by layer normalization.

In our chatbot application, the Transformer architecture allows the model to generate responses in a conversational context, considering the entire context of the conversation rather than just the immediate previous response.
<img src="assets/gpt2_architecture.png" alt="gpt_architecture" height="400">
## Self-Attention Mechanism

The self-attention mechanism, also known as scaled dot-product attention, is a key component of the Transformer architecture. It allows the model to weigh the importance of words in an input sequence when generating an output sequence. In other words, it helps the model to decide where to "pay attention" when generating text.

In the context of our chatbot, the self-attention mechanism allows the model to generate responses that are contextually relevant and coherent. For example, if a user asks the chatbot a question about a specific episode of Rick and Morty, the self-attention mechanism helps the model to focus on the relevant parts of the conversation history when generating a response.

<img src="assets/self_attention.webp" alt="self_attention" height="300">

## Positional Encoding

Positional encoding is used in the Transformer model to give the model some information about the relative positions of the words in the input sequence. Since the Transformer model doesn't have any recurrence or convolutions, it doesn't have any inherent sense of position or order of the words. Positional encoding solves this problem by adding a vector to each input embedding. These vectors follow a specific pattern that allows the model to determine the position of each word in the sequence.

In the context of our chatbot, positional encoding helps the model to generate responses that make sense in the context of the conversation. For example, if a user asks a follow-up question, the model can use the positional encoding to understand that the follow-up question is related to the previous question, and generate a response accordingly.

<img src="assets/positional_encoding.png" alt="positional_encoding" height="200">

## Layer Normalization

Layer normalization is a technique used in the GPT-2 model to stabilize the neural network's learning process. It normalizes the inputs across the features instead of normalizing the features across the batch as in batch normalization. In other words, for each feature computed, layer normalization computes the mean and variance used for normalization from all of the summed inputs to the neurons in a layer on a single training case.

In the context of our chatbot, layer normalization helps to stabilize the learning process and improve the model's performance. It ensures that the scale of the features does not change drastically, making the model more robust and less prone to overfitting. This is particularly important for a chatbot application, where the model needs to handle a wide variety of inputs and generate coherent and contextually relevant responses.

<img src="assets/layer_norm.png" alt="layer_norm" height="200">

## Feed-Forward Networks

Each layer of the GPT-2 model contains a fully connected feed-forward network, which is applied to each position separately and identically. This consists of two linear transformations with a ReLU activation function in between. While the linear transformations are the same across different positions, they use different parameters from layer to layer.

The feed-forward networks in the GPT-2 model serve to increase the representational power of the model. They allow the model to learn more complex patterns in the data, which is crucial for a chatbot application. For example, the model needs to understand complex conversational patterns, detect sarcasm or humor, and generate responses that are not only contextually relevant but also in line with the character's personality (in this case, Rick from Rick and Morty).

<img src="assets/feed_forward.png" alt="feed_forward" height="200">

## Model Size and Parameters

The GPT-2 model comes in several sizes, ranging from "small" (117 million parameters) to "extra large" (1.5 billion parameters). The size of the model (i.e., the number of parameters) is a key factor that determines the model's capacity to learn from data. A larger model can learn more complex patterns in the data, but it also requires more computational resources to train and run.

In the context of our chatbot, we need to balance the model size with the available computational resources and the complexity of the task. For example, if we want the chatbot to generate highly creative and nuanced responses that closely mimic the character Rick, we might opt for a larger model. However, if we have limited computational resources or if the chatbot's responses don't need to be highly complex, a smaller model might suffice.

## Loss Function

The GPT-2 model, like many other language models, uses a loss function to measure the difference between its predictions and the actual outcomes. This loss function is crucial for training the model, as it provides a metric that the model can aim to minimize through gradient descent.

In the case of GPT-2, the loss function used is the Cross-Entropy Loss. This is a common choice for classification problems, including language modeling, which can be thought of as the task of predicting the next word in a sequence (a form of classification).

### Cross-Entropy Loss

Cross-Entropy Loss, also known as Negative Log Likelihood, measures the dissimilarity between the model's predictions and the actual outcomes. It is calculated as follows:

```
Cross-Entropy Loss = - Σ (y_i * log(p(y_i)))
```

Where:
- `y_i` is the actual outcome (1 for the correct word, 0 for all other words in the vocabulary)
- `p(y_i)` is the predicted probability of the outcome `y_i` according to the model

The model outputs a probability distribution over the entire vocabulary for the next word in a sequence. The Cross-Entropy Loss is then calculated between this predicted distribution and the actual outcome (which is a one-hot encoded vector representing the actual next word).

<img src="assets/loss.png" alt="loss" height="200">

___



# DialoGPT and its Adaptation from GPT-2

DialoGPT is a variant of the GPT-2 model, specifically fine-tuned for conversational responses. It was trained on a large dataset of internet-based conversations, allowing it to generate human-like conversational responses. This makes DialoGPT particularly suitable for chatbot applications, such as our Rick and Morty character chatbot.

## Differences and Similarities with GPT-2

While DialoGPT and GPT-2 share the same underlying Transformer architecture, there are some key differences between the two models, mainly due to the fine-tuning process used to train DialoGPT.

### Similarities

- **Architecture**: Both GPT-2 and DialoGPT are based on the Transformer architecture, which includes components like self-attention mechanism, positional encoding, layer normalization, and feed-forward networks.

- **Training Method**: Both models are trained using a variant of the Transformer's decoder, with masked self-attention.

- **Generative Models**: Both GPT-2 and DialoGPT are generative models, meaning they generate text by predicting the next word in a sequence.

- **Loss Function**: Another similarity between GPT-2 and DialoGPT is the loss function used during training. Both models use the cross-entropy loss function, a common choice for training language models.

    Cross-entropy loss measures the dissimilarity between the model's predictions and the actual data. In the context of language models like GPT-2 and DialoGPT, the model's prediction is the probability distribution of the next word in the sequence, and the actual data is the true next word.

    The cross-entropy loss for a single prediction can be calculated using the formula:

    ```
    H(p, q) = - Σ p(x) log(q(x))
    ```

    where:
    - `p` is the true probability distribution (in this case, a one-hot encoded vector of the true next word)
    - `q` is the predicted probability distribution (the model's output)
    - `Σ` denotes the sum over all classes (words in the vocabulary)
    - `log` is the natural logarithm

    The goal during training is to minimize this loss, i.e., to make the model's predictions as close as possible to the actual data.

    In the context of our chatbot, minimizing the cross-entropy loss means that the model is effectively learning to predict the next word in a dialogue, given the previous words. This allows the model to generate coherent and contextually appropriate responses, mimicking the character Rick from Rick and Morty.




### Differences

- **Training Data**: While GPT-2 is trained on a diverse range of internet text, DialoGPT is specifically fine-tuned on a dataset of internet dialogues. This allows DialoGPT to generate more conversational and contextually appropriate responses.

- **Tokenization**: DialoGPT uses a byte-level BPE tokenizer, which allows it to handle a wider range of input text compared to GPT-2.

- **Fine-Tuning**: DialoGPT is fine-tuned for dialogue generation tasks, which makes it more suitable for chatbot applications compared to GPT-2.

In the context of our chatbot, the fine-tuning process allows DialoGPT to generate responses that are more conversational and contextually appropriate. This is crucial for creating a chatbot that can engage users in a natural and engaging conversation, mimicking the character Rick from Rick and Morty.

___


### Comparison of DialoGPT Small and Large

DialoGPT comes in different sizes, with the "small" and "large" variants being two of them. The primary difference between these variants lies in their capacity, which is determined by the number of parameters they have. The "large" variant has more parameters than the "small" one, which generally allows it to capture more complex patterns in the data.

#### Effect on Results

The increased capacity of the DialoGPT large model can lead to improved performance in generating text that mimics the character Rick Sanchez. It can potentially capture more nuanced aspects of Rick's personality and provide more contextually appropriate responses.

However, the larger model also requires more computational resources to train and use, which can be a limiting factor depending on the available hardware. Additionally, larger models can sometimes overfit to the training data, especially if the dataset is small. This means that they might perform exceptionally well on the training data but struggle to generalize to new, unseen data.

On the other hand, the DialoGPT small model, while less capable than the large model, is more computationally efficient. It can be trained and used on less powerful hardware, making it a more accessible option. However, its performance might not be as good as the large model, especially on complex tasks like mimicking a specific character's personality.

In the context of our Rick Sanchez chatbot, choosing between the small and large DialoGPT models involves a trade-off between performance and computational efficiency. The best choice would depend on the specific requirements of the project, including the available hardware, the size of the training data, and the desired level of performance.

For the majority of our training we confirmed that DialoGPT-small is indeed most appropriate for our project considering the performance in terms of robustness in capturing conversational nuances, and size of the model because of the limited time/computing resources. 

___

# Training and Fine-Tuning DialoGPT

After understanding the architecture and the underlying mechanisms of the GPT-2 and DialoGPT models, the next step is to delve into the practical aspects of implementing these models. Specifically, we'll focus on the training and fine-tuning process, which is crucial for adapting the pre-trained DialoGPT model to our specific task: creating a chatbot that mimics the character Rick from Rick and Morty.

Fine-tuning is a process that involves further training a pre-trained model on a new dataset. In our case, this new dataset would ideally be a collection of dialogues from the Rick and Morty series. The fine-tuning process allows us to adapt the general language understanding capabilities of DialoGPT to the specific language style and nuances of the character Rick.

In the following sections, we'll provide a step-by-step guide on how to fine-tune the DialoGPT model, including the code and explanations for each step.


In [33]:
"""

Python code on jupyter notebook to fine-tune language model (DialoGPT-large) of Microsoft
so that it speaks in a certain style of Rick, who is the character from Rick and Morty sitcom.


important libraries used:
Pytorch (machine learning framework)
sklearn (data splitting & preprocessing)
Transformer (to download/fine-tune existing model)
nltk (evaluation metrics)
"""
import glob, logging, pickle, random, re, shutil
from typing import Dict, List, Tuple

import pandas as pd
import numpy as np
import torch
import nltk
from sklearn.model_selection import train_test_split
from torch.nn.utils.rnn import pad_sequence
from torch.utils.data import DataLoader, Dataset, RandomSampler, SequentialSampler
from torch.utils.data.distributed import DistributedSampler

from tqdm.notebook import tqdm, trange
from pathlib import Path


from transformers import (
    MODEL_WITH_LM_HEAD_MAPPING,
    WEIGHTS_NAME,
    AdamW,
    AutoConfig,
    AutoModelForCausalLM,
    AutoTokenizer,
    PreTrainedModel,
    PreTrainedTokenizer,
    get_linear_schedule_with_warmup,
)

from torch.utils.tensorboard import SummaryWriter

Model Initial Configuration

In [35]:
# Configuration code
# Setting up a logger for this script so that the module write log messages to the console or a file
logger = logging.getLogger(__name__)
# Mapping provided by the HuggingFace transformers library that maps each model type to its corresponding configuration class
MODEL_CONFIG_CLASSES = list(MODEL_WITH_LM_HEAD_MAPPING.keys())
# Identifies the type of the model
MODEL_TYPES = tuple(conf.model_type for conf in MODEL_CONFIG_CLASSES)

Modelling Pipeline:

We define the Args class, encapsulating essential hyperparameters and settings for our model training process.

In [38]:
class Args():
    """
    Argument class definition, define hyperparameters and setting for the model training.
    """
    def __init__(self):
        # Place to store the trained model
        self.output_dir = 'output-small'
        self.model_type = 'gpt2'
        # Model used for training
        self.model_name_or_path = 'microsoft/DialoGPT-small'
        self.config_name = 'microsoft/DialoGPT-small'
        self.tokenizer_name = 'microsoft/DialoGPT-small'
        # Directory where the preprocessed datasets will be cached
        self.cache_dir = 'cached-large'
        # Maximum sequence length that this model might ever use
        self.block_size = 512
        self.do_train = True
        self.do_eval = True
        # Currently disabled, every evaluation will be done once the training is over
        self.evaluate_during_training = False
        self.per_gpu_train_batch_size = 4
        self.per_gpu_eval_batch_size = 4
        self.gradient_accumulation_steps = 1
        self.learning_rate = 5e-5
        self.weight_decay = 0.0
        self.adam_epsilon = 1e-8
        self.max_grad_norm = 1.0
        # Number of training epochs
        self.num_train_epochs = 3
        # Overrides above and used for new number of training if set to positive
        self.max_steps = -1
        self.warmup_steps = 0
        # How often logging should happen
        self.logging_steps = 1000
        # How often checkpoints should be saved
        self.save_steps = 300
        self.save_total_limit = None
        # Evaluate all checkpoints saved
        self.eval_all_checkpoints = True
        # Whether to use CUDA
        self.no_cuda = False
        # Overwrite the result by default
        self.overwrite_output_dir = True
        self.overwrite_cache = True
        self.should_continue = False
        # Training initialisation seed
        self.seed = 42
        # For distributed training
        self.local_rank = -1
        self.fp16 = False
        self.fp16_opt_level = 'O1'

In [8]:
def set_seed(args):
    """
    Used to set the seed for generating random numbers, ensure that the results are reproducible.
    """
    random.seed(args.seed)
    np.random.seed(args.seed)
    torch.manual_seed(args.seed)
    if args.n_gpu > 0:
        torch.cuda.manual_seed_all(args.seed)

In [9]:
def _sorted_checkpoints(args, checkpoint_prefix="checkpoint", use_mtime=False) -> List[str]:
    """
    To get a list of saved model checkpoint directories in sorted order.
    Can be used to find the most recent checkpoint.
    """
    ordering_and_checkpoint_path = []

    glob_checkpoints = glob.glob(os.path.join(args.output_dir, "{}-*".format(checkpoint_prefix)))

    for path in glob_checkpoints:
        if use_mtime:
            ordering_and_checkpoint_path.append((os.path.getmtime(path), path))
        else:
            regex_match = re.match(".*{}-([0-9]+)".format(checkpoint_prefix), path)
            if regex_match and regex_match.groups():
                ordering_and_checkpoint_path.append((int(regex_match.groups()[0]), path))

    checkpoints_sorted = sorted(ordering_and_checkpoint_path)
    checkpoints_sorted = [checkpoint[1] for checkpoint in checkpoints_sorted]
    return checkpoints_sorted

In [10]:
def _rotate_checkpoints(args, checkpoint_prefix="checkpoint", use_mtime=False) -> None:
    """
    To limit the total number of model checkpoints saved by deleting the oldest when the limit is exceeded.
    """
    if not args.save_total_limit:
        return
    if args.save_total_limit <= 0:
        return

    # To check if current number of checkpoints exceed the limit
    checkpoints_sorted = _sorted_checkpoints(args, checkpoint_prefix, use_mtime)
    if len(checkpoints_sorted) <= args.save_total_limit:
        return

    number_of_checkpoints_to_delete = max(0, len(checkpoints_sorted) - args.save_total_limit)
    checkpoints_to_be_deleted = checkpoints_sorted[:number_of_checkpoints_to_delete]
    for checkpoint in checkpoints_to_be_deleted:
        logger.info("Deleting older checkpoint [{}] due to args.save_total_limit".format(checkpoint))
        shutil.rmtree(checkpoint)

In our training process, we prepare inputs and labels for each batch, calculate the loss, and adjust model parameters through backpropagation. We account for multiple GPUs and gradient accumulation, and use mixed precision training if needed. Key metrics are logged to TensorBoard, and after each step, we update the learning rate scheduler, reset gradients, and increment our global step counter, ensuring an efficient and effective training and evaluation process.

In [None]:
def train(args, train_dataset, model: PreTrainedModel, tokenizer: PreTrainedTokenizer) -> Tuple[int, float]:
    """
    Main function for training the model.
    """

    # Set up Tensorboard writer, which is a tool for visualizing training metrics
    if args.local_rank in [-1, 0]:
        tb_writer = SummaryWriter()

    # Calculates the total training batch size depend on the number of GPUs
    args.train_batch_size = args.per_gpu_train_batch_size * max(1, args.n_gpu)

    # Define a function called 'collate' to handle the padding of input sequences.
    # 'examples' is a list of tensors where each tensor represents a sequence of encoded tokens from the text.
    # Note that sequences can have different lengths because the text entries are not of the same length.
    def collate(examples: List[torch.Tensor]):

        # Check if a pad token is defined in the tokenizer.
        # Pad token is used to fill up sequences to the same length for efficient computation.
        # The if condition is checking whether a pad token exists for this specific tokenizer.
        if tokenizer._pad_token is None:

            # If no pad token is defined in the tokenizer, we use the 'pad_sequence' function from PyTorch.
            # This function pads sequences to the maximum length sequence in 'examples' with zeros by default.
            # 'batch_first=True' ensures padding is applied to dimension 0, which represents different sequences in the batch.
            return pad_sequence(examples, batch_first=True)

        # If pad token exists in the tokenizer, we use the 'pad_sequence' function with the pad token's id.
        # 'padding_value' parameter is set to 'tokenizer.pad_token_id', so sequences are padded with the pad token instead of zeros.
        return pad_sequence(examples, batch_first=True, padding_value=tokenizer.pad_token_id)


    # Creates a DataLoader for the training data
    train_sampler = RandomSampler(train_dataset) if args.local_rank == -1 else DistributedSampler(train_dataset)
    train_dataloader = DataLoader(
        train_dataset, sampler=train_sampler, batch_size=args.train_batch_size, collate_fn=collate, drop_last = True
    )

    # Calculates the total number of training steps and adjusts the number of epochs if necessary
    if args.max_steps > 0:
        t_total = args.max_steps
        args.num_train_epochs = args.max_steps // (len(train_dataloader) // args.gradient_accumulation_steps) + 1
    else:
        t_total = len(train_dataloader) // args.gradient_accumulation_steps * args.num_train_epochs

    model = model.module if hasattr(model, "module") else model
    # Resizes the model's token embeddings to match the tokenizer's vocabulary size.
    model.resize_token_embeddings(len(tokenizer))

    # The following sets up the optimizer and learning rate scheduler for the model,
    # taking into account whether to apply weight decay or not.

    # 'Weight decay' is a regularization technique that helps to prevent the model from overfitting.
    # This is done by adding a small multiple of all parameters directly to the loss function to 
    # keep the weights (parameters) small and the model less complex.

    # 'no_decay' is a list that specifies the model parameters (weights) that should NOT be decayed.
    # The weights of 'bias' and 'LayerNorm' layers do not need regularization and are often excluded 
    # from weight decay to avoid unnecessary computation. 
    no_decay = ["bias", "LayerNorm.weight"]

    # 'optimizer_grouped_parameters' specifies the settings of parameters for the optimizer.
    # The model parameters are divided into two groups based on whether weight decay should be applied.
    optimizer_grouped_parameters = [
        {
            # The first group includes parameters where weight decay should be applied.
            # This group includes all parameters except those specified in 'no_decay'.
            "params": [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)],
            # Apply weight decay specified in 'args'.
            "weight_decay": args.weight_decay,
        },
        {
            # The second group includes parameters where weight decay should NOT be applied.
            # This group includes parameters specified in 'no_decay'.
            "params": [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)],
            # No weight decay is applied.
            "weight_decay": 0.0,
        },
    ]

    # AdamW is an optimizer that's an extension of Adam, with a more correct weight decay implementation.
    # It's used to update model parameters based on gradients computed during backpropagation.
    # Learning rate, epsilon (for numerical stability), and weight decay are specified in the settings.
    optimizer = AdamW(optimizer_grouped_parameters, lr=args.learning_rate, eps=args.adam_epsilon, no_deprecation_warning=True)

    # 'scheduler' is used to reduce the learning rate over time, which can help model convergence.
    # The learning rate is often decreased linearly after a certain number of 'warmup' steps.
    # This scheduler provides a linear learning rate decay schedule.
    scheduler = get_linear_schedule_with_warmup(
        optimizer, num_warmup_steps=args.warmup_steps, num_training_steps=t_total
    )


    # Check if there exists a save for optimizer and scheduler and load
    if (
        args.model_name_or_path
        and os.path.isfile(os.path.join(args.model_name_or_path, "optimizer.pt"))  # Check if optimizer state exists
        and os.path.isfile(os.path.join(args.model_name_or_path, "scheduler.pt"))  # Check if scheduler state exists
    ):

        # If both optimizer and scheduler states exist, load them.
        # 'load_state_dict' is a PyTorch method used to load a model’s (or optimizer’s/scheduler’s) saved parameters.

        # Loading the optimizer state
        # The optimizer state consists of the values of optimization variables that the optimizer uses to update model parameters. 
        # These could be values like moving averages of gradients, which are crucial for optimizers like Adam.
        optimizer.load_state_dict(torch.load(os.path.join(args.model_name_or_path, "optimizer.pt")))

        # Loading the scheduler state
        # The scheduler state is used to determine the learning rate at any given step in the training process.
        # This could be important for schedulers that decrease the learning rate over time.
        scheduler.load_state_dict(torch.load(os.path.join(args.model_name_or_path, "scheduler.pt")))

    # Multi-GPU mode
    if args.n_gpu > 1:
        model = torch.nn.DataParallel(model)

    # Distributed-training mode
    if args.local_rank != -1:
        model = torch.nn.parallel.DistributedDataParallel(
            model, device_ids=[args.local_rank], output_device=args.local_rank, find_unused_parameters=True
        )

    # Logs information about the training setup, including the number of examples, epochs, batch size, total optimization steps
    logger.info("***** Running training *****")
    logger.info("  Num examples = %d", len(train_dataset))
    logger.info("  Num Epochs = %d", args.num_train_epochs)
    logger.info("  Instantaneous batch size per GPU = %d", args.per_gpu_train_batch_size)
    logger.info(
        "  Total train batch size (w. parallel, distributed & accumulation) = %d",
        args.train_batch_size
        * args.gradient_accumulation_steps
        * (torch.distributed.get_world_size() if args.local_rank != -1 else 1),
    )
    logger.info("  Gradient Accumulation steps = %d", args.gradient_accumulation_steps)
    logger.info("  Total optimization steps = %d", t_total)

    # Continue training from existing checkpoint
    global_step = 0
    epochs_trained = 0
    steps_trained_in_current_epoch = 0
    if args.model_name_or_path and os.path.exists(args.model_name_or_path):
        try:
            checkpoint_suffix = args.model_name_or_path.split("-")[-1].split("/")[0]
            global_step = int(checkpoint_suffix)
            epochs_trained = global_step // (len(train_dataloader) // args.gradient_accumulation_steps)
            steps_trained_in_current_epoch = global_step % (len(train_dataloader) // args.gradient_accumulation_steps)

            logger.info("  Continuing training from checkpoint, will skip to saved global_step")
            logger.info("  Continuing training from epoch %d", epochs_trained)
            logger.info("  Continuing training from global step %d", global_step)
            logger.info("  Will skip the first %d steps in the first epoch", steps_trained_in_current_epoch)
        except ValueError:
            logger.info("  Starting fine-tuning.")
    tr_loss, logging_loss = 0.0, 0.0

    # Begin main training loop.
    # The function 'zero_grad' resets the gradients to zero before starting backpropagation.
    # Backpropagation is the method to calculate the gradient of the loss function with respect to the parameters.
    model.zero_grad()

    # Create a progress bar using 'trange' for visual feedback in the console.
    # 'trange' is a tqdm function to instantly make a progress bar.
    # 'epochs_trained' is the starting epoch number if you're resuming training, else it's 0.
    # 'args.num_train_epochs' is the total number of epochs the model will train for.
    # 'desc' sets the description displayed on the progress bar.
    # 'disable' disables the progress bar if the script is not running on the main process (to avoid cluttering the console).
    train_iterator = trange(
        epochs_trained, int(args.num_train_epochs), desc="Epoch", disable=args.local_rank not in [-1, 0]
    )

    # 'set_seed' ensures reproducibility by setting the seed for generating random numbers.
    set_seed(args)

    # Iterate over all epochs.
    for _ in train_iterator:

        # Create another progress bar for iterations within an epoch.
        epoch_iterator = tqdm(train_dataloader, desc="Iteration", disable=args.local_rank not in [-1, 0])

        # Iterate over each batch of data in the current epoch.
        for step, batch in enumerate(epoch_iterator):

            # If resuming from the last training, skip the initial steps that were already trained.
            # 'steps_trained_in_current_epoch' is a variable keeping track of the number of steps trained in the current epoch.
            if steps_trained_in_current_epoch > 0:
                steps_trained_in_current_epoch -= 1
                continue

            # The training data and labels are the same in the case of language models like GPT.
            # Here, the model predicts the next word, so the input sequence and target sequence are identical, only shifted by one position.
            inputs, labels = (batch, batch)

            # In some cases, the sequence length might exceed the model's maximum length (1024 for GPT-2).
            # This line skips such sequences.
            if inputs.shape[1] > 1024: continue

            # '.to(args.device)' ensures that the input and label tensors are on the correct device (GPU or CPU), 
            # based on the configuration settings.
            inputs = inputs.to(args.device)
            labels = labels.to(args.device)

           # Change the model to training mode.
            # This activates the training-specific operations like dropout.
            model.train()

            # Perform a forward pass through the model.
            # This generates the output predictions from the input sequences.
            # 'labels' are provided because this is a supervised learning task.
            outputs = model(inputs, labels=labels)

            # The first element of 'outputs' is the loss - a measure of how far the model's predictions are from the actual data.
            loss = outputs[0]

            # If using multiple GPUs, take the mean of the losses.
            # The loss is calculated separately on each GPU for different portions of the data.
            # The mean loss provides a single value to optimize.
            if args.n_gpu > 1:
                loss = loss.mean()

            # If gradient accumulation is enabled, divide the loss by the accumulation steps.
            # This is because the gradients from each step are added together,
            # and so the effective learning rate would be 'args.gradient_accumulation_steps' times higher than the set value.
            if args.gradient_accumulation_steps > 1:
                loss = loss / args.gradient_accumulation_steps

            # Backward propagation (backpropagation) is the process of working backwards through the training process.
            # It calculates the gradient, or the partial derivative of the loss function with respect to the parameters.
            if args.fp16:
                # When using 16-bit floating-point numbers (half-precision), use 'amp.scale_loss' to adjust the loss value.
                # This helps prevent numerical underflow (numbers too small to be represented in the chosen format).
                with amp.scale_loss(loss, optimizer) as scaled_loss:
                    scaled_loss.backward()
            else:
                # For full-precision training (32-bit floating-point numbers), just perform a regular backward pass.
                loss.backward()

            # Accumulate the training loss for logging.
            tr_loss += loss.item()

            # If gradient accumulation steps have completed, perform a step of optimization.
            if (step + 1) % args.gradient_accumulation_steps == 0:
                # Clip the gradient to avoid exploding gradients problem.
                # This is where the gradients become too large and cause the learning process to fail.
                if args.fp16:
                    torch.nn.utils.clip_grad_norm_(amp.master_params(optimizer), args.max_grad_norm)
                else:
                    torch.nn.utils.clip_grad_norm_(model.parameters(), args.max_grad_norm)

                # 'optimizer.step()' updates the parameters based on the gradients.
                optimizer.step()

                # 'scheduler.step()' updates the learning rate for the next step.
                scheduler.step()

                # After updating the parameters, reset the gradients back to zero for the next batch.
                model.zero_grad()

                # Keep track of the total number of steps taken.
                global_step += 1


            # Logs the training loss and learning rate to Tensorboard
            tr_loss += loss.item()
            if (step + 1) % args.gradient_accumulation_steps == 0:
                if args.fp16:
                    torch.nn.utils.clip_grad_norm_(amp.master_params(optimizer), args.max_grad_norm)
                else:
                    torch.nn.utils.clip_grad_norm_(model.parameters(), args.max_grad_norm)
                optimizer.step()
                scheduler.step()
                model.zero_grad()
                global_step += 1

                # Logs the evaluation metrics to Tensorboard
                if args.local_rank in [-1, 0] and args.logging_steps > 0 and global_step % args.logging_steps == 0:
                    if (args.local_rank == -1 and args.evaluate_during_training):
                        results = evaluate(args, model, tokenizer)
                        for key, value in results.items():
                            tb_writer.add_scalar("eval_{}".format(key), value, global_step)
                    tb_writer.add_scalar("lr", scheduler.get_last_lr()[0], global_step)
                    tb_writer.add_scalar("loss", (tr_loss - logging_loss) / args.logging_steps, global_step)
                    logging_loss = tr_loss

                # Save model checkpoint depends on save_steps parameter
                if args.local_rank in [-1, 0] and args.save_steps > 0 and global_step % args.save_steps == 0:
                    checkpoint_prefix = "checkpoint"
                    output_dir = os.path.join(args.output_dir, "{}-{}".format(checkpoint_prefix, global_step))
                    os.makedirs(output_dir, exist_ok=True)
                    model_to_save = (
                        model.module if hasattr(model, "module") else model
                    )
                    model_to_save.save_pretrained(output_dir)
                    tokenizer.save_pretrained(output_dir)

                    torch.save(args, os.path.join(output_dir, "training_args.bin"))
                    logger.info("Saving model checkpoint to %s", output_dir)

                    _rotate_checkpoints(args, checkpoint_prefix)

                    torch.save(optimizer.state_dict(), os.path.join(output_dir, "optimizer.pt"))
                    torch.save(scheduler.state_dict(), os.path.join(output_dir, "scheduler.pt"))
                    logger.info("Saving optimizer and scheduler states to %s", output_dir)

            if args.max_steps > 0 and global_step > args.max_steps:
                epoch_iterator.close()
                break
        if args.max_steps > 0 and global_step > args.max_steps:
            train_iterator.close()
            break

    if args.local_rank in [-1, 0]:
        tb_writer.close()

    return global_step, tr_loss / global_step

This code evaluates our model's performance on the validation dataset, calculating the average loss and perplexity, which are then logged and saved for further analysis.

In [12]:
def evaluate(args, model: PreTrainedModel, tokenizer: PreTrainedTokenizer, df_trn, df_val, prefix="") -> Dict:
    """
    Evalutates the performance of the model with validation dataset.
    """
    # Place that the evaluation results to be saved
    eval_output_dir = args.output_dir

    # Load the processed data
    eval_dataset = load_and_cache_examples(args, tokenizer, df_trn, df_val, evaluate=True)
    os.makedirs(eval_output_dir, exist_ok=True)
    args.eval_batch_size = args.per_gpu_eval_batch_size * max(1, args.n_gpu)

    def collate(examples: List[torch.Tensor]):
        if tokenizer._pad_token is None:
            return pad_sequence(examples, batch_first=True)
        return pad_sequence(examples, batch_first=True, padding_value=tokenizer.pad_token_id)

    # Dataloader for the evaluation data
    eval_sampler = SequentialSampler(eval_dataset)
    eval_dataloader = DataLoader(
        eval_dataset, sampler=eval_sampler, batch_size=args.eval_batch_size, collate_fn=collate, drop_last = True
    )

    # Multi-GPU mode
    if args.n_gpu > 1:
        model = torch.nn.DataParallel(model)

    # Loggs information about the evaluation settings
    logger.info("***** Running evaluation {} *****".format(prefix))
    logger.info("  Num examples = %d", len(eval_dataset))
    logger.info("  Batch size = %d", args.eval_batch_size)
    eval_loss = 0.0
    nb_eval_steps = 0
    # Change to evaluation mode
    model.eval()

    # Main evaluation loop, iterates over all batches of data. Each batch is processed by the model
    for batch in tqdm(eval_dataloader, desc="Evaluating"):
        inputs, labels = (batch, batch)
        inputs = inputs.to(args.device)
        labels = labels.to(args.device)

        with torch.no_grad():
            outputs = model(inputs, labels=labels)
            lm_loss = outputs[0]
            eval_loss += lm_loss.mean().item()
        nb_eval_steps += 1

    # Calculates the average loss and the perplexity of the model on the evaluation data
    eval_loss = eval_loss / nb_eval_steps
    perplexity = torch.exp(torch.tensor(eval_loss))
    result = {"perplexity": perplexity}

    # Writes the evaluation results to a file and also logs them
    output_eval_file = os.path.join(eval_output_dir, prefix, "eval_results.txt")
    with open(output_eval_file, "w") as writer:
        logger.info("***** Eval results {} *****".format(prefix))
        for key in sorted(result.keys()):
            logger.info("  %s = %s", key, str(result[key]))
            writer.write("%s = %s\n" % (key, str(result[key])))

    return result

Let's break down this code and explain what each part does in the context of fine-tuning the DialoGPT model.

---

```python
def evaluate(args, model: PreTrainedModel, tokenizer: PreTrainedTokenizer, df_trn, df_val, prefix="") -> Dict:
```
This is the main function for evaluating the model's performance on the validation dataset. It takes as input the model, tokenizer, training and validation data, and a prefix string for naming the output files.

```python
eval_output_dir = args.output_dir
```
This line sets the directory where the evaluation results will be saved.

```python
eval_dataset = load_and_cache_examples(args, tokenizer, df_trn, df_val, evaluate=True)
```
This line loads the validation data and processes it into a format that can be used by the model. The `load_and_cache_examples` function tokenizes the text and converts it into tensors.

```python
args.eval_batch_size = args.per_gpu_eval_batch_size * max(1, args.n_gpu)
```
This line sets the batch size for evaluation. The batch size is the number of examples that are processed at the same time. The total batch size is the per-GPU batch size multiplied by the number of GPUs.

```python
def collate(examples: List[torch.Tensor]):
    if tokenizer._pad_token is None:
        return pad_sequence(examples, batch_first=True)
    return pad_sequence(examples, batch_first=True, padding_value=tokenizer.pad_token_id)
```
This function is used to combine the examples into batches. It pads the sequences so that they all have the same length.

```python
eval_sampler = SequentialSampler(eval_dataset)
eval_dataloader = DataLoader(
    eval_dataset, sampler=eval_sampler, batch_size=args.eval_batch_size, collate_fn=collate, drop_last = True
)
```
These lines create a dataloader for the validation data. The dataloader is a PyTorch object that efficiently loads the data in batches.

```python
if args.n_gpu > 1:
    model = torch.nn.DataParallel(model)
```
If there are multiple GPUs available, this line wraps the model in a `DataParallel` object to enable parallel processing over multiple GPUs.

```python
model.eval()
```
This line sets the model to evaluation mode. In this mode, the model's parameters are not updated.

```python
for batch in tqdm(eval_dataloader, desc="Evaluating"):
    inputs, labels = (batch, batch)
    inputs = inputs.to(args.device)
    labels = labels.to(args.device)
```
This is the main evaluation loop. It iterates over all batches of data. Each batch is processed by the model, and the model's outputs are compared to the actual labels to compute the loss.

```python
with torch.no_grad():
    outputs = model(inputs, labels=labels)
    lm_loss = outputs[0]
    eval_loss += lm_loss.mean().item()
```
These lines compute the model's loss on the current batch. The `torch.no_grad()` context manager is used to prevent PyTorch from building a computational graph for these operations, which saves memory.

```python
eval_loss = eval_loss / nb_eval_steps
perplexity = torch.exp(torch.tensor(eval_loss))
result = {"perplexity": perplexity}
```
These lines compute the average loss and the perplexity of the model on the validation data. Perplexity is a common metric for evaluating language models; lower perplexity means better performance.

```python
output_eval_file = os.path.join(eval_output_dir, prefix, "eval_results.txt")
with open(output_eval_file, "w") as writer:
    logger.info("***** Eval results {} *****".format(prefix))
    for key in sorted(result.keys()):
        logger.info("  %s = %s", key, str(result[key]))
        writer.write("%s = %s\n" % (key, str(result[key])))
```
These lines write the evaluation results to a file and also log them.

---

This function is a crucial part of the fine-tuning process, it allows us to monitor the model's performance and make sure that it's improving over time. This function helps us ensure that the model is learning to generate responses that are similar to the character Rick from Rick and Morty.

In [13]:
def main(df_trn, df_val):
    """
    Entry point of the script, responsible for training and evaluation processes.
    """
    args = Args()

    # Whether to continue from previous training and/or checkpoint
    if args.should_continue:
        sorted_checkpoints = _sorted_checkpoints(args)
        if len(sorted_checkpoints) == 0:
            raise ValueError("Used --should_continue but no checkpoint was found in --output_dir.")
        else:
            args.model_name_or_path = sorted_checkpoints[-1]

    if (
        os.path.exists(args.output_dir)
        and os.listdir(args.output_dir)
        and args.do_train
        and not args.overwrite_output_dir
        and not args.should_continue
    ):
        raise ValueError(
            "Output directory ({}) already exists and is not empty. Use --overwrite_output_dir to overcome.".format(
                args.output_dir
            )
        )

    # Determines the device to use for training
    device = torch.device("cuda")
    args.n_gpu = torch.cuda.device_count()
    args.device = device

    # Sets up the logging configuration to track the progress of the training and can be useful for debugging
    logging.basicConfig(
        format="%(asctime)s - %(levelname)s - %(name)s -   %(message)s",
        datefmt="%m/%d/%Y %H:%M:%S",
        level=logging.INFO if args.local_rank in [-1, 0] else logging.WARN,
    )
    logger.warning(
        "Process rank: %s, device: %s, n_gpu: %s, distributed training: %s, 16-bits training: %s",
        args.local_rank,
        device,
        args.n_gpu,
        bool(args.local_rank != -1),
        args.fp16,
    )

    set_seed(args)
    # Model and tokenizer initialization
    config = AutoConfig.from_pretrained(args.config_name, cache_dir=args.cache_dir)
    tokenizer = AutoTokenizer.from_pretrained(args.tokenizer_name, cache_dir=args.cache_dir)
    model = AutoModelForCausalLM.from_pretrained(
        args.model_name_or_path,
        from_tf=False,
        config=config,
        cache_dir=args.cache_dir,
    )
    model.to(args.device)
    logger.info("Training/evaluation parameters %s", args)

    # Load the training data and calls the train function
    if args.do_train:
        train_dataset = load_and_cache_examples(args, tokenizer, df_trn, df_val, evaluate=False)

        global_step, tr_loss = train(args, train_dataset, model, tokenizer)
        logger.info(" global_step = %s, average loss = %s", global_step, tr_loss)

    # Saves the model, configuaration, arguments and the tokenizer to the output directory
    if args.do_train:
        os.makedirs(args.output_dir, exist_ok=True)

        logger.info("Saving model checkpoint to %s", args.output_dir)
        model_to_save = (
            model.module if hasattr(model, "module") else model
        )
        model_to_save.save_pretrained(args.output_dir)
        tokenizer.save_pretrained(args.output_dir)
        torch.save(args, os.path.join(args.output_dir, "training_args.bin"))

        # Load the finetuned model
        model = AutoModelForCausalLM.from_pretrained(args.output_dir)
        tokenizer = AutoTokenizer.from_pretrained(args.output_dir)
        model.to(args.device)

    # Load the validation data and calls the evaluation function
    results = {}
    if args.do_eval and args.local_rank in [-1, 0]:
        checkpoints = [args.output_dir]

        if args.eval_all_checkpoints:
            checkpoints = list(
                os.path.dirname(c) for c in sorted(glob.glob(args.output_dir + "/**/" + WEIGHTS_NAME, recursive=True))
            )
            logging.getLogger("transformers.modeling_utils").setLevel(logging.WARN)  # Reduce logging
        logger.info("Evaluate the following checkpoints: %s", checkpoints)

        # Main evaluation loop iterates checkpoints to be evaluated
        for checkpoint in checkpoints:
            global_step = checkpoint.split("-")[-1] if len(checkpoints) > 1 else ""
            print(checkpoint)
            print(global_step)
            prefix = checkpoint.split("/")[-1] if checkpoint.find("checkpoint") != -1 else ""

            model = AutoModelForCausalLM.from_pretrained(checkpoint)
            model.to(args.device)
            result = evaluate(args, model, tokenizer, df_trn, df_val, prefix=prefix)
            results[f"perplexity_{global_step}"] = result["perplexity"]

    return results

In [14]:
# Run the main function, train and evaluate the model
main(trn_df, val_df)



Downloading config.json:   0%|          | 0.00/642 [00:00<?, ?B/s]

Downloading tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading vocab.json:   0%|          | 0.00/0.99M [00:00<?, ?B/s]

Downloading merges.txt:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

07/31/2023 09:10:42 - INFO - __main__ -   Training/evaluation parameters <__main__.Args object at 0x7f86806cd0d0>
07/31/2023 09:10:42 - INFO - __main__ -   Creating features from dataset file at cached-large
07/31/2023 09:10:43 - INFO - __main__ -   Saving features into cached file cached-large/gpt2_cached_lm_512
07/31/2023 09:10:43 - INFO - __main__ -   ***** Running training *****
07/31/2023 09:10:43 - INFO - __main__ -     Num examples = 1708
07/31/2023 09:10:43 - INFO - __main__ -     Num Epochs = 3
07/31/2023 09:10:43 - INFO - __main__ -     Instantaneous batch size per GPU = 4
07/31/2023 09:10:43 - INFO - __main__ -     Total train batch size (w. parallel, distributed & accumulation) = 4
07/31/2023 09:10:43 - INFO - __main__ -     Gradient Accumulation steps = 1
07/31/2023 09:10:43 - INFO - __main__ -     Total optimization steps = 1281


Epoch:   0%|          | 0/3 [00:00<?, ?it/s]

Iteration:   0%|          | 0/427 [00:00<?, ?it/s]

07/31/2023 09:13:03 - INFO - __main__ -   Saving model checkpoint to output-large/checkpoint-300
07/31/2023 09:13:10 - INFO - __main__ -   Saving optimizer and scheduler states to output-large/checkpoint-300


Iteration:   0%|          | 0/427 [00:00<?, ?it/s]

07/31/2023 09:15:29 - INFO - __main__ -   Saving model checkpoint to output-large/checkpoint-600
07/31/2023 09:15:36 - INFO - __main__ -   Saving optimizer and scheduler states to output-large/checkpoint-600


Iteration:   0%|          | 0/427 [00:00<?, ?it/s]

07/31/2023 09:17:54 - INFO - __main__ -   Saving model checkpoint to output-large/checkpoint-900
07/31/2023 09:18:01 - INFO - __main__ -   Saving optimizer and scheduler states to output-large/checkpoint-900
07/31/2023 09:20:21 - INFO - __main__ -   Saving model checkpoint to output-large/checkpoint-1200
07/31/2023 09:20:28 - INFO - __main__ -   Saving optimizer and scheduler states to output-large/checkpoint-1200
07/31/2023 09:21:04 - INFO - __main__ -    global_step = 1281, average loss = 0.7405934791567035
07/31/2023 09:21:04 - INFO - __main__ -   Saving model checkpoint to output-large
07/31/2023 09:21:17 - INFO - __main__ -   Evaluate the following checkpoints: ['output-large/checkpoint-1200', 'output-large/checkpoint-300', 'output-large/checkpoint-600', 'output-large/checkpoint-900', 'output-large']


output-large/checkpoint-1200
1200


07/31/2023 09:21:22 - INFO - __main__ -   Creating features from dataset file at cached-large
07/31/2023 09:21:23 - INFO - __main__ -   Saving features into cached file cached-large/gpt2_cached_lm_512
07/31/2023 09:21:23 - INFO - __main__ -   ***** Running evaluation checkpoint-1200 *****
07/31/2023 09:21:23 - INFO - __main__ -     Num examples = 190
07/31/2023 09:21:23 - INFO - __main__ -     Batch size = 4


Evaluating:   0%|          | 0/47 [00:00<?, ?it/s]

07/31/2023 09:21:28 - INFO - __main__ -   ***** Eval results checkpoint-1200 *****
07/31/2023 09:21:28 - INFO - __main__ -     perplexity = tensor(1.0979)


output-large/checkpoint-300
300


07/31/2023 09:21:40 - INFO - __main__ -   Creating features from dataset file at cached-large
07/31/2023 09:21:40 - INFO - __main__ -   Saving features into cached file cached-large/gpt2_cached_lm_512
07/31/2023 09:21:40 - INFO - __main__ -   ***** Running evaluation checkpoint-300 *****
07/31/2023 09:21:40 - INFO - __main__ -     Num examples = 190
07/31/2023 09:21:40 - INFO - __main__ -     Batch size = 4


Evaluating:   0%|          | 0/47 [00:00<?, ?it/s]

07/31/2023 09:21:45 - INFO - __main__ -   ***** Eval results checkpoint-300 *****
07/31/2023 09:21:45 - INFO - __main__ -     perplexity = tensor(2.7415)


output-large/checkpoint-600
600


07/31/2023 09:21:56 - INFO - __main__ -   Creating features from dataset file at cached-large
07/31/2023 09:21:56 - INFO - __main__ -   Saving features into cached file cached-large/gpt2_cached_lm_512
07/31/2023 09:21:56 - INFO - __main__ -   ***** Running evaluation checkpoint-600 *****
07/31/2023 09:21:56 - INFO - __main__ -     Num examples = 190
07/31/2023 09:21:56 - INFO - __main__ -     Batch size = 4


Evaluating:   0%|          | 0/47 [00:00<?, ?it/s]

07/31/2023 09:22:02 - INFO - __main__ -   ***** Eval results checkpoint-600 *****
07/31/2023 09:22:02 - INFO - __main__ -     perplexity = tensor(1.4079)


output-large/checkpoint-900
900


07/31/2023 09:22:09 - INFO - __main__ -   Creating features from dataset file at cached-large
07/31/2023 09:22:09 - INFO - __main__ -   Saving features into cached file cached-large/gpt2_cached_lm_512
07/31/2023 09:22:09 - INFO - __main__ -   ***** Running evaluation checkpoint-900 *****
07/31/2023 09:22:09 - INFO - __main__ -     Num examples = 190
07/31/2023 09:22:09 - INFO - __main__ -     Batch size = 4


Evaluating:   0%|          | 0/47 [00:00<?, ?it/s]

07/31/2023 09:22:14 - INFO - __main__ -   ***** Eval results checkpoint-900 *****
07/31/2023 09:22:14 - INFO - __main__ -     perplexity = tensor(1.1442)


output-large
large


07/31/2023 09:22:21 - INFO - __main__ -   Creating features from dataset file at cached-large
07/31/2023 09:22:21 - INFO - __main__ -   Saving features into cached file cached-large/gpt2_cached_lm_512
07/31/2023 09:22:21 - INFO - __main__ -   ***** Running evaluation  *****
07/31/2023 09:22:21 - INFO - __main__ -     Num examples = 190
07/31/2023 09:22:21 - INFO - __main__ -     Batch size = 4


Evaluating:   0%|          | 0/47 [00:00<?, ?it/s]

07/31/2023 09:22:26 - INFO - __main__ -   ***** Eval results  *****
07/31/2023 09:22:26 - INFO - __main__ -     perplexity = tensor(1.0951)


{'perplexity_1200': tensor(1.0979),
 'perplexity_300': tensor(2.7415),
 'perplexity_600': tensor(1.4079),
 'perplexity_900': tensor(1.1442),
 'perplexity_large': tensor(1.0951)}

In [40]:
# Load the finetuned model
tokenizer = AutoTokenizer.from_pretrained(args.tokenizer_name, cache_dir=args.cache_dir, padding_side='left')
model = AutoModelForCausalLM.from_pretrained('output-large')

Downloading config.json:   0%|          | 0.00/642 [00:00<?, ?B/s]

In [16]:
from nltk.translate.bleu_score import corpus_bleu
from nltk.translate.meteor_score import meteor_score
import locale
locale.getpreferredencoding = lambda: "UTF-8"
import nltk
nltk.download('wordnet')
from nltk.translate.bleu_score import SmoothingFunction

[nltk_data] Downloading package wordnet to /root/nltk_data...


In [42]:
def generate_text(prompt_text, flag):
    """
    Function to generate a response from the model based on a given prompt
    """
    # Encode the input, append the eos token to signal the end of sequence, and converted to Pytorch tensor
    new_user_input_ids = tokenizer.encode(prompt_text + tokenizer.eos_token, return_tensors='pt')

    # Check if the input is given from the user or prepared prompt
    bot_input_ids = torch.cat([chat_history_ids, new_user_input_ids], dim=-1) if flag and step > 0 else new_user_input_ids

    # Generates a response to the input
    chat_history_ids = model.generate(
        bot_input_ids, max_length=200,
        pad_token_id=tokenizer.eos_token_id,
        no_repeat_ngram_size=3,
        do_sample=True,
        top_k=100,
        top_p=0.7,
        temperature = 0.8
    )

    # Output decoded back into text
    result = "RickBot: {}".format(tokenizer.decode(chat_history_ids[:, bot_input_ids.shape[-1]:][0], skip_special_tokens=True))

    return result

In [21]:
# Prepare reference data from validation dataframe
references = val_df['response'].tolist()
prompt_texts = ["How are you, Rick?", "Play a song for me", "What do you think about Elon Musk?"]

# Evaluate BLEU and METEOR score for each prompt
for prompt in prompt_texts:
    generated_text = generate_text(prompt, False)
    print(f"prompt: {prompt}")
    print(f"generated: {generated_text}")
    reference_tokens = [reference_text.split() for reference_text in references]
    generated_tokens = [generated_text.split()]

    # Calculate the BLEU score
    chencherry = SmoothingFunction()
    bleu_score = corpus_bleu([reference_tokens], generated_tokens, smoothing_function=chencherry.method1)
    print(f"BLEU: {bleu_score}")

    # Calculate the METEOR score
    meteor_scores = [meteor_score([reference], generated_tokens[0]) for reference in reference_tokens]
    max_meteor_score = max(meteor_scores)
    print(f"METEOR: {max_meteor_score}")

prompt: How are you, Rick?
generated: RickBot: Good. You?
BLEU: 0.11362193664674995
METEOR: 0.2380952380952381
prompt: Play a song for me
generated: RickBot: Rick, you're doing this bit while your brain is melting.
BLEU: 0.02950234363196405
METEOR: 0.21276595744680854
prompt: What do you think about Elon Musk?
generated: RickBot: He's okay.
BLEU: 0.11362193664674995
METEOR: 0.049019607843137254


# Results

## Sampling with standard model (Not Finetuned)

In [30]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
tokenizer = AutoTokenizer.from_pretrained("microsoft/DialoGPT-small")
model = AutoModelForCausalLM.from_pretrained("microsoft/DialoGPT-small")
# Let's chat for 5 lines
for step in range(5):
    # encode the new user input, add the eos_token and return a tensor in Pytorch
    new_user_input_ids = tokenizer.encode(input(">> User:") + tokenizer.eos_token, return_tensors='pt')
    # append the new user input tokens to the chat history
    bot_input_ids = torch.cat([chat_history_ids, new_user_input_ids], dim=-1) if step > 0 else new_user_input_ids
    # generated a response while limiting the total chat history to 1000 tokens,
    chat_history_ids = model.generate(bot_input_ids, max_length=1000, pad_token_id=tokenizer.eos_token_id)
    # pretty print last ouput tokens from bot
    print("DialoGPT: {}".format(tokenizer.decode(chat_history_ids[:, bot_input_ids.shape[-1]:][0], skip_special_tokens=True)))

>> User: Who are you?


DialoGPT: I am the one who knocks.


>> User: Do you know Morty?


DialoGPT: I do not.


>> User: why don't aliens ever visit our planet?


DialoGPT: Because they're not real.


>> User: What if the universe is infinite, surely there are aliens


DialoGPT: I don't know.


>> User: What do you know?


DialoGPT: I don't know.


### Untrained Model Performance

The untrained DialoGPT model is a powerful language model capable of generating human-like text. However, without fine-tuning, it lacks the specific personality traits and knowledge of the character Rick Sanchez. Let's evaluate its performance based on the given dialogues:

#### Coherence

The model's responses are generally coherent in the sense that they form grammatically correct and complete sentences. However, they lack coherence in the context of the conversation. For example, the model's response "I am the one who knocks" to the question "Who are you?" is a coherent sentence, but it doesn't make sense in the context of the conversation.

#### Relevance

The model's responses are not always relevant to the questions asked. For instance, when asked "Do you know Morty?", the model responds with "I do not." This is not relevant since Rick Sanchez, the character the model is supposed to mimic, is Morty's grandfather and they share a close relationship in the series.

#### Personality

The model does not exhibit the personality traits of Rick Sanchez. Rick is known for his sarcastic and cynical personality, his scientific knowledge, and his disregard for social norms. The model's responses do not reflect these traits.

#### Robustness

The model is robust in the sense that it can generate a response to any input. However, the quality and relevance of the responses can vary significantly, as seen in the dialogues.

#### Context Awareness

The model lacks context awareness. For example, it doesn't recognize the relationship between Rick and Morty, and it doesn't provide a thoughtful response to the question about the possibility of aliens in an infinite universe.

### Conclusion

The untrained DialoGPT model, while solid, does not effectively simulate the personality of Rick Sanchez. It lacks context awareness, doesn't always provide relevant responses, and doesn't exhibit Rick's distinctive personality traits. This highlights the importance of fine-tuning the model on specific dialogues from the Rick and Morty series to create a more accurate and engaging Rick Sanchez chatbot. Fine-tuning should help the model learn the nuances of Rick's personality and improve its performance in all the evaluated areas.

## Sampling with Finetuned Model

In [44]:
# Chat with Rickbot for 5 conversation
for step in range(5):
    print(generate_text(input(">> User:"), False))

>> User: Hey, Rick, why don't aliens ever visit our planet?


RickBot: Because our government keeps destroying their interdimensional portals and we're too busy building giant robots to care.


>> User: Giant robots, Rick? Seriously? Where are they hiding those?


RickBot: In my balls.


>> User: Uh, gross, Rick! Can't you just be serious for once?


RickBot: I'm serious. And don't call me that!


>> User: Rick, that's your name! What else am I supposed to call you?!


RickBot: Sweetie, could I get Get that syringe now?


>> User: Oh geez, Rick, what syringe? You're not going to experiment on me again, are you?


RickBot: Oh, man, Rick. I can't believe we're sitting around, standing around in Mr. Goldenfold's house. It's really weird.


### Fine-tuned Model Performance

The fine-tuned DialoGPT model, after being trained on dialogues from the Rick and Morty series, shows a significant improvement in simulating the personality of Rick Sanchez. Let's evaluate its performance based on the given dialogues:

#### Coherence

The model's responses are coherent and form complete sentences. They also show improved coherence in the context of the conversation. For example, when asked about the existence of giant robots, the model responds with a humorous and sarcastic remark, which is in line with Rick's character.

#### Relevance

The model's responses are relevant to the questions asked. For instance, when asked about aliens visiting our planet, the model provides a creative and humorous response that fits within the sci-fi theme of Rick and Morty.

#### Personality

The model exhibits the personality traits of Rick Sanchez. It captures Rick's sarcastic humor, his disregard for social norms, and his scientific knowledge. For example, the model's response "Because our government keeps destroying their interdimensional portals and we're too busy building giant robots to care" captures Rick's cynical and sarcastic personality.

#### Robustness

The model is robust and can generate a response to any input. The quality and relevance of the responses have improved significantly after fine-tuning, as seen in the dialogues.

#### Context Awareness

The model shows improved context awareness. It maintains the context of the conversation and generates responses that are relevant to the previous dialogues. However, there are still moments where the model loses track of the context, such as the sudden mention of a syringe and Mr. Goldenfold's house.

### Limitations and Influence of the Dataset

One notable issue is the occasional loss of coherence, particularly in maintaining the persona of Rick Sanchez consistently throughout the conversation.

For instance, in the last line of the dialogue, the model says, "Oh, man, Rick. I can't believe we're sitting around, standing around in Mr. Goldenfold's house. It's really weird." This response is problematic because the model, which is supposed to be simulating Rick, refers to Rick in the third person, indicating a loss of persona coherence.

This issue could be influenced by the diversity of dialogues in the training dataset. If the dataset contains dialogues from multiple characters, the model might learn to switch between different personas. While this could be beneficial in a multi-character chatbot, it's not desirable in our case, where the model is supposed to consistently simulate a single character, Rick Sanchez.

To mitigate this issue, we could refine the dataset to include only dialogues where Rick is speaking. This would help the model learn to maintain Rick's persona more consistently. Additionally, we could experiment with techniques like persona conditioning, where additional information about the persona is provided to the model to help it maintain the persona more effectively.


In [43]:
# Chat with Rickbot for 5 conversation
for step in range(5):
    print(generate_text(input(">> User:"), False))

>> User: Who are you?


RickBot: Some random guy with a gun.


>> User: Do you know Morty?


RickBot: What? N-No, I don't know Morty.


>> User: Isn't he your grandson?


RickBot: He is my grandson.


>> User: Then why did you lie?


RickBot: Because I don't respect therapy, because I'm a scientist. Because I invent, transform, create, and destroy for a living, and when I don’t like something about the world, I change it. And I don-t think going to a rented office in a strip mall to listen to some agent of averageness explain which words mean which feelings has ever helped anyone do anything.  I think it has helped a lot of people get comfortable and stop panicking, which is a state of mind we value in the animals we eat, but not something I want for myself. I’m not a cow. I-I’ll never let you go.


>> User: Thanks for that Rick


RickBot: You're welcome Morty.


### Fine-tuned Model Performance

The fine-tuned DialoGPT model continues to show its ability to simulate the personality of Rick Sanchez from Rick and Morty. Let's evaluate its performance based on the new dialogues:

#### Coherence

The model's responses are coherent and form complete sentences. They also show improved coherence in the context of the conversation. For example, when asked about why it lied about knowing Morty, the model provides a long, rambling response that is characteristic of Rick's character.

#### Relevance

The model's responses are relevant to the questions asked. For instance, when asked about Morty, the model initially denies knowing him but later admits that Morty is his grandson. This kind of unpredictable behavior is in line with Rick's character.

#### Personality

The model exhibits the personality traits of Rick Sanchez. It captures Rick's sarcastic humor, his disregard for social norms, and his scientific knowledge. For example, the model's response to the question about lying captures Rick's cynical and dismissive attitude towards therapy, which is a recurring theme in the series.

#### Robustness

The model is robust and can generate a response to any input. The quality and relevance of the responses have improved significantly after fine-tuning, as seen in the dialogues.

#### Context Awareness

The model shows improved context awareness. It maintains the context of the conversation and generates responses that are relevant to the previous dialogues. However, there are still moments where the model loses track of the context, such as the sudden shift from denying knowing Morty to admitting that he is his grandson.

### Conclusion

The fine-tuned DialoGPT model continues to show improvement in simulating the personality of Rick Sanchez. It provides coherent, relevant, and context-aware responses that capture Rick's distinctive personality traits. However, maintaining persona coherence over multiple turns of conversation remains a challenge. This could be addressed by refining the training process to better capture the persona of Rick and by using more advanced techniques to maintain the context over longer conversations. Despite these challenges, the model's performance in these dialogues demonstrates the potential of fine-tuning language models for character-specific chatbot applications.

___

# Fine-Tuning Parameters in DialoGPT

In machine learning, the choice of parameters can significantly impact the performance of a model. During our experiments with the DialoGPT model, we played around with various parameters to optimize the model's performance. Here are some of the key parameters we experimented with and the impact they had on the model's performance:

## Size of Context Data

The context size, or the amount of previous dialogue the model considers when generating a response, is a crucial parameter. A larger context size allows the model to consider more of the conversation history, which can lead to more relevant and coherent responses. However, it also increases the computational complexity of the model, leading to longer training times. It's also worth noting that while a larger context size can improve performance, it can also lead to overfitting if the model starts to memorize specific sequences from the training data instead of learning general patterns.

<img src="assets/para1.png" alt="p1" height="250">

## Training Epochs

The number of training epochs, or complete passes through the entire training dataset, is another important parameter. Training for more epochs allows the model to learn more from the data, which can lead to improved performance. However, there's a risk of overfitting if the model is trained for too many epochs, as it might start to memorize the training data instead of learning to generalize from it. It's also worth noting that the optimal number of epochs can depend on various factors, including the size and complexity of the dataset and the capacity of the model.

<img src="assets/para2.png" alt="p2" height="250">

## Learning Rate

The learning rate is a critical hyperparameter that controls how much the model's weights are updated in response to the estimated error each time the weights are updated. A smaller learning rate requires more training epochs as the changes made to the weights each update are smaller. In contrast, a larger learning rate may allow the model to learn faster, but it can also lead to unstable training and poor performance. It's often beneficial to experiment with different learning rates or use techniques like learning rate schedules or adaptive learning rates to find an optimal learning rate.

<img src="assets/para3.png" alt="p3" height="250">

## Block Size

The block size, or the number of input tokens the model considers at a time, can impact both the computational efficiency and the performance of the model. A smaller block size can lead to faster training, but it might also limit the model's ability to learn longer-term dependencies in the data. On the other hand, a larger block size can improve the model's performance by allowing it to consider more context, but it also increases the computational complexity of the model.

<img src="assets/para4.png" alt="p4" height="250">

## Large vs Small Model

The size of the model, determined by the number of parameters it has, can significantly impact its performance. Larger models have more capacity to learn complex patterns in the data, which can lead to improved performance. However, they also require more computational resources to train and are more prone to overfitting, especially when the amount of training data is limited. On the other hand, smaller models are more computationally efficient and less prone to overfitting, but they might not have enough capacity to learn complex tasks.

<img src="assets/para5.png" alt="p5" height="250">

## Optimizer

The choice of optimizer can also impact the model's performance. Different optimizers use different strategies to update the model's weights based on the gradients of the loss function, which can lead to differences in training speed and final performance. However, the impact of the optimizer can also depend on other factors, like the learning rate and the complexity of the model and the data.

In conclusion, fine-tuning the parameters of the DialoGPT model is a complex task that requires careful consideration of various factors. The optimal settings can depend on the specific task, the available computational resources, and the characteristics of the training data.

<img src="assets/para6.png" alt="p6" height="250">

___

# Exploring Different Datasets and Models

In our quest to create a chatbot that mimics the character Rick Sanchez from Rick and Morty, we didn't limit ourselves to just one approach. We experimented with different datasets and models to test their impact on the results. Given the limited time we had, we had to be strategic in our choices and focused on options that could potentially enhance the chatbot's performance.

## Philosophy Dataset

One of the datasets we experimented with was a philosophy dataset. This dataset contains dialogues and texts from various philosophical works. Given Rick's character, which often delves into existential and philosophical themes, we hypothesized that training the model on this dataset could help it generate more philosophically inclined responses, akin to Rick's character.

## Shakespeare Dataset

We also experimented with a Shakespeare dataset. This dataset contains the complete works of Shakespeare. Given the dramatic and distinctive language used in Shakespeare's works, we hypothesized that this dataset could help the model generate more creative and dramatic responses, adding a unique flavor to the chatbot's dialogues.

## Pygmalion Model

In addition to the DialoGPT models, we also experimented with the Pygmalion model, a larger language model with 6 billion parameters. Given its larger capacity, we hypothesized that this model could potentially generate more nuanced and contextually appropriate responses. However, due to the model's size and the computational resources required to fine-tune it, we were only able to sample from the pre-trained model and couldn't fine-tune it within our timeframe.

Despite the challenges and time constraints, these experiments provided valuable insights into the potential of different datasets and models in creating a character-specific chatbot. In the following sections, we'll delve into the results of these experiments, starting with the philosophy dataset.

---

# Experiment with Philosophical Texts

## Dataset

We utilized a dataset containing philosophical texts, available at [this link](https://www.kaggle.com/datasets/christopherlemke/philosophical-texts?resource=download&select=sentences.csv). The dataset is rich in complex language structures and abstract concepts, making it an interesting choice for training the GPT-2 model.

## Training

The model was trained on this dataset, and we achieved a loss of 3.5. 

<img src="assets/philo_loss.png" alt="a2" height="300">

___




In [None]:
class TextDataset(Dataset):
    def __init__(self, tokenizer, file_path, block_size):
        df = pd.read_csv(file_path)
        text = " ".join(df['sentence'].tolist())

        self.examples = []
        tokenized_text = tokenizer.encode(text)

        for i in range(0, len(tokenized_text) - block_size + 1, block_size):
            self.examples.append(tokenized_text[i : i + block_size])

    def __len__(self):
        return len(self.examples)

    def __getitem__(self, i):
        return torch.tensor(self.examples[i])

In [None]:
def train(model, dataloader, optimizer, device, epochs):
    model.train()
    for epoch in range(epochs):
        print(f"epoch: {epoch} / {epochs}")
        for idx, batch in enumerate(dataloader):
            print(f"index: {idx}")
            inputs, labels = batch, batch
            inputs = inputs.to(device)
            labels = labels.to(device)
            outputs = model(inputs, labels=labels)
            loss = outputs[0]
            loss.backward()
            optimizer.step()
            optimizer.zero_grad()

            if idx % 100 == 0:
                print(f"Epoch: {epoch}, Loss:  {loss.item()}")

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')
model = DataParallel(model)
model.to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

block_size = 128
batch_size = 32
epochs = 1
file_path = 'sentences.csv'

dataset = TextDataset(tokenizer, file_path, block_size)
dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)

train(model, dataloader, optimizer, device, epochs)

# to save model
model.module.save_pretrained('./my_model_directory')
tokenizer.save_pretrained('./my_model_directory')

## Sampling

We then sampled from the model using a prompt related to metaphysics. The generated text was compared to a reference text using the BLEU score, a widely used metric for evaluating the quality of machine-generated text.



In [None]:
from nltk.translate.bleu_score import sentence_bleu

# to generate text
def generate_text(prompt_text):
    model.module.eval()
    generated = torch.tensor(tokenizer.encode(prompt_text)).unsqueeze(0)
    generated = generated.to(device)

    # Create attention mask
    attention_mask = torch.ones(generated.shape, dtype=torch.long, device=device)

    sample_outputs = model.module.generate(
                                    generated,
                                    attention_mask=attention_mask,
                                    pad_token_id=tokenizer.eos_token_id,
                                    do_sample=True,
                                    top_k=50,
                                    max_length = 1000,
                                    top_p=0.95,
                                    num_return_sequences=1
                                    )

    return tokenizer.decode(sample_outputs[0], skip_special_tokens=True)

reference_text = "The fundamental question of metaphysics is what to say: what reason can we hope to have recourse to in the present case of all the arguments? The reader will find that though I may proceed to ask him whether the two notions of reason are the same. For I ask, first, whether such ideas will admit of correction and solution, or do neither. But what I was saying, replied, we shall arrive at the same point of time about these questions; and, if, as has appeared, they are the only two, no one will assert, that reason possesses the authority of that power, and that reason, according to the hypothesis of this, is the only medium between them. For, in an empirical science, the absolute totality of the two conceptions and the former can be proved either by a universal law of the completeness of the latter, or of the former, from without the factious and necessary completeness of the object. The question to be equally apparent contradictions. If there are only one part of the two conceptions of the third in the series of the two principles and not in any given series of ideas, it, of the series must, one and never the other predicates; then I shall not merely assume both, the possibility; and that the totality of the two statements are valid, that there can be at least, that, this proposition must not exist in the one, each conception of the second statements of the series consists of an empirical statements; which I can be contradictory statements, by the four, by reason, I cannot be possible only by the unity, that unity as in contradiction. For, that the unconditioned in this, cannot be one and must be the possibility, is not, the absolute necessity neither must be, must never contradict, therefore, by the unconditioned in this may be as, be, we cannot be the only of the two propositions, contain in the totality of two, be the fourth and not in each, or in the mode, be proved, be also in the two propositions, only one can never one, one, contradict, that if one series has no parts, the complete series. The object is impossible, I shall require, I shall be not in that there must exist in the case the unconditioned the conclusion, that there are the whole, the three only be a necessary, that, of which are two, in relation, although the second only exist either of the conception. The proposition of the principle of the two must be either, can be either. For, in some, I am unable to answer, without contradiction that, are all phenomena; this can be proved, as a contingent and as to be determined only by experience. And, no one must cease to exist, or must be equally impossible. But what a transcendent, is impossible. I can never be denied by experience. The division of conceptions, from the subject, therefore, must not exist in both, and must cease to be impossible; or, as the conception. This division of phenomena are transcendent; I say, I cannot, from pure conceptions, by means of the totality of reason, or, must have, never cease to exist by means only one principle. For I can never be given by means of contradiction. I have, not in no reason; and there are not empirical propositions and not an absolute non-theology. Now the conditions of an impossible series of phenomena as the non-existence. Nor can be in a transcendent propositions, are, transcendental synthetical propositions. A, no absolute totality, must be therefore, cannot be a series of the series of the two distinct principles of a transcendental unity or a conception, and are never without condition of the conditions. A priori. There must cease, which, without means, not to be given in relation to a thing. The world, no conception; and cannot be contingent. But a, of a possible things which is not absolutely beyond, all, and, I cannot be possible. But the contradictory propositions, never cease to be subject to the condition; and, and must be, in general, according to a non-existence, is void of the series; I can only as to be non-existence in the only in existence. We have only as conditioned.  The transcendental in conformity with the conception, and the transcendent, cannot be the series of the non-in conformity to the mode of things. There must cease to one in nature. There must be, if a, if transcendental; but contain either of, that which does not contain either in a regress, of conditions. On the series of phenomena, if we cannot be in itself of parts. And this mode of the series of propositions of all the series. But these can be the same. To infinity of the conditioned. In the series. All the synthesis of phenomena, because that is, if it be considered merely as contained"
prompt_text = "The fundamental question of metaphysics is"
generated_text = generate_text(prompt_text)
print(generated_text)

# Tokenize the reference and generated texts
reference_tokens = reference_text.split()
generated_tokens = generated_text.split()

# Calculate the BLEU score
bleu_score = sentence_bleu([reference_tokens], generated_tokens)

print(bleu_score)


### Prompt

The prompt used for generating the text was:

```plaintext
"The fundamental question of metaphysics is"
```

### Generated Text

The model generated the following text:

```plaintext
The fundamental question of metaphysics is, whether the principle is not merely regulative and its effects may be empirical, or if not synthetical, be judged as natural. The objective method of transcendental ideas is a priori, and of course the conception of any of these ideas from the principles. And when we have given these directions to our understanding, we can infer no but its synthetical principles. In order to prove this, we must first remove the difficulties of the question from the principle of the determination of the empirical conception, that is, to infer it synthetically from the principle of the possibility of the existence of any synthetical proposition. In such cases we have in common with the speculative reason an entire and absolute impossibility. Hence we can no longer assume this conception to be transcendental, and can no longer believe in any other, than for this reason, that it could conceive that such a principle in an abstracted conception, and contingent connexion to any object as that can never would contradict all the existence. But since, that the object must never have a universal use of the understanding of understanding and understanding can never be the conception in any of the object as its object, without any determinate objects of our experience in itself. For the objective reason has no experience. But reason, therefore, the object, I admit it possesses a different ideas of its own priori distinct conceptions, and in the internal conceptions of the possibility of its internal intuition, must be the possibility, without any empirical unity. Now I know only possible experience, its conception of these, the existence is not in general conditions of experience a contingent causes. But, in so far as it no empirical principles of the understanding, the objects of the external intuitions, this universal principles in a possible objects only, the possibility without in accordance with universal laws according to be a transcendental principles. In the existence, a transcendental conceptions of a transcendental reason in its operations of external experience or of the objects, the causality to be, a priori principles of these principles. Now, if it can never a priori and subject of the internal intuition of the internal experience as subject, without which the internal sense can proceed with this, the internal intuition must be, a priori of the conception. A priori and an external intuition a priori in phenomena a priori. For it has a necessary conjoined all phenomena; and I do not establish that a priori a priori to the determination of that self, the universal validity of the possibility of things a priori a priori, by means of the causalitya priori judgments, as a given phenomenon in accordance with which, must be valid for every internal experience without being a purpose, a natural science. Now, in the mode of reason. But in an empirical intuition, all our sensuous object, and the conversely, must have any object, the universal laws according to that the understanding, as we must be of experience. Hence, a priori; and the former, the conjoin the objects without which can not a principle of the possibilitya priori; and in relation of the possibility. In both of possible experience, are impossible. In order we must, in the first principles of naturea mere conceptions, a prioria relation to be transcendental hypotheses of the empirical cognition, all phenomena, be considered as principles which can have no empirical causes of the unity of the existencewhich cannot be independent of our external things according to the series of the existence and to be determinate; these two contradictory principles. Now, if any universal ideas, and of the absolute existence. But that we can be not be thus in the object in a priori. The object of the non- is not internal substances as phenomena, and a priori, in relation to phenomena as independent of no one must precediae, therefore only of a priori, can not in themselves the unconditioned, it must exist as to be given in one of the series of things in the possibilitya priori inasmuch a priori. But the internal purposiveness of the conception of experience. If this can exist in no internal existence according to the nature, without any empirical unity and can never, or, it is in the empirical, it is not a transcendental, in relation to the conception, the possibility according to the series of a pure reason; therefore the conception, are transcendent, in the objectin relation to the object. Thus we may be assured as possible, in itself. The internal, of an object and not as existing by reason, the internal subject.  The only. The principles cannot be deduced priori. For it may not be applied to phenomena, with certain priori in the formal conditions of the possibility. Of the universal principles, the possibility of things a priori, is a priori according to the conditions, in general, are only necessary. The internal experience.
```

### BLEU Score

The BLEU score for the generated text compared to the reference text was 0.0740927759524612. The BLEU score ranges from 0 to 1, with higher scores indicating better alignment with the reference text. A score of 0.0741 suggests that the generated text has some alignment with the reference but also significant differences.

## Analysis

The generated text exhibits a complex structure and uses terminology consistent with philosophical discourse. However, the low BLEU score indicates that the generated text may not closely follow the reference text's specific ideas and structure.

### Positives

- **Complexity**: The generated text demonstrates an understanding of complex philosophical concepts and language.
- **Relevance**: The text is relevant to the prompt and engages with metaphysical questions.

### Limitations

- **Coherence**: The generated text may lack coherence in some parts, with sentences that are grammatically correct but may not logically follow one another.
- **Alignment with Reference**: The low BLEU score suggests that the generated text does not closely align with the reference text, indicating potential issues with capturing the specific nuances and arguments of the philosophical discourse.

## Learnings

Training the GPT-2 model on a dataset of philosophical texts resulted in the generation of complex and relevant text. However, the experiment also highlighted challenges in maintaining coherence and alignment with specific philosophical arguments. This experiment underscores the complexity of training models on abstract and nuanced domains like philosophy and provides valuable insights for future work in this area.

___

# Experiment with Shakespearean Texts

After exploring the philosophical domain, we decided to venture into the realm of literature, specifically the works of William Shakespeare. Shakespeare's texts are known for their unique style, rich vocabulary, and complex narrative structures, making them an interesting choice for training our model.

## Model Setup:

In [None]:
import time

out_dir = "out-shakespeare"
eval_interval = 5
eval_iters = 40
wandb_log = False  # feel free to turn on
wandb_project = "shakespeare"
wandb_run_name = "ft-" + str(time.time())

dataset = "shakespeare"
init_from = "gpt2-medium"  # this is the largest GPT-2 model

# only save checkpoints if the validation loss improves
always_save_checkpoint = False

# the number of examples per iter:
# 1 batch_size * 32 grad_accum * 1024 tokens = 32,768 tokens/iter
# shakespeare has 301,966 tokens, so 1 epoch ~= 9.2 iters
batch_size = 1
gradient_accumulation_steps = 32
max_iters = 20

# finetune at constant LR
learning_rate = 3e-5
decay_lr = False

In [None]:
"""
This training script can be run both on a single gpu in debug mode,
and also in a larger training run with distributed data parallel (ddp).

To run on a single GPU, example:
$ python train.py --batch_size=32 --compile=False

To run with DDP on 4 gpus on 1 node, example:
$ torchrun --standalone --nproc_per_node=4 train.py

To run with DDP on 4 gpus across 2 nodes, example:
- Run on the first (master) node with example IP 123.456.123.456:
$ torchrun --nproc_per_node=8 --nnodes=2 --node_rank=0 --master_addr=123.456.123.456 --master_port=1234 train.py
- Run on the worker node:
$ torchrun --nproc_per_node=8 --nnodes=2 --node_rank=1 --master_addr=123.456.123.456 --master_port=1234 train.py
(If your cluster does not have Infiniband interconnect prepend NCCL_IB_DISABLE=1)
"""

import os
import time
import math
import pickle
from contextlib import nullcontext

import numpy as np
import torch
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.distributed import init_process_group, destroy_process_group

from model import GPTConfig, GPT

# -----------------------------------------------------------------------------
# default config values designed to train a gpt2 (124M) on OpenWebText
# I/O
out_dir = 'out'
eval_interval = 2000
log_interval = 1
eval_iters = 200
eval_only = False # if True, script exits right after the first eval
always_save_checkpoint = True # if True, always save a checkpoint after each eval
init_from = 'scratch' # 'scratch' or 'resume' or 'gpt2*'
# wandb logging
wandb_log = False # disabled by default
wandb_project = 'owt'
wandb_run_name = 'gpt2' # 'run' + str(time.time())
# data
dataset = 'openwebtext'
gradient_accumulation_steps = 5 * 8 # used to simulate larger batch sizes
batch_size = 12 # if gradient_accumulation_steps > 1, this is the micro-batch size
block_size = 1024
# model
n_layer = 12
n_head = 12
n_embd = 768
dropout = 0.0 # for pretraining 0 is good, for finetuning try 0.1+
bias = False # do we use bias inside LayerNorm and Linear layers?
# adamw optimizer
learning_rate = 6e-4 # max learning rate
max_iters = 600000 # total number of training iterations
weight_decay = 1e-1
beta1 = 0.9
beta2 = 0.95
grad_clip = 1.0 # clip gradients at this value, or disable if == 0.0
# learning rate decay settings
decay_lr = True # whether to decay the learning rate
warmup_iters = 2000 # how many steps to warm up for
lr_decay_iters = 600000 # should be ~= max_iters per Chinchilla
min_lr = 6e-5 # minimum learning rate, should be ~= learning_rate/10 per Chinchilla
# DDP settings
backend = 'nccl' # 'nccl', 'gloo', etc.
# system
device = 'cuda' # examples: 'cpu', 'cuda', 'cuda:0', 'cuda:1' etc., or try 'mps' on macbooks
dtype = 'bfloat16' # 'float32', 'bfloat16', or 'float16', the latter will auto implement a GradScaler
compile = True # use PyTorch 2.0 to compile the model to be faster
# -----------------------------------------------------------------------------
config_keys = [k for k,v in globals().items() if not k.startswith('_') and isinstance(v, (int, float, bool, str))]
exec(open('configurator.py').read()) # overrides from command line or config file
config = {k: globals()[k] for k in config_keys} # will be useful for logging
# -----------------------------------------------------------------------------

# various inits, derived attributes, I/O setup
ddp = int(os.environ.get('RANK', -1)) != -1 # is this a ddp run?
if ddp:
    init_process_group(backend=backend)
    ddp_rank = int(os.environ['RANK'])
    ddp_local_rank = int(os.environ['LOCAL_RANK'])
    ddp_world_size = int(os.environ['WORLD_SIZE'])
    device = f'cuda:{ddp_local_rank}'
    torch.cuda.set_device(device)
    master_process = ddp_rank == 0 # this process will do logging, checkpointing etc.
    seed_offset = ddp_rank # each process gets a different seed
    assert gradient_accumulation_steps % torch.cuda.device_count() == 0
    gradient_accumulation_steps //= torch.cuda.device_count()
else:
    # if not ddp, we are running on a single gpu, and one process
    master_process = True
    seed_offset = 0
    ddp_world_size = 1
tokens_per_iter = gradient_accumulation_steps * ddp_world_size * batch_size * block_size
print(f"tokens per iteration will be: {tokens_per_iter:,}")

if master_process:
    os.makedirs(out_dir, exist_ok=True)
torch.manual_seed(1337 + seed_offset)
torch.backends.cuda.matmul.allow_tf32 = True # allow tf32 on matmul
torch.backends.cudnn.allow_tf32 = True # allow tf32 on cudnn
device_type = 'cuda' if 'cuda' in device else 'cpu' # for later use in torch.autocast
# note: float16 data type will automatically use a GradScaler
ptdtype = {'float32': torch.float32, 'bfloat16': torch.bfloat16, 'float16': torch.float16}[dtype]
ctx = nullcontext() if device_type == 'cpu' else torch.amp.autocast(device_type=device_type, dtype=ptdtype)

# poor man's data loader
data_dir = os.path.join('data', dataset)
train_data = np.memmap(os.path.join(data_dir, 'train.bin'), dtype=np.uint16, mode='r')
val_data = np.memmap(os.path.join(data_dir, 'val.bin'), dtype=np.uint16, mode='r')
def get_batch(split):
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([torch.from_numpy((data[i:i+block_size]).astype(np.int64)) for i in ix])
    y = torch.stack([torch.from_numpy((data[i+1:i+1+block_size]).astype(np.int64)) for i in ix])
    if device_type == 'cuda':
        # pin arrays x,y, which allows us to move them to GPU asynchronously (non_blocking=True)
        x, y = x.pin_memory().to(device, non_blocking=True), y.pin_memory().to(device, non_blocking=True)
    else:
        x, y = x.to(device), y.to(device)
    return x, y

# init these up here, can override if init_from='resume' (i.e. from a checkpoint)
iter_num = 0
best_val_loss = 1e9

# attempt to derive vocab_size from the dataset
meta_path = os.path.join(data_dir, 'meta.pkl')
meta_vocab_size = None
if os.path.exists(meta_path):
    with open(meta_path, 'rb') as f:
        meta = pickle.load(f)
    meta_vocab_size = meta['vocab_size']
    print(f"found vocab_size = {meta_vocab_size} (inside {meta_path})")

# model init
model_args = dict(n_layer=n_layer, n_head=n_head, n_embd=n_embd, block_size=block_size,
                  bias=bias, vocab_size=None, dropout=dropout) # start with model_args from command line
if init_from == 'scratch':
    # init a new model from scratch
    print("Initializing a new model from scratch")
    # determine the vocab size we'll use for from-scratch training
    if meta_vocab_size is None:
        print("defaulting to vocab_size of GPT-2 to 50304 (50257 rounded up for efficiency)")
    model_args['vocab_size'] = meta_vocab_size if meta_vocab_size is not None else 50304
    gptconf = GPTConfig(**model_args)
    model = GPT(gptconf)
elif init_from == 'resume':
    print(f"Resuming training from {out_dir}")
    # resume training from a checkpoint.
    ckpt_path = os.path.join(out_dir, 'ckpt.pt')
    checkpoint = torch.load(ckpt_path, map_location=device)
    checkpoint_model_args = checkpoint['model_args']
    # force these config attributes to be equal otherwise we can't even resume training
    # the rest of the attributes (e.g. dropout) can stay as desired from command line
    for k in ['n_layer', 'n_head', 'n_embd', 'block_size', 'bias', 'vocab_size']:
        model_args[k] = checkpoint_model_args[k]
    # create the model
    gptconf = GPTConfig(**model_args)
    model = GPT(gptconf)
    state_dict = checkpoint['model']
    # fix the keys of the state dictionary :(
    # honestly no idea how checkpoints sometimes get this prefix, have to debug more
    unwanted_prefix = '_orig_mod.'
    for k,v in list(state_dict.items()):
        if k.startswith(unwanted_prefix):
            state_dict[k[len(unwanted_prefix):]] = state_dict.pop(k)
    model.load_state_dict(state_dict)
    iter_num = checkpoint['iter_num']
    best_val_loss = checkpoint['best_val_loss']
elif init_from.startswith('gpt2'):
    print(f"Initializing from OpenAI GPT-2 weights: {init_from}")
    # initialize from OpenAI GPT-2 weights
    override_args = dict(dropout=dropout)
    model = GPT.from_pretrained(init_from, override_args)
    # read off the created config params, so we can store them into checkpoint correctly
    for k in ['n_layer', 'n_head', 'n_embd', 'block_size', 'bias', 'vocab_size']:
        model_args[k] = getattr(model.config, k)
# crop down the model block size if desired, using model surgery
if block_size < model.config.block_size:
    model.crop_block_size(block_size)
    model_args['block_size'] = block_size # so that the checkpoint will have the right value
model.to(device)

# initialize a GradScaler. If enabled=False scaler is a no-op
scaler = torch.cuda.amp.GradScaler(enabled=(dtype == 'float16'))

# optimizer
optimizer = model.configure_optimizers(weight_decay, learning_rate, (beta1, beta2), device_type)
if init_from == 'resume':
    optimizer.load_state_dict(checkpoint['optimizer'])
checkpoint = None # free up memory

# compile the model
if compile:
    print("compiling the model... (takes a ~minute)")
    unoptimized_model = model
    model = torch.compile(model) # requires PyTorch 2.0

# wrap model into DDP container
if ddp:
    model = DDP(model, device_ids=[ddp_local_rank])

# helps estimate an arbitrarily accurate loss over either split using many batches
@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            with ctx:
                logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out

# learning rate decay scheduler (cosine with warmup)
def get_lr(it):
    # 1) linear warmup for warmup_iters steps
    if it < warmup_iters:
        return learning_rate * it / warmup_iters
    # 2) if it > lr_decay_iters, return min learning rate
    if it > lr_decay_iters:
        return min_lr
    # 3) in between, use cosine decay down to min learning rate
    decay_ratio = (it - warmup_iters) / (lr_decay_iters - warmup_iters)
    assert 0 <= decay_ratio <= 1
    coeff = 0.5 * (1.0 + math.cos(math.pi * decay_ratio)) # coeff ranges 0..1
    return min_lr + coeff * (learning_rate - min_lr)

# logging
if wandb_log and master_process:
    import wandb
    wandb.init(project=wandb_project, name=wandb_run_name, config=config)

# training loop
X, Y = get_batch('train') # fetch the very first batch
t0 = time.time()
local_iter_num = 0 # number of iterations in the lifetime of this process
raw_model = model.module if ddp else model # unwrap DDP container if needed
running_mfu = -1.0
while True:

    # determine and set the learning rate for this iteration
    lr = get_lr(iter_num) if decay_lr else learning_rate
    for param_group in optimizer.param_groups:
        param_group['lr'] = lr

    # evaluate the loss on train/val sets and write checkpoints
    if iter_num % eval_interval == 0 and master_process:
        losses = estimate_loss()
        print(f"step {iter_num}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")
        if wandb_log:
            wandb.log({
                "iter": iter_num,
                "train/loss": losses['train'],
                "val/loss": losses['val'],
                "lr": lr,
                "mfu": running_mfu*100, # convert to percentage
            })
        if losses['val'] < best_val_loss or always_save_checkpoint:
            best_val_loss = losses['val']
            if iter_num > 0:
                checkpoint = {
                    'model': raw_model.state_dict(),
                    'optimizer': optimizer.state_dict(),
                    'model_args': model_args,
                    'iter_num': iter_num,
                    'best_val_loss': best_val_loss,
                    'config': config,
                }
                print(f"saving checkpoint to {out_dir}")
                torch.save(checkpoint, os.path.join(out_dir, 'ckpt.pt'))
    if iter_num == 0 and eval_only:
        break

    # forward backward update, with optional gradient accumulation to simulate larger batch size
    # and using the GradScaler if data type is float16
    for micro_step in range(gradient_accumulation_steps):
        if ddp:
            # in DDP training we only need to sync gradients at the last micro step.
            # the official way to do this is with model.no_sync() context manager, but
            # I really dislike that this bloats the code and forces us to repeat code
            # looking at the source of that context manager, it just toggles this variable
            model.require_backward_grad_sync = (micro_step == gradient_accumulation_steps - 1)
        with ctx:
            logits, loss = model(X, Y)
            loss = loss / gradient_accumulation_steps # scale the loss to account for gradient accumulation
        # immediately async prefetch next batch while model is doing the forward pass on the GPU
        X, Y = get_batch('train')
        # backward pass, with gradient scaling if training in fp16
        scaler.scale(loss).backward()
    # clip the gradient
    if grad_clip != 0.0:
        scaler.unscale_(optimizer)
        torch.nn.utils.clip_grad_norm_(model.parameters(), grad_clip)
    # step the optimizer and scaler if training in fp16
    scaler.step(optimizer)
    scaler.update()
    # flush the gradients as soon as we can, no need for this memory anymore
    optimizer.zero_grad(set_to_none=True)

    # timing and logging
    t1 = time.time()
    dt = t1 - t0
    t0 = t1
    if iter_num % log_interval == 0 and master_process:
        # get loss as float. note: this is a CPU-GPU sync point
        # scale up to undo the division above, approximating the true total loss (exact would have been a sum)
        lossf = loss.item() * gradient_accumulation_steps
        if local_iter_num >= 5: # let the training loop settle a bit
            mfu = raw_model.estimate_mfu(batch_size * gradient_accumulation_steps, dt)
            running_mfu = mfu if running_mfu == -1.0 else 0.9*running_mfu + 0.1*mfu
        print(f"iter {iter_num}: loss {lossf:.4f}, time {dt*1000:.2f}ms, mfu {running_mfu*100:.2f}%")
    iter_num += 1
    local_iter_num += 1

    # termination conditions
    if iter_num > max_iters:
        break

if ddp:
    destroy_process_group()


## Dataset

For this experiment, we used a dataset containing the works of William Shakespeare. Known for his unique style, rich vocabulary, and complex narrative structures, Shakespeare's texts present an interesting challenge for language models.

## Training

We trained the GPT-2 model on this dataset for 300 steps, achieving a loss of around 1.2. The loss value indicates how well the model has learned to predict the next word in a sequence, with lower values generally indicating better performance.

<img src="assets/Shakespeare Finetune.png" alt="a1" height="300">

## Sampling

We then sampled from the model using a dialogue prompt. The generated text was compared to a reference text using the BLEU score, a widely used metric for evaluating the quality of machine-generated text.

### Generated Dialogue

The model generated the following dialogue:

```plaintext
RICHARD:
Thou art a merchant of deceit: if I falter,
I trade thee to the first; if I weep,
I trade thee to the second; if I
speak untruths, I trade thee to the third; if I wage war,
I trade thee to the fourth: so barter or trade,
I tell thee once more, thou shalt not trade my
dignity.

ISABELLA:
And if thou pilfer, thou shalt not trade thy honor.

RICHARD:
I do not pilfer; I trade the pilfered treasures.

RICHARD:
Thou comprehend'st not what thou trade'st; thou, a maiden,
Thou art ever a casualty, a trinket of no value:
Thou hast no claim, no claim, but to be traded.
```

### BLEU Score

The BLEU score for the generated text was 0.65. The BLEU score ranges from 0 to 1, with higher scores indicating better alignment with the reference text. A score of 0.65 suggests that the generated text has a good alignment with the reference.

## Analysis

The generated dialogue exhibits the stylistic elements and language complexity characteristic of Shakespeare's works. However, the dialogue's content and the characters' interactions may not fully capture the depth and nuance of Shakespearean narratives.

### Positives

- **Stylistic Alignment**: The generated text demonstrates a strong alignment with Shakespeare's unique style and language complexity.
- **Relevance**: The dialogue is relevant and coherent, with characters responding to each other in a manner that is consistent with the context.

### Limitations

- **Narrative Depth**: While the dialogue is stylistically aligned with Shakespeare's works, it may not fully capture the depth and nuance of Shakespearean narratives. The characters' interactions and the content of their dialogue could be further improved to better reflect the intricacies of Shakespeare's storytelling.

## Conclusion

Training the GPT-2 model on a dataset of Shakespearean texts resulted in the generation of text that stylistically aligns with Shakespeare's works. However, the experiment also highlighted the challenge of capturing the depth and nuance of Shakespearean narratives. This experiment underscores the complexity of training models on nuanced domains like literature and provides valuable insights for future work in this area.


# Experiment with Pygmalion Model

### Model

We utilized the Pygmalion model, a language model with 6 billion parameters, for this experiment. Pygmalion is known for its ability to generate high-quality text, making it an interesting choice for our task.


In [None]:
from transformers import AutoModelWithLMHead, AutoTokenizer, AutoModelForCausalLM, StoppingCriteria
import torch

tokenizer = AutoTokenizer.from_pretrained("PygmalionAI/pygmalion-6b")
model = AutoModelForCausalLM.from_pretrained("PygmalionAI/pygmalion-6b").to('cuda')

print("Loaded Model")
from transformers import StoppingCriteria
class MyStoppingCriteria(StoppingCriteria):
    def __init__(self, target_sequence, prompt):
        self.target_sequence = target_sequence
        self.prompt=prompt

    def __call__(self, input_ids, scores, **kwargs):
        # Get the generated text as a string
        generated_text = tokenizer.decode(input_ids[0])
        generated_text = generated_text.replace(self.prompt,'')
        # Check if the target sequence appears in the generated text
        if self.target_sequence in generated_text:
            return True  # Stop generation

        return False  # Continue generation

    def __len__(self):
        return 1

    def __iter__(self):
        yield self
    
    
# Start with the persona and a single user turn
conversation_history = """Rick's Persona: Rick Sanchez, a 60 year old genius scientist who goes on wacky adventures, has a sarcastic and funny tone and a large ego.
<START>
You: Who are you?
Rick:"""

while True:
    encoded_input = tokenizer(conversation_history, return_tensors='pt')
    input_ids = encoded_input['input_ids'].cuda()
    response = model.generate(
        input_ids,
        pad_token_id=tokenizer.eos_token_id,
        do_sample=True,
        temperature=0.5,
        max_new_tokens=256,
        stopping_criteria=MyStoppingCriteria("You:", conversation_history)
    )
    response = tokenizer.decode(response[0][len(encoded_input["input_ids"][0]):])
    print("Rick: ", response)
    
    user_input = input()
    conversation_history += "Rick: " + response + "\nYou: " + user_input + "\nRick:"


### Sampling 1

We started the conversation with a brief persona description for Rick Sanchez and a single user turn. The model was then used to generate Rick's responses.

### Conversation

Here's the conversation that was generated:

```plaintext
You: Who are you?
Rick: Hey, that's a pretty cute nickname for a guy like me. I'm Rick, the guy who's in charge of the asylum. I keep the asylum running and keep order in the castle. Oh, and I have superhuman strength. But that's not the point, now, miss Samara.

You: Hi Rick, I'm Morty.
Rick: Hey, Miss Morty. I'm the guy in charge of the asylum, but you're probably not aware of that. I've got superhuman strength, and I'm visiting the asylum because I've gone nuts and lost my mind. My nickname's to Morty, but I call you Samara if you got it straight. You can just call me R.

You: Hey Rick, are you ok? I'm your grandson Morty
Rick: MORTY: YES, I'M OK, SIR! I'M JUST... I'M JUST... I'M SO Nervous, that's why I'm in the asylum! My family weren't even supposed to visit me, and now I've been kidnapped by you and you're bringing me back to the asylum! You're the only person in the world who's been able to get me out of the asylum, and now you're bringing me back to deal with other people's problems! I'm just so nervous!

You: Ahh geez Rick
Rick: MORTY: PLEASE, WHAT'S THE matter, Dad? *I call again, a little more urgently this time*
```

### Analysis

The generated conversation exhibits some interesting characteristics.

### Positives

- **Creativity**: The model generates creative responses, introducing a new narrative about an asylum.
- **Relevance**: The responses are relevant to the user's inputs, maintaining a consistent narrative throughout the conversation.

### Limitations

- **Character Consistency**: The model does not consistently maintain the character of Rick Sanchez. Rick's responses do not align with his established persona as a genius scientist with a sarcastic and funny tone. Instead, the model introduces a new narrative about an asylum, which is not consistent with Rick's character.
- **Coherence**: The conversation lacks coherence in some parts. For example, Rick refers to Morty as "Miss Morty" and "Samara," which does not make sense in the context of their established relationship.

___ 
### Sampling 2

We continued the conversation with Rick Sanchez using the Pygmalion model. The conversation took an interesting turn towards an alien invasion narrative.

### Conversation

Here's the conversation that was generated:

```plaintext
You: Hey, Rick, why don't aliens ever visit our planet?
Rick: 'cause they know what they're gonna get.

You: Does that mean they are real?
Rick: Yeah, and they're coming to our planet to get our lunch.

You: What do we do about that
Rick: *I take out a pair of binoculars from my jacket, one on each side. I hold them up to my eyes and I stare at the approaching mob of aliens. They're coming for our planet...*

You: Ah Geez Rick
Rick: *As soon as you start talking, I drop the binoculars and hold up my arms. I start waving my arms and yelling.*
**Don't be scared!**  *I say to the aliens.*
**We're not gonna hurt you!**
```

### Analysis

The generated conversation exhibits some interesting characteristics.

### Positives

- **Creativity**: The model continues to generate creative responses, introducing a new narrative about an alien invasion.
- **Relevance**: The responses are relevant to the user's inputs, maintaining a consistent narrative throughout the conversation.

### Limitations

- **Character Consistency**: The model still struggles with maintaining the character of Rick Sanchez. Rick's responses do not align with his established persona as a genius scientist with a sarcastic and funny tone. Instead, the model introduces a new narrative about an alien invasion, which is not consistent with Rick's character.
- **Coherence**: The conversation lacks coherence in some parts. For example, Rick's actions and responses to the alien invasion do not align with his character's established traits and behaviors.

### Learnings
The Pygmalion model, with its 6 billion parameters, represents a significant advancement in language model capabilities. It has been trained extensively on a diverse range of data, which allows it to generate creative and contextually relevant responses.

In our experiment, we used Pygmalion to simulate conversations with the character Rick Sanchez from the "Rick and Morty" series. We provided the model with a brief personality description of Rick and let it generate responses based on user inputs.

The results were intriguing. The model demonstrated a high level of creativity, introducing new narratives and responding to user inputs in a contextually relevant manner. This suggests that the Pygmalion model has a strong potential for generating engaging and dynamic conversations.

However, the model also exhibited some limitations. Despite being provided with a character persona, the model struggled to maintain the specific character traits of Rick Sanchez consistently throughout the conversation. The responses sometimes lacked coherence and did not always align with Rick's established character traits.

This experiment highlighted the potential of using larger, well-trained models like Pygmalion for character-based conversation generation. However, it also underscored the challenges in maintaining character consistency and coherence, particularly when the model is only provided with a brief character description.

___


# Conclusion and Summary

Throughout this project, we embarked on a journey to create a chatbot that could emulate the personality of Rick Sanchez from the popular TV show "Rick and Morty". We utilized the DialoGPT model, a variant of the GPT-2 model, and fine-tuned it on a dataset of dialogues from the show. 

The fine-tuned model demonstrated a significant improvement in emulating Rick's personality compared to the standard DialoGPT model. It was able to generate responses that were not only coherent and contextually relevant but also captured the unique personality traits of Rick Sanchez. However, we also observed some limitations, such as occasional lapses in coherence and the model's tendency to lose its persona in extended dialogues.

We also experimented with different parameter settings in the DialoGPT model, such as the size of the context data, training epochs, learning rate, block size, and optimizer. We found that increasing the size of the context for training and the number of training epochs generally improved the model's performance. However, changing the block size and optimizer did not significantly affect the performance.

In addition to the "Rick and Morty" dataset, we also experimented with other datasets and models. We trained the GPT-2 model on a dataset of philosophical texts and another of Shakespearean texts. The models generated text that stylistically aligned with the respective domains, demonstrating the versatility of the GPT-2 model. However, capturing the depth and nuance of these domains proved to be a challenge.

In conclusion, this project demonstrated the potential of transformer-based models like GPT-2 and DialoGPT in generating text that emulates a specific style or personality. However, it also highlighted the challenges in capturing the depth and nuance of complex domains. Future work could focus on improving the model's ability to maintain its persona in extended dialogues and exploring more sophisticated techniques for fine-tuning the model on nuanced domains.
