<h1> Introduction to minGPT - A PyTorch re-implementation of GPT </h1>

minGPT is originally a PyTorch re-implementation of GPT, a highly successful language modeling framework developed by OpenAI. Created by Andrej Karpathy, minGPT is a lightweight and efficient implementation of GPT, designed to be easy to use and highly customizable. With its modular architecture and flexible design, minGPT is a powerful tool for researchers and practitioners working in natural language processing and related fields. And don't worry, despite its name, minGPT is not small-minded! ;) 
<br/> This assignment will introduce you to the basics of minGPT and how to use it for language modeling tasks.




OpenAI’s generative pre-trained transformer (GPT) was first introduced in ”Improving Language
Understanding by Generative Pre-Training” [1], and has since then developed into being one of the
most topical models within the field of ML, and a common conversational topic on a global scale.
Due to the significant societal impact that this model already has a lot of questions have been
raised in the aftermath of its public release. This projects aim to make a homework assignment on
the implementation of the model architecture and word embedding with an additional twist where
we want to provoke students to make their own reflections on the societal impact of the model.

<i> [1] Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. Improving language
understanding by generative pre-training. 2018

<h3> GPT (current) Gold Standard

Before we dive into minGPT, let's look at the current gold-standard of GPT models, GPT-3.5. While there are several use cases of GPT, let's look at one of the most commonly used one, ChatGPT! ChatGPT is a large language model based on the GPT-3.5 architecture, trained by OpenAI to generate human-like text. It is designed to respond to human prompts with natural language responses, making it useful for a variety of applications such as chatbots, automated content generation, and more. Under the hood, ChatGPT uses a deep neural network to learn from massive amounts of data, allowing it to generate text that is coherent and contextually relevant
<br/> You can interact with ChatGPT here: http://chat.openai.com (sign up using Google and your Berkeley account).
<br/>
Have a short conversation with ChatGPT about Transformers, GPT models, and the basic workings of ChatGPT. Limit your conversation to three questions. 
<br/><br/><b> Answer the following questions: </b>
1. What questions did you ask ChatGPT?
2. In 2-3 sentences, what did you learn from the conversation?
3. Were you satisfied with the responses? On a scale of 1-5, rate the conversation you had with ChatGPT.

Now that you have interacted with ChatGPT, let's work on a much more simple and scaled-down version, minGPT! In this assignment, we will train minGPT to be a character-level language model on some arbitrary text input. 
<br/> <h5> Flax and Jax </h4>
In this homework, we will be exploring the world of [Flax](http://flax.readthedocs.io/) and [Jax](https://github.com/google/jax), two powerful deep learning libraries made by Google that can be used to reimplement the PyTorch version of minGPT. Jax and Flax are both built on the concept of functional programming and provide a high-level interface for building and training neural networks, which can simplify the development process and reduce the amount of boilerplate code that needs to be written. Flax and Jax are designed to take advantage of modern hardware architectures such as GPUs and TPUs, which can significantly speed up the training process and reduce the time-to-deployment for new models.
<br/> By using these libraries, you will gain a deeper understanding of the underlying principles of neural networks and develop your skills in functional programming. You will also have the opportunity to compare and contrast the PyTorch and Flax/Jax versions of minGPT, gaining valuable insights into the similarities and differences between these powerful tools. So let's roll up our sleeves and dive into the world of Flax and Jax!

<h4> Train a character-level GPT on some text data

In [None]:
# First, some imports!
!pip install flax
!pip install optax
!pip install jax
!pip install transformers
!pip install git+https://github.com/deepmind/dm-haiku

import jax
import jax.numpy as jnp
import haiku as hk
from functools import partial
import torch
from torch.utils.data import Dataset
import numpy as np
np.random.seed(182)

from train import trainer, train_config

import model

<h3> 1. Attention is all we need!

In this section, we are going to be focusing specifically on the attention mechanism that is at the heart of its architecture. Attention is a critical component of many modern neural networks, and it plays a particularly important role in natural language processing tasks such as language modeling and machine translation. By completing this section of the homework, you will gain a deeper understanding of how attention works and how it can be used to improve the performance of language models. 
<br/>You will now implement the causal self attention for (min) GPT! You will implmement the code in the `model/model.py` file. Read the instructions in the docstring and then fill in the code in the places that says `#YOUR CODE HERE`.

In [None]:
# some tests here
from tests.test_attention import TestAttention
TestAttention().autograde()

<h3> 2. Unpacking the MLP: Layer by Layer for Better Language Modeling

The multi-layer perceptron (MLP) in GPT, also known as the feedforward network, is an essential component that helps to improve the model's ability to learn from sequential data. While the self-attention mechanism in GPT allows the model to attend to different parts of the input sequence, the MLP is responsible for processing and transforming the attended features before they are fed into the next layer. This additional non-linearity helps to capture more complex patterns and dependencies between the input tokens, leading to better performance on a wide range of language modeling tasks.
<br/> In this section, you are going to implement the MLP for (min) GPT! You will implmement the code in the `model/model.py` file. Read the instructions in the docstring and then fill in the code in the places that says `#YOUR CODE HERE`.
<br/> HINT: Read the documentation [here](https://flax.readthedocs.io/en/latest/api_reference/_autosummary/flax.linen.Dense.html)

In [None]:
# some tests here
from tests.test_mlp import TestMLP
TestMLP().autograde()

<h3> 3. From Text to Numbers: Understanding Encoding in Natural Language Processing

The encoding section focuses on one of the most important aspects of natural language processing: converting textual data into numerical representations that can be understood and processed by machine learning models. In this section, we will explore the different types of encoding methods commonly used in NLP, including one-hot encoding, word embeddings, and more. By the end of this section, you should have a better understanding of how encoding works and why it is crucial for many language-based applications. You will implmement the code in the `encoding/bpe.py` file. Read the instructions in the docstring and then fill in the code in the places that says `#YOUR CODE HERE`.

In [4]:
# some tests here
from tests.test_encoding import TestEncoder
TestEncoder().autograde()

<h3> 4. Crossing the Finish Line: Completing the MinGPT model</h3>

We're almost there! In this section, you will complete the final parts of the minGPT model.

In [None]:
from tests import test_set_params
# run test

In [None]:
# Let's run some unittests to see if your implementation is correct!
from tests import unittest

<h3> 5. Unleashing the power of minGPT: Training (and testing) the model!

We will now train minGPT to be a character-level language model on some input text file. The input text file is an extract of Shakespearean text of about 1.1 MB. <br/> To use minGPT as a character level language model, we first preprocess the input text by converting it into a sequence of characters. We then train the model to predict the next character in the sequence given the preceding characters as input. Once the model is trained, we can use it to generate new text by feeding it a starting sequence (context) and iteratively sampling new characters from the model's predicted distribution until the desired length of text is generated. By using minGPT as a character level language model, we can generate new text that is similar in style and content to the input data, making it a useful tool for tasks such as text generation, language modeling, and autocompletion tasks.

In [17]:
from torch.utils.data import Dataset
class TextDataset(Dataset):

    def __init__(self, data, block_size):
        chars = sorted(list(set(data)))
        data_size, vocab_size = len(data), len(chars)
        print('The input data has %d characters. %d of these characters are unique. These characters include uppercase and lower case letters, as well as punctuations.'
        % (data_size, vocab_size))
        
        self.stoi = {ch:i for i,ch in enumerate(chars)}
        self.itos = {i:ch for i,ch in enumerate(chars)} # will be used for prediction/text generation task
        self.block_size = block_size
        self.vocab_size = vocab_size
        self.data = data

    def __getitem__(self, idx):
        text_block = self.data[idx:idx + self.block_size + 1]
        # encode every character to an integer
        encoded_txt = [self.stoi[char] for char in text_block]
        x = torch.tensor(encoded_txt[:-1], dtype=torch.long)
        y = torch.tensor(encoded_txt[1:], dtype=torch.long)
        return x, y

    def __len__(self):
        return (len(self.data) - self.block_size)

Let's now load the input Shakespearean text file and look at the composition of the text.

In [19]:
# Let's load in the input data of Shakespearean text
shakespeare_txt = open('./gpt_text_input/shakespeare.txt', 'r').read() 
train_dataset = TextDataset(shakespeare_txt, block_size = 128)

The input data has 1115393 characters. 65 of these characters are unique. These characters include uppercase and lower case letters, as well as punctuations.


Go through the `train_config.py` file in the `train` directory. It contains the parameters we will use to train the model. You can play aroud with these parameters after you have trained the model for the first time. 