# SimpleGPT
### HW3 @ DL Course, Dr. Soleymani

*Full Name:* Mohammadjavad Maheronnaghsh

The objective of this notebook is to create and train a decoder-only model, which is a custom and scaled-down version of GPT, using the specified dataset.



### import libraries

In [18]:
# Import necessary libraries for data manipulation
import pandas as pd
import numpy as np

# Import PyTorch and submodules for neural network construction and operations
import torch
import torch.nn as nn
from torch.nn import functional as F

### Download dataset

In [19]:
!wget https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-08/friends.csv

--2024-04-28 21:53:02--  https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-08/friends.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.108.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5383844 (5.1M) [text/plain]
Saving to: ‘friends.csv.1’


2024-04-28 21:53:02 (106 MB/s) - ‘friends.csv.1’ saved [5383844/5383844]



## Hyperparameters

In [20]:
batch_size = 16
block_size = 32  # Length of sequence fed into the model
max_iters = 5000  # Maximum number of training iterations
eval_interval = 100  # Interval for evaluating the model on validation data
learning_rate = 1e-3

n_embd = 64  # Dimensionality of the embeddings
n_head = 4   # Number of attention heads
n_layer = 4  # Number of transformer layers

eval_iters = 200  # Number of iterations to run during evaluation

device = 'cuda' if torch.cuda.is_available() else 'cpu'
torch.manual_seed(1337)


<torch._C.Generator at 0x7a08a3a56a70>

## Preparing dateset

In [21]:
friends_df = pd.read_csv('friends.csv')
friends_df.head()

Unnamed: 0,text,speaker,season,episode,scene,utterance
0,There's nothing to tell! He's just some guy I ...,Monica Geller,1,1,1,1
1,"C'mon, you're going out with the guy! There's ...",Joey Tribbiani,1,1,1,2
2,"All right Joey, be nice. So does he have a hum...",Chandler Bing,1,1,1,3
3,"Wait, does he eat chalk?",Phoebe Buffay,1,1,1,4
4,"(They all stare, bemused.)",Scene Directions,1,1,1,5


In [22]:
friends_df = friends_df.drop(['episode','season','scene','utterance'], axis='columns')
friends_df = friends_df[friends_df['speaker'].str.contains('Scene')==False].copy()
friends_df['speaker'] = friends_df['speaker'].apply(lambda sp: sp.lower().capitalize().split(' ')[0])

friends_df.head()

Unnamed: 0,text,speaker
0,There's nothing to tell! He's just some guy I ...,Monica
1,"C'mon, you're going out with the guy! There's ...",Joey
2,"All right Joey, be nice. So does he have a hum...",Chandler
3,"Wait, does he eat chalk?",Phoebe
5,"Just, 'cause, I don't want her to go through w...",Phoebe


In [23]:
# Generate the dataset text
text = '\n\n'.join(f"{row['speaker']}:\n{row['text']}" for _, row in friends_df.iterrows())
print("Length of dataset in characters:", len(text))

Length of dataset in characters: 3774765


In [24]:
# Print the first 1000 characters of the dataset text
print(text[:1000])

Monica:
There's nothing to tell! He's just some guy I work with!

Joey:
C'mon, you're going out with the guy! There's gotta be something wrong with him!

Chandler:
All right Joey, be nice. So does he have a hump? A hump and a hairpiece?

Phoebe:
Wait, does he eat chalk?

Phoebe:
Just, 'cause, I don't want her to go through what I went through with Carl- oh!

Monica:
Okay, everybody relax. This is not even a date. It's just two people going out to dinner and- not having sex.

Chandler:
Sounds like a date to me.

Chandler:
Alright, so I'm back in high school, I'm standing in the middle of the cafeteria, and I realize I am totally naked.

#all#:
Oh, yeah. Had that dream.

Chandler:
Then I look down, and I realize there's a phone... there.

Joey:
Instead of...?

Chandler:
That's right.

Joey:
Never had that dream.

Phoebe:
No.

Chandler:
All of a sudden, the phone starts to ring. Now I don't know what to do, everybody starts looking at me.

Monica:
And they weren't looking at you before?!


In [25]:
# Create a vocabulary and encode/decode functions
chars = sorted(set(text))
vocab_size = len(chars)
char_to_id = {ch: i for i, ch in enumerate(chars)}
id_to_char = {i: ch for i, ch in enumerate(chars)}

def encode(string):
    return [char_to_id[char] for char in string]

def decode(ids):
    return ''.join(id_to_char[id] for id in ids)

In [26]:
print(vocab_size)
print(chars)

88
['\n', ' ', '!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '>', '?', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', '[', '_', '`', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '{', '}']


In [27]:
# Prepare the data for model training
data = torch.LongTensor(encode(text))
train_part = int(0.9 * len(data))
train_data, val_data = data[:train_part], data[train_part:]


# Display information about the prepared data
print(f"Vocabulary Size: {vocab_size}")
print(f"Training Data Length: {len(train_data)}")
print(f"Validation Data Length: {len(val_data)}")

Vocabulary Size: 88
Training Data Length: 3397288
Validation Data Length: 377477


## Utils

In [28]:
def get_random_batch(data_source, block_size, batch_size):
    """
    Generates a random batch of input and label tensors from the data source.

    Parameters:
    - data_source: The dataset from which to sample.
    - block_size: The size of each sequence to be sampled.
    - batch_size: The number of sequences per batch.

    Returns:
    - A tuple of input and label tensors for the batch.
    """
    indices = torch.randint(high=len(data_source) - block_size, size=(batch_size,))
    inputs = torch.stack([data_source[idx: idx + block_size] for idx in indices]).to(device)
    labels = torch.stack([data_source[idx + 1: idx + block_size + 1] for idx in indices]).to(device)
    return inputs, labels


def estimate_loss(model, data_sources, block_size, batch_size, eval_iters):
    """
    Estimates the model's loss on different data splits.

    Parameters:
    - model: The model to evaluate.
    - data_sources: A dictionary of datasets for each split.
    - block_size: The size of each sequence block.
    - batch_size: The number of sequences per batch.
    - eval_iters: The number of iterations for evaluation.


    Returns:
    - A dictionary with the mean loss for each data split.
    """
    losses_dict = {}
    model.eval()
    with torch.no_grad():
        for split, data_source in data_sources.items():
            losses = [model(*get_random_batch(data_source, block_size, batch_size))[1].item() for _ in range(eval_iters)]
            losses_dict[split] = torch.tensor(losses).mean()
    model.train()
    return losses_dict

def generate_text(model, initial_idx, block_size, max_new_tokens):
    """
    Generates text by sampling from the model's predictions.

    Parameters:
    - model: The model to use for text generation.
    - initial_idx: The initial indices for generation.
    - block_size: The size of the block to consider for each prediction.
    - max_new_tokens: The maximum number of tokens to generate.


    Returns:
    - A tensor of indices representing the generated text.
    """
    idx = initial_idx
    model.eval()
    with torch.no_grad():
        for _ in range(max_new_tokens):
            idx_cond = idx[:, -block_size:]
            logits, _ = model(idx_cond)
            probs = F.softmax(logits[:, -1, :], dim=-1)
            idx_next = torch.multinomial(probs, num_samples=1)
            idx = torch.cat((idx, idx_next), dim=1)
    model.train()
    return idx


def train_model(model, train_data, val_data, block_size, batch_size, max_iters, eval_interval, optimizer):
    """
    Trains the model on the training data and evaluates it on the validation data.

    Parameters:
    - model: The model to train.
    - train_data: The training dataset.
    - val_data: The validation dataset.
    - block_size: The size of each sequence block.
    - batch_size: The number of sequences per batch.
    - max_iters: The maximum number of iterations for training.
    - eval_interval: The interval at which to evaluate the model.
    - optimizer: The optimizer for training the model.

    Returns:
    - The trained model.
    """
    data_sources = {'train': train_data, 'val': val_data}
    for iteration in range(max_iters):
        if iteration % eval_interval == 0 or iteration == max_iters - 1:
            losses = estimate_loss(model, data_sources, block_size, batch_size, eval_iters)
            print(f"Iteration {iteration}: Train Loss {losses['train']:.4f}, Val Loss {losses['val']:.4f}")

        inputs, labels = get_random_batch(train_data, block_size, batch_size)
        optimizer.zero_grad()
        _, loss = model(inputs, labels)
        loss.backward()
        optimizer.step()

    return model



#Model architecture

The Generative Pre-trained Transformer (GPT) model represents a significant breakthrough in the field of natural language processing (NLP) and beyond, thanks to its ability to generate human-like text based on the input it receives. Its architecture is based on the Transformer model, which allows it to effectively capture the context and semantics of the input text over long distances, making it particularly adept at tasks such as language modeling, text generation, and even complex reasoning tasks.

Here's a brief overview of the decoder-only architecture(like GPT) and steps you can follow to implement its components:

## 1. Understanding the Transformer Block

The core of the decoder-only architecture is the Transformer block, which consists of two main components: multi-head self-attention and position-wise feed-forward networks. Each block applies these components in sequence, each followed by layer normalization and a residual connection.


*   **Multi-Head Self-Attention:** This mechanism allows the model to weigh the importance of different words in the input sequence differently, providing a dynamic way to aggregate context from the entire sequence.

![MHSA](https://miro.medium.com/v2/resize:fit:720/format:webp/1*PiZyU-_J_nWixsTjXOUP7Q.png)

*   **Position-wise Feed-Forward Networks:** These are simple, fully connected neural networks applied to each position separately and identically. This means they look at each word (or token) in isolation and then transform it.

## 2. Understanding the whole architecture
To build a decode-only architecture, you would generally follow these steps:



*   **Embedding Layer:** This is where the model learns representations for each token in the vocabulary and for each possible position in the input sequence. The embeddings for tokens and their positions are summed to produce a single representation for each token that captures both its meaning and its position in the sequence.

*   **Stack of Transformer Blocks:** The heart of the model. Several Transformer blocks are stacked on top of each other to allow the model to learn complex relationships between tokens in the input sequence. Each block includes multi-head self-attention and feed-forward networks, as explained above.

*   **Output Layer:** After passing through the Transformer blocks, the output is normalized and then passed through a linear layer that projects it back to the size of the vocabulary. This produces a set of logits that can be used, with a softmax layer, to generate probabilities for each token in the vocabulary being the next token in the sequence.

![](https://miro.medium.com/v2/resize:fit:700/0*77memcl1VYIdpE8f.png)






---
Now for implementing SimpleGPT model you should code the components described above. Here's a approach to doing so:


1.   **SelfAttentionHead:** Implement the self-attention mechanism with key, query, and value projections. Don't forget to apply masking to ignore future tokens in the sequence when calculating attention scores.
2.   **MultiHeadSelfAttention:** Aggregate multiple self-attention heads, allowing the model to focus on different parts of the input sequence simultaneously.
3.   **FeedForward:** Implement the position-wise feed-forward network with a simple sequence of linear layers and activation functions.
4.   **TransformerBlock:** Combine the multi-head self-attention and feed-forward network, adding normalization and residual connections around each.
5.   **SimpleGPT:** Assemble the model by starting with embedding layers for tokens and positions, stacking several Transformer blocks, and then adding the output layer to produce logits.


## Transformer block

In [54]:
class SelfAttentionHead(nn.Module):
    """
    Implements a single head of self-attention.

    This module applies self-attention on the input data, allowing the model to weigh the importance of different tokens within the same input sequence.

    Args:
        n_embd (int): Dimensionality of the embeddings.
        head_size (int): Size of each attention head.

    Attributes:
        key, query, value (nn.Linear): Linear transformations for computing self-attention mechanism's components.
    """

    def __init__(self, n_embd, head_size):
        super().__init__()
        ######################  TODO  ########################
        ######################  TODO  ########################
        self.head_size = head_size
        self.key = nn.Linear(n_embd, head_size)
        self.query = nn.Linear(n_embd, head_size)
        self.value = nn.Linear(n_embd, head_size)
        ######################  TODO  ########################
        ######################  TODO  ########################

    def forward(self, x):
        """
        Forward pass for self-attention head.

        Args:
            x (torch.Tensor): The input tensor (batch_size, seq_length, n_embd).

        Returns:
            torch.Tensor: Output tensor after applying self-attention.
        """
        ######################  TODO  ########################
        ######################  TODO  ########################
        # Make sure to create a mask and use it on the attention weights.
        # You can do this by using torch.tril to make a lower triangle mask and masked_fill_ in PyTorch to put the mask in place
        keys = self.key(x)
        queries = self.query(x)
        values = self.value(x)
        scores = torch.matmul(queries, keys.transpose(-2, -1)) / self.head_size**0.5
        weights = torch.softmax(scores, dim=-1)
        out = torch.matmul(weights, values)
        ######################  TODO  ########################
        ######################  TODO  ########################
        return out


class MultiHeadSelfAttention(nn.Module):
    """
    Implements multi-head self-attention by running several self-attention mechanisms in parallel.

    Args:
        num_heads (int): Number of attention heads.
        input_size (int): Size of each input token.
        head_size (int): Size of each attention head.

    Attributes:
        heads (nn.ModuleList): ModuleList containing all the self-attention heads.
        projection (nn.Linear): Linear layer to project the concatenated outputs of all heads back to the input_size dimensions.
    """

    def __init__(self, num_heads, n_embd, head_size):
        super().__init__()
        ######################  TODO  ########################
        ######################  TODO  ########################
        self.heads = nn.ModuleList([SelfAttentionHead(n_embd, head_size) for _ in range(num_heads)])
        self.projection = nn.Linear(num_heads * head_size, n_embd)

        ######################  TODO  ########################
        ######################  TODO  ########################
    def forward(self, x):
        """
        Forward pass for multi-head self-attention.

        Args:
            x (torch.Tensor): The input tensor (batch_size, seq_length, input_size).

        Returns:
            torch.Tensor: Output tensor after applying multi-head self-attention.
        """
        ######################  TODO  ########################
        ######################  TODO  ########################
        attention_outputs = [head(x) for head in self.heads]
        concatenated_attention = torch.cat(attention_outputs, dim=-1)
        out = self.projection(concatenated_attention)

        ######################  TODO  ########################
        ######################  TODO  ########################
        return out


class FeedForward(nn.Module):
    """
    Implements a simple feed-forward neural network as part of the transformer block.

    Args:
        n_embd (int): Dimensionality of the embeddings.

    Attributes:
        net (nn.Sequential): A sequence of linear layers and a ReLU activation function.
    """

    def __init__(self, n_embd):
        super().__init__()
        ######################  TODO  ########################
        ######################  TODO  ########################
        self.net =nn.Sequential(
            nn.Linear(n_embd, 4 * n_embd),
            nn.ReLU(),
            nn.Linear(4 * n_embd, n_embd)
        )
        ######################  TODO  ########################
        ######################  TODO  ########################
    def forward(self, x):
        """Perform forward pass through the feedforward layer.

        Args:
            x (torch.Tensor): Input tensor.

        Returns:
            torch.Tensor: Output tensor after feedforward computation.

        """
        ######################  TODO  ########################
        ######################  TODO  ########################
        output = self.net(x)

        ######################  TODO  ########################
        ######################  TODO  ########################
        return output

In [60]:
class TransformerBlock(nn.Module):
    """
    Implements a Transformer block with self-attention and feed-forward layers.

    This class combines multi-head self-attention and a position-wise feed-forward network,
    each followed by layer normalization and residual connections.

    Args:
        n_embd (int): Dimensionality of the embeddings.
        num_heads (int): Number of heads in the multi-head self-attention component.

    Attributes:
        self_attention (MultiHeadSelfAttention): The multi-head self-attention module.
        feed_forward (FeedForward): The feed-forward neural network module.
        norm1, norm2 (nn.LayerNorm): Layer normalization modules.
    """

    def __init__(self, n_embd, num_heads):
        super().__init__()
        ######################  TODO  ########################
        ######################  TODO  ########################
        self.self_attention = MultiHeadSelfAttention(num_heads, n_embd, n_embd // num_heads)
        self.feed_forward = FeedForward(n_embd)
        self.norm1 = nn.LayerNorm(n_embd)
        self.norm2 = nn.LayerNorm(n_embd)
        ######################  TODO  ########################
        ######################  TODO  ########################
    def forward(self, x):
        """
        Forward pass of the Transformer block.

        Args:
            x (torch.Tensor): Input tensor of shape (batch_size, seq_length, input_size).

        Returns:
            torch.Tensor: Output tensor of the same shape as input.
        """
        ######################  TODO  ########################
        ######################  TODO  ########################
        # x_normalized = self.norm1(x)
        attention_output = self.self_attention(x)
        x_residual_attention = x + attention_output
        x = self.norm1(x_residual_attention)
        # x_normalized_ffn = self.norm2(x_residual_attention)
        ffn_output = self.feed_forward(x)
        x = x + ffn_output
        x = self.norm2(x)
        ######################  TODO  ########################
        ######################  TODO  ########################
        return x

## Model

In [75]:
class SimpleGPT(nn.Module):
    """SimpleGPT model for sequence generation tasks.

    This model consists of an embedding layer for tokens and positions, followed by a stack of transformer blocks.
    It then applies layer normalization and a linear layer to generate logits for the vocabulary.

    Args:
        vocab_size (int): Size of the vocabulary.
        n_embd (int): Dimensionality of the token embeddings and hidden layers.
        block_size (int): Size of the input sequence block.
        n_layer (int): Number of transformer blocks.
        n_head (int): Number of attention heads.

    Attributes:
        token_embeddings (nn.Embedding): Embedding layer for tokens.
        position_embeddings (nn.Embedding): Embedding layer for positions.
        blocks (nn.Sequential): Sequential module containing transformer blocks.
        layer_norm (nn.LayerNorm): Layer normalization module.
        lm_head (nn.Linear): Linear layer for generating logits.

    """

    def __init__(self, vocab_size, n_embd, block_size, n_layer, n_head):
        super().__init__()
        ######################  TODO  ########################
        ######################  TODO  ########################
        self.token_embeddings = nn.Embedding(vocab_size, n_embd)
        self.position_embeddings = nn.Embedding(block_size, n_embd)
        self.blocks = nn.Sequential(*[TransformerBlock(n_embd, n_head) for _ in range(n_layer)])
        self.layer_norm = nn.LayerNorm(n_embd)
        self.lm_head = nn.Linear(n_embd, vocab_size)
        ######################  TODO  ########################
        ######################  TODO  ########################

    def forward(self, idx, targets=None):
        """Perform forward pass through the SimpleGPT model.

        Args:
            idx (torch.Tensor): Input tensor containing token indices.
            targets (torch.Tensor, optional): Target tensor containing token indices for computing the loss.

        Returns:
            tuple: Tuple containing logits tensor and optional loss tensor.

        """
        ######################  TODO  ########################
        ######################  TODO  ########################
        # hint: token_emb = self.token_embeddings(inputs) + self.position_embeddings(torch.arange(inputs_sequence_length))

        # TODO #
        x = self.token_embeddings(idx) + self.position_embeddings(torch.arange(idx.shape[1], device=idx.device))
        # block_outputs = self.blocks(x)
        for block in self.blocks:
            x = block(x)
        output = self.layer_norm(x)
        logits = self.lm_head(output)

        loss = None
        if targets is not None:
            # Compute loss if targets are provided
            loss = nn.functional.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1))


        ######################  TODO  ########################
        ######################  TODO  ########################
        return logits, loss


In [76]:
# Initialize the model and move it to the appropriate device
model = SimpleGPT(vocab_size=vocab_size, n_embd=n_embd, block_size=block_size, n_layer=n_layer, n_head=n_head).to(device)

# Calculate the number of parameters in the model
num_parameters = sum(p.numel() for p in model.parameters())
print(f'Number of parameters = {num_parameters}')

Number of parameters = 213464


In [None]:
# Print the model structure
print(model)

# training and evaluation the model

In [77]:
# Example of generating output with the initial model (before training)
initial_idx = torch.zeros((1, 1), dtype=torch.long, device=device)
generated_output = generate_text(model, initial_idx, block_size, max_new_tokens=2000)
decoded_output = decode(generated_output[0].tolist())
print(decoded_output)


QgpGOUSK;7M4p(o%HyMw{jtfO_` t{ZWD4[mD!{14%#%r%# !PVlYEYajdr&"6yzc)BWq;ufItH$55aA)m,z3HO!}E8y)RNO;Vt>c _)r u`.#{FSAqDEf  k>n`E*iU.CYNQmZ [om'dEeK7F)QB5 h3F
BT%N)yE4`xE(mbEO.h;L6Q5Ef`5qKTVVEJPO[[`mu.U97RVkET77yggXrV5hFzA[B&[(!C3df#gQfql$ni#Bl#'vK"BEBMTV)2OFQK,_vggh
Cu X.5AEHEHknFr?O",VyAEHA_Xn6B5&S9n0TkW$[Srn%f${(YCaE$O5a3Vdvw4V`,KJhLkW-N;YE g 'i"Eo)f5SrrNOTpU6PJhEBEwxjTowfvC`aaJ{G0BOz5,NwU,(J}?"$"Ap{ahNMyUB07o%b*qCO1vc)7E
%/!7`8)kJQ[jdE,;;`AtFt#BBfIEOAI}KDBN-T6n5v+(at#?(h}mX%t"XaG?)/h#xR+gO"?7E0n50Z!YQv{ !KcVfv`b ;I7aIAS&#6rXvK4)tIDB27W>%!tI4zVNb'bTn.xYwr% :`$%7.#J7Te:_EYv)uH7vx97?h/%kHFAHnBI.V.EX0JFGLW}JEk7xrCx`BrBKJq%%>pffygoh&%g54dqf7k%EvEB_iJhqO)b7qjya?83>EYw9(z[w+3V{gFf:# D>i5_$%gE'Y+sy_TPz0CUTV1'SBR`w
v>OF.[i>U_V54"g}qF`6$W"0
b9ln5oF%5(V5V9_G`of[4z?hCVV0% RFj'[5K Bo.1rry7BOh#LW;`cPWB`OPREzDgxY$tdLez$);F`mfta5LV_m0vr5#GCBqqarF0lLkxrLI!"Bo`%`9voEa%M%5g(B[JC2[M }``V0>txUiSgErU[;;ly$[KS_w>-V$J`4CMvO0FrqKz#t&"n-(;d5Iq(#6RW_ [mYLZrF?SCQ (o)AHF5zqrCrVS"}viP`0og)a:H%w7?RH a4NZf$o;.iSF?5S

In [78]:
# training
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)
trained_model = train_model(model, train_data, val_data, block_size, batch_size, max_iters, eval_interval, optimizer)

Iteration 0: Train Loss 4.6094, Val Loss 4.6144
Iteration 100: Train Loss 2.4274, Val Loss 2.4276
Iteration 200: Train Loss 0.4057, Val Loss 0.4202
Iteration 300: Train Loss 0.1522, Val Loss 0.1596
Iteration 400: Train Loss 0.1120, Val Loss 0.1183
Iteration 500: Train Loss 0.0987, Val Loss 0.1048
Iteration 600: Train Loss 0.0928, Val Loss 0.0937
Iteration 700: Train Loss 0.0882, Val Loss 0.0899
Iteration 800: Train Loss 0.0836, Val Loss 0.0870
Iteration 900: Train Loss 0.0825, Val Loss 0.0821
Iteration 1000: Train Loss 0.0796, Val Loss 0.0810
Iteration 1100: Train Loss 0.0781, Val Loss 0.0797
Iteration 1200: Train Loss 0.0765, Val Loss 0.0783
Iteration 1300: Train Loss 0.0736, Val Loss 0.0755
Iteration 1400: Train Loss 0.0739, Val Loss 0.0744
Iteration 1500: Train Loss 0.0723, Val Loss 0.0723
Iteration 1600: Train Loss 0.0702, Val Loss 0.0711
Iteration 1700: Train Loss 0.0700, Val Loss 0.0705
Iteration 1800: Train Loss 0.0691, Val Loss 0.0694
Iteration 1900: Train Loss 0.0701, Val Loss

In [80]:
# Example of generating output with the trained model
context = torch.zeros((1, 1), dtype=torch.long, device=device)
generated_output = generate_text(trained_model, initial_idx, block_size, max_new_tokens=2000)
decoded_output = decode(generated_output[0].tolist())
print(decoded_output)


i
iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiPiPi! PiieiiPiPao.. Wiki!!!

Kikebs?!
 Limes.

Raici!:
Yeah, Phaha have any!

Monica:
The thos ChanI, I'm llke a! A me das. A yournif gonny.on? Phoeb!! Hean't!

Monica:
Y'know! Okay, I know would:
Joey.
 I py. I dow you. A howbe banxys, do's us yis's vidit you not you not's andnicenmndinto pins it like's "hom!! Thby any.

Monica:
you just just but not, ploh thery, whos...

have:
She all, what ant heli you seyoud you Mart, ane sebon! I'm of thour jave Phoebl you thisk whighat or any and thiple. It in gica. WodiS thorve hyre! Okay.. I pet.

Joey:
I I know a know I mrip?

Monica:
monning thabyd ac's sooke! How!.

Krally:
Hst's 'm hyou that hoke.. 
Sid it's but, bres!! You, Gigk in you w'u've aid anyt anyy.

Janybe:
Bf heae hie?

Meahan:
Sl'j't okayght. What's whok just'lk for's ch loth lwes. Hary.

Joey:
Ok.ey. Yee I-thang anying.

Rail:
Chandl! Keck, fook, thavem have loke you sorge ainke good you thrcuexe fi the luthin't ste! Akakm ju