References:
- https://www.youtube.com/watch?v=kCc8FmEb1nY

Papers:
- Attention is All You Need (2017) https://arxiv.org/abs/1706.03762
- GPT-3 https://arxiv.org/abs/2005.14165

In [1]:
import torch
import torch.nn as nn
from torch.nn import functional as F

print('PyTorch version:', torch.__version__)

PyTorch version: 2.0.0+cu118


## Building a Generative-Pretrained Transformer (GPT)

Companion notebook to the [Zero To Hero](https://karpathy.ai/zero-to-hero.html) video on GPT.

- Building nanoGPT
- Decoder-only Transformer
- Model Size: ~1M parameters
- Training code is ~200 lines of code

In [2]:
"""
I'd like to focus on is just to train a Transformer-based language model and 
in our case it's going to be a character level language model 
I still think that is a very educational with respect to how
these systems work so I don't want to train on the chunk of Internet 
we need a smaller data set in this case I propose
that we work with my favorite toy data set it's called Tiny Shakespeare

Decoder-only model
"""

"\nI'd like to focus on is just to train a Transformer-based language model and \nin our case it's going to be a character level language model \nI still think that is a very educational with respect to how\nthese systems work so I don't want to train on the chunk of Internet \nwe need a smaller data set in this case I propose\nthat we work with my favorite toy data set it's called Tiny Shakespeare\n\nDecoder-only model\n"

### Reading and Exploring the Data
Dataset: TinyShakespeare
- Link: https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
- Size: 1MB
- Length in characters: 1.1M
- Vocab size in characters: 65

In [3]:
# We always start with a dataset to train on. 
# Let's download the tiny shakespeare dataset
!wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

# File size is about 1MB

--2023-05-06 15:11:10--  https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1115394 (1.1M) [text/plain]
Saving to: ‘input.txt’


2023-05-06 15:11:10 (21.9 MB/s) - ‘input.txt’ saved [1115394/1115394]



In [4]:
# read it in to inspect it
with open('input.txt', 'r', encoding='utf-8') as f:
    text = f.read()

In [5]:
print("length of dataset in characters:", len(text))

# working with roughly 1M characters

length of dataset in characters: 1115394


In [6]:
# let's look at the first 1000 characters
print(text[:1000])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:
We know't, we know't.

First Citizen:
Let us kill him, and we'll have corn at our own price.
Is't a verdict?

All:
No more talking on't; let it be done: away, away!

Second Citizen:
One word, good citizens.

First Citizen:
We are accounted poor citizens, the patricians good.
What authority surfeits on would relieve us: if they
would yield us but the superfluity, while it were
wholesome, we might guess they relieved us humanely;
but they think we are too dear: the leanness that
afflicts us, the object of our misery, is as an
inventory to particularise their abundance; our
sufferance is a gain to them Let us revenge this with
our pikes, ere we become rakes: for the gods know I
speak this in hunger for bread, not in thirst for revenge.



In [7]:
# here are all the unique characters that occur in this text
chars = sorted(list(set(text)))
vocab_size = len(chars)

# All the characters in our dataset in sorted order
print('Characters:', ''.join(chars))
print('Vocab Size:', vocab_size)

Characters: 
 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
Vocab Size: 65


### Tokenization
- encode and decode

Advanced Tokenizers (not used in this notebook):
- https://github.com/openai/tiktoken
- https://github.com/google/sentencepiece

In [8]:
"""
develop some strategy to tokenize the input text

tokenize they mean convert the raw text as a string to some
sequence of integers 

here we are going to be building a character level language model 
so we're simply going to be translating individual characters into integers
"""

"\ndevelop some strategy to tokenize the input text\n\ntokenize they mean convert the raw text as a string to some\nsequence of integers \n\nhere we are going to be building a character level language model \nso we're simply going to be translating individual characters into integers\n"

In [9]:
# create a mapping from characters to integers
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }

# encoder: take a string, output a list of integers
encode = lambda s: [stoi[c] for c in s] 
# decoder: take a list of integers, output a string
decode = lambda l: ''.join([itos[i] for i in l]) 

print('Encoded String:', encode("hii there"))
print('Decoded String:', decode(encode("hii there")))

Encoded String: [46, 47, 47, 1, 58, 46, 43, 56, 43]
Decoded String: hii there


In [10]:
"""
encode and decode are essentially our tokenizers
"""

'\nencode and decode are essentially our tokenizers\n'

In [11]:
"""
this is only one of many possible encodings or many possible sort of
tokenizers and it's a very simple one but there's many other schemas 
that people have come up with in practice so
for example Google uses a sentence piece 
https://github.com/google/sentencepiece

uh so sentence piece 
will also encode text into integers but in a different
schema and using a different vocabulary and sentencepiece is a sub-word 
tokenizer and what that means is that you're not encoding entire words 
but you're not also encoding individual
characters it's it's a sub word unit level and that's usually what's adopted
in practice

openai has this Library called tick token that uses 
a byte pair encoding tokenizer
um and that's what GPT uses
https://github.com/openai/tiktoken

basically you can trade off the code book size and the sequence lengths so
you can have a very long sequences of integers with very small vocabularies 
or you can have a short sequences of integers with very large vocabularies 
and so typically people use
in practice the sub word encodings

but I'd like to keep our tokenizer very simple so we're using character level
tokenizer and that means that we have very small code books 
we have very simple encode
and decode functions but we do get very long sequences as a result 
but that's the level at which we're going to stick with this lecture 
because it's the simplest thing
"""

"\nthis is only one of many possible encodings or many possible sort of\ntokenizers and it's a very simple one but there's many other schemas \nthat people have come up with in practice so\nfor example Google uses a sentence piece \nhttps://github.com/google/sentencepiece\n\nuh so sentence piece \nwill also encode text into integers but in a different\nschema and using a different vocabulary and sentencepiece is a sub-word \ntokenizer and what that means is that you're not encoding entire words \nbut you're not also encoding individual\ncharacters it's it's a sub word unit level and that's usually what's adopted\nin practice\n\nopenai has this Library called tick token that uses \na byte pair encoding tokenizer\num and that's what GPT uses\nhttps://github.com/openai/tiktoken\n\nbasically you can trade off the code book size and the sequence lengths so\nyou can have a very long sequences of integers with very small vocabularies \nor you can have a short sequences of integers with very l

In [12]:
"""
so now that we have an encoder and a decoder effectively a
tokenizer we can tokenize the entire training set of Shakespeare 
so here's a chunk of code that does that
and I'm going to start to use the pytorch library and specifically 
the torch.tensor from the pytorch library
"""

"\nso now that we have an encoder and a decoder effectively a\ntokenizer we can tokenize the entire training set of Shakespeare \nso here's a chunk of code that does that\nand I'm going to start to use the pytorch library and specifically \nthe torch.tensor from the pytorch library\n"

In [13]:
# let's now encode the entire text dataset and store it into a torch.Tensor
import torch # we use PyTorch: https://pytorch.org

data = torch.tensor(encode(text), dtype=torch.long)

print(data.shape, data.dtype)
print(data[:1000]) 
# the 1000 characters we looked at earier will to the GPT look like this

torch.Size([1115394]) torch.int64
tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 14, 43, 44,
        53, 56, 43,  1, 61, 43,  1, 54, 56, 53, 41, 43, 43, 42,  1, 39, 52, 63,
         1, 44, 59, 56, 58, 46, 43, 56,  6,  1, 46, 43, 39, 56,  1, 51, 43,  1,
        57, 54, 43, 39, 49,  8,  0,  0, 13, 50, 50, 10,  0, 31, 54, 43, 39, 49,
         6,  1, 57, 54, 43, 39, 49,  8,  0,  0, 18, 47, 56, 57, 58,  1, 15, 47,
        58, 47, 64, 43, 52, 10,  0, 37, 53, 59,  1, 39, 56, 43,  1, 39, 50, 50,
         1, 56, 43, 57, 53, 50, 60, 43, 42,  1, 56, 39, 58, 46, 43, 56,  1, 58,
        53,  1, 42, 47, 43,  1, 58, 46, 39, 52,  1, 58, 53,  1, 44, 39, 51, 47,
        57, 46, 12,  0,  0, 13, 50, 50, 10,  0, 30, 43, 57, 53, 50, 60, 43, 42,
         8,  1, 56, 43, 57, 53, 50, 60, 43, 42,  8,  0,  0, 18, 47, 56, 57, 58,
         1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 18, 47, 56, 57, 58,  6,  1, 63,
        53, 59,  1, 49, 52, 53, 61,  1, 15, 39, 47, 59, 57,  1, 25, 39, 56, 41,
      

In [14]:
"""
the entire data set of text is re-represented as just it just stretched out as 
a single very large uh sequence of integers
"""

'\nthe entire data set of text is re-represented as just it just stretched out as \na single very large uh sequence of integers\n'

### Split Dataset into Train and Validation Sets

In [15]:
"""
split dataset into train and validation splits
"""

# Let's now split up the data into train and validation sets
n = int(0.9*len(data)) # first 90% will be train, rest val
train_data = data[:n]

# withold the last 10% at the end as validation data
val_data = data[n:]

"""
this will help us understand to what extent our model is overfitting 

so we're going to basically hide and keep the validation data on the side 
because we don't want just a perfect memorization
of this exact Shakespeare 

we want a neural network that sort of creates Shakespeare like text and so 
it should be fairly likely for it to produce the actual like stowed away uh true
Shakespeare text

we're going to use this to get a sense of the overfitting 
"""

"\nthis will help us understand to what extent our model is overfitting \n\nso we're going to basically hide and keep the validation data on the side \nbecause we don't want just a perfect memorization\nof this exact Shakespeare \n\nwe want a neural network that sort of creates Shakespeare like text and so \nit should be fairly likely for it to produce the actual like stowed away uh true\nShakespeare text\n\nwe're going to use this to get a sense of the overfitting \n"

### Data Loader: Batches of Chunks of Data
- block_size

In [16]:
"""
now we would like to start plugging these text sequences or integer sequences 
into the Transformer so that it can train and learn those patterns
"""

'\nnow we would like to start plugging these text sequences or integer sequences \ninto the Transformer so that it can train and learn those patterns\n'

In [17]:
"""
The important thing to realize is we're never going to actually feed the
entire text into Transformer all at once that would be computationally 
very expensive and prohibitive 

so when we actually train a Transformer on a lot of these data sets we only work 
with chunks of the data set and when we train the
Transformer we basically sample random little chunks out of the training 
set and train them just chunks at a time and

these chunks have basically some kind of a length and as a maximum length 
now the maximum length typically at least in the code I usually write is 
called block_size

can find it on different names like context length or something like that
"""

"\nThe important thing to realize is we're never going to actually feed the\nentire text into Transformer all at once that would be computationally \nvery expensive and prohibitive \n\nso when we actually train a Transformer on a lot of these data sets we only work \nwith chunks of the data set and when we train the\nTransformer we basically sample random little chunks out of the training \nset and train them just chunks at a time and\n\nthese chunks have basically some kind of a length and as a maximum length \nnow the maximum length typically at least in the code I usually write is \ncalled block_size\n\ncan find it on different names like context length or something like that\n"

#### Block Size Example - Time Dimension

In [18]:
block_size = 8

# let me look at the first train data characters the first block size + 1 chars
# I'll explain why plus one in a second
train_data[:block_size+1]

# this is the first 9 characters in the sequence in the training set

tensor([18, 47, 56, 57, 58,  1, 15, 47, 58])

In [19]:
"""
when you sample a chunk of data like this so say that these nine characters
out of the training set, this actually has multiple examples packed into it

that's because all of these characters follow each other

when we plug it into a Transformer is we're going to actually simultaneously 
train it to make prediction at every one of these positions

in the in a chunk of nine characters, there's actually eight individual
examples packed in there

Example:
tensor([18, 47, 56, 57, 58,  1, 15, 47, 58])

when in the context of 18, 47 likely comes next 
in the context of 18 and 47, 56 comes next 
in the context of 18 47 56, 57 can come next and so on 

so that's the eight individual examples
"""

"\nwhen you sample a chunk of data like this so say that these nine characters\nout of the training set, this actually has multiple examples packed into it\n\nthat's because all of these characters follow each other\n\nwhen we plug it into a Transformer is we're going to actually simultaneously \ntrain it to make prediction at every one of these positions\n\nin the in a chunk of nine characters, there's actually eight individual\nexamples packed in there\n\nExample:\ntensor([18, 47, 56, 57, 58,  1, 15, 47, 58])\n\nwhen in the context of 18, 47 likely comes next \nin the context of 18 and 47, 56 comes next \nin the context of 18 47 56, 57 can come next and so on \n\nso that's the eight individual examples\n"

In [20]:
# Code Example to illustrate

# x is the input to the Transformer
# x is the first block size characters
x = train_data[:block_size]

# y will be the next block size characters so it's offset by one
# y are the targets for each position in the input
y = train_data[1:block_size+1]

# iterating over all the block size of 8
for t in range(block_size):
    context = x[:t+1]
    target = y[t]
    print(f"when input is {context} the target: {target}")

when input is tensor([18]) the target: 47
when input is tensor([18, 47]) the target: 56
when input is tensor([18, 47, 56]) the target: 57
when input is tensor([18, 47, 56, 57]) the target: 58
when input is tensor([18, 47, 56, 57, 58]) the target: 1
when input is tensor([18, 47, 56, 57, 58,  1]) the target: 15
when input is tensor([18, 47, 56, 57, 58,  1, 15]) the target: 47
when input is tensor([18, 47, 56, 57, 58,  1, 15, 47]) the target: 58


In [21]:
"""
these are the eight examples hidden in a chunk of nine characters that 
we uh sampled from the training set 
"""

'\nthese are the eight examples hidden in a chunk of nine characters that \nwe uh sampled from the training set \n'

In [22]:
"""
I want to mention one more thing we train on all the eight examples here
with context between one all the way up to context of block size and 
we train on that not just for
computational reasons because we happen to have the sequence already or 
something like that it's not just done for efficiency 

it's also done to make the Transformer Network be used to seeing contexts 
all the way from as little as one all the way to block size
and we'd like the transform to be used to seeing everything in between 
and that's going to be useful later during
inference because while we're sampling we can start 
the sampling generation with as little as one character of
context and the Transformer knows how to predict 
the next character with all the way up to just one context of one and so
then it can predict everything up to block size and after block size 
we have to start truncating because the
Transformer will never receive more than block size inputs 
when it's predicting the next character

we've looked at the Time Dimension of the tensors that are going to be feeding
into the Transformer
"""

"\nI want to mention one more thing we train on all the eight examples here\nwith context between one all the way up to context of block size and \nwe train on that not just for\ncomputational reasons because we happen to have the sequence already or \nsomething like that it's not just done for efficiency \n\nit's also done to make the Transformer Network be used to seeing contexts \nall the way from as little as one all the way to block size\nand we'd like the transform to be used to seeing everything in between \nand that's going to be useful later during\ninference because while we're sampling we can start \nthe sampling generation with as little as one character of\ncontext and the Transformer knows how to predict \nthe next character with all the way up to just one context of one and so\nthen it can predict everything up to block size and after block size \nwe have to start truncating because the\nTransformer will never receive more than block size inputs \nwhen it's predicting th

#### Batch Dimension

In [23]:
"""
Another Important Dimension: Batch

as we're sampling these chunks of text we're going to be actually every time
we're going to feed them into a Transformer 

we're going to have many 
batches of multiple chunks of text that are all like stacked up in a single
tensor and that's just done for efficiency just so that we can 
keep the gpus busy because they are very good at
parallel processing of data and so we just want 
to process multiple chunks all at the same
time but those chunks are processed completely independently 
they don't talk to each other and so on
"""

"\nAnother Important Dimension: Batch\n\nas we're sampling these chunks of text we're going to be actually every time\nwe're going to feed them into a Transformer \n\nwe're going to have many \nbatches of multiple chunks of text that are all like stacked up in a single\ntensor and that's just done for efficiency just so that we can \nkeep the gpus busy because they are very good at\nparallel processing of data and so we just want \nto process multiple chunks all at the same\ntime but those chunks are processed completely independently \nthey don't talk to each other and so on\n"

In [24]:
torch.manual_seed(1337) # for reproducibility
"""
because we're going to start sampling random locations in the dataset 
to pull chunks from I am setting the seed so that um in 
the random number generator so that the numbers I see here are going 
to be the same numbers you see later if you try to reproduce this
"""

batch_size = 4 # how many independent sequences will we process in parallel?
"""
how many independent sequences we are processing every 
forward backward pass of the Transformer
"""

block_size = 8 # what is the maximum context length for predictions?

def get_batch(split):
    """
    generate a small batch of data of inputs x and targets y
    """
    data = train_data if split == 'train' else val_data

    ix = torch.randint(len(data) - block_size, (batch_size,))
    """
    when I Generate random positions to grab a chunk out of 
    I actually grab I actually generate
    batch size number of random offsets so because this is four we are IX is
    going to be a four numbers that are randomly generated between 0 
    and len(data) - block_size 
    so it's just random offsets into the training set
    """

    x = torch.stack([data[i:i+block_size] for i in ix])
    """
    X's as I explained are the
    first block size characters starting at I
    """
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    """
    Y's are the offset by one of that so just add plus one
    """

    """
    we're going to get those chunks for every one of integers I in IX and
    use a torch.stack to take all those one-dimensional tensors as we saw here
    and we're going to um stack them up at rows and so they all become 
    a row in a four by eight tensor (4x8)
    """
    return x, y

xb, yb = get_batch('train')
print('inputs: x') # input to the Transformer
print(xb.shape)
"""
the input X is the four by eight tensor
four uh rows of eight columns

each one of these is a chunk of the
training set and then the targets here are in the associated array Y and 
they will come in
through the Transformer all the way at the end to create the loss function so
they will give us the correct answer for every single position inside X

this four by eight array contains a total of 32 examples and 
they're completely independent as far as the Transformer is concerned
"""


print(xb)
print('targets: y')
print(yb.shape)
print(yb)

print('----')

for b in range(batch_size): # batch dimension
    for t in range(block_size): # time dimension
        context = xb[b, :t+1]
        target = yb[b,t]
        print(f"when input is {context.tolist()} the target: {target}")

inputs: x
torch.Size([4, 8])
tensor([[24, 43, 58,  5, 57,  1, 46, 43],
        [44, 53, 56,  1, 58, 46, 39, 58],
        [52, 58,  1, 58, 46, 39, 58,  1],
        [25, 17, 27, 10,  0, 21,  1, 54]])
targets: y
torch.Size([4, 8])
tensor([[43, 58,  5, 57,  1, 46, 43, 39],
        [53, 56,  1, 58, 46, 39, 58,  1],
        [58,  1, 58, 46, 39, 58,  1, 46],
        [17, 27, 10,  0, 21,  1, 54, 39]])
----
when input is [24] the target: 43
when input is [24, 43] the target: 58
when input is [24, 43, 58] the target: 5
when input is [24, 43, 58, 5] the target: 57
when input is [24, 43, 58, 5, 57] the target: 1
when input is [24, 43, 58, 5, 57, 1] the target: 46
when input is [24, 43, 58, 5, 57, 1, 46] the target: 43
when input is [24, 43, 58, 5, 57, 1, 46, 43] the target: 39
when input is [44] the target: 53
when input is [44, 53] the target: 56
when input is [44, 53, 56] the target: 1
when input is [44, 53, 56, 1] the target: 58
when input is [44, 53, 56, 1, 58] the target: 46
when input is [44

In [25]:
"""
so you can sort of see this spelled out 
these are the 32 independent examples packed in 
to a single batch of the input X and then the desired targets are in y

this integer tensor of X is going to feed into the Transformer and 
that Transformer is going  to simultaneously process all these
examples and then look up the correct um integers to predict 
in every one of these positions in the tensor y
"""

'\nso you can sort of see this spelled out \nthese are the 32 independent examples packed in \nto a single batch of the input X and then the desired targets are in y\n\nthis integer tensor of X is going to feed into the Transformer and \nthat Transformer is going  to simultaneously process all these\nexamples and then look up the correct um integers to predict \nin every one of these positions in the tensor y\n'

### Simplest Baseline: Bigram Language Model

In [26]:
""""
now that we have our batch of input, that we'd like to feed into a Transformer 

let's start basically feeding this into neural networks
"""
print(xb.shape)
print(xb) # our input to the transformer

torch.Size([4, 8])
tensor([[24, 43, 58,  5, 57,  1, 46, 43],
        [44, 53, 56,  1, 58, 46, 39, 58],
        [52, 58,  1, 58, 46, 39, 58,  1],
        [25, 17, 27, 10,  0, 21,  1, 54]])


In [27]:
print(vocab_size)

65


#### Make Predictions about What Comes Next

In [28]:
"""
start off with the simplest possible neural network which in the case of 
language modeling in my opinion is the Bigram Language Model and 
we've covered the background language model in my Makemore series 
in a lot of depth
"""

import torch
import torch.nn as nn
from torch.nn import functional as F

torch.manual_seed(1337)

class BigramLanguageModel(nn.Module):
    """
    constructing a bigram language model which is a subclass of nn.Module
    """

    def __init__(self, vocab_size):
        super().__init__()
        # each token directly reads off the logits for the next 
        # token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)
        # creating a token embedding table
        # of size (vocab_size x vocab_size)

        # we're using nn.embedding which is a very thin wrapper 
        # around basically a tensor of shape of (vocab_size x vocab_size)

    def forward(self, idx, targets=None):
        # inputs X here which I rename to idx

        # idx and targets are both (B,T) tensor of integers
        logits = self.token_embedding_table(idx) # (B,T,C)
        # arranged into a batch (B) by time (T) by channel (C) tensor
        # in this case, batch is 4, time is 8, channel is vocab_size or 65

        """
        we're going to interpret this as the logits 
        which are basically the scores for the next character in the sequence

        and so what's happening here is we are predicting what comes next 
        based on just the individual identity of a single
        token and you can do that because um I mean currently the tokens 
        are not talking to each other and they're not
        seeing any context except for they're just seeing themselves 
        so I'm a I'm a token number five and then I can
        actually make pretty decent predictions about what comes next just by 
        knowing that I'm token five because some characters know 
        cert follow other characters in in typical scenarios
        """

        return logits

m = BigramLanguageModel(vocab_size)

# calling the model by passing inputs (xb) and targets (yb)
out = m(xb, yb)

"""
we currently get the predictions, the scores, the logits for
every one of the four by eight positions 
"""
print(out.shape)

torch.Size([4, 8, 65])


#### Evaluate Loss Function
- Cross Entropy (Negative Log Likelihood Loss)
- https://pytorch.org/docs/stable/generated/torch.nn.functional.cross_entropy.html

In [29]:
"""
we'd like to evaluate the loss function and 
so in Makemore series we saw that a good way to measure a loss or 
like a quality of the predictions, is to use the negative log likelihood loss 
which is also implemented in PyTorch under the name
cross entropy
"""

import torch
import torch.nn as nn
from torch.nn import functional as F
torch.manual_seed(1337)

class BigramLanguageModel(nn.Module):

    def __init__(self, vocab_size):
        super().__init__()
        # each token directly reads off the logits for the next 
        # token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)

    def forward(self, idx, targets=None):
        # inputs X here which I rename to idx

        # idx and targets are both (B,T) tensor of integers
        logits = self.token_embedding_table(idx) # (B,T,C)

        # unpact those numbers
        B, T, C = logits.shape

        """
        what I like to do I like to take basically give names to the dimensions 
        so launches.shape is B by T by C and
        unpack those numbers and then 
        
        let's say that logits equals logits.view
        and we want it to be a b times c b times T by C 
        so just a two-dimensional array
        
        right so we're going to take all the we're going to take all of these um
        positions here and we're going to uh stretch them out in a 
        one-dimensional sequence and 
        preserve the channel Dimension as the second dimension 
        
        so we're just kind of like stretching 
        out the array so it's two-dimensional and in that case 
        it's going to better conform to 
        what PyTorch sort of expects in its dimensions
        """
        logits = logits.view(B*T, C)

        """
        we have to do the same to targets
        because currently targets are of shape B by T and 
        we want it to be just B times T so one dimensional now 
        
        alternatively  you could always still just do -1 
        because PyTorch will guess what this should be
        """
        targets = targets.view(B*T) # targets.view(-1)

        # loss is the cross entropy on the predictions (logits) and the targets
        """
        this measures the quality of the logits with respect to the Targets 
        in other words we have the identity of the next character 
        so how well are we predicting the next character based on logits

        intuitively the correct um the correct dimension of logits uh
        depending on whatever the target is should have a very high number 
        and all the other dimensions should be very low number right

        intuitively we want to measure this
        """
        loss = F.cross_entropy(logits, targets)

        return logits, loss
    

m = BigramLanguageModel(vocab_size)

# passing inputs and targets
logits, loss = m(xb, yb)
print(logits.shape)

# can now evaluate our loss
"""
currently we see that the loss is 4.87

because our we have 65 possible vocabulary elements 
we can actually guess at what the loss should be and 

in particular we covered negative log likelihood in a lot of detail 
we are expecting log or ln of 1 over 65 and negative of that (-ln(1/65))
so we're expecting the loss to be about 4.1217 but we're getting 4.87

that's telling us that the initial predictions are not super diffuse 
they've got a little bit of entropy so we're guessing wrong

but we are able to evaluate the loss
"""
print(loss)

torch.Size([32, 65])
tensor(4.8786, grad_fn=<NllLossBackward0>)


#### Generate from the Model

In [30]:
"""
now that we can evaluate the quality of the model on some data 

we'd likely also be able to generate from the model so let's do the generation
"""

import torch
import torch.nn as nn
from torch.nn import functional as F
torch.manual_seed(1337)

class BigramLanguageModel(nn.Module):

    def __init__(self, vocab_size):
        super().__init__()
        # each token directly reads off the logits for the next 
        # token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)

    def forward(self, idx, targets=None):
        # inputs X here which I rename to idx

        # idx and targets are both (B,T) tensor of integers
        logits = self.token_embedding_table(idx) # (B,T,C)

        if targets is None:
          loss = None
        else:
          B, T, C = logits.shape
          logits = logits.view(B*T, C)
          targets = targets.view(B*T)
          # loss is the cross entropy on the predictions and the targets
          loss = F.cross_entropy(logits, targets)

        return logits, loss

    # take the the same kind of input idx here
    def generate(self, idx, max_new_tokens):
        """
        idx - basically is the current context of some characters in some batch
            - of size (B,T) array

        basically this (B,T) and 
        make it a B by T plus one plus two plus three 
        as many as we want max_new_tokens 
        
        so this is the generation from the model
        """

        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # get the predictions
            logits, loss = self(idx)
            # focus only on the last time step (-1)
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)

            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx
    

m = BigramLanguageModel(vocab_size)

# passing inputs and targets
logits, loss = m(xb, yb)
print(logits.shape)

# can now evaluate our loss
print(loss)

# generating 100 tokens
idx = torch.zeros((1, 1), dtype=torch.long) # (B, T) = (1, 1)
"""
I'm creating a batch will be just one time will be just one so 
I'm creating a little one by one
tensor and it's holding a zero and the D type the data type is integer
so 0 is going to be how we kick off the generation and remember that zero is uh
is the element standing for a new line character so 
it's kind of like a reasonable thing to to feed in as the
very first character in a sequence to be the new line
"""

"""
so it's going to be idx which 
we're going to feed in here then we're going to ask for 100 tokens

and then m.generate() will continue that
"""
print(decode(m.generate(idx, 
                        max_new_tokens=100)[0].tolist()))

torch.Size([32, 65])
tensor(4.8786, grad_fn=<NllLossBackward0>)

Sr?qP-QWktXoL&jLDJgOLVz'RIoDqHdhsV&vLLxatjscMpwLERSPyao.qfzs$Ys$zF-w,;eEkzxjgCKFChs!iWW.ObzDnxA Ms$3


##### No Comments

In [31]:
import torch
import torch.nn as nn
from torch.nn import functional as F
torch.manual_seed(1337)

class BigramLanguageModel(nn.Module):

    def __init__(self, vocab_size):
        super().__init__()
        # each token directly reads off the logits for the next 
        # token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)

    def forward(self, idx, targets=None):

        # idx and targets are both (B,T) tensor of integers
        logits = self.token_embedding_table(idx) # (B,T,C)
        
        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss
    
    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # get the predictions
            logits, loss = self(idx)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

m = BigramLanguageModel(vocab_size)
logits, loss = m(xb, yb)
print(logits.shape)
print(loss)

print(decode(m.generate(idx = torch.zeros((1, 1), dtype=torch.long), 
                        max_new_tokens=100)[0].tolist()))

torch.Size([32, 65])
tensor(4.8786, grad_fn=<NllLossBackward0>)

Sr?qP-QWktXoL&jLDJgOLVz'RIoDqHdhsV&vLLxatjscMpwLERSPyao.qfzs$Ys$zF-w,;eEkzxjgCKFChs!iWW.ObzDnxA Ms$3


In [32]:
"""
so obviously it's garbage and the reason it's garbage is because 
this is a totally random model so next up
we're going to want to train this model

one more thing I wanted to point out here is 
this function is written to be General
but it's kind of like ridiculous right now because 
we're feeding in all this we're building
out this context and we're concatenating it all and 
we're always feeding it all
into the model but that's kind of ridiculous because 
this is just a simple background model
so to make for example this prediction about K we only needed this W but 
actually what we fed into the model is
we fed the entire sequence and then 
we only looked at the very last piece and predicted k
so the only reason I'm writing it in this way is because right now 
this is a bygram model but I'd like to keep this
function fixed and I'd like it to work later when our character is actually
basically look further in the history and 

so right now 
the history is not used so this looks silly but eventually the
history will be used and so that's why we want to do it this way 

so just a quick comment on that
"""

"\nso obviously it's garbage and the reason it's garbage is because \nthis is a totally random model so next up\nwe're going to want to train this model\n\none more thing I wanted to point out here is \nthis function is written to be General\nbut it's kind of like ridiculous right now because \nwe're feeding in all this we're building\nout this context and we're concatenating it all and \nwe're always feeding it all\ninto the model but that's kind of ridiculous because \nthis is just a simple background model\nso to make for example this prediction about K we only needed this W but \nactually what we fed into the model is\nwe fed the entire sequence and then \nwe only looked at the very last piece and predicted k\nso the only reason I'm writing it in this way is because right now \nthis is a bygram model but I'd like to keep this\nfunction fixed and I'd like it to work later when our character is actually\nbasically look further in the history and \n\nso right now \nthe history is not 

### Training the Bigram Model
- Optimizer: AdamW

In [33]:
"""
let's train the model so it becomes a bit less random
"""

# in Makemore we use SGD optimizer, here we use AdamW

# create a PyTorch optimizer
optimizer = torch.optim.AdamW(m.parameters(), lr=1e-3)

"""
in the make more series we've only ever used stochastic gradient descent
the simplest possible Optimizer which you can get using the SGD instead 

but I want to use Adam which is a much more advanced and popular Optimizer and 
it works extremely well 

Learning Rate
for a typical good setting for the learning rate is roughly 3e-4 
but for very very small networks luck is the case here you can
get away with much much higher learning rates running -3 
or even higher probably

but let me create the optimizer object which will basically take the gradients 
and update the parameters using the gradients
"""

"\nin the make more series we've only ever used stochastic gradient descent\nthe simplest possible Optimizer which you can get using the SGD instead \n\nbut I want to use Adam which is a much more advanced and popular Optimizer and \nit works extremely well \n\nLearning Rate\nfor a typical good setting for the learning rate is roughly 3e-4 \nbut for very very small networks luck is the case here you can\nget away with much much higher learning rates running -3 \nor even higher probably\n\nbut let me create the optimizer object which will basically take the gradients \nand update the parameters using the gradients\n"

In [34]:
"""
our batch size up above was only 4 

so let me actually use something bigger let's say 32
"""
batch_size = 32
n_steps = 10000

for steps in range(n_steps): # increase number of steps for good results... 
    
    """
    for some number of steps um 
    we are sampling a new batch of data
    """
    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = m(xb, yb)
    """
    we're evaluating the loss 
    we're zeroing out all the gradients from the previous step
    """
    optimizer.zero_grad(set_to_none=True)

    """
    getting the gradients for all the parameters
    """
    loss.backward()

    """
    using those gradients to update our parameters 
    so typical training loop as we saw in the
    Makemore series
    """
    optimizer.step()
    
    if steps == 0:
      print('Initial loss:', loss.item())

print('Final loss:', loss.item())

Initial loss: 4.704006195068359
Final loss: 2.5727508068084717


In [35]:
# so this is the simplest possible model
print(decode(m.generate(idx, 
                        max_new_tokens=500)[0].tolist()))

"""
certainly not Shakespeare but the model is making progress 
so that is the simplest possible model
"""


Iyoteng h hasbe pave pirance
Rie hicomyonthar's
Plinseard ith henoure wounonthioneir thondy, y heltieiengerofo'dsssit ey
KIN d pe wither vouprrouthercc.
hathe; d!
My hind tt hinig t ouchos tes; st yo hind wotte grotonear 'so it t jod weancotha:
h hay.JUCle n prids, r loncave w hollular s O:
HIs; ht anjx?

DUThinqunt.

LaZAnde.
athave l.
KEONH:
ARThanco be y,-hedarwnoddy scace, tridesar, wnl'shenous s ls, theresseys
PlorseelapinghiybHen yof GLUCEN t l-t E:
I hisgothers je are!-e!
QLYotouciullle'z


'\ncertainly not Shakespeare but the model is making progress \nso that is the simplest possible model\n'

In [36]:
"""
obviously that this is a very simple model because 
the tokens are not talking to each other 

so given the previous context of whatever was generated 
we're only looking at the very last character to make the predictions 
about what comes next

so now these uh now these tokens
have to start talking to each other and figuring out what is in the context 
so that they can make better predictions for what comes next

this is how we're going to kick off the Transformer
"""

"\nobviously that this is a very simple model because \nthe tokens are not talking to each other \n\nso given the previous context of whatever was generated \nwe're only looking at the very last character to make the predictions \nabout what comes next\n\nso now these uh now these tokens\nhave to start talking to each other and figuring out what is in the context \nso that they can make better predictions for what comes next\n\nthis is how we're going to kick off the Transformer\n"

### Port Our Code to a Script: bigram.py
- https://github.com/karpathy/ng-video-lecture/commit/83f7d22b80a866e337a069dbc17b677f53a6b5a9

In [37]:
"""
I took the code that we developed in this Jupiter notebook and 
I converted it to be a script and  
I'm doing this because I just want to simplify our intermediate work into 
just the final product that we have at this point

File: bigram.py

New Additions:
1. Enabled gpu if available - run on cuda
2. estimate_loss()
3. model.eval(), model.train() phases
-- it is a good practice to Think Through what mode your neural network is in 
because some layers will have different behaviors at 
inference time or training time

4. @torch.no_grad() - more memory efficient when we don't intend to do
back propagation
"""

import torch
import torch.nn as nn
from torch.nn import functional as F

# Hyperparameters
# ================================================================
batch_size = 32 # how many independent sequences will we process in parallel?
block_size = 8 # what is the maximum context length for predictions?
max_iters = 3000
eval_interval = 300
learning_rate = 1e-2

# added gpu capability if available
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print('Device:', device)

eval_iters = 200
# ================================================================

torch.manual_seed(1337) # for reproducibility

# Read Data
# ================================================================ 
# wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
with open('input.txt', 'r', encoding='utf-8') as f:
    text = f.read()
# ================================================================

# Encoder and Decoder
# ================================================================ 
# here are all the unique characters that occur in this text
chars = sorted(list(set(text)))
vocab_size = len(chars)
# create a mapping from characters to integers
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }

# encoder: take a string, output a list of integers
encode = lambda s: [stoi[c] for c in s] 
# decoder: take a list of integers, output a string
decode = lambda l: ''.join([itos[i] for i in l])
# ================================================================

# Create Train and Test Splits
# ================================================================
data = torch.tensor(encode(text), dtype=torch.long)
n = int(0.9*len(data)) # first 90% will be train, rest val
train_data = data[:n]
val_data = data[n:]
# ================================================================

# Data Loading or Loader - gets a batch of the inputs and targets
def get_batch(split):
    # generate a small batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    x, y = x.to(device), y.to(device) # when we load the data, we move to device
    return x, y

"""
this context manager torch.nograd and this is just telling pytorch 
that everything that happens
inside this function we will not call that backward on and 

so pytorch can be a
lot more efficient with its memory use because it doesn't have to store all 
the intermediate variables because we're
never going to call backward and 
so it can it can be a lot more memory efficient in that way

a good practice to tell PyTorch when we don't intend to do back propagation
"""
@torch.no_grad()
def estimate_loss():
    """
    it averages up the loss over multiple batches 
    so in particular 
    we're going to iterate eval_iter times and 
    we're going to
    basically get our loss and then we're going to get the average loss 
    for both splits and so this will be a lot less
    noisy

    when we call the estimate loss we're going to report the pretty
    accurate train and validation loss
    """
    out = {}
    model.eval() # setting phases
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train() # setting phases
    return out

# super simple bigram model
class BigramLanguageModel(nn.Module):

    def __init__(self, vocab_size):
        super().__init__()
        # each token directly reads off the logits for the next 
        # token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)

    def forward(self, idx, targets=None):

        # idx and targets are both (B,T) tensor of integers
        logits = self.token_embedding_table(idx) # (B,T,C)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # get the predictions
            logits, loss = self(idx)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

model = BigramLanguageModel(vocab_size)
m = model.to(device) # when we create the model, we want to move the model
# parameters to device

# create a PyTorch optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

# Training Loop
# ================================================================
for iter in range(max_iters):

    # every once in a while evaluate the loss on train and val sets
    if iter % eval_interval == 0:
        # when we call the estimate loss 
        # we're going to report the pretty accurate train and validation loss
        losses = estimate_loss()
        print(f"""step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}""")

    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

# generate from the model
context = torch.zeros((1, 1), dtype=torch.long, device=device)
# when I'm creating the context that feeds into generate 
# I have to make sure that I create on the device

"""
I ran this code it was giving me the train loss and val loss
and we see that we convert to somewhere around 2.5 with the bigram model 

and then here's
the sample that we produced at the end and so we have everything packaged up in
the script and we're in a good position now to iterate on this
"""
print(decode(m.generate(context, max_new_tokens=500)[0].tolist()))

Device: cuda
step 0: train loss 4.7305, val loss 4.7241
step 300: train loss 2.8110, val loss 2.8249
step 600: train loss 2.5434, val loss 2.5682
step 900: train loss 2.4932, val loss 2.5088
step 1200: train loss 2.4863, val loss 2.5035
step 1500: train loss 2.4665, val loss 2.4921
step 1800: train loss 2.4683, val loss 2.4936
step 2100: train loss 2.4696, val loss 2.4846
step 2400: train loss 2.4638, val loss 2.4879
step 2700: train loss 2.4738, val loss 2.4911



CEThik brid owindakis b, bth

HAPet bobe d e.
S:
O:3 my d?
LUCous:
Wanthar u qur, t.
War dXENDoate awice my.

Hastarom oroup
Yowhthetof isth ble mil ndill, ath iree sengmin lat Heriliovets, and Win nghir.
Swanousel lind me l.
HAshe ce hiry:
Supr aisspllw y.
Hentofu n Boopetelaves
MPOLI s, d mothakleo Windo whth eisbyo the m dourive we higend t so mower; te

AN ad nterupt f s ar igr t m:

Thin maleronth,
Mad
RD:

WISo myrangoube!
KENob&y, wardsal thes ghesthinin couk ay aney IOUSts I&fr y ce.
J


#### No Comments
- https://github.com/karpathy/ng-video-lecture/blob/master/bigram.py

In [38]:
import torch
import torch.nn as nn
from torch.nn import functional as F

# hyperparameters
batch_size = 32 # how many independent sequences will we process in parallel?
block_size = 8 # what is the maximum context length for predictions?
max_iters = 3000
eval_interval = 300
learning_rate = 1e-2
device = 'cuda' if torch.cuda.is_available() else 'cpu'
eval_iters = 200
# ------------

torch.manual_seed(1337)

# wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
with open('input.txt', 'r', encoding='utf-8') as f:
    text = f.read()

# here are all the unique characters that occur in this text
chars = sorted(list(set(text)))
vocab_size = len(chars)
# create a mapping from characters to integers
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }
encode = lambda s: [stoi[c] for c in s] # encoder: take a string, output a list of integers
decode = lambda l: ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a string

# Train and test splits
data = torch.tensor(encode(text), dtype=torch.long)
n = int(0.9*len(data)) # first 90% will be train, rest val
train_data = data[:n]
val_data = data[n:]

# data loading
def get_batch(split):
    # generate a small batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    x, y = x.to(device), y.to(device)
    return x, y

@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out

# super simple bigram model
class BigramLanguageModel(nn.Module):

    def __init__(self, vocab_size):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)

    def forward(self, idx, targets=None):

        # idx and targets are both (B,T) tensor of integers
        logits = self.token_embedding_table(idx) # (B,T,C)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # get the predictions
            logits, loss = self(idx)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

model = BigramLanguageModel(vocab_size)
m = model.to(device)

# create a PyTorch optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

for iter in range(max_iters):

    # every once in a while evaluate the loss on train and val sets
    if iter % eval_interval == 0:
        losses = estimate_loss()
        print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

# generate from the model
context = torch.zeros((1, 1), dtype=torch.long, device=device)
print(decode(m.generate(context, max_new_tokens=500)[0].tolist()))

step 0: train loss 4.7305, val loss 4.7241
step 300: train loss 2.8110, val loss 2.8249
step 600: train loss 2.5434, val loss 2.5682
step 900: train loss 2.4932, val loss 2.5088
step 1200: train loss 2.4863, val loss 2.5035
step 1500: train loss 2.4665, val loss 2.4921
step 1800: train loss 2.4683, val loss 2.4936
step 2100: train loss 2.4696, val loss 2.4846
step 2400: train loss 2.4638, val loss 2.4879
step 2700: train loss 2.4738, val loss 2.4911



CEThik brid owindakis b, bth

HAPet bobe d e.
S:
O:3 my d?
LUCous:
Wanthar u qur, t.
War dXENDoate awice my.

Hastarom oroup
Yowhthetof isth ble mil ndill, ath iree sengmin lat Heriliovets, and Win nghir.
Swanousel lind me l.
HAshe ce hiry:
Supr aisspllw y.
Hentofu n Boopetelaves
MPOLI s, d mothakleo Windo whth eisbyo the m dourive we higend t so mower; te

AN ad nterupt f s ar igr t m:

Thin maleronth,
Mad
RD:

WISo myrangoube!
KENob&y, wardsal thes ghesthinin couk ay aney IOUSts I&fr y ce.
J


In [39]:
"""
we have everything packaged up in the script and we're in a good position 
now to iterate on this okay so we are almost ready to start writing our very
first self-attention block for processing these tokens 
"""

"\nwe have everything packaged up in the script and we're in a good position \nnow to iterate on this okay so we are almost ready to start writing our very\nfirst self-attention block for processing these tokens \n"

### Self-Attention

#### Version 1: Averaging Past Context with for loops
The Weakest Form of Aggregation

##### The Mathematical Trick in Self-Attention

In [40]:
"""
a mathematical trick that is used in the Self-Attention inside a Transformer 
and is really just like at the heart of an efficient implementation 
of Self-Attention
"""

'\na mathematical trick that is used in the Self-Attention inside a Transformer \nand is really just like at the heart of an efficient implementation \nof Self-Attention\n'

In [41]:
# toy example illustrating how matrix multiplication can 
# be used for a "weighted aggregation"
torch.manual_seed(42)

B,T,C = 4,8,2 
# batch, time, channels - we have some information at each point
# in the sequence
x = torch.randn(B,T,C)
x.shape

torch.Size([4, 8, 2])

In [42]:
"""
now what we would like to do is 
we would like these um tokens 
so we have up to eight tokens here in a batch and these
eight tokens are currently not talking to each other and 
we would like them to talk to each other 
we'd like to couple them

the token for example at the fifth location 
it should not communicate with tokens in the sixth seventh and eighth location 
because those are future tokens in the sequence 
the token on the fifth location should only talk to 
the one in the fourth third second and first

so it's only so information only flows from previous context 
to the current timestamp and 
we cannot get any information from the future because 
we are about to try to predict the future

so what is the easiest way for tokens to communicate 
okay the easiest way I would say is okay if we are up to if we're a
fifth token and I'd like to communicate with my past 
the simplest way we can do that is to just do a weight is to just
do an AVERAGE OF ALL THE PRECEDING ELEMENTS

so for example if I'm the fifth token 
I would like to take the channels that make up that are information at my step 
but then also the channels from the four step, third step, second step 
and the first step 
I'd like to average those up and then 
that would become sort of like a feature vector 
that summarizes me in the context of my history

now of course just doing a sum or like an average is an extremely weak form of
interaction 
like this communication is extremely lossy 
we've lost a ton of information about the spatial arrangements of 
all those tokens 
but that's okay for now 

we'll see how we can bring that information back later
"""

"\nnow what we would like to do is \nwe would like these um tokens \nso we have up to eight tokens here in a batch and these\neight tokens are currently not talking to each other and \nwe would like them to talk to each other \nwe'd like to couple them\n\nthe token for example at the fifth location \nit should not communicate with tokens in the sixth seventh and eighth location \nbecause those are future tokens in the sequence \nthe token on the fifth location should only talk to \nthe one in the fourth third second and first\n\nso it's only so information only flows from previous context \nto the current timestamp and \nwe cannot get any information from the future because \nwe are about to try to predict the future\n\nso what is the easiest way for tokens to communicate \nokay the easiest way I would say is okay if we are up to if we're a\nfifth token and I'd like to communicate with my past \nthe simplest way we can do that is to just do a weight is to just\ndo an AVERAGE OF ALL THE

In [43]:
"""
for now what we would like to do is
for every single batch element independently 
for every teeth token in that sequence
we'd like to now calculate the average of all the vectors in all the previous
tokens and also at this token

so let's write that out
"""

"\nfor now what we would like to do is\nfor every single batch element independently \nfor every teeth token in that sequence\nwe'd like to now calculate the average of all the vectors in all the previous\ntokens and also at this token\n\nso let's write that out\n"

In [44]:
# We want x[b,t] = mean_{i<=t} x[b,i]
xbow = torch.zeros((B,T,C))
"""
bow = bag of words

kind of like um a term that people use when you are just
averaging up things so it's just a bag of words basically there's a 
word stored on every one of these eight locations
and we're doing a bag of words such as averaging

in the beginning we initialize at zero (torch.zeros)
"""
for b in range(B): # iterating over batch dimension
    for t in range(T): # iterating over time dimension
    
        xprev = x[b,:t+1] # of shape (t,C) - t, how many elements in the past
                          # C, all the 2D information fro mthese tokens
        xbow[b,t] = torch.mean(xprev, 0)
        """
        doing the average or the mean over the zeroth dimension 
        so I'm averaging out the time here

        I'm just going to get a little C one-dimensional Vector which 
        I'm going to store in X background words
        """

In [45]:
x[0] # the zeroth batch element

tensor([[ 1.9269,  1.4873],
        [ 0.9007, -2.1055],
        [ 0.6784, -1.2345],
        [-0.0431, -1.6047],
        [-0.7521,  1.6487],
        [-0.3925, -1.4036],
        [-0.7279, -0.5594],
        [-0.7688,  0.7624]])

In [46]:
xbow[0] # the last row is an average of all the elements

tensor([[ 1.9269,  1.4873],
        [ 1.4138, -0.3091],
        [ 1.1687, -0.6176],
        [ 0.8657, -0.8644],
        [ 0.5422, -0.3617],
        [ 0.3864, -0.5354],
        [ 0.2272, -0.5388],
        [ 0.1027, -0.3762]])

In [47]:
"""
you see how the at the first location here you see that the two are equal and
that's because it's we're just doing an average of this one token

but here this one is now an average of
these two and now this one is an average of these three
and so on 

this last one is the average of all of these elements so vertical
average just averaging up all the tokens

now gives this outcome here
so this is all well and good but this is very inefficient
"""

"\nyou see how the at the first location here you see that the two are equal and\nthat's because it's we're just doing an average of this one token\n\nbut here this one is now an average of\nthese two and now this one is an average of these three\nand so on \n\nthis last one is the average of all of these elements so vertical\naverage just averaging up all the tokens\n\nnow gives this outcome here\nso this is all well and good but this is very inefficient\n"

##### The Trick in Self-Attention: Matrix Multiply as Weighted Aggregation

In [48]:
"""
the trick is that we can be very very efficient about
doing this using matrix multiplication so that's the mathematical trick
"""

"\nthe trick is that we can be very very efficient about\ndoing this using matrix multiplication so that's the mathematical trick\n"

In [49]:
# toy example illustrating how matrix multiplication can be 
# used for a "weighted aggregation"
torch.manual_seed(42)

a = torch.ones(3, 3) # 3x3 matrix
b = torch.randint(0,10,(3,2)).float() # 3x2 matrix

c = a @ b # (3x3) x (3x2) = 3x2 matrix

print('a=')
print(a)
print('---')
print('b=')
print(b)
print('---')
print('c=a*b')
print(c)

a=
tensor([[1., 1., 1.],
        [1., 1., 1.],
        [1., 1., 1.]])
---
b=
tensor([[2., 7.],
        [6., 4.],
        [6., 5.]])
---
c=a*b
tensor([[14., 16.],
        [14., 16.],
        [14., 16.]])


In [50]:
"""
okay so how are these numbers in C achieved right 

so this number in the top of C
"""

'\nokay so how are these numbers in C achieved right \n\nso this number in the top of C\n'

###### Trick: torch.tril

In [51]:
"""
now the trick here is uh the following this is just a boring number of
um it's just a boring array of all ones but 

torch has this function called tril
which is short for a triangular uh something like that and 
you can wrap it in torch.ones and 
it will just return the lower triangular portion of this
"""

torch.tril(torch.ones(3,3))

tensor([[1., 0., 0.],
        [1., 1., 0.],
        [1., 1., 1.]])

In [52]:
# toy example illustrating how matrix multiplication can be 
# used for a "weighted aggregation"
torch.manual_seed(42)

a = torch.tril(torch.ones(3, 3)) # 3x3 matrix
b = torch.randint(0,10,(3,2)).float() # 3x2 matrix

c = a @ b # (3x3) x (3x2) = 3x2 matrix

print('a=')
print(a)
print('---')
print('b=')
print(b)
print('---')
print('c=a*b')
print(c)

a=
tensor([[1., 0., 0.],
        [1., 1., 0.],
        [1., 1., 1.]])
---
b=
tensor([[2., 7.],
        [6., 4.],
        [6., 5.]])
---
c=a*b
tensor([[ 2.,  7.],
        [ 8., 11.],
        [14., 16.]])


In [53]:
"""
and so basically depending on how many ones and zeros 
we have here we are basically doing a sum currently of a
variable number of these rows and that gets deposited into C
"""

'\nand so basically depending on how many ones and zeros \nwe have here we are basically doing a sum currently of a\nvariable number of these rows and that gets deposited into C\n'

In [54]:
"""
Dot Product
Matrix Multiplication
"""

'\nDot Product\nMatrix Multiplication\n'

###### Trick: Divide by the Sum

In [55]:
# toy example illustrating how matrix multiplication can be 
# used for a "weighted aggregation"
torch.manual_seed(42)
a = torch.tril(torch.ones(3, 3))
a = a / torch.sum(a, 1, keepdim=True)
b = torch.randint(0,10,(3,2)).float()
c = a @ b
print('a=')
print(a)
print('---')
print('b=')
print(b)
print('---')
print('c=a*b')
print(c)

a=
tensor([[1.0000, 0.0000, 0.0000],
        [0.5000, 0.5000, 0.0000],
        [0.3333, 0.3333, 0.3333]])
---
b=
tensor([[2., 7.],
        [6., 4.],
        [6., 5.]])
---
c=a*b
tensor([[2.0000, 7.0000],
        [4.0000, 5.5000],
        [4.6667, 5.3333]])


In [56]:
"""
the trick here is uh the following this is just a boring number of
um it's just a boring array of all ones but torch has this function called tril

which is short for a triangular uh something like that and you can wrap
it in torched at once and it will just return 
the lower triangular portion of this
"""

"\nthe trick here is uh the following this is just a boring number of\num it's just a boring array of all ones but torch has this function called tril\n\nwhich is short for a triangular uh something like that and you can wrap\nit in torched at once and it will just return \nthe lower triangular portion of this\n"

In [57]:
"""
so basically depending on how many ones and zeros we have here we are basically doing a sum currently of a
variable number of these rows and that gets deposited into C

So currently we're doing sums because
these are ones but we can also do average right and you can start to see how we could do average of the rows of B
uh sort of in an incremental fashion
"""

"\nso basically depending on how many ones and zeros we have here we are basically doing a sum currently of a\nvariable number of these rows and that gets deposited into C\n\nSo currently we're doing sums because\nthese are ones but we can also do average right and you can start to see how we could do average of the rows of B\nuh sort of in an incremental fashion\n"

#### Version 2: Using Matrix Multiply

In [58]:
"""
see how we can vectorize this and make it much more efficient
"""

'\nsee how we can vectorize this and make it much more efficient\n'

In [59]:
# version 2: using matrix multiply for a weighted aggregation

# wei = weights
"""
we are going to produce an array a but
here I'm going to call it way short for weights but this is our "a" Matrix

this is how much of every row we want to average up and it's going to 
be an average because you can see it in these rows sum to 1
"""
wei = torch.tril(torch.ones(T, T))
wei = wei / wei.sum(1, keepdim=True)
wei

tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.5000, 0.5000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.3333, 0.3333, 0.3333, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2500, 0.2500, 0.2500, 0.2500, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2000, 0.2000, 0.2000, 0.2000, 0.2000, 0.0000, 0.0000, 0.0000],
        [0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.0000, 0.0000],
        [0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.0000],
        [0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250]])

In [60]:
"""
our b is going to be X
"""
xbow2 = wei @ x 
# wei is (T, T)
# x is (B, T, C)
# (T, T) @ (B, T, C) ----> (B, T, C)
# it will create a batch Dimension here and this is a batch matrix multiply
# and so it will apply this matrix multiplication in all 
# the batch elements in parallel
# (B, T, T) @ (B, T, C) ----> (B, T, C)
"""
and individually and then for each batch element there will 
be a T by T multiplying T by C exactly as we had
below
"""

torch.allclose(xbow, xbow2) # will show True because they are the same

True

In [61]:
xbow[0], xbow2[0]

(tensor([[ 1.9269,  1.4873],
         [ 1.4138, -0.3091],
         [ 1.1687, -0.6176],
         [ 0.8657, -0.8644],
         [ 0.5422, -0.3617],
         [ 0.3864, -0.5354],
         [ 0.2272, -0.5388],
         [ 0.1027, -0.3762]]),
 tensor([[ 1.9269,  1.4873],
         [ 1.4138, -0.3091],
         [ 1.1687, -0.6176],
         [ 0.8657, -0.8644],
         [ 0.5422, -0.3617],
         [ 0.3864, -0.5354],
         [ 0.2272, -0.5388],
         [ 0.1027, -0.3762]]))

In [62]:
"""
the trick is we were able to use batched Matrix multiply 
to do this uh aggregation really

and it's weighted aggregation and the weights are specified in this T by T array

and we're basically doing weighted sums and uh 
these weighted sums are according to the weights inside here (wei)

they take on sort of this triangular form
and so that means that a token at the T-th Dimension will only get uh sort of 
um information from the um tokens preceding it so 

that's exactly what we want
"""

"\nthe trick is we were able to use batched Matrix multiply \nto do this uh aggregation really\n\nand it's weighted aggregation and the weights are specified in this T by T array\n\nand we're basically doing weighted sums and uh \nthese weighted sums are according to the weights inside here (wei)\n\nthey take on sort of this triangular form\nand so that means that a token at the T-th Dimension will only get uh sort of \num information from the um tokens preceding it so \n\nthat's exactly what we want\n"

#### Version 3: Adding Softmax

In [63]:
"""
finally I would like to rewrite it in one more way
and we're going to see why that's useful

this is the third version and it's also identical to the first and second

it uses Softmax
"""

"\nfinally I would like to rewrite it in one more way\nand we're going to see why that's useful\n\nthis is the third version and it's also identical to the first and second\n\nit uses Softmax\n"

In [64]:
# lower triangular matrix with all 1s
tril = torch.tril(torch.ones(T, T))
tril

tensor([[1., 0., 0., 0., 0., 0., 0., 0.],
        [1., 1., 0., 0., 0., 0., 0., 0.],
        [1., 1., 1., 0., 0., 0., 0., 0.],
        [1., 1., 1., 1., 0., 0., 0., 0.],
        [1., 1., 1., 1., 1., 0., 0., 0.],
        [1., 1., 1., 1., 1., 1., 0., 0.],
        [1., 1., 1., 1., 1., 1., 1., 0.],
        [1., 1., 1., 1., 1., 1., 1., 1.]])

In [65]:
# begins as all zeros
wei = torch.zeros((T,T))
wei

tensor([[0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.]])

In [66]:
wei = wei.masked_fill(tril == 0, float('-inf'))
wei

tensor([[0., -inf, -inf, -inf, -inf, -inf, -inf, -inf],
        [0., 0., -inf, -inf, -inf, -inf, -inf, -inf],
        [0., 0., 0., -inf, -inf, -inf, -inf, -inf],
        [0., 0., 0., 0., -inf, -inf, -inf, -inf],
        [0., 0., 0., 0., 0., -inf, -inf, -inf],
        [0., 0., 0., 0., 0., 0., -inf, -inf],
        [0., 0., 0., 0., 0., 0., 0., -inf],
        [0., 0., 0., 0., 0., 0., 0., 0.]])

In [67]:
"""
if I take a Softmax along every single row (since dim is -1)
what is that going to do 

well softmax is um
it's also like a normalization operation right and so spoiler 
alert you get the exact same Matrix

in softmax we're going
to exponentiate every single one of these and then we're going 
to divide by the sum

this is also the uh the same way to produce this mask
"""
wei = F.softmax(wei, dim=-1)
wei

tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.5000, 0.5000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.3333, 0.3333, 0.3333, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2500, 0.2500, 0.2500, 0.2500, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2000, 0.2000, 0.2000, 0.2000, 0.2000, 0.0000, 0.0000, 0.0000],
        [0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.0000, 0.0000],
        [0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.0000],
        [0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250]])

In [68]:
# version 3: use Softmax
tril = torch.tril(torch.ones(T, T))

"""
now the reason that this is a bit more interesting and the reason 
we're going to end up using it and solve a tension
is that these weights here begin uh with zero
and 

you can think of this as like an interaction strength or like 
an affinity so basically it's telling us how much of
each token from the past do we want to Aggregate and average up
"""
wei = torch.zeros((T,T))

"""
this line is saying tokens from the past cannot communicate by setting
them to negative Infinity we're saying that we will not aggregate 
anything from those tokens

so basically this then goes through softmax
"""
wei = wei.masked_fill(tril == 0, float('-inf'))
wei = F.softmax(wei, dim=-1)

"""
this is the aggregation through matrix multiplication
"""
xbow3 = wei @ x


torch.allclose(xbow, xbow3) # both matrix are equivalent

True

In [69]:
"""
Focus on wei
so what this is now is you can think of these as um these zeros are currently 
just set by
us to be zero but a quick preview is that these affinities between the tokens
are not going to be just constant at zero they're going to be data dependent 
these tokens are going to start looking
at each other and some tokens will find other tokens more or less interesting 
and depending on what their values are
they're going to find each other interesting to different amounts 
and I'm going to call those affinities I think

wei = wei.masked_fill(tril == 0, float('-inf'))
here we are saying the future cannot communicate with the past 
we're going to clamp them

then when we normalize and sum we're going to aggregate sort of their values 
depending on how interesting they find
each other

that's the preview for Self-Attenton
"""

"\nFocus on wei\nso what this is now is you can think of these as um these zeros are currently \njust set by\nus to be zero but a quick preview is that these affinities between the tokens\nare not going to be just constant at zero they're going to be data dependent \nthese tokens are going to start looking\nat each other and some tokens will find other tokens more or less interesting \nand depending on what their values are\nthey're going to find each other interesting to different amounts \nand I'm going to call those affinities I think\n\nwei = wei.masked_fill(tril == 0, float('-inf'))\nhere we are saying the future cannot communicate with the past \nwe're going to clamp them\n\nthen when we normalize and sum we're going to aggregate sort of their values \ndepending on how interesting they find\neach other\n\nthat's the preview for Self-Attenton\n"

###### No Comments

In [70]:
# version 3: use Softmax
tril = torch.tril(torch.ones(T, T))
wei = torch.zeros((T,T))
wei = wei.masked_fill(tril == 0, float('-inf'))
wei = F.softmax(wei, dim=-1)
xbow3 = wei @ x
torch.allclose(xbow, xbow3)

True

#### TLDR

In [71]:
"""
basically long story short from this entire section is that 

you can do weighted aggregations of your past elements
by having by using matrix multiplication of a lower triangular fashion

then the elements here in the lower triangular part are telling you 
how much of each element fuses into this position
"""

'\nbasically long story short from this entire section is that \n\nyou can do weighted aggregations of your past elements\nby having by using matrix multiplication of a lower triangular fashion\n\nthen the elements here in the lower triangular part are telling you \nhow much of each element fuses into this position\n'

### Minor Code Cleanup
- https://github.com/karpathy/ng-video-lecture/commit/8050fde82a6380f7f7b0645e3dec8a02c984ef47

In [72]:
import torch
import torch.nn as nn
from torch.nn import functional as F

# Hyperparameters
# ================================================================
batch_size = 32 # how many independent sequences will we process in parallel?
block_size = 8 # what is the maximum context length for predictions?
max_iters = 3000
eval_interval = 300
learning_rate = 1e-2

# added gpu capability if available
device = 'cuda' if torch.cuda.is_available() else 'cpu'

eval_iters = 200

n_embd = 32 # number of embedding dimensions
# ================================================================

torch.manual_seed(1337) # for reproducibility

# Read Data
# ================================================================ 
# wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
with open('input.txt', 'r', encoding='utf-8') as f:
    text = f.read()
# ================================================================

# Encoder and Decoder
# ================================================================ 
# here are all the unique characters that occur in this text
chars = sorted(list(set(text)))
vocab_size = len(chars)
# create a mapping from characters to integers
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }
encode = lambda s: [stoi[c] for c in s] # encoder: take a string, output a list of integers
decode = lambda l: ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a string
# ================================================================

# Create Train and Test Splits
# ================================================================
data = torch.tensor(encode(text), dtype=torch.long)
n = int(0.9*len(data)) # first 90% will be train, rest val
train_data = data[:n]
val_data = data[n:]
# ================================================================

# data loading - gets a batch of the inputs and targets
def get_batch(split):
    # generate a small batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    x, y = x.to(device), y.to(device)
    return x, y

"""
this context manager torch.nograd and this is just telling pytorch 
that everything that happens
inside this function we will not call that backward on and 

so pytorch can be a
lot more efficient with its memory use because it doesn't have to store all 
the intermediate variables because we're
never going to call backward and 
so it can it can be a lot more memory efficient in that way

a good practice to tell Pi torch when we don't intend to do back propagation
"""
@torch.no_grad()
def estimate_loss():
    """
    it averages up the loss over multiple batches 
    so in particular 
    we're going to iterate invalider times and 
    we're going to
    basically get our loss and then we're going to get the average loss 
    for both splits and so this will be a lot less
    noisy

    when we call the estimate loss we're going to report the pretty
    accurate train and validation loss
    """
    out = {}
    model.eval() # setting phases
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train() # setting phases
    return out

# super simple bigram model
class BigramLanguageModel(nn.Module):

    def __init__(self):
        super().__init__()
        # each token directly reads off the logits for the next 
        # token from a lookup table
        """
        I want to do is I don't want to actually create I want to create like a
        level of interaction here where we don't directly go to the embedding 
        for the um logits but instead we go through this
        intermediate phase because we're going to start making that bigger
        """
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)

        # lm_head = language model head
        self.lm_head = nn.Linear(n_embd, vocab_size)


    def forward(self, idx, targets=None):

        # idx and targets are both (B,T) tensor of integers
        tok_emb = self.token_embedding_table(idx) # (B,T,C), C = n_embd
        """
        going to give us token embeddings

        then to go from the token embeddings to the logits 
        we're going to need a linear layer so self.lm head let's call it
        short for language modeling head is n linear 
        from an embed up to vocab size
        """

        logits = self.lm_head(tok_emb) # (B,T,vocab_size)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # get the predictions
            logits, loss = self(idx)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

# no need to pass vocab_size into the constructor, already defined as a
# global variable
model = BigramLanguageModel()
m = model.to(device)

# create a PyTorch optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

# Training Loop
# ================================================================
for iter in range(max_iters):

    # every once in a while evaluate the loss on train and val sets
    if iter % eval_interval == 0:
        losses = estimate_loss()
        print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

# generate from the model
context = torch.zeros((1, 1), dtype=torch.long, device=device)
print(decode(m.generate(context, max_new_tokens=500)[0].tolist()))

"""
I ran this code it was giving me the train loss and val loss
and we see that we convert to somewhere around 2.5 with the bigram model 

and then here's
the sample that we produced at the end and so we have everything packaged up in
the script and we're in a good position now to iterate on this
"""

step 0: train loss 4.3886, val loss 4.3734
step 300: train loss 2.5267, val loss 2.5399
step 600: train loss 2.4998, val loss 2.5315
step 900: train loss 2.4903, val loss 2.5085
step 1200: train loss 2.4967, val loss 2.5128
step 1500: train loss 2.4809, val loss 2.5020
step 1800: train loss 2.4858, val loss 2.5149
step 2100: train loss 2.4865, val loss 2.5000
step 2400: train loss 2.4882, val loss 2.5127
step 2700: train loss 2.5006, val loss 2.5117



CExthantrid owindike on, ble

HAPen bube d e.
S:
Ond my d?
LUMuss ar hthar usqur, t. bar dilasoaten wice my.

Hastacom o mup
Yowhthetof isth ble mil; dilll,

W:

Yees, hein lat Hetidrovets, and Wh p.
Gore y jomes l lind me l.
MAshe cechiry ptupr aisspllwhy.
Hurinde n Boopetelaves
MPORIII od mothakleo Windo wh t eiibys woutit,

Hive cend iend t so mower; te

AN ad nterupt f s ar irist m:

Thin maleronth,
Mad
RD:

Whio myr f-bube!
KENobuisarardsal this aresthidin couk ay aney Iry ts I fr t ce.
J


"\nI ran this code it was giving me the train loss and val loss\nand we see that we convert to somewhere around 2.5 with the bigram model \n\nand then here's\nthe sample that we produced at the end and so we have everything packaged up in\nthe script and we're in a good position now to iterate on this\n"

### Positional Encoding
- https://github.com/karpathy/ng-video-lecture/commit/28e5fd789dc24231d6047f7a897e3b6aec95642a

In [73]:
"""
next up so far we've taken these in indices (idx) and 
we've encoded them based on the identity of the tokens inside idx
the next thing that people very often do is that we're not just encoding the identity of these tokens but also their
position

we're going to have a second position uh embedding table here
"""

"\nnext up so far we've taken these in indices (idx) and \nwe've encoded them based on the identity of the tokens inside idx\nthe next thing that people very often do is that we're not just encoding the identity of these tokens but also their\nposition\n\nwe're going to have a second position uh embedding table here\n"

In [74]:
import torch
import torch.nn as nn
from torch.nn import functional as F

# Hyperparameters
# ================================================================
batch_size = 32 # how many independent sequences will we process in parallel?
block_size = 8 # what is the maximum context length for predictions?
max_iters = 3000
eval_interval = 300
learning_rate = 1e-2

# added gpu capability if available
device = 'cuda' if torch.cuda.is_available() else 'cpu'

eval_iters = 200

n_embd = 32 # number of embedding dimensions
# ================================================================

torch.manual_seed(1337) # for reproducibility

# Read Data
# ================================================================ 
# wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
with open('input.txt', 'r', encoding='utf-8') as f:
    text = f.read()
# ================================================================

# Encoder and Decoder
# ================================================================ 
# here are all the unique characters that occur in this text
chars = sorted(list(set(text)))
vocab_size = len(chars)
# create a mapping from characters to integers
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }
encode = lambda s: [stoi[c] for c in s] # encoder: take a string, output a list of integers
decode = lambda l: ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a string
# ================================================================

# Create Train and Test Splits
# ================================================================
data = torch.tensor(encode(text), dtype=torch.long)
n = int(0.9*len(data)) # first 90% will be train, rest val
train_data = data[:n]
val_data = data[n:]
# ================================================================

# data loading - gets a batch of the inputs and targets
def get_batch(split):
    # generate a small batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    x, y = x.to(device), y.to(device)
    return x, y

"""
this context manager torch.nograd and this is just telling pytorch 
that everything that happens
inside this function we will not call that backward on and 

so pytorch can be a
lot more efficient with its memory use because it doesn't have to store all 
the intermediate variables because we're
never going to call backward and 
so it can it can be a lot more memory efficient in that way

a good practice to tell Pi torch when we don't intend to do back propagation
"""
@torch.no_grad()
def estimate_loss():
    """
    it averages up the loss over multiple batches 
    so in particular 
    we're going to iterate invalider times and 
    we're going to
    basically get our loss and then we're going to get the average loss 
    for both splits and so this will be a lot less
    noisy

    when we call the estimate loss we're going to report the pretty
    accurate train and validation loss
    """
    out = {}
    model.eval() # setting phases
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train() # setting phases
    return out

# super simple bigram model
class BigramLanguageModel(nn.Module):

    def __init__(self):
        super().__init__()
        # each token directly reads off the logits for the next 
        # token from a lookup table
        """
        I want to do is I don't want to actually create I want to create like a
        level of interaction here where we don't directly go to the embedding 
        for the um logits but instead we go through this
        intermediate phase because we're going to start making that bigger
        """
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        self.position_embedding_table = nn.Embedding(block_size, n_embd)
        self.lm_head = nn.Linear(n_embd, vocab_size)


    def forward(self, idx, targets=None):
        B, T = idx.shape

        # idx and targets are both (B,T) tensor of integers
        tok_emb = self.token_embedding_table(idx) # (B,T,C)
        pos_emb = self.position_embedding_table(torch.arange(T, device=device)) # (T,C)
        x = tok_emb + pos_emb # (B,T,C)
        logits = self.lm_head(x) # (B,T,vocab_size)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # crop idx to the last block_size tokens
            idx_cond = idx[:, -block_size:]
            # get the predictions
            logits, loss = self(idx_cond)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

# no need to pass vocab_size into the constructor, already defined as a
# global variable
model = BigramLanguageModel()
m = model.to(device)

# create a PyTorch optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

# Training Loop
# ================================================================
for iter in range(max_iters):

    # every once in a while evaluate the loss on train and val sets
    if iter % eval_interval == 0:
        losses = estimate_loss()
        print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

# generate from the model
context = torch.zeros((1, 1), dtype=torch.long, device=device)


"""
I ran this code it was giving me the train loss and val loss
and we see that we convert to somewhere around 2.5 with the bigram model 

and then here's
the sample that we produced at the end and so we have everything packaged up in
the script and we're in a good position now to iterate on this
"""

print(decode(m.generate(context, max_new_tokens=500)[0].tolist()))


step 0: train loss 4.4801, val loss 4.4801
step 300: train loss 2.5404, val loss 2.5566
step 600: train loss 2.5160, val loss 2.5335
step 900: train loss 2.4967, val loss 2.5149
step 1200: train loss 2.5106, val loss 2.5254
step 1500: train loss 2.4853, val loss 2.5109
step 1800: train loss 2.4966, val loss 2.5198
step 2100: train loss 2.4949, val loss 2.5100
step 2400: train loss 2.4937, val loss 2.5102
step 2700: train loss 2.5040, val loss 2.5114



CExthantrid owindikis s, bll

HAPen bube t e.
S:
O:
IS:
Folatangs:
Wanthar u qurthe. bar dilasoate awice my.

Hastatom o mup
Yowhthatof isth ble mil; dilll,

W:

Ye s, hain latisttid ov ts, and Wh pomano.
Swanous l lind me l.
MIshe ce hiry ptupr aisspllw y. w'stoul noroopetelaves
Momy ll, d mothake o Windo wh t eiibys the m douris TENGByore s poo mo th; te

AN ad nthrupt f s ar irist m:

Thin maleronth, af Pre?

Whio myr f-
LI har,
S:


Thardsal this ghesthidin cour ay aney Iry ts I f my ce hy


### The Crux: Self-Attention

In [75]:
"""
we're going to implement a small Self-Attention for a single individual Head 
as they're called
"""

"\nwe're going to implement a small Self-Attention for a single individual Head \nas they're called\n"

#### Version 4

In [76]:
# version 4: self-attention!
torch.manual_seed(1337)
B,T,C = 4,8,32 # batch, time, channels
x = torch.randn(B,T,C)
"""
so we have 4 by 8 arrangement of tokens
at each token is currently 32-dimensional
"""

print(x.shape)
print(x)

torch.Size([4, 8, 32])
tensor([[[ 0.1808, -0.0700, -0.3596,  ..., -0.8016,  1.5236,  2.5086],
         [-0.6631, -0.2513,  1.0101,  ...,  1.5333,  1.6097, -0.4032],
         [-0.8345,  0.5978, -0.0514,  ..., -0.4370, -1.0012, -0.4094],
         ...,
         [-0.8961,  0.0662, -0.0563,  ...,  2.1382,  0.5114,  1.2191],
         [ 0.1910, -0.3425,  1.7955,  ...,  0.3699, -0.5556, -0.3983],
         [-0.5819, -0.2208,  0.0135,  ..., -1.9079, -0.5276,  1.0807]],

        [[ 0.4562, -1.0917, -0.8207,  ...,  0.0512, -0.6576, -2.5729],
         [ 0.0210,  1.0060, -1.2492,  ...,  0.7859, -1.1501,  1.3132],
         [ 2.2007, -0.2195,  0.5427,  ..., -0.6445,  1.0834, -0.7995],
         ...,
         [ 0.3091,  1.1661, -2.1821,  ...,  0.6151,  0.6763,  0.6228],
         [ 0.0943, -0.3156,  0.7850,  ..., -1.5735,  1.3876,  0.7251],
         [ 0.6455, -0.3313, -1.0390,  ...,  0.0895, -0.3748, -0.4781]],

        [[-0.6067,  1.8328,  0.2931,  ...,  1.0041,  0.8656,  0.1688],
         [-0.2352, -0.

In [77]:
tril = torch.tril(torch.ones(T, T))
tril

tensor([[1., 0., 0., 0., 0., 0., 0., 0.],
        [1., 1., 0., 0., 0., 0., 0., 0.],
        [1., 1., 1., 0., 0., 0., 0., 0.],
        [1., 1., 1., 1., 0., 0., 0., 0.],
        [1., 1., 1., 1., 1., 0., 0., 0.],
        [1., 1., 1., 1., 1., 1., 0., 0.],
        [1., 1., 1., 1., 1., 1., 1., 0.],
        [1., 1., 1., 1., 1., 1., 1., 1.]])

In [78]:
wei = torch.zeros((T,T))
wei = wei.masked_fill(tril == 0, float('-inf'))
wei = F.softmax(wei, dim=-1)
out = wei @ x

print(out.shape)

torch.Size([4, 8, 32])


#### Single Head

In [79]:
# let's see a single Head perform self-attention
head_size = 16

key = nn.Linear(C, head_size, bias=False)
query = nn.Linear(C, head_size, bias=False)
value = nn.Linear(C, head_size, bias=False)

k = key(x)   # (B, T, 16)
q = query(x) # (B, T, 16)

# communication/interaction happens now
# -2 = second last dimension, -1 = last dimension
wei =  q @ k.transpose(-2, -1) # (B, T, 16) @ (B, 16, T) ---> (B, T, T)

tril = torch.tril(torch.ones(T, T))
#wei = torch.zeros((T,T))
wei = wei.masked_fill(tril == 0, float('-inf'))
wei = F.softmax(wei, dim=-1)

v = value(x)
out = wei @ v
#out = wei @ x

out.shape

torch.Size([4, 8, 16])

In [80]:
wei

tensor([[[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
         [0.1574, 0.8426, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
         [0.2088, 0.1646, 0.6266, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
         [0.5792, 0.1187, 0.1889, 0.1131, 0.0000, 0.0000, 0.0000, 0.0000],
         [0.0294, 0.1052, 0.0469, 0.0276, 0.7909, 0.0000, 0.0000, 0.0000],
         [0.0176, 0.2689, 0.0215, 0.0089, 0.6812, 0.0019, 0.0000, 0.0000],
         [0.1691, 0.4066, 0.0438, 0.0416, 0.1048, 0.2012, 0.0329, 0.0000],
         [0.0210, 0.0843, 0.0555, 0.2297, 0.0573, 0.0709, 0.2423, 0.2391]],

        [[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
         [0.1687, 0.8313, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
         [0.2477, 0.0514, 0.7008, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
         [0.4410, 0.0957, 0.3747, 0.0887, 0.0000, 0.0000, 0.0000, 0.0000],
         [0.0069, 0.0456, 0.0300, 0.7748, 0.1427, 0.0000, 0.0000, 0.0000],
         [0.0660, 0.089

In [81]:
# look at zeroth row
wei[0]

tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.1574, 0.8426, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2088, 0.1646, 0.6266, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.5792, 0.1187, 0.1889, 0.1131, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.0294, 0.1052, 0.0469, 0.0276, 0.7909, 0.0000, 0.0000, 0.0000],
        [0.0176, 0.2689, 0.0215, 0.0089, 0.6812, 0.0019, 0.0000, 0.0000],
        [0.1691, 0.4066, 0.0438, 0.0416, 0.1048, 0.2012, 0.0329, 0.0000],
        [0.0210, 0.0843, 0.0555, 0.2297, 0.0573, 0.0709, 0.2423, 0.2391]],
       grad_fn=<SelectBackward0>)

#### Notes on Attention
1. Attention is a **communication mechanism**. Can be seen as nodes in a directed graph looking at each other and aggregating information with a weighted sum from all nodes that point to them, with data-dependent weights
2. There is no notion of space. Attention simply acts over a set of vectors. This is why we need to positionally encode tokens.
3. Each example across batch dimension is of course processed completely independently and never "talk" to each other
4. In an "encoder" attention block just delete the single line that does masking with `tril`, allowing all tokens to communicate. This block here is called a "decoder" attention block because it has triangular masking, and is usually used in autoregressive settings, like language modeling.
5. "self-attention" just means that the keys and values are produced from the same source as queries. In "cross-attention", the queries still get produced from x, but the keys and values come from some other, external source (e.g. an encoder module)
6. "Scaled" attention additional divides `wei` by 1/sqrt(head_size). This makes it so when input Q,K are unit variance, wei will be unit variance too and Softmax will stay diffuse and not saturate too much. Illustration below

##### Note 6

In [82]:
k = torch.randn(B,T,head_size)
q = torch.randn(B,T,head_size)

# try with and without scaling by head_size
wei = q @ k.transpose(-2, -1) # * head_size**-0.5

In [83]:
k.var(), q.var(), wei.var()

(tensor(1.0449), tensor(1.0700), tensor(17.4690))

In [84]:
# try with and without scaling by head_size
wei = q @ k.transpose(-2, -1) * head_size**-0.5

In [85]:
k.var(), q.var(), wei.var()

(tensor(1.0449), tensor(1.0700), tensor(1.0918))

In [86]:
torch.softmax(torch.tensor([0.1, -0.2, 0.3, -0.2, 0.5]), dim=-1)

tensor([0.1925, 0.1426, 0.2351, 0.1426, 0.2872])

In [87]:
torch.softmax(torch.tensor([0.1, -0.2, 0.3, -0.2, 0.5])*8, dim=-1) 
# gets too peaky, converges to one-hot

tensor([0.0326, 0.0030, 0.1615, 0.0030, 0.8000])

### Inserting a Single Self-Attention Block to Our Network
- https://github.com/karpathy/ng-video-lecture/commit/10024b146809927c603aef91e8646f2a65e659ca

In [88]:
import torch
import torch.nn as nn
from torch.nn import functional as F

# Hyperparameters
# ================================================================
batch_size = 32 # how many independent sequences will we process in parallel?
block_size = 8 # what is the maximum context length for predictions?
max_iters = 5000 # 3000
eval_interval = 500 # 300
learning_rate = 1e-3 # 1e-2

# added gpu capability if available
device = 'cuda' if torch.cuda.is_available() else 'cpu'

eval_iters = 200

n_embd = 32 # number of embedding dimensions
# ================================================================

torch.manual_seed(1337) # for reproducibility

# Read Data
# ================================================================ 
# wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
with open('input.txt', 'r', encoding='utf-8') as f:
    text = f.read()
# ================================================================

# Encoder and Decoder
# ================================================================ 
# here are all the unique characters that occur in this text
chars = sorted(list(set(text)))
vocab_size = len(chars)
# create a mapping from characters to integers
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }
encode = lambda s: [stoi[c] for c in s] # encoder: take a string, output a list of integers
decode = lambda l: ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a string
# ================================================================

# Create Train and Test Splits
# ================================================================
data = torch.tensor(encode(text), dtype=torch.long)
n = int(0.9*len(data)) # first 90% will be train, rest val
train_data = data[:n]
val_data = data[n:]
# ================================================================

# data loading - gets a batch of the inputs and targets
def get_batch(split):
    # generate a small batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    x, y = x.to(device), y.to(device)
    return x, y

"""
this context manager torch.nograd and this is just telling pytorch 
that everything that happens
inside this function we will not call that backward on and 

so pytorch can be a
lot more efficient with its memory use because it doesn't have to store all 
the intermediate variables because we're
never going to call backward and 
so it can it can be a lot more memory efficient in that way

a good practice to tell Pi torch when we don't intend to do back propagation
"""
@torch.no_grad()
def estimate_loss():
    """
    it averages up the loss over multiple batches 
    so in particular 
    we're going to iterate invalider times and 
    we're going to
    basically get our loss and then we're going to get the average loss 
    for both splits and so this will be a lot less
    noisy

    when we call the estimate loss we're going to report the pretty
    accurate train and validation loss
    """
    out = {}
    model.eval() # setting phases
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train() # setting phases
    return out

class Head(nn.Module):
    """ one head of self-attention """

    def __init__(self, head_size):
        super().__init__()
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))

    def forward(self, x):
        B,T,C = x.shape
        k = self.key(x)   # (B,T,C)
        q = self.query(x) # (B,T,C)
        # compute attention scores ("affinities")

        # Corrected in the tutorial
        wei = q @ k.transpose(-2,-1) * k.shape[-1]**-0.5 # (B, T, C) @ (B, C, T) -> (B, T, T)
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf')) # (B, T, T)
        wei = F.softmax(wei, dim=-1) # (B, T, T)
        # perform the weighted aggregation of the values
        v = self.value(x) # (B,T,C)
        out = wei @ v # (B, T, T) @ (B, T, C) -> (B, T, C)
        return out

# super simple bigram model
class BigramLanguageModel(nn.Module):

    def __init__(self):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        """
        I want to do is I don't want to actually create I want to create like a
        level of interaction here where we don't directly go to the embedding 
        for the um logits but instead we go through this
        intermediate phase because we're going to start making that bigger
        """
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        self.position_embedding_table = nn.Embedding(block_size, n_embd)
        self.sa_head = Head(n_embd)
        self.lm_head = nn.Linear(n_embd, vocab_size)


    def forward(self, idx, targets=None):
        B, T = idx.shape

        # idx and targets are both (B,T) tensor of integers
        tok_emb = self.token_embedding_table(idx) # (B,T,C)
        pos_emb = self.position_embedding_table(torch.arange(T, device=device)) # (T,C)
        x = tok_emb + pos_emb # (B,T,C)
        x = self.sa_head(x) # apply one head of self-attention (B,T,C)
        logits = self.lm_head(x) # (B,T,vocab_size)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # crop idx to the last block_size tokens
            idx_cond = idx[:, -block_size:]
            # get the predictions
            logits, loss = self(idx_cond)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

# no need to pass vocab_size into the constructor, already defined as a
# global variable
model = BigramLanguageModel()
m = model.to(device)

# create a PyTorch optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

# Training Loop
# ================================================================
for iter in range(max_iters):

    # every once in a while evaluate the loss on train and val sets
    if iter % eval_interval == 0:
        losses = estimate_loss()
        print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

# generate from the model
context = torch.zeros((1, 1), dtype=torch.long, device=device)
print(decode(m.generate(context, max_new_tokens=500)[0].tolist()))

"""
I ran this code it was giving me the train loss and val loss
and we see that we convert to somewhere around 2.5 with the bigram model 

and then here's
the sample that we produced at the end and so we have everything packaged up in
the script and we're in a good position now to iterate on this
"""

step 0: train loss 4.2000, val loss 4.2047
step 500: train loss 2.6911, val loss 2.7087
step 1000: train loss 2.5196, val loss 2.5303
step 1500: train loss 2.4775, val loss 2.4829
step 2000: train loss 2.4408, val loss 2.4523
step 2500: train loss 2.4272, val loss 2.4435
step 3000: train loss 2.4130, val loss 2.4327
step 3500: train loss 2.3956, val loss 2.4212
step 4000: train loss 2.4041, val loss 2.3992
step 4500: train loss 2.3980, val loss 2.4084

Whent iknt,
Thowi, ht son, bth

Hiset bobe ale.
S:
O-' st dalilanss:
Want he us he, vet?
Wedilas ate awice my.

HDET:
ANGo oug
Yowhavetof is he ot mil ndill, aes iree sen cie lat Herid ovets, and Win ngarigoerabous lelind peal.
-hule onchiry ptugr aiss hew ye wllinde norod atelaves
Momy yowod mothake ont-wou whth eiiby we ati dourive wee, ired thoouso er; th
To kad nteruptef so;
ARID Wam:
ENGCI inleront ffaf Pre?

Wh om.

He-
LIERCKENIGUICar adsal aces ard thinin cour ay aney Iry ts I fr af ve y


"\nI ran this code it was giving me the train loss and val loss\nand we see that we convert to somewhere around 2.5 with the bigram model \n\nand then here's\nthe sample that we produced at the end and so we have everything packaged up in\nthe script and we're in a good position now to iterate on this\n"

#### No Comments

In [89]:
import torch
import torch.nn as nn
from torch.nn import functional as F

# hyperparameters
batch_size = 32 # how many independent sequences will we process in parallel?
block_size = 8 # what is the maximum context length for predictions?
max_iters = 5000
eval_interval = 500
learning_rate = 1e-3
device = 'cuda' if torch.cuda.is_available() else 'cpu'
eval_iters = 200
n_embd = 32
# ------------

torch.manual_seed(1337)

# wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
with open('input.txt', 'r', encoding='utf-8') as f:
    text = f.read()

# here are all the unique characters that occur in this text
chars = sorted(list(set(text)))
vocab_size = len(chars)
# create a mapping from characters to integers
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }
encode = lambda s: [stoi[c] for c in s] # encoder: take a string, output a list of integers
decode = lambda l: ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a string

# Train and test splits
data = torch.tensor(encode(text), dtype=torch.long)
n = int(0.9*len(data)) # first 90% will be train, rest val
train_data = data[:n]
val_data = data[n:]

# data loading
def get_batch(split):
    # generate a small batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    x, y = x.to(device), y.to(device)
    return x, y

@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out

class Head(nn.Module):
    """ one head of self-attention """

    def __init__(self, head_size):
        super().__init__()
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))

    def forward(self, x):
        B,T,C = x.shape
        k = self.key(x)   # (B,T,C)
        q = self.query(x) # (B,T,C)
        # compute attention scores ("affinities")

        # Corrected in the tutorial
        wei = q @ k.transpose(-2,-1) * k.shape[-1]**-0.5 # (B, T, C) @ (B, C, T) -> (B, T, T)
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf')) # (B, T, T)
        wei = F.softmax(wei, dim=-1) # (B, T, T)
        # perform the weighted aggregation of the values
        v = self.value(x) # (B,T,C)
        out = wei @ v # (B, T, T) @ (B, T, C) -> (B, T, C)
        return out

# super simple bigram model
class BigramLanguageModel(nn.Module):

    def __init__(self):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        self.position_embedding_table = nn.Embedding(block_size, n_embd)
        self.sa_head = Head(n_embd)
        self.lm_head = nn.Linear(n_embd, vocab_size)

    def forward(self, idx, targets=None):
        B, T = idx.shape

        # idx and targets are both (B,T) tensor of integers
        tok_emb = self.token_embedding_table(idx) # (B,T,C)
        pos_emb = self.position_embedding_table(torch.arange(T, device=device)) # (T,C)
        x = tok_emb + pos_emb # (B,T,C)
        x = self.sa_head(x) # apply one head of self-attention. (B,T,C)
        logits = self.lm_head(x) # (B,T,vocab_size)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # crop idx to the last block_size tokens
            idx_cond = idx[:, -block_size:]
            # get the predictions
            logits, loss = self(idx_cond)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

model = BigramLanguageModel()
m = model.to(device)

# create a PyTorch optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

for iter in range(max_iters):

    # every once in a while evaluate the loss on train and val sets
    if iter % eval_interval == 0:
        losses = estimate_loss()
        print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

# generate from the model
context = torch.zeros((1, 1), dtype=torch.long, device=device)
print(decode(m.generate(context, max_new_tokens=500)[0].tolist()))

step 0: train loss 4.2000, val loss 4.2047
step 500: train loss 2.6911, val loss 2.7087
step 1000: train loss 2.5196, val loss 2.5303
step 1500: train loss 2.4775, val loss 2.4829
step 2000: train loss 2.4408, val loss 2.4523
step 2500: train loss 2.4272, val loss 2.4435
step 3000: train loss 2.4130, val loss 2.4327
step 3500: train loss 2.3956, val loss 2.4212
step 4000: train loss 2.4041, val loss 2.3992
step 4500: train loss 2.3980, val loss 2.4084

Whent iknt,
Thowi, ht son, bth

Hiset bobe ale.
S:
O-' st dalilanss:
Want he us he, vet?
Wedilas ate awice my.

HDET:
ANGo oug
Yowhavetof is he ot mil ndill, aes iree sen cie lat Herid ovets, and Win ngarigoerabous lelind peal.
-hule onchiry ptugr aiss hew ye wllinde norod atelaves
Momy yowod mothake ont-wou whth eiiby we ati dourive wee, ired thoouso er; th
To kad nteruptef so;
ARID Wam:
ENGCI inleront ffaf Pre?

Wh om.

He-
LIERCKENIGUICar adsal aces ard thinin cour ay aney Iry ts I fr af ve y


### Multi-Head Self-Attention
- https://github.com/karpathy/ng-video-lecture/commit/a6e0bee43163848076df568eb78799d524306ac9

In [90]:
"""
so now we've implemented the scale.product attention 
now next up in the attention is all you need paper 
there's something called multi-head attention and what is multi-head attention 

it's just applying multiple attentions in parallel and 
concatenating the results
"""

"\nso now we've implemented the scale.product attention \nnow next up in the attention is all you need paper \nthere's something called multi-head attention and what is multi-head attention \n\nit's just applying multiple attentions in parallel and \nconcatenating the results\n"

In [91]:
import torch
import torch.nn as nn
from torch.nn import functional as F

# Hyperparameters
# ================================================================
batch_size = 32 # how many independent sequences will we process in parallel?
block_size = 8 # what is the maximum context length for predictions?
max_iters = 5000 # 3000
eval_interval = 500 # 300
learning_rate = 1e-3 # 1e-2

# added gpu capability if available
device = 'cuda' if torch.cuda.is_available() else 'cpu'

eval_iters = 200

n_embd = 32 # number of embedding dimensions
# ================================================================

torch.manual_seed(1337) # for reproducibility

# Read Data
# ================================================================ 
# wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
with open('input.txt', 'r', encoding='utf-8') as f:
    text = f.read()
# ================================================================

# Encoder and Decoder
# ================================================================ 
# here are all the unique characters that occur in this text
chars = sorted(list(set(text)))
vocab_size = len(chars)
# create a mapping from characters to integers
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }
encode = lambda s: [stoi[c] for c in s] # encoder: take a string, output a list of integers
decode = lambda l: ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a string
# ================================================================

# Create Train and Test Splits
# ================================================================
data = torch.tensor(encode(text), dtype=torch.long)
n = int(0.9*len(data)) # first 90% will be train, rest val
train_data = data[:n]
val_data = data[n:]
# ================================================================

# data loading - gets a batch of the inputs and targets
def get_batch(split):
    # generate a small batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    x, y = x.to(device), y.to(device)
    return x, y

"""
this context manager torch.nograd and this is just telling pytorch 
that everything that happens
inside this function we will not call that backward on and 

so pytorch can be a
lot more efficient with its memory use because it doesn't have to store all 
the intermediate variables because we're
never going to call backward and 
so it can it can be a lot more memory efficient in that way

a good practice to tell Pi torch when we don't intend to do back propagation
"""
@torch.no_grad()
def estimate_loss():
    """
    it averages up the loss over multiple batches 
    so in particular 
    we're going to iterate invalider times and 
    we're going to
    basically get our loss and then we're going to get the average loss 
    for both splits and so this will be a lot less
    noisy

    when we call the estimate loss we're going to report the pretty
    accurate train and validation loss
    """
    out = {}
    model.eval() # setting phases
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train() # setting phases
    return out

class Head(nn.Module):
    """ one head of self-attention """

    def __init__(self, head_size):
        super().__init__()
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))

    def forward(self, x):
        B,T,C = x.shape
        k = self.key(x)   # (B,T,C)
        q = self.query(x) # (B,T,C)
        # compute attention scores ("affinities")

        # Corrected in the tutorial
        wei = q @ k.transpose(-2,-1) * k.shape[-1]**-0.5 # (B, T, C) @ (B, C, T) -> (B, T, T)
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf')) # (B, T, T)
        wei = F.softmax(wei, dim=-1) # (B, T, T)
        # perform the weighted aggregation of the values
        v = self.value(x) # (B,T,C)
        out = wei @ v # (B, T, T) @ (B, T, C) -> (B, T, C)
        return out

class MultiHeadAttention(nn.Module):
    """ multiple heads of self-attention in parallel """

    def __init__(self, num_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])

    def forward(self, x):
        return torch.cat([h(x) for h in self.heads], dim=-1)

# super simple bigram model
class BigramLanguageModel(nn.Module):

    def __init__(self):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        """
        I want to do is I don't want to actually create I want to create like a
        level of interaction here where we don't directly go to the embedding 
        for the um logits but instead we go through this
        intermediate phase because we're going to start making that bigger
        """
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        self.position_embedding_table = nn.Embedding(block_size, n_embd)
        # self.sa_head = Head(n_embd)
        self.sa_heads = MultiHeadAttention(4, n_embd//4) 
        # i.e. 4 heads of 8-dimensional self-attention
        
        self.lm_head = nn.Linear(n_embd, vocab_size)


    def forward(self, idx, targets=None):
        B, T = idx.shape

        # idx and targets are both (B,T) tensor of integers
        tok_emb = self.token_embedding_table(idx) # (B,T,C)
        pos_emb = self.position_embedding_table(torch.arange(T, device=device)) # (T,C)
        x = tok_emb + pos_emb # (B,T,C)
        # x = self.sa_head(x) # apply one head of self-attention (B,T,C)
        x = self.sa_heads(x) # apply one head of self-attention. (B,T,C)
        logits = self.lm_head(x) # (B,T,vocab_size)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # crop idx to the last block_size tokens
            idx_cond = idx[:, -block_size:]
            # get the predictions
            logits, loss = self(idx_cond)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

# no need to pass vocab_size into the constructor, already defined as a
# global variable
model = BigramLanguageModel()
m = model.to(device)

# create a PyTorch optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

# Training Loop
# ================================================================
for iter in range(max_iters):

    # every once in a while evaluate the loss on train and val sets
    if iter % eval_interval == 0:
        losses = estimate_loss()
        print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

# generate from the model
context = torch.zeros((1, 1), dtype=torch.long, device=device)
print(decode(m.generate(context, max_new_tokens=500)[0].tolist()))

"""
I ran this code it was giving me the train loss and val loss
and we see that we convert to somewhere around 2.5 with the bigram model 

and then here's
the sample that we produced at the end and so we have everything packaged up in
the script and we're in a good position now to iterate on this
"""

step 0: train loss 4.2248, val loss 4.2250
step 500: train loss 2.6663, val loss 2.6809
step 1000: train loss 2.5107, val loss 2.5189
step 1500: train loss 2.4394, val loss 2.4447
step 2000: train loss 2.3769, val loss 2.3890
step 2500: train loss 2.3459, val loss 2.3606
step 3000: train loss 2.3163, val loss 2.3361
step 3500: train loss 2.2867, val loss 2.3138
step 4000: train loss 2.2861, val loss 2.2796
step 4500: train loss 2.2692, val loss 2.2816

Whent if bridcowilfakis s, bt madiret bobe to tarver-'t thealleauss:
Want he us hat vet?
Wedtlaccane awice my.

HDY'n om oroug
Youts, tof is heirt mil nowlit,
Whiiree--viecin lat Het drov the and Wing.

DWAFeransesel lind peall liser cochiry ptur; aiss hiwty. Huntike normopeeelave whomy.
Whoulllelake ont---o whr Ceviby wey thour rive wees ime st so mo lif thure kadmn,
Turt for are;
Dor my monge inledooth, af Pre?

WISo myay I sok!
Whied is:
Sadsal the E'd steruin cour ay andy I yous I frouf voul


"\nI ran this code it was giving me the train loss and val loss\nand we see that we convert to somewhere around 2.5 with the bigram model \n\nand then here's\nthe sample that we produced at the end and so we have everything packaged up in\nthe script and we're in a good position now to iterate on this\n"

#### No Comments

In [92]:
import torch
import torch.nn as nn
from torch.nn import functional as F

# hyperparameters
batch_size = 32 # how many independent sequences will we process in parallel?
block_size = 8 # what is the maximum context length for predictions?
max_iters = 5000
eval_interval = 500
learning_rate = 1e-3
device = 'cuda' if torch.cuda.is_available() else 'cpu'
eval_iters = 200
n_embd = 32
# ------------

torch.manual_seed(1337)

# wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
with open('input.txt', 'r', encoding='utf-8') as f:
    text = f.read()

# here are all the unique characters that occur in this text
chars = sorted(list(set(text)))
vocab_size = len(chars)
# create a mapping from characters to integers
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }
encode = lambda s: [stoi[c] for c in s] # encoder: take a string, output a list of integers
decode = lambda l: ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a string

# Train and test splits
data = torch.tensor(encode(text), dtype=torch.long)
n = int(0.9*len(data)) # first 90% will be train, rest val
train_data = data[:n]
val_data = data[n:]

# data loading
def get_batch(split):
    # generate a small batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    x, y = x.to(device), y.to(device)
    return x, y

@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out

class Head(nn.Module):
    """ one head of self-attention """

    def __init__(self, head_size):
        super().__init__()
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))

    def forward(self, x):
        B,T,C = x.shape
        k = self.key(x)   # (B,T,C)
        q = self.query(x) # (B,T,C)
        # compute attention scores ("affinities")

        # Corrected in the tutorial
        wei = q @ k.transpose(-2,-1) * k.shape[-1]**-0.5 # (B, T, C) @ (B, C, T) -> (B, T, T)
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf')) # (B, T, T)
        wei = F.softmax(wei, dim=-1) # (B, T, T)
        # perform the weighted aggregation of the values
        v = self.value(x) # (B,T,C)
        out = wei @ v # (B, T, T) @ (B, T, C) -> (B, T, C)
        return out

class MultiHeadAttention(nn.Module):
    """ multiple heads of self-attention in parallel """

    def __init__(self, num_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])

    def forward(self, x):
        return torch.cat([h(x) for h in self.heads], dim=-1)


# super simple bigram model
class BigramLanguageModel(nn.Module):

    def __init__(self):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        self.position_embedding_table = nn.Embedding(block_size, n_embd)
        self.sa_heads = MultiHeadAttention(4, n_embd//4) # i.e. 4 heads of 8-dimensional self-attention
        self.lm_head = nn.Linear(n_embd, vocab_size)

    def forward(self, idx, targets=None):
        B, T = idx.shape

        # idx and targets are both (B,T) tensor of integers
        tok_emb = self.token_embedding_table(idx) # (B,T,C)
        pos_emb = self.position_embedding_table(torch.arange(T, device=device)) # (T,C)
        x = tok_emb + pos_emb # (B,T,C)
        x = self.sa_heads(x) # apply one head of self-attention. (B,T,C)
        logits = self.lm_head(x) # (B,T,vocab_size)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # crop idx to the last block_size tokens
            idx_cond = idx[:, -block_size:]
            # get the predictions
            logits, loss = self(idx_cond)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

model = BigramLanguageModel()
m = model.to(device)

# create a PyTorch optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

for iter in range(max_iters):

    # every once in a while evaluate the loss on train and val sets
    if iter % eval_interval == 0:
        losses = estimate_loss()
        print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

# generate from the model
context = torch.zeros((1, 1), dtype=torch.long, device=device)
print(decode(m.generate(context, max_new_tokens=500)[0].tolist()))

step 0: train loss 4.2248, val loss 4.2250
step 500: train loss 2.6663, val loss 2.6809
step 1000: train loss 2.5107, val loss 2.5189
step 1500: train loss 2.4394, val loss 2.4447
step 2000: train loss 2.3769, val loss 2.3890
step 2500: train loss 2.3459, val loss 2.3606
step 3000: train loss 2.3163, val loss 2.3361
step 3500: train loss 2.2867, val loss 2.3138
step 4000: train loss 2.2861, val loss 2.2796
step 4500: train loss 2.2692, val loss 2.2816

Whent if bridcowilfakis s, bt madiret bobe to tarver-'t thealleauss:
Want he us hat vet?
Wedtlaccane awice my.

HDY'n om oroug
Youts, tof is heirt mil nowlit,
Whiiree--viecin lat Het drov the and Wing.

DWAFeransesel lind peall liser cochiry ptur; aiss hiwty. Huntike normopeeelave whomy.
Whoulllelake ont---o whr Ceviby wey thour rive wees ime st so mo lif thure kadmn,
Turt for are;
Dor my monge inledooth, af Pre?

WISo myay I sok!
Whied is:
Sadsal the E'd steruin cour ay andy I yous I frouf voul


In [93]:
"""
I ran the same thing and then we now get
this down to 2.28 roughly and the output is still the generation is still not
amazing but clearly the validation loss is improving because 
we were at 2.4 just now

and so it helps to have multiple communication channels because 
obviously these tokens have a lot to talk about
and they want to find the consonants the vowels they want to find the vowels 
just from certain positions they want to find
any kinds of different things and so it helps 
to create multiple independent channels of communication gather lots of
different types of data and then 
decode the output now 
"""

'\nI ran the same thing and then we now get\nthis down to 2.28 roughly and the output is still the generation is still not\namazing but clearly the validation loss is improving because \nwe were at 2.4 just now\n\nand so it helps to have multiple communication channels because \nobviously these tokens have a lot to talk about\nand they want to find the consonants the vowels they want to find the vowels \njust from certain positions they want to find\nany kinds of different things and so it helps \nto create multiple independent channels of communication gather lots of\ndifferent types of data and then \ndecode the output now \n'

### Feed Forward Layers of Transformer Block
- https://github.com/karpathy/ng-video-lecture/commit/97dd3f9dee3dbb6445adcddb527370dc76010e41

In [94]:
"""
going back to the paper for a second of course I didn't explain
this figure in full detail but we are starting to see some components of 
what we've already implemented 

we have the positional encodings, the token encodings that add 
we have the masked multi-headed attention implemented

now here's another multi-headed tension which is a cross attention 
to an encoder which we haven't we're not going to implement in this
case I'm going to come back to that later but 

I want you to notice that there's a feed forward part here and then 
this is grouped into a block that gets repeated again and again 

now the feed forward part here is just a simple multi-layer perceptron
um so the multi-headed so here position wise feed forward networks is just a
simple little MLP 

so I want to start basically in a similar fashion also adding computation
into the network and this computation is on the per node level
"""

"\ngoing back to the paper for a second of course I didn't explain\nthis figure in full detail but we are starting to see some components of \nwhat we've already implemented \n\nwe have the positional encodings, the token encodings that add \nwe have the masked multi-headed attention implemented\n\nnow here's another multi-headed tension which is a cross attention \nto an encoder which we haven't we're not going to implement in this\ncase I'm going to come back to that later but \n\nI want you to notice that there's a feed forward part here and then \nthis is grouped into a block that gets repeated again and again \n\nnow the feed forward part here is just a simple multi-layer perceptron\num so the multi-headed so here position wise feed forward networks is just a\nsimple little MLP \n\nso I want to start basically in a similar fashion also adding computation\ninto the network and this computation is on the per node level\n"

In [95]:
"""
I took the code that we
developed in this Jupiter notebook and I converted it to be a script and 
I'm doing this because I just want to
simplify our intermediate work into just the final product that we have 
at this point

bigram.py

New additions
1. Enabled gpu - run on cuda
2. estimate_loss()
3. model.eval(), model.train() phases
4. @torch.no_grad() - 
"""

import torch
import torch.nn as nn
from torch.nn import functional as F

# Hyperparameters
# ================================================================
batch_size = 32 # how many independent sequences will we process in parallel?
block_size = 8 # what is the maximum context length for predictions?
max_iters = 5000 # 3000
eval_interval = 500 # 300
learning_rate = 1e-3 # 1e-2

# added gpu capability if available
device = 'cuda' if torch.cuda.is_available() else 'cpu'

eval_iters = 200

n_embd = 32 # number of embedding dimensions
# ================================================================

torch.manual_seed(1337) # for reproducibility

# Read Data
# ================================================================ 
# wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
with open('input.txt', 'r', encoding='utf-8') as f:
    text = f.read()
# ================================================================

# Encoder and Decoder
# ================================================================ 
# here are all the unique characters that occur in this text
chars = sorted(list(set(text)))
vocab_size = len(chars)
# create a mapping from characters to integers
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }
encode = lambda s: [stoi[c] for c in s] # encoder: take a string, output a list of integers
decode = lambda l: ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a string
# ================================================================

# Create Train and Test Splits
# ================================================================
data = torch.tensor(encode(text), dtype=torch.long)
n = int(0.9*len(data)) # first 90% will be train, rest val
train_data = data[:n]
val_data = data[n:]
# ================================================================

# data loading - gets a batch of the inputs and targets
def get_batch(split):
    # generate a small batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    x, y = x.to(device), y.to(device)
    return x, y

"""
this context manager torch.nograd and this is just telling pytorch 
that everything that happens
inside this function we will not call that backward on and 

so pytorch can be a
lot more efficient with its memory use because it doesn't have to store all 
the intermediate variables because we're
never going to call backward and 
so it can it can be a lot more memory efficient in that way

a good practice to tell Pi torch when we don't intend to do back propagation
"""
@torch.no_grad()
def estimate_loss():
    """
    it averages up the loss over multiple batches 
    so in particular 
    we're going to iterate invalider times and 
    we're going to
    basically get our loss and then we're going to get the average loss 
    for both splits and so this will be a lot less
    noisy

    when we call the estimate loss we're going to report the pretty
    accurate train and validation loss
    """
    out = {}
    model.eval() # setting phases
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train() # setting phases
    return out

class Head(nn.Module):
    """ one head of self-attention """

    def __init__(self, head_size):
        super().__init__()
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))

    def forward(self, x):
        B,T,C = x.shape
        k = self.key(x)   # (B,T,C)
        q = self.query(x) # (B,T,C)
        # compute attention scores ("affinities")

        # Corrected in the tutorial
        wei = q @ k.transpose(-2,-1) * k.shape[-1]**-0.5 # (B, T, C) @ (B, C, T) -> (B, T, T)
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf')) # (B, T, T)
        wei = F.softmax(wei, dim=-1) # (B, T, T)
        # perform the weighted aggregation of the values
        v = self.value(x) # (B,T,C)
        out = wei @ v # (B, T, T) @ (B, T, C) -> (B, T, C)
        return out

class MultiHeadAttention(nn.Module):
    """ multiple heads of self-attention in parallel """

    def __init__(self, num_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])

    def forward(self, x):
        return torch.cat([h(x) for h in self.heads], dim=-1)

class FeedFoward(nn.Module):
    """ a simple linear layer followed by a non-linearity """

    def __init__(self, n_embd):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, n_embd),
            nn.ReLU(),
        )

    def forward(self, x):
        return self.net(x)

# super simple bigram model
class BigramLanguageModel(nn.Module):

    def __init__(self):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        """
        I want to do is I don't want to actually create I want to create like a
        level of interaction here where we don't directly go to the embedding 
        for the um logits but instead we go through this
        intermediate phase because we're going to start making that bigger
        """
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        self.position_embedding_table = nn.Embedding(block_size, n_embd)
        # self.sa_head = Head(n_embd)
        self.sa_heads = MultiHeadAttention(4, n_embd//4) 
        self.ffwd = FeedFoward(n_embd)
        # i.e. 4 heads of 8-dimensional self-attention
        
        self.lm_head = nn.Linear(n_embd, vocab_size)


    def forward(self, idx, targets=None):
        B, T = idx.shape

        # idx and targets are both (B,T) tensor of integers
        tok_emb = self.token_embedding_table(idx) # (B,T,C)
        pos_emb = self.position_embedding_table(torch.arange(T, device=device)) # (T,C)
        x = tok_emb + pos_emb # (B,T,C)
        # x = self.sa_head(x) # apply one head of self-attention (B,T,C)
        x = self.sa_heads(x) # apply one head of self-attention. (B,T,C)
        x = self.ffwd(x) # (B,T,C)
        logits = self.lm_head(x) # (B,T,vocab_size)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # crop idx to the last block_size tokens
            idx_cond = idx[:, -block_size:]
            # get the predictions
            logits, loss = self(idx_cond)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

# no need to pass vocab_size into the constructor, already defined as a
# global variable
model = BigramLanguageModel()
m = model.to(device)

# create a PyTorch optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

# Training Loop
# ================================================================
for iter in range(max_iters):

    # every once in a while evaluate the loss on train and val sets
    if iter % eval_interval == 0:
        losses = estimate_loss()
        print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

# generate from the model
context = torch.zeros((1, 1), dtype=torch.long, device=device)
print(decode(m.generate(context, max_new_tokens=500)[0].tolist()))

"""
I ran this code it was giving me the train loss and val loss
and we see that we convert to somewhere around 2.5 with the bigram model 

and then here's
the sample that we produced at the end and so we have everything packaged up in
the script and we're in a good position now to iterate on this
"""

step 0: train loss 4.2022, val loss 4.2019
step 500: train loss 2.6144, val loss 2.6230
step 1000: train loss 2.4766, val loss 2.4768
step 1500: train loss 2.3985, val loss 2.3938
step 2000: train loss 2.3277, val loss 2.3451
step 2500: train loss 2.2955, val loss 2.3156
step 3000: train loss 2.2826, val loss 2.2922
step 3500: train loss 2.2455, val loss 2.2727
step 4000: train loss 2.2436, val loss 2.2459
step 4500: train loss 2.2292, val loss 2.2417

And they tridcowf,
The lay ble
bairet bube to tarvirt.

MBRCELTUS:
Far baparuus hith bubar dilth ane awith my.

HDER:
Ay onoth
Yowns, to uit he cove lind lincaes if ees, hain lat Heacl ov the and to pomant.

Wables lill dite litens;
Honcelly:
Augh aiss hit yevell nal nordopetelavle
Momtell, demet aklloal-nou wher eiibys to th dour warce hidend to-LOR:
Bhe the the danterth po so;
Ang. Wam:

EDI youled atw, af Pried my of.

HKING ERCKH:
Puis:
Arost Waced and to din cour ay aney Rry to chan thour y


"\nI ran this code it was giving me the train loss and val loss\nand we see that we convert to somewhere around 2.5 with the bigram model \n\nand then here's\nthe sample that we produced at the end and so we have everything packaged up in\nthe script and we're in a good position now to iterate on this\n"

#### No Comments

In [96]:
import torch
import torch.nn as nn
from torch.nn import functional as F

# hyperparameters
batch_size = 32 # how many independent sequences will we process in parallel?
block_size = 8 # what is the maximum context length for predictions?
max_iters = 5000
eval_interval = 500
learning_rate = 1e-3
device = 'cuda' if torch.cuda.is_available() else 'cpu'
eval_iters = 200
n_embd = 32
# ------------

torch.manual_seed(1337)

# wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
with open('input.txt', 'r', encoding='utf-8') as f:
    text = f.read()

# here are all the unique characters that occur in this text
chars = sorted(list(set(text)))
vocab_size = len(chars)
# create a mapping from characters to integers
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }
encode = lambda s: [stoi[c] for c in s] # encoder: take a string, output a list of integers
decode = lambda l: ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a string

# Train and test splits
data = torch.tensor(encode(text), dtype=torch.long)
n = int(0.9*len(data)) # first 90% will be train, rest val
train_data = data[:n]
val_data = data[n:]

# data loading
def get_batch(split):
    # generate a small batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    x, y = x.to(device), y.to(device)
    return x, y

@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out

class Head(nn.Module):
    """ one head of self-attention """

    def __init__(self, head_size):
        super().__init__()
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))

    def forward(self, x):
        B,T,C = x.shape
        k = self.key(x)   # (B,T,C)
        q = self.query(x) # (B,T,C)
        # compute attention scores ("affinities")

        # Corrected in the tutorial
        wei = q @ k.transpose(-2,-1) * k.shape[-1]**-0.5 # (B, T, C) @ (B, C, T) -> (B, T, T)
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf')) # (B, T, T)
        wei = F.softmax(wei, dim=-1) # (B, T, T)
        # perform the weighted aggregation of the values
        v = self.value(x) # (B,T,C)
        out = wei @ v # (B, T, T) @ (B, T, C) -> (B, T, C)
        return out

class MultiHeadAttention(nn.Module):
    """ multiple heads of self-attention in parallel """

    def __init__(self, num_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])

    def forward(self, x):
        return torch.cat([h(x) for h in self.heads], dim=-1)

class FeedFoward(nn.Module):
    """ a simple linear layer followed by a non-linearity """

    def __init__(self, n_embd):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, n_embd),
            nn.ReLU(),
        )

    def forward(self, x):
        return self.net(x)

# super simple bigram model
class BigramLanguageModel(nn.Module):

    def __init__(self):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        self.position_embedding_table = nn.Embedding(block_size, n_embd)
        self.sa_heads = MultiHeadAttention(4, n_embd//4) # i.e. 4 heads of 8-dimensional self-attention
        self.ffwd = FeedFoward(n_embd)
        self.lm_head = nn.Linear(n_embd, vocab_size)

    def forward(self, idx, targets=None):
        B, T = idx.shape

        # idx and targets are both (B,T) tensor of integers
        tok_emb = self.token_embedding_table(idx) # (B,T,C)
        pos_emb = self.position_embedding_table(torch.arange(T, device=device)) # (T,C)
        x = tok_emb + pos_emb # (B,T,C)
        x = self.sa_heads(x) # apply one head of self-attention. (B,T,C)
        x = self.ffwd(x) # (B,T,C)
        logits = self.lm_head(x) # (B,T,vocab_size)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # crop idx to the last block_size tokens
            idx_cond = idx[:, -block_size:]
            # get the predictions
            logits, loss = self(idx_cond)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

model = BigramLanguageModel()
m = model.to(device)

# create a PyTorch optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

for iter in range(max_iters):

    # every once in a while evaluate the loss on train and val sets
    if iter % eval_interval == 0:
        losses = estimate_loss()
        print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

# generate from the model
context = torch.zeros((1, 1), dtype=torch.long, device=device)
print(decode(m.generate(context, max_new_tokens=500)[0].tolist()))

step 0: train loss 4.2022, val loss 4.2019
step 500: train loss 2.6144, val loss 2.6230
step 1000: train loss 2.4766, val loss 2.4768
step 1500: train loss 2.3985, val loss 2.3938
step 2000: train loss 2.3277, val loss 2.3451
step 2500: train loss 2.2955, val loss 2.3156
step 3000: train loss 2.2826, val loss 2.2922
step 3500: train loss 2.2455, val loss 2.2727
step 4000: train loss 2.2436, val loss 2.2459
step 4500: train loss 2.2292, val loss 2.2417

And they tridcowf,
The lay ble
bairet bube to tarvirt.

MBRCELTUS:
Far baparuus hith bubar dilth ane awith my.

HDER:
Ay onoth
Yowns, to uit he cove lind lincaes if ees, hain lat Heacl ov the and to pomant.

Wables lill dite litens;
Honcelly:
Augh aiss hit yevell nal nordopetelavle
Momtell, demet aklloal-nou wher eiibys to th dour warce hidend to-LOR:
Bhe the the danterth po so;
Ang. Wam:

EDI youled atw, af Pried my of.

HKING ERCKH:
Puis:
Arost Waced and to din cour ay aney Rry to chan thour y


### Residual Connections
- https://github.com/karpathy/ng-video-lecture/commit/5c3a2d299592603581129fdb14c5b06f5c50938c
- https://arxiv.org/abs/1512.03385

In [97]:
"""
we're starting to actually get like a pretty deep neural net and 
deep neural Nets uh suffer from optimization issues and I
think that's where we're kind of like slightly starting to run into 

so we need one more idea that we can borrow from
the um Transformer paper to resolve those difficulties 

now there are two optimizations that dramatically help
with the depth of these networks and make sure 
that the networks remain optimizable 

let's talk about the first one the first one in this diagram is 
you see this Arrow here and then this arrow and this Arrow those
are skip connections or sometimes called residual connections 

they come from this paper uh the
procedural learning form and recognition from about 2015. 
https://arxiv.org/abs/1512.03385

that introduced the concept now 
these are basically what it means is you transform the data but then you have
a skip connection with addition from the previous features 
now the way I like to visualize it that I prefer is the following 
here the computation happens from the top to bottom and
basically you have this uh residual pathway and 
you are free to Fork off from the residual pathway perform some

computation and then project back to the residual pathway via addition and so 
you go from the the inputs to the
targets only the plus and plus and plus and the reason this is useful is 
because during that propagation remember from
our micrograd video earlier addition distributes gradients equally to 
both of its branches that that fat as the input
and so the supervision or the gradients from the loss basically hop
through every addition node all the way to the input and then also Fork off into
the residual blocks 

but basically you have this gradient Super Highway 
that goes directly from
the supervision all the way to the input, unimpeded and then 
these virtual blocks are usually initialized in the beginning
so they contribute very very little if anything to the residual pathway 
they are initialized that way so in the
beginning they are sort of almost kind of like not there but then 
during the optimization they come online over time
and they start to contribute but at least at the initialization you can go
from directly supervision to the input gradient is unimpeded and 
just close and then the blocks over time kick in and so
that dramatically helps with the optimization 
"""

"\nwe're starting to actually get like a pretty deep neural net and \ndeep neural Nets uh suffer from optimization issues and I\nthink that's where we're kind of like slightly starting to run into \n\nso we need one more idea that we can borrow from\nthe um Transformer paper to resolve those difficulties \n\nnow there are two optimizations that dramatically help\nwith the depth of these networks and make sure \nthat the networks remain optimizable \n\nlet's talk about the first one the first one in this diagram is \nyou see this Arrow here and then this arrow and this Arrow those\nare skip connections or sometimes called residual connections \n\nthey come from this paper uh the\nprocedural learning form and recognition from about 2015. \nhttps://arxiv.org/abs/1512.03385\n\nthat introduced the concept now \nthese are basically what it means is you transform the data but then you have\na skip connection with addition from the previous features \nnow the way I like to visualize it that I 

In [98]:
"""
I took the code that we
developed in this Jupiter notebook and I converted it to be a script and 
I'm doing this because I just want to
simplify our intermediate work into just the final product that we have 
at this point

bigram.py

New additions
1. Enabled gpu - run on cuda
2. estimate_loss()
3. model.eval(), model.train() phases
4. @torch.no_grad() - 
"""

import torch
import torch.nn as nn
from torch.nn import functional as F

# Hyperparameters
# ================================================================
batch_size = 32 # how many independent sequences will we process in parallel?
block_size = 8 # what is the maximum context length for predictions?
max_iters = 5000 # 3000
eval_interval = 500 # 300
learning_rate = 1e-3 # 1e-2

# added gpu capability if available
device = 'cuda' if torch.cuda.is_available() else 'cpu'

eval_iters = 200

n_embd = 32 # number of embedding dimensions
# ================================================================

torch.manual_seed(1337) # for reproducibility

# Read Data
# ================================================================ 
# wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
with open('input.txt', 'r', encoding='utf-8') as f:
    text = f.read()
# ================================================================

# Encoder and Decoder
# ================================================================ 
# here are all the unique characters that occur in this text
chars = sorted(list(set(text)))
vocab_size = len(chars)
# create a mapping from characters to integers
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }
encode = lambda s: [stoi[c] for c in s] # encoder: take a string, output a list of integers
decode = lambda l: ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a string
# ================================================================

# Create Train and Test Splits
# ================================================================
data = torch.tensor(encode(text), dtype=torch.long)
n = int(0.9*len(data)) # first 90% will be train, rest val
train_data = data[:n]
val_data = data[n:]
# ================================================================

# data loading - gets a batch of the inputs and targets
def get_batch(split):
    # generate a small batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    x, y = x.to(device), y.to(device)
    return x, y

"""
this context manager torch.nograd and this is just telling pytorch 
that everything that happens
inside this function we will not call that backward on and 

so pytorch can be a
lot more efficient with its memory use because it doesn't have to store all 
the intermediate variables because we're
never going to call backward and 
so it can it can be a lot more memory efficient in that way

a good practice to tell Pi torch when we don't intend to do back propagation
"""
@torch.no_grad()
def estimate_loss():
    """
    it averages up the loss over multiple batches 
    so in particular 
    we're going to iterate invalider times and 
    we're going to
    basically get our loss and then we're going to get the average loss 
    for both splits and so this will be a lot less
    noisy

    when we call the estimate loss we're going to report the pretty
    accurate train and validation loss
    """
    out = {}
    model.eval() # setting phases
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train() # setting phases
    return out

class Head(nn.Module):
    """ one head of self-attention """

    def __init__(self, head_size):
        super().__init__()
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))

    def forward(self, x):
        B,T,C = x.shape
        k = self.key(x)   # (B,T,C)
        q = self.query(x) # (B,T,C)
        # compute attention scores ("affinities")

        # Corrected in the tutorial
        wei = q @ k.transpose(-2,-1) * k.shape[-1]**-0.5 # (B, T, C) @ (B, C, T) -> (B, T, T)
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf')) # (B, T, T)
        wei = F.softmax(wei, dim=-1) # (B, T, T)
        # perform the weighted aggregation of the values
        v = self.value(x) # (B,T,C)
        out = wei @ v # (B, T, T) @ (B, T, C) -> (B, T, C)
        return out

class MultiHeadAttention(nn.Module):
    """ multiple heads of self-attention in parallel """

    def __init__(self, num_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
        self.proj = nn.Linear(n_embd, n_embd)

    def forward(self, x):
        out = torch.cat([h(x) for h in self.heads], dim=-1)
        out = self.proj(out)
        return out

class FeedFoward(nn.Module):
    """ a simple linear layer followed by a non-linearity """

    def __init__(self, n_embd):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, 4 * n_embd),
            nn.ReLU(),
            nn.Linear(4 * n_embd, n_embd),
        )

    def forward(self, x):
        return self.net(x)

class Block(nn.Module):
    """ Transformer block: communication followed by computation """

    def __init__(self, n_embd, n_head):
        # n_embd: embedding dimension, n_head: the number of heads we'd like
        super().__init__()
        head_size = n_embd // n_head
        self.sa = MultiHeadAttention(n_head, head_size)
        self.ffwd = FeedFoward(n_embd)

    def forward(self, x):
        x = x + self.sa(x)
        x = x + self.ffwd(x)
        return x

# super simple bigram model
class BigramLanguageModel(nn.Module):

    def __init__(self):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        """
        I want to do is I don't want to actually create I want to create like a
        level of interaction here where we don't directly go to the embedding 
        for the um logits but instead we go through this
        intermediate phase because we're going to start making that bigger
        """
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        self.position_embedding_table = nn.Embedding(block_size, n_embd)
        # self.sa_head = Head(n_embd)
        # self.sa_heads = MultiHeadAttention(4, n_embd//4) 
        # self.ffwd = FeedFoward(n_embd)
        self.blocks = nn.Sequential(
            Block(n_embd, n_head=4),
            Block(n_embd, n_head=4),
            Block(n_embd, n_head=4),
        )
        # i.e. 4 heads of 8-dimensional self-attention
        
        self.lm_head = nn.Linear(n_embd, vocab_size)


    def forward(self, idx, targets=None):
        B, T = idx.shape

        # idx and targets are both (B,T) tensor of integers
        tok_emb = self.token_embedding_table(idx) # (B,T,C)
        pos_emb = self.position_embedding_table(torch.arange(T, device=device)) # (T,C)
        x = tok_emb + pos_emb # (B,T,C)
        # x = self.sa_head(x) # apply one head of self-attention (B,T,C)
        # x = self.sa_heads(x) # apply one head of self-attention. (B,T,C)
        # x = self.ffwd(x) # (B,T,C)
        x = self.blocks(x) # (B,T,C)
        logits = self.lm_head(x) # (B,T,vocab_size)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # crop idx to the last block_size tokens
            idx_cond = idx[:, -block_size:]
            # get the predictions
            logits, loss = self(idx_cond)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

# no need to pass vocab_size into the constructor, already defined as a
# global variable
model = BigramLanguageModel()
m = model.to(device)

# create a PyTorch optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

# Training Loop
# ================================================================
for iter in range(max_iters):

    # every once in a while evaluate the loss on train and val sets
    if iter % eval_interval == 0 or iter == max_iters - 1:
        losses = estimate_loss()
        print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

# generate from the model
context = torch.zeros((1, 1), dtype=torch.long, device=device)
print(decode(m.generate(context, max_new_tokens=500)[0].tolist()))

"""
I ran this code it was giving me the train loss and val loss
and we see that we convert to somewhere around 2.5 with the bigram model 

and then here's
the sample that we produced at the end and so we have everything packaged up in
the script and we're in a good position now to iterate on this
"""

step 0: train loss 4.6328, val loss 4.6313
step 500: train loss 2.3721, val loss 2.3673
step 1000: train loss 2.2588, val loss 2.2626
step 1500: train loss 2.1726, val loss 2.1961
step 2000: train loss 2.1308, val loss 2.1718
step 2500: train loss 2.0991, val loss 2.1479
step 3000: train loss 2.0615, val loss 2.1318
step 3500: train loss 2.0522, val loss 2.1113
step 4000: train loss 2.0198, val loss 2.1015
step 4500: train loss 1.9975, val loss 2.0942
step 4999: train loss 1.9896, val loss 2.0735

And they bridce.

SOROLOUS:
Ay, a selk of our tarther'ds me?
That suard that us hath buby, dilay a endway, my feanstar, zoknow
You some fuitio be this now
Whige miseets, hein latisely movets, and the now on you muself in you liet uprecce in the priness him you lord.
In Bookes, and whome:
Whed moth?

Ko Winso what eis as the modour fall ey, me sto-deal the the deard nubrupt for treagis! muft wity.

MUENTIUS:
Marred my of.

HKING ESLOUKES:
Wardads.
Wice age, thisin cour a save
Hiry the have for

"\nI ran this code it was giving me the train loss and val loss\nand we see that we convert to somewhere around 2.5 with the bigram model \n\nand then here's\nthe sample that we produced at the end and so we have everything packaged up in\nthe script and we're in a good position now to iterate on this\n"

#### No Comments

In [99]:
import torch
import torch.nn as nn
from torch.nn import functional as F

# hyperparameters
batch_size = 32 # how many independent sequences will we process in parallel?
block_size = 8 # what is the maximum context length for predictions?
max_iters = 5000
eval_interval = 500
learning_rate = 1e-3
device = 'cuda' if torch.cuda.is_available() else 'cpu'
eval_iters = 200
n_embd = 32
# ------------

torch.manual_seed(1337)

# wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
with open('input.txt', 'r', encoding='utf-8') as f:
    text = f.read()

# here are all the unique characters that occur in this text
chars = sorted(list(set(text)))
vocab_size = len(chars)
# create a mapping from characters to integers
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }
encode = lambda s: [stoi[c] for c in s] # encoder: take a string, output a list of integers
decode = lambda l: ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a string

# Train and test splits
data = torch.tensor(encode(text), dtype=torch.long)
n = int(0.9*len(data)) # first 90% will be train, rest val
train_data = data[:n]
val_data = data[n:]

# data loading
def get_batch(split):
    # generate a small batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    x, y = x.to(device), y.to(device)
    return x, y

@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out

class Head(nn.Module):
    """ one head of self-attention """

    def __init__(self, head_size):
        super().__init__()
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))

    def forward(self, x):
        B,T,C = x.shape
        k = self.key(x)   # (B,T,C)
        q = self.query(x) # (B,T,C)
        # compute attention scores ("affinities")

        # Corrected in the tutorial
        wei = q @ k.transpose(-2,-1) * k.shape[-1]**-0.5 # (B, T, C) @ (B, C, T) -> (B, T, T)
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf')) # (B, T, T)
        wei = F.softmax(wei, dim=-1) # (B, T, T)
        # perform the weighted aggregation of the values
        v = self.value(x) # (B,T,C)
        out = wei @ v # (B, T, T) @ (B, T, C) -> (B, T, C)
        return out

class MultiHeadAttention(nn.Module):
    """ multiple heads of self-attention in parallel """

    def __init__(self, num_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
        self.proj = nn.Linear(n_embd, n_embd)
    def forward(self, x):
        out = torch.cat([h(x) for h in self.heads], dim=-1)
        out = self.proj(out)
        return out

class FeedFoward(nn.Module):
    """ a simple linear layer followed by a non-linearity """

    def __init__(self, n_embd):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, 4 * n_embd),
            nn.ReLU(),
            nn.Linear(4 * n_embd, n_embd),
        )

    def forward(self, x):
        return self.net(x)

class Block(nn.Module):
    """ Transformer block: communication followed by computation """

    def __init__(self, n_embd, n_head):
        # n_embd: embedding dimension, n_head: the number of heads we'd like
        super().__init__()
        head_size = n_embd // n_head
        self.sa = MultiHeadAttention(n_head, head_size)
        self.ffwd = FeedFoward(n_embd)

    def forward(self, x):
        x = x + self.sa(x)
        x = x + self.ffwd(x)
        return x

# super simple bigram model
class BigramLanguageModel(nn.Module):

    def __init__(self):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        self.position_embedding_table = nn.Embedding(block_size, n_embd)
        self.blocks = nn.Sequential(
            Block(n_embd, n_head=4),
            Block(n_embd, n_head=4),
            Block(n_embd, n_head=4),
        )
        self.lm_head = nn.Linear(n_embd, vocab_size)

    def forward(self, idx, targets=None):
        B, T = idx.shape

        # idx and targets are both (B,T) tensor of integers
        tok_emb = self.token_embedding_table(idx) # (B,T,C)
        pos_emb = self.position_embedding_table(torch.arange(T, device=device)) # (T,C)
        x = tok_emb + pos_emb # (B,T,C)
        x = self.blocks(x) # (B,T,C)
        logits = self.lm_head(x) # (B,T,vocab_size)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # crop idx to the last block_size tokens
            idx_cond = idx[:, -block_size:]
            # get the predictions
            logits, loss = self(idx_cond)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

model = BigramLanguageModel()
m = model.to(device)

# create a PyTorch optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

for iter in range(max_iters):

    # every once in a while evaluate the loss on train and val sets
    if iter % eval_interval == 0 or iter == max_iters - 1:
        losses = estimate_loss()
        print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

# generate from the model
context = torch.zeros((1, 1), dtype=torch.long, device=device)
print(decode(m.generate(context, max_new_tokens=500)[0].tolist()))

step 0: train loss 4.6328, val loss 4.6313
step 500: train loss 2.3721, val loss 2.3673
step 1000: train loss 2.2588, val loss 2.2626
step 1500: train loss 2.1726, val loss 2.1961
step 2000: train loss 2.1308, val loss 2.1718
step 2500: train loss 2.0991, val loss 2.1479
step 3000: train loss 2.0615, val loss 2.1318
step 3500: train loss 2.0522, val loss 2.1113
step 4000: train loss 2.0198, val loss 2.1015
step 4500: train loss 1.9975, val loss 2.0942
step 4999: train loss 1.9896, val loss 2.0735

And they bridce.

SOROLOUS:
Ay, a selk of our tarther'ds me?
That suard that us hath buby, dilay a endway, my feanstar, zoknow
You some fuitio be this now
Whige miseets, hein latisely movets, and the now on you muself in you liet uprecce in the priness him you lord.
In Bookes, and whome:
Whed moth?

Ko Winso what eis as the modour fall ey, me sto-deal the the deard nubrupt for treagis! muft wity.

MUENTIUS:
Marred my of.

HKING ESLOUKES:
Wardads.
Wice age, thisin cour a save
Hiry the have for

### Layer Normalization
- https://arxiv.org/abs/1607.06450

Implemented in PyTorch
- https://pytorch.org/docs/stable/generated/torch.nn.LayerNorm.html

In [100]:
"""
layer Norm is very very similar to batch norm 

so remember back to our make more series part three we implemented
batch normalization and batch normalization basically just made sure that across the batch
Dimension any individual neuron had unit gaussian
distribution so it was zero mean and unit standard deviation one standard deviation output
so what I did here is I'm copy pasting The Bachelor 1D that we developed in our makemore series
and see here we can initialize for example this module and we can have a batch of 32 100 dimensional vectors
feeding through the bathroom layer so what this does is it guarantees
that when we look at just the zeroth column it's a zero mean one standard deviation
so it's normalizing every single column of this input now the rows are not going to be
normalized by default because we're just normalizing columns so let's now implement the layer Norm uh it's very
complicated look we come here we change this from 0 to 1. so we don't normalize
The Columns we normalize the rows and now we've implemented layer Norm
so now the columns are not going to be normalized but the rows are going to be normalized
for every individual example it's 100 dimensional Vector is normalized in this way and because our computation Now does
not span across examples we can delete all of this buffers stuff because we can
always apply this operation and don't need to maintain any running buffers so
we don't need the buffers we don't There's no distinction between
training and test time and we don't need these running buffers we do keep gamma and beta we don't need
the momentum we don't care if it's training or not and this is now a layer Norm
"""

"\nlayer Norm is very very similar to batch norm \n\nso remember back to our make more series part three we implemented\nbatch normalization and batch normalization basically just made sure that across the batch\nDimension any individual neuron had unit gaussian\ndistribution so it was zero mean and unit standard deviation one standard deviation output\nso what I did here is I'm copy pasting The Bachelor 1D that we developed in our makemore series\nand see here we can initialize for example this module and we can have a batch of 32 100 dimensional vectors\nfeeding through the bathroom layer so what this does is it guarantees\nthat when we look at just the zeroth column it's a zero mean one standard deviation\nso it's normalizing every single column of this input now the rows are not going to be\nnormalized by default because we're just normalizing columns so let's now implement the layer Norm uh it's very\ncomplicated look we come here we change this from 0 to 1. so we don't normalize\

#### Implementation

In [101]:
"""
Implement Layer Normalization
"""

class LayerNorm1d: # (used to be BatchNorm1d)
  
  def __init__(self, dim, eps=1e-5, momentum=0.1):
    self.eps = eps
    self.gamma = torch.ones(dim)
    self.beta = torch.zeros(dim)
  
  def __call__(self, x):
    # calculate the forward pass
    xmean = x.mean(1, keepdim=True) # batch mean
    xvar = x.var(1, keepdim=True) # batch variance
    xhat = (x - xmean) / torch.sqrt(xvar + self.eps) # normalize to unit variance
    self.out = self.gamma * xhat + self.beta
    return self.out
  
  def parameters(self):
    return [self.gamma, self.beta]

torch.manual_seed(1337)
module = LayerNorm1d(100)
x = torch.randn(32, 100) # batch size 32 of 100-dimensional vectors
x = module(x)
x.shape

torch.Size([32, 100])

In [102]:
x[:,0].mean(), x[:,0].std() # mean,std of one feature across all batch inputs

(tensor(0.1469), tensor(0.8803))

In [103]:
x[0,:].mean(), x[0,:].std() # mean,std of a single input from the batch, of its features

(tensor(-9.5367e-09), tensor(1.0000))

#### Note

In [104]:
"""
before I incorporate the layer Norm I just wanted to note that as
I said very few details about the Transformer have changed in the 
last five years but this is actually something that slightly departs from the
original paper you see that the ADD and Norm is applied after the transformation

but um in now it is a bit more basically common to apply the layer Norm before
the transformation 

so there's a reshuffling of the layer Norms uh so 
this is called the pre-norm formulation
and that's the one that we're going to implement as well so slight deviation 
from the original paper 

basically we need two layer Norms layer
Norm one is an N dot layer norm and we tell it how many
um what is the embedding dimension and we need 
the second layer Norm and then here the layer rooms are
applied immediately on x so self.layer number one in applied on x
and salt on layer number two applied on X before it goes into sulfur tension 
and feed forward
and the size of the layer Norm here is an embeds of 32. so when the layer Norm
is normalizing our features it is the normalization here
happens the mean and the variance are taking over 32 numbers so the batch 
and the time act as batch Dimensions both of
them so this is kind of like a per token transformation that just normalizes the
features and makes them a unit mean unit gaussian at initialization
"""

"\nbefore I incorporate the layer Norm I just wanted to note that as\nI said very few details about the Transformer have changed in the \nlast five years but this is actually something that slightly departs from the\noriginal paper you see that the ADD and Norm is applied after the transformation\n\nbut um in now it is a bit more basically common to apply the layer Norm before\nthe transformation \n\nso there's a reshuffling of the layer Norms uh so \nthis is called the pre-norm formulation\nand that's the one that we're going to implement as well so slight deviation \nfrom the original paper \n\nbasically we need two layer Norms layer\nNorm one is an N dot layer norm and we tell it how many\num what is the embedding dimension and we need \nthe second layer Norm and then here the layer rooms are\napplied immediately on x so self.layer number one in applied on x\nand salt on layer number two applied on X before it goes into sulfur tension \nand feed forward\nand the size of the layer No

In [105]:
"""
I took the code that we
developed in this Jupiter notebook and I converted it to be a script and 
I'm doing this because I just want to
simplify our intermediate work into just the final product that we have 
at this point

bigram.py

New additions
1. Enabled gpu - run on cuda
2. estimate_loss()
3. model.eval(), model.train() phases
4. @torch.no_grad() - 
"""

import torch
import torch.nn as nn
from torch.nn import functional as F

# Hyperparameters
# ================================================================
batch_size = 32 # how many independent sequences will we process in parallel?
block_size = 8 # what is the maximum context length for predictions?
max_iters = 5000 # 3000
eval_interval = 500 # 300
learning_rate = 1e-3 # 1e-2

# added gpu capability if available
device = 'cuda' if torch.cuda.is_available() else 'cpu'

eval_iters = 200

n_embd = 32 # number of embedding dimensions
# ================================================================

torch.manual_seed(1337) # for reproducibility

# Read Data
# ================================================================ 
# wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
with open('input.txt', 'r', encoding='utf-8') as f:
    text = f.read()
# ================================================================

# Encoder and Decoder
# ================================================================ 
# here are all the unique characters that occur in this text
chars = sorted(list(set(text)))
vocab_size = len(chars)
# create a mapping from characters to integers
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }
encode = lambda s: [stoi[c] for c in s] # encoder: take a string, output a list of integers
decode = lambda l: ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a string
# ================================================================

# Create Train and Test Splits
# ================================================================
data = torch.tensor(encode(text), dtype=torch.long)
n = int(0.9*len(data)) # first 90% will be train, rest val
train_data = data[:n]
val_data = data[n:]
# ================================================================

# data loading - gets a batch of the inputs and targets
def get_batch(split):
    # generate a small batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    x, y = x.to(device), y.to(device)
    return x, y

"""
this context manager torch.nograd and this is just telling pytorch 
that everything that happens
inside this function we will not call that backward on and 

so pytorch can be a
lot more efficient with its memory use because it doesn't have to store all 
the intermediate variables because we're
never going to call backward and 
so it can it can be a lot more memory efficient in that way

a good practice to tell Pi torch when we don't intend to do back propagation
"""
@torch.no_grad()
def estimate_loss():
    """
    it averages up the loss over multiple batches 
    so in particular 
    we're going to iterate invalider times and 
    we're going to
    basically get our loss and then we're going to get the average loss 
    for both splits and so this will be a lot less
    noisy

    when we call the estimate loss we're going to report the pretty
    accurate train and validation loss
    """
    out = {}
    model.eval() # setting phases
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train() # setting phases
    return out

class Head(nn.Module):
    """ one head of self-attention """

    def __init__(self, head_size):
        super().__init__()
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))

    def forward(self, x):
        B,T,C = x.shape
        k = self.key(x)   # (B,T,C)
        q = self.query(x) # (B,T,C)
        # compute attention scores ("affinities")

        # Corrected in the tutorial
        wei = q @ k.transpose(-2,-1) * k.shape[-1]**-0.5 # (B, T, C) @ (B, C, T) -> (B, T, T)
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf')) # (B, T, T)
        wei = F.softmax(wei, dim=-1) # (B, T, T)
        # perform the weighted aggregation of the values
        v = self.value(x) # (B,T,C)
        out = wei @ v # (B, T, T) @ (B, T, C) -> (B, T, C)
        return out

class MultiHeadAttention(nn.Module):
    """ multiple heads of self-attention in parallel """

    def __init__(self, num_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
        self.proj = nn.Linear(n_embd, n_embd)

    def forward(self, x):
        out = torch.cat([h(x) for h in self.heads], dim=-1)
        out = self.proj(out)
        return out

class FeedFoward(nn.Module):
    """ a simple linear layer followed by a non-linearity """

    def __init__(self, n_embd):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, 4 * n_embd),
            nn.ReLU(),
            nn.Linear(4 * n_embd, n_embd),
        )

    def forward(self, x):
        return self.net(x)

class Block(nn.Module):
    """ Transformer block: communication followed by computation """

    def __init__(self, n_embd, n_head):
        # n_embd: embedding dimension, n_head: the number of heads we'd like
        super().__init__()
        head_size = n_embd // n_head
        self.sa = MultiHeadAttention(n_head, head_size)
        self.ffwd = FeedFoward(n_embd)
        self.ln1 = nn.LayerNorm(n_embd)
        self.ln2 = nn.LayerNorm(n_embd)

    def forward(self, x):
        x = x + self.sa(x)
        x = x + self.ffwd(x)
        x = x + self.sa(self.ln1(x))
        x = x + self.ffwd(self.ln2(x))
        return x

# super simple bigram model
class BigramLanguageModel(nn.Module):

    def __init__(self):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        """
        I want to do is I don't want to actually create I want to create like a
        level of interaction here where we don't directly go to the embedding 
        for the um logits but instead we go through this
        intermediate phase because we're going to start making that bigger
        """
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        self.position_embedding_table = nn.Embedding(block_size, n_embd)
        # self.sa_head = Head(n_embd)
        # self.sa_heads = MultiHeadAttention(4, n_embd//4) 
        # self.ffwd = FeedFoward(n_embd)
        self.blocks = nn.Sequential(
            Block(n_embd, n_head=4),
            Block(n_embd, n_head=4),
            Block(n_embd, n_head=4),
            nn.LayerNorm(n_embd),
        )
        # i.e. 4 heads of 8-dimensional self-attention
        
        self.lm_head = nn.Linear(n_embd, vocab_size)


    def forward(self, idx, targets=None):
        B, T = idx.shape

        # idx and targets are both (B,T) tensor of integers
        tok_emb = self.token_embedding_table(idx) # (B,T,C)
        pos_emb = self.position_embedding_table(torch.arange(T, device=device)) # (T,C)
        x = tok_emb + pos_emb # (B,T,C)
        # x = self.sa_head(x) # apply one head of self-attention (B,T,C)
        # x = self.sa_heads(x) # apply one head of self-attention. (B,T,C)
        # x = self.ffwd(x) # (B,T,C)
        x = self.blocks(x) # (B,T,C)
        logits = self.lm_head(x) # (B,T,vocab_size)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # crop idx to the last block_size tokens
            idx_cond = idx[:, -block_size:]
            # get the predictions
            logits, loss = self(idx_cond)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

# no need to pass vocab_size into the constructor, already defined as a
# global variable
model = BigramLanguageModel()
m = model.to(device)

# create a PyTorch optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

# Training Loop
# ================================================================
for iter in range(max_iters):

    # every once in a while evaluate the loss on train and val sets
    if iter % eval_interval == 0 or iter == max_iters - 1:
        losses = estimate_loss()
        print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

# generate from the model
context = torch.zeros((1, 1), dtype=torch.long, device=device)
print(decode(m.generate(context, max_new_tokens=500)[0].tolist()))

"""
I ran this code it was giving me the train loss and val loss
and we see that we convert to somewhere around 2.5 with the bigram model 

and then here's
the sample that we produced at the end and so we have everything packaged up in
the script and we're in a good position now to iterate on this
"""

step 0: train loss 4.3072, val loss 4.3054
step 500: train loss 2.3786, val loss 2.3743
step 1000: train loss 2.2618, val loss 2.2563
step 1500: train loss 2.1680, val loss 2.1906
step 2000: train loss 2.1240, val loss 2.1591
step 2500: train loss 2.0784, val loss 2.1302
step 3000: train loss 2.0517, val loss 2.1186
step 3500: train loss 2.0452, val loss 2.1062
step 4000: train loss 2.0192, val loss 2.0996
step 4500: train loss 1.9921, val loss 2.0934
step 4999: train loss 1.9838, val loss 2.0680

When before
will aff Our felt maderen buber weranth the led?
Thathe art this us hathert?
F dilay anessway, my feanstal mzorn heavens, tof is heart milend lib,
Whiire, sengein;
Stistlidrevens, and the now on you mes liling me littise, oncely spear; ais allw you:
That I mand and at gonour you, my thake onWindot her eignase, and dour was in him,
And Long encore
To king thrust for are
grean whit, dy ale of whipfierr?

KIS
But Her, be!
Athed is wards beaces and thising mustear tey Iry to chan you!

"\nI ran this code it was giving me the train loss and val loss\nand we see that we convert to somewhere around 2.5 with the bigram model \n\nand then here's\nthe sample that we produced at the end and so we have everything packaged up in\nthe script and we're in a good position now to iterate on this\n"

#### No Comments

In [106]:
import torch
import torch.nn as nn
from torch.nn import functional as F

# hyperparameters
batch_size = 32 # how many independent sequences will we process in parallel?
block_size = 8 # what is the maximum context length for predictions?
max_iters = 5000
eval_interval = 500
learning_rate = 1e-3
device = 'cuda' if torch.cuda.is_available() else 'cpu'
eval_iters = 200
n_embd = 32
# ------------

torch.manual_seed(1337)

# wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
with open('input.txt', 'r', encoding='utf-8') as f:
    text = f.read()

# here are all the unique characters that occur in this text
chars = sorted(list(set(text)))
vocab_size = len(chars)
# create a mapping from characters to integers
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }
encode = lambda s: [stoi[c] for c in s] # encoder: take a string, output a list of integers
decode = lambda l: ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a string

# Train and test splits
data = torch.tensor(encode(text), dtype=torch.long)
n = int(0.9*len(data)) # first 90% will be train, rest val
train_data = data[:n]
val_data = data[n:]

# data loading
def get_batch(split):
    # generate a small batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    x, y = x.to(device), y.to(device)
    return x, y

@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out

class Head(nn.Module):
    """ one head of self-attention """

    def __init__(self, head_size):
        super().__init__()
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))

    def forward(self, x):
        B,T,C = x.shape
        k = self.key(x)   # (B,T,C)
        q = self.query(x) # (B,T,C)
        # compute attention scores ("affinities")

        # Corrected in the tutorial
        wei = q @ k.transpose(-2,-1) * k.shape[-1]**-0.5 # (B, T, C) @ (B, C, T) -> (B, T, T)
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf')) # (B, T, T)
        wei = F.softmax(wei, dim=-1) # (B, T, T)
        # perform the weighted aggregation of the values
        v = self.value(x) # (B,T,C)
        out = wei @ v # (B, T, T) @ (B, T, C) -> (B, T, C)
        return out

class MultiHeadAttention(nn.Module):
    """ multiple heads of self-attention in parallel """

    def __init__(self, num_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
        self.proj = nn.Linear(n_embd, n_embd)
    def forward(self, x):
        out = torch.cat([h(x) for h in self.heads], dim=-1)
        out = self.proj(out)
        return out

class FeedFoward(nn.Module):
    """ a simple linear layer followed by a non-linearity """

    def __init__(self, n_embd):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, 4 * n_embd),
            nn.ReLU(),
            nn.Linear(4 * n_embd, n_embd),
        )

    def forward(self, x):
        return self.net(x)

class Block(nn.Module):
    """ Transformer block: communication followed by computation """

    def __init__(self, n_embd, n_head):
        # n_embd: embedding dimension, n_head: the number of heads we'd like
        super().__init__()
        head_size = n_embd // n_head
        self.sa = MultiHeadAttention(n_head, head_size)
        self.ffwd = FeedFoward(n_embd)
        self.ln1 = nn.LayerNorm(n_embd)
        self.ln2 = nn.LayerNorm(n_embd)

    def forward(self, x):
        x = x + self.sa(self.ln1(x))
        x = x + self.ffwd(self.ln2(x))
        return x

# super simple bigram model
class BigramLanguageModel(nn.Module):

    def __init__(self):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        self.position_embedding_table = nn.Embedding(block_size, n_embd)
        self.blocks = nn.Sequential(
            Block(n_embd, n_head=4),
            Block(n_embd, n_head=4),
            Block(n_embd, n_head=4),
            nn.LayerNorm(n_embd),
        )
        self.lm_head = nn.Linear(n_embd, vocab_size)

    def forward(self, idx, targets=None):
        B, T = idx.shape

        # idx and targets are both (B,T) tensor of integers
        tok_emb = self.token_embedding_table(idx) # (B,T,C)
        pos_emb = self.position_embedding_table(torch.arange(T, device=device)) # (T,C)
        x = tok_emb + pos_emb # (B,T,C)
        x = self.blocks(x) # (B,T,C)
        logits = self.lm_head(x) # (B,T,vocab_size)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # crop idx to the last block_size tokens
            idx_cond = idx[:, -block_size:]
            # get the predictions
            logits, loss = self(idx_cond)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

model = BigramLanguageModel()
m = model.to(device)

# create a PyTorch optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

for iter in range(max_iters):

    # every once in a while evaluate the loss on train and val sets
    if iter % eval_interval == 0 or iter == max_iters - 1:
        losses = estimate_loss()
        print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

# generate from the model
context = torch.zeros((1, 1), dtype=torch.long, device=device)
print(decode(m.generate(context, max_new_tokens=500)[0].tolist()))

step 0: train loss 4.3103, val loss 4.3100
step 500: train loss 2.3808, val loss 2.3804
step 1000: train loss 2.2503, val loss 2.2550
step 1500: train loss 2.1566, val loss 2.1836
step 2000: train loss 2.1200, val loss 2.1594
step 2500: train loss 2.0701, val loss 2.1246
step 3000: train loss 2.0397, val loss 2.1207
step 3500: train loss 2.0342, val loss 2.1001
step 4000: train loss 2.0054, val loss 2.0900
step 4500: train loss 1.9905, val loss 2.0904
step 4999: train loss 1.9755, val loss 2.0625

And they bridce.

SOROROTES:
KING PANTIBbeed enaStirn'd the gatands:
Wanther us heart. Wardethat ane away, my feanstatue of my
Yout proof is heart milend lixcaes is ensen cin;
Stiselid ove the the me now on that spelplind me litles;
Honce by prupernisell why mold name.
Book this down'd
Is would thake of in on her eights would dour was genfience poor of his but that non this suke; ign the flow I male of whith Pried my of.

HKING ESLA,
So is wards.
Wices a for hoppecs.

DUKARD
HIONG EDWARD VES:

### Scaling Up the Model
- https://github.com/karpathy/ng-video-lecture/commit/482b15d53a3a330e7515d11dfb9702839dd5f586

Cosmetic Changes
- n_layer
- dropout

In [107]:
import torch
import torch.nn as nn
from torch.nn import functional as F

# hyperparameters
batch_size = 64 # how many independent sequences will we process in parallel?
block_size = 256 # what is the maximum context length for predictions?
max_iters = 5000
eval_interval = 500
learning_rate = 3e-4
device = 'cuda' if torch.cuda.is_available() else 'cpu'
eval_iters = 200
n_embd = 384
n_head = 6
n_layer = 6
dropout = 0.2
# ------------

torch.manual_seed(1337)

# wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
with open('input.txt', 'r', encoding='utf-8') as f:
    text = f.read()

# here are all the unique characters that occur in this text
chars = sorted(list(set(text)))
vocab_size = len(chars)
# create a mapping from characters to integers
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }
encode = lambda s: [stoi[c] for c in s] # encoder: take a string, output a list of integers
decode = lambda l: ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a string

# Train and test splits
data = torch.tensor(encode(text), dtype=torch.long)
n = int(0.9*len(data)) # first 90% will be train, rest val
train_data = data[:n]
val_data = data[n:]

# data loading
def get_batch(split):
    # generate a small batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    x, y = x.to(device), y.to(device)
    return x, y

@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out

class Head(nn.Module):
    """ one head of self-attention """

    def __init__(self, head_size):
        super().__init__()
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))

        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        B,T,C = x.shape
        k = self.key(x)   # (B,T,C)
        q = self.query(x) # (B,T,C)
        # compute attention scores ("affinities")

        # Corrected in the tutorial
        wei = q @ k.transpose(-2,-1) * k.shape[-1]**-0.5 # (B, T, C) @ (B, C, T) -> (B, T, T)
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf')) # (B, T, T)
        wei = F.softmax(wei, dim=-1) # (B, T, T)
        wei = self.dropout(wei)
        # perform the weighted aggregation of the values
        v = self.value(x) # (B,T,C)
        out = wei @ v # (B, T, T) @ (B, T, C) -> (B, T, C)
        return out

class MultiHeadAttention(nn.Module):
    """ multiple heads of self-attention in parallel """

    def __init__(self, num_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
        self.proj = nn.Linear(n_embd, n_embd)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        out = torch.cat([h(x) for h in self.heads], dim=-1)
        out = self.dropout(self.proj(out))
        return out

class FeedFoward(nn.Module):
    """ a simple linear layer followed by a non-linearity """

    def __init__(self, n_embd):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, 4 * n_embd),
            nn.ReLU(),
            nn.Linear(4 * n_embd, n_embd),
            nn.Dropout(dropout),
        )

    def forward(self, x):
        return self.net(x)

class Block(nn.Module):
    """ Transformer block: communication followed by computation """

    def __init__(self, n_embd, n_head):
        # n_embd: embedding dimension, n_head: the number of heads we'd like
        super().__init__()
        head_size = n_embd // n_head
        self.sa = MultiHeadAttention(n_head, head_size)
        self.ffwd = FeedFoward(n_embd)
        self.ln1 = nn.LayerNorm(n_embd)
        self.ln2 = nn.LayerNorm(n_embd)

    def forward(self, x):
        x = x + self.sa(self.ln1(x))
        x = x + self.ffwd(self.ln2(x))
        return x

# super simple bigram model
class BigramLanguageModel(nn.Module):

    def __init__(self):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        self.position_embedding_table = nn.Embedding(block_size, n_embd)
        self.blocks = nn.Sequential(*[Block(n_embd, n_head=n_head) for _ in range(n_layer)])
        self.ln_f = nn.LayerNorm(n_embd) # final layer norm
        self.lm_head = nn.Linear(n_embd, vocab_size)

    def forward(self, idx, targets=None):
        B, T = idx.shape

        # idx and targets are both (B,T) tensor of integers
        tok_emb = self.token_embedding_table(idx) # (B,T,C)
        pos_emb = self.position_embedding_table(torch.arange(T, device=device)) # (T,C)
        x = tok_emb + pos_emb # (B,T,C)
        x = self.blocks(x) # (B,T,C)
        x = self.ln_f(x) # (B,T,C)
        logits = self.lm_head(x) # (B,T,vocab_size)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # crop idx to the last block_size tokens
            idx_cond = idx[:, -block_size:]
            # get the predictions
            logits, loss = self(idx_cond)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

model = BigramLanguageModel()
m = model.to(device)
# print the number of parameters in the model
print(sum(p.numel() for p in m.parameters())/1e6, 'M parameters')

# create a PyTorch optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

for iter in range(max_iters):

    # every once in a while evaluate the loss on train and val sets
    if iter % eval_interval == 0 or iter == max_iters - 1:
        losses = estimate_loss()
        print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

# generate from the model
context = torch.zeros((1, 1), dtype=torch.long, device=device)
print(decode(m.generate(context, max_new_tokens=500)[0].tolist()))
#open('more.txt', 'w').write(decode(m.generate(context, max_new_tokens=10000)[0].tolist()))

10.788929 M parameters
step 0: train loss 4.2846, val loss 4.2820


KeyboardInterrupt: ignored

### Reference: Full Finished Code

You may want to refer directly to the git repo instead though.

In [108]:
import torch
import torch.nn as nn
from torch.nn import functional as F

# hyperparameters
batch_size = 16 # how many independent sequences will we process in parallel?
block_size = 32 # what is the maximum context length for predictions?
max_iters = 5000
eval_interval = 100
learning_rate = 1e-3
device = 'cuda' if torch.cuda.is_available() else 'cpu'
eval_iters = 200
n_embd = 64
n_head = 4
n_layer = 4
dropout = 0.0
# ------------

torch.manual_seed(1337)

# wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
with open('input.txt', 'r', encoding='utf-8') as f:
    text = f.read()

# here are all the unique characters that occur in this text
chars = sorted(list(set(text)))
vocab_size = len(chars)
# create a mapping from characters to integers
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }
encode = lambda s: [stoi[c] for c in s] # encoder: take a string, output a list of integers
decode = lambda l: ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a string

# Train and test splits
data = torch.tensor(encode(text), dtype=torch.long)
n = int(0.9*len(data)) # first 90% will be train, rest val
train_data = data[:n]
val_data = data[n:]

# data loading
def get_batch(split):
    # generate a small batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    x, y = x.to(device), y.to(device) 
    # when device becomes cuda up then we need to make sure 
    # that when we load the  data we move it to device
    return x, y

@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out

class Head(nn.Module):
    """ one head of self-attention """

    def __init__(self, head_size):
        super().__init__()
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))

        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        B,T,C = x.shape
        k = self.key(x)   # (B,T,C)
        q = self.query(x) # (B,T,C)
        # compute attention scores ("affinities")

        # Corrected in the tutorial
        wei = q @ k.transpose(-2,-1) * k.shape[-1]**-0.5 # (B, T, C) @ (B, C, T) -> (B, T, T)
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf')) # (B, T, T)
        wei = F.softmax(wei, dim=-1) # (B, T, T)
        wei = self.dropout(wei)
        # perform the weighted aggregation of the values
        v = self.value(x) # (B,T,C)
        out = wei @ v # (B, T, T) @ (B, T, C) -> (B, T, C)
        return out

class MultiHeadAttention(nn.Module):
    """ multiple heads of self-attention in parallel """

    def __init__(self, num_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
        self.proj = nn.Linear(n_embd, n_embd)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        out = torch.cat([h(x) for h in self.heads], dim=-1)
        out = self.dropout(self.proj(out))
        return out

class FeedFoward(nn.Module):
    """ a simple linear layer followed by a non-linearity """

    def __init__(self, n_embd):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, 4 * n_embd),
            nn.ReLU(),
            nn.Linear(4 * n_embd, n_embd),
            nn.Dropout(dropout),
        )

    def forward(self, x):
        return self.net(x)

class Block(nn.Module):
    """ Transformer block: communication followed by computation """

    def __init__(self, n_embd, n_head):
        # n_embd: embedding dimension, n_head: the number of heads we'd like
        super().__init__()
        head_size = n_embd // n_head
        self.sa = MultiHeadAttention(n_head, head_size)
        self.ffwd = FeedFoward(n_embd)
        self.ln1 = nn.LayerNorm(n_embd)
        self.ln2 = nn.LayerNorm(n_embd)

    def forward(self, x):
        x = x + self.sa(self.ln1(x))
        x = x + self.ffwd(self.ln2(x))
        return x

# super simple bigram model
class BigramLanguageModel(nn.Module):

    def __init__(self):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        self.position_embedding_table = nn.Embedding(block_size, n_embd)
        self.blocks = nn.Sequential(*[Block(n_embd, n_head=n_head) for _ in range(n_layer)])
        self.ln_f = nn.LayerNorm(n_embd) # final layer norm
        self.lm_head = nn.Linear(n_embd, vocab_size)

    def forward(self, idx, targets=None):
        B, T = idx.shape

        # idx and targets are both (B,T) tensor of integers
        tok_emb = self.token_embedding_table(idx) # (B,T,C)
        pos_emb = self.position_embedding_table(torch.arange(T, device=device)) # (T,C)
        x = tok_emb + pos_emb # (B,T,C)
        x = self.blocks(x) # (B,T,C)
        x = self.ln_f(x) # (B,T,C)
        logits = self.lm_head(x) # (B,T,vocab_size)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # crop idx to the last block_size tokens
            idx_cond = idx[:, -block_size:]
            # get the predictions
            logits, loss = self(idx_cond)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

model = BigramLanguageModel()
m = model.to(device) 
# when we create the model we want to move the model parameters to device

# print the number of parameters in the model
print(sum(p.numel() for p in m.parameters())/1e6, 'M parameters')

# create a PyTorch optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

for iter in range(max_iters):

    # every once in a while evaluate the loss on train and val sets
    if iter % eval_interval == 0 or iter == max_iters - 1:
        losses = estimate_loss()
        print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

# generate from the model
context = torch.zeros((1, 1), dtype=torch.long, device=device)
# when I'm creating the context that feeds into generate 
# I have to make sure that I create on the device
print(decode(m.generate(context, max_new_tokens=2000)[0].tolist()))

0.209729 M parameters
step 0: train loss 4.4109, val loss 4.4016
step 100: train loss 2.6511, val loss 2.6604
step 200: train loss 2.4941, val loss 2.4903
step 300: train loss 2.3907, val loss 2.4023
step 400: train loss 2.3215, val loss 2.3287
step 500: train loss 2.2544, val loss 2.2712
step 600: train loss 2.1955, val loss 2.2051
step 700: train loss 2.1671, val loss 2.1900
step 800: train loss 2.1254, val loss 2.1529
step 900: train loss 2.0779, val loss 2.1124
step 1000: train loss 2.0565, val loss 2.0908
step 1100: train loss 2.0292, val loss 2.0837
step 1200: train loss 1.9997, val loss 2.0475
step 1300: train loss 1.9891, val loss 2.0360
step 1400: train loss 1.9549, val loss 2.0084
step 1500: train loss 1.9353, val loss 1.9998
step 1600: train loss 1.9234, val loss 2.0112
step 1700: train loss 1.9145, val loss 1.9927
step 1800: train loss 1.8807, val loss 1.9749
step 1900: train loss 1.8771, val loss 1.9624
step 2000: train loss 1.8524, val loss 1.9698
step 2100: train loss 1.

### Notes

#### Encoder, Decoder, Encoder-Decoder

In [109]:
"""
what we implemented here is a Decoder-only Transformer 
so there's no component here 
this part is called the Encoder and 
there's no Cross-Attention block here

our block only has a Self-Attention and the Feed Forward so it is missing this
third in between piece here 
this piece does Cross-Attention 
so we don't have it

and we don't have the Encoder 
we just have the Decoder and 
the reason we have a Dcoder-only
is because we are just generating text and it's unconditioned on anything 
we're just we're just blabbering on according
to a given data set 

what makes it a Decoder is that we are using the Triangular mask 
in our Transformer so 
it has this Auto regressive property 
where we can just go and sample from it

so the fact that it's using the Triangular triangular mask 
to mask out the attention makes it a Decoder and it
can be used for Language Modeling
"""

# French to English translation example:

# <--------- ENCODE ------------------><--------------- DECODE ----------------->
# les réseaux de neurones sont géniaux! <START> neural networks are awesome!<END>

"""
now the reason that the original paper had an Encoder-Decoder architecture is
because it is a machine translation paper so 
it is concerned with a different setting 
in particular it expects some tokens that encode say for example French
and then it is expected to decode the translation in English

typically these here are special tokens so you are expected 
to read in this and condition on it and
then you start off the generation with a special token called <START> 
so this is a special new token that you introduce and
always place in the beginning and then the network is expected 
to output "neural networks are awesome" and then a
special <END> token to finish a generation

so this part here will be decoded
exactly as we we've done it neural networks are awesome will 
be identical to what we did

but unlike what we did 
they want to condition the generation on some
additional information and in that case 
this additional information is the French sentence 
that they should be translating

so what they do now is 
they bring in the Encoder 
now the encoder reads this part here so 
we're only going to take the part of French and 
we're going to create tokens from it exactly as we've seen in our video and
we're going to put a Transformer on it but 
there's going to be no triangular mask and so all the tokens are allowed
to talk to each other as much as they want and 
they're just encoding whatever's the content of this French sentence 

once they've encoded it they've they basically come out in the
top here and then what happens here is in our Decoder 
which does the language modeling
there's an additional connection here to the outputs of the Encoder

and that is brought in through a Cross-Attention 
so the queries are still generated from X but now 
the keys and the values are coming from the side 
the keys and the values are coming from the top
generated by the nodes that came outside of the Encoder and 
those tops, the keys and the values
there the top of it , feed it on the side 
into every single block of the decoder and so 
that's why there's an additional Cross-Attention

and really what it's doing is 
it's conditioning the decoding not just on
the past of this current decoding 
but also on having seen the full 
fully encoded French prompt sort of and so 

it's an Encoder-Decoder model 
which is why we have those two Transformers 
an additional block and so on so 
we did not do this because we have
no we have nothing to encode there's no conditioning 
we just have a text file and 
we just want to imitate it and 
that's why we are using a decoder only
Transformer exactly as done in GPT
"""

'\nnow the reason that the original paper had an Encoder-Decoder architecture is\nbecause it is a machine translation paper so \nit is concerned with a different setting \nin particular it expects some tokens that encode say for example French\nand then it is expected to decode the translation in English\n\ntypically these here are special tokens so you are expected \nto read in this and condition on it and\nthen you start off the generation with a special token called <START> \nso this is a special new token that you introduce and\nalways place in the beginning and then the network is expected \nto output "neural networks are awesome" and then a\nspecial <END> token to finish a generation\n\nso this part here will be decoded\nexactly as we we\'ve done it neural networks are awesome will \nbe identical to what we did\n\nbut unlike what we did \nthey want to condition the generation on some\nadditional information and in that case \nthis additional information is the French sentence \nt

#### ChatGPT, GPT-3, Pre-training, Fine-tuning, RLHF

In [110]:
"""
so let's now bring things back to 
ChatGPT, GPT-3, Pretraining vs. Finetuning, RLHF

what would it look like if we wanted to train ChatGPT ourselves and 
how does it relate to what we learned today

well to train in ChatGPT there are roughly two stages:
(1) Pre-training
(2) Fine-tuning

First: Pre-training
In the pre-training stage we are training on a large chunk of internet and 
just trying to get a first Decoder-only Transformer to Babel text
so it's very very similar to what we've done ourselves 
except we've done like a tiny little baby pre-training step and so 
in our case uh this is how you print a number of parameters 

I printed it and it's about 10 million so this Transformer 
that I created here to create little Shakespeare um Transformer 
was about 10 million parameters 
our data set is roughly 1 million uh characters 
so roughly 1 million tokens 
but you have to remember that opening uses different vocabulary
they're not on the Character level they use these um subword chunks of words and
so they have a vocabulary of 50,000 roughly elements and 
so their sequences are a bit more condensed
so our data set the Shakespeare data set would be probably around 300,000 tokens
in the OpenAI vocabulary roughly so 
we trained about 10 million parameter
model and roughly 300,000 tokens 

when you go to the GPT-3 paper
and you look at the Transformers that they trained 
they trained a number of Transformers of
different sizes but the biggest Transformer here has 175 billion parameters 
so ours is again 10 million
they used this number of layers in the Transformer 
This is the n embed 
this is the number of heads and this is
the head size and then this is the batch size so ours was 65
and the learning rate is similar 

now when they train this Transformer they trained on 300 billion tokens
so again remember ours is about 300,000 so this is uh about a million fold
increase and this number would not be even that large by today's standards 
you'd be going up uh one trillion and
above so they are training a significantly larger model
on a good chunk of the internet and that is the pre-training stage 

but otherwise
these hyper parameters should be fairly recognizable to you and 
the architecture is actually like nearly identical to
what we implemented ourselves but 
of course it's a massive infrastructure challenge to train this 
you're talking about typically thousands of gpus having to you know
talk to each other 
to train models of this size so that's just a pre-training stage 

now after you complete the pre-training stage 
you don't get something that responds to
your questions with answers and is not helpful and Etc 
you get a document completer right so it babbles 
but it doesn't babbles Shakespeare 
it babbles the internet 
it will create arbitrary news articles and documents and 
it will try to complete documents because that's what it's trained for 
it's trying to complete the sequence so 
when you give it a question it would just uh
potentially just give you more questions 
it would follow with more questions 
it will do whatever it looks like the some closed document would do 
in the training data on the internet and 
so who knows you're getting kind of like undefined behavior 
it might basically answer with two questions with other questions 
it might ignore your question it might just
try to complete some news article it's totally underlined as we say 

Second: Fine-tuning
the second fine tuning stage is to actually align it to be an assistant and 
this is the second stage and so this ChatGPT blog post 
https://openai.com/blog/chatgpt
from OpenAI talks a little bit about how the stage is achieved 

we basically um
there's roughly three steps to  this stage

(1) Fine-tuning
so what  they do here is they start to collect training data that
looks specifically like what an assistant would do so 
you have documents that have the format where the question is on top and then 
an answer is below and 
they have a large number of these but 
probably not on the order of the internet 
this is probably on the
order of maybe thousands of examples and so 
they then fine-tuned the model to basically only focus on documents 
that look like that and so
you're starting to slowly align it so 
it's going to expect a question at the top and 
it's going to expect to complete the answer
and uh these very very large models are very sample efficient 
during their fine tuning so this actually somehow works
but that's just step one that's just fine-tuning

(2)
then they actually have more steps where okay the second step is
you let the model respond and then different Raters 
look at the different responses and rank them for their
preference as to which one is better than the other 

they use that to train a reward model so they can predict
basically using a different network 
how much of any candidate response would be
desirable and then once they have a reward model they run PPO 
which is a form of policy gradient reinforcement learning optimizer 
to fine-tune this sampling policy so
that the answers that  ChatGPT now generates, are expected to score a high
reward according to the reward model and so 

basically there's a whole the alining stage here or fine-tuning stage
it's got multiple steps in between there as well and 
it takes the model from  being a document completer to a question answerer and 
that's like a whole separate stage 

a lot of this data is not available publicly it is internal to OpenAI and 
it's much harder to replicate this stage um and 

so that's roughly what would give you a ChatGPT and 

nanoGPT focuses on the pre-training stage 
"""

"\nso let's now bring things back to \nChatGPT, GPT-3, Pretraining vs. Finetuning, RLHF\n\nwhat would it look like if we wanted to train ChatGPT ourselves and \nhow does it relate to what we learned today\n\nwell to train in ChatGPT there are roughly two stages:\n(1) Pre-training\n(2) Fine-tuning\n\nFirst: Pre-training\nIn the pre-training stage we are training on a large chunk of internet and \njust trying to get a first Decoder-only Transformer to Babel text\nso it's very very similar to what we've done ourselves \nexcept we've done like a tiny little baby pre-training step and so \nin our case uh this is how you print a number of parameters \n\nI printed it and it's about 10 million so this Transformer \nthat I created here to create little Shakespeare um Transformer \nwas about 10 million parameters \nour data set is roughly 1 million uh characters \nso roughly 1 million tokens \nbut you have to remember that opening uses different vocabulary\nthey're not on the Character level the

#### Conclusion

In [111]:
"""
so we trained to summarize a Decoder-only Transformer 
following this famous paper Attention is All You Need from 2017

and so that's basically a GPT 
we trained it on a Tiny Shakespeare and got sensible results 

all of the training code is roughly 200 lines of code 
I will be releasing this code base so also it comes with
all the git log commits along the way as we built it up 

in addition to this code I'm going to release the notebook of course 
the Google collab and I hope that gave you a sense for how
you can train um these models like say GPT-3 there will be 
architecturally basically identical to
what we have but they are somewhere between ten thousand and one million times 
bigger depending on how you count

and so that's all I have for now 
we did not talk about any of the fine-tuning
stages that would typically go on top of this so if you're interested 
in something that's not just language modeling but you actually want to you
know say perform tasks or you want them to be aligned in a specific way or you
want to detect sentiment or anything like that 
basically anytime you don't want something that's just a document completer 

you have to complete further stages of fine-tuning which we did not cover and 
that could be simple supervised fine-tuning (SFT) or 
it can be something more fancy 
like we see in chargept we actually train a reward model and then
do rounds of PPO to align it with respect to the reward model so 
there's a lot more that can be done on top of it 

I think for now we're starting to get to about two hours mark so I'm going to
um kind of finish here I hope you enjoyed the lecture and 

uh yeah go forth and transform, see you later
"""

"\nso we trained to summarize a Decoder-only Transformer \nfollowing this famous paper Attention is All You Need from 2017\n\nand so that's basically a GPT \nwe trained it on a Tiny Shakespeare and got sensible results \n\nall of the training code is roughly 200 lines of code \nI will be releasing this code base so also it comes with\nall the git log commits along the way as we built it up \n\nin addition to this code I'm going to release the notebook of course \nthe Google collab and I hope that gave you a sense for how\nyou can train um these models like say GPT-3 there will be \narchitecturally basically identical to\nwhat we have but they are somewhere between ten thousand and one million times \nbigger depending on how you count\n\nand so that's all I have for now \nwe did not talk about any of the fine-tuning\nstages that would typically go on top of this so if you're interested \nin something that's not just language modeling but you actually want to you\nknow say perform tasks 

## Dependencies

In [112]:
!pip install session-info

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [113]:
import session_info

session_info.show()