# Generating some text

Generative-AI is the buzz word since November 2022, when Chat-GPT came out. Previously GANs (Generative Adversarial Networks) had their hours of glory some years ago (circa 2016), and provided spectacular results as computing resources grew. (see for example  [thispersondoesnotexist.com](https://thispersondoesnotexist.com/) )

More recently Transformers were able to build Large Language Models (LLMs). 

Let's have a look, starting with extremly naive attempts to buid some text : 
 
0. random letters
1. random words 
2. words with frequencies of sequences
3. mini transformer (courtesy of Andrej Karptahy)


*Jm Torres, August 2023, torrejm@fr.ibm.com*

# 0. Random letters.

Of course, the results are awfull, but is it very easy to do, so let's do it. 
With 26 caracters, we can generate strings and try to make it look like some text. 

1. absolutely random chars,
2. random chars grouped in "words" of random length. 
3. random chars grouped in "words" of random length, but now we use the actual frequencies of letters in a given language. 
4. then we can use a realistic distribution of words length (measured on actual text).

In [None]:
# 1 just a list of random characters
from random import randint
size = 200 # nombre de caractères 
chars = [c for c in 'abcdefghijklmnopqrstuvwxyz'] # liste des caractères

# on produit une liste de caractères
for i in range(size):
    print(chars[randint(0,len(chars)-1)], end ="")

In [None]:
# 2 random chars with ramdomly selectecd blank characters (random words)
from random import randint
size = 20 # nombre de mots à produire

#word_lengths = [1,2,3,4,5,6,7,8,9,10]
chars = [c for c in 'abcdefghijklmnopqrstuvwxyz'] # liste des caractères
text = ""
for i in range(size): 
    l = randint(1,10) # dans cet exemple on se limite à des mots de longueur maxi 10.
    for i in range(l):
        text += chars[randint(0,len(chars)-1)]
    text += " "

print(text)

Next, we can use the frequencies for each letter in a text, they are available here for a set of languages: 

https://fr.wikipedia.org/wiki/Fr%C3%A9quence_d%27apparition_des_lettres

(we could build this from a large text) 

In [None]:
# 3  groupement par mots, et cette fois les caractères sont tirés au sort,
#    mais avec une fréquence ressemblant à celle du français.
from random import choices
size = 200 # nombre de caractères 
chars = [c for c in 'abcdefghijklmnopqrstuvwxyz'] # liste des caractères
#freq = [71,11,32,37,121,11,12,11,66,3,3,50, 26, 64, 50, 25, 6, 61,65,59, 45, 11, 2,4,5,2 ] # fr
freq = [82,15,28,42,127,22,20,61,70,2,8,40, 24, 67, 75, 20, 9, 60,63,90,27, 10, 24,2,20,7,] # en

# on produit une liste de caractères
print(''.join(choices(chars, weights = freq, k = size)))

Then we cut at random to make words:

In [None]:
# 3 en coupant en mots
from random import choices
size = 20 # nombre de mots
chars = [c for c in 'abcdefghijklmnopqrstuvwxyz'] # liste des caractères
#freq = [71,11,32,37,121,11,12,11,66,3,3,50, 26, 64, 50, 25, 6, 61,65,59, 45, 11, 2,4,5,2 ]
freq = [82,15,28,42,127,22,20,61,70,2,8,40, 24, 67, 75, 20, 9, 60,63,90,27, 10, 24,2,20,7,]
text = ""
for i in range(size): 
    l = randint(1,10) # dans cet exemple on se limite à des mots de longueur maxi 10.
    for i in range(l):
        text += choices(chars, freq)[0]
    text += " "

print(text)

With a better than random words length distribution:

In [None]:
# 4.a : fabrication de la distribution des longeurs de mots
import string

long = {}
# mots = "bonjour tout le monde comment ca va".split()
with open('hp.txt','r') as f:
    text = f.read()
    text = ' '.join(text.split()) # suppression des espaces multiple
    text = text.lower() # suppression des majuscules
    text = text.translate(str.maketrans('','', string.punctuation)) # supprime ponctu.
    mots = text.split() # fait une liste des mots
    # print(words)

for mot in mots:
    long[len(mot)] = long.get(len(mot),0) + 1
    
long_mot = []
freq_mot = []

print("  length quantity")
print("-------- --------")
for k,v in long.items():
    if k < 25:  #plus long mot français 
        print(f"{k:8.0f}  {v:7.0f}")
        long_mot.append(k)
        freq_mot.append(v)

lf = list(zip(long_mot, freq_mot))
ll,ff = [[i for i, j in sorted(lf)],
       [j for i, j in sorted(lf)]]

print(ll)
print(ff)

import matplotlib.pyplot as plt
plt.plot(ll, ff)
plt.show()

In [None]:
# 4.b
from random import choices
size = 20 # nombre de mots
chars = [c for c in 'abcdefghijklmnopqrstuvwxyz'] # liste des caractères
freq = [82,15,28,42,127,22,20,61,70,2,8,40, 24, 67, 75, 20, 9, 60,63,90,27, 10, 24,2,20,7]

text = ""
for i in range(size): 
    l = choices(ll,ff)[0]
    for i in range(l):
        text += choices(chars, freq)[0]
    text += " "

print(text)

Pour l'anglais la fréquence à utiliser serait (de a à z) : 

`freq = [82,15,28,42,127,22,20,61,70,2,8,40, 24, 67, 75, 20, 9, 60,63,90,27, 10, 24,2,20,7,]`

Pour l'allemand : 


Pour l'espagnol : 


Pour le français : 

`freq = [71,11,32,37,121,11,12,11,66,3,3,50, 26, 64, 50, 25, 6, 61,65,59, 45, 11, 2,4,5,2]`

... ça ne ressemble toujours pas vraiment à un texte ... 

# 1. Random words

With random letter we get nowhere, not a single valid word, not even able to tell if it is english or other... 

So let's start using words, random words(5.0) , using their frequency(5.1)... it looks like english text, but absolutely no meaning... as we could expect 

Randomness is not usefull in this case...

In [None]:
# 5.0 

long = 100
# mots = "bonjour tout le monde comment ca va".split()
with open('hp.txt','r') as f:
    text = f.read()
    text = ' '.join(text.split()) # suppression des espaces multiple
    text = text.lower() # suppression des majuscules
    text = text.translate(str.maketrans('','', string.punctuation)) # supprime ponctu.
    words = text.split() # fait une liste des mots
    # print(words)

t = ""
for i in range(long): 
    t += words[randint(1,len(words))] + " "

print(t)
    

In [None]:
# 5.1

long = 50

with open('hp.txt','r') as f:
    text = f.read()
    text = ' '.join(text.split()) # suppression des espaces multiple
    text = text.lower() # suppression des majuscules
    text = text.translate(str.maketrans('','', string.punctuation)) # supprime ponctu.
    words = text.split() # fait une liste des mots
    # print(words)

    
#words = "bonjour tout le monde comment ca va le monde".split()
# calcul des fréquences des mots dans le texte. 
fr = {}
for word in words:
    fr[word] = fr.get(word,0) + 1 

#print(fr)    

mots = list(fr.keys())
freq = list(fr.values())
t = ""
for i in range(long): 
    t += choices(mots,freq)[0] + " "

print(t)

# 2. Trying better : from an existing text, let's build a graph that counts the frequencies of the following words for a given word. 

Using grammatical rules to generate texts is a difficult problem (very difficult in french) this is why other strategies have come up, such as this one (bigrams) 


From Text to Graph (oriented, and weighted) 
- Vertices: words  
- Edges: 
    - oriented: 'I' links to 'eat' (end not 'eat' to 'I')
    - weighted: proportional the frequency of the link ("need -> money" is more frequent than "need -> hippopotamus")
    
From this text :  `je vais à la gare, je vais à la maison pour manger des pâtes, il mange des fruits, elle mange des fruits, je chante à la maison, je vais manger des fuits, à la maison je vais dormir`, this graph can be build: 

![le graphe correspondant](graphe.png)


Once build, the graph can be used to generate a list of words connected by an edge and chosen according to the vertice weight.

We can unsterstand that the larger the set of texts used to build the graph, the best the quality of the output. 

Let's try a small one ... 

### a. building the vertex anf graph classes, with their methods

In [12]:
import random

class Vertex:
    ''' Classe pour les sommets (vertices) du graphe, qui représentent les mots
        Attributs : 
        - value (str): le mot lui même
        - adjacent{} : le dictionnaire des sommets adjacents (reliés, avec un edge)
            les éléments du dictionnaire "adjacent" sont {vertex_adjacent1:poids1, ... 
            vertex_adjacentN:poids:N}
            Les poids sont entiers et seront incrémentés au fur et à mesure que le même 
            lien entre ce vertex et le vertex adjacent_i est retrouvé dans le texte
        - neighbors[] : la liste des vertices adjacents
        - neighbors_value[] : la liste des valeurs (mot) des vertices adjacents
        - neighbors_weights[] : la liste des poids pour les edges correspondants
        Ces listes sont extraites du dictionnaire d'adjacence, une fois celui-ci 
        constitué
            '''
    def __init__(self,value):
        self.value = value  # le mot lui même
        self.adjacent = {}  # dictionnaire pour les vertex adjacents
        self.neighbors = [] # liste des sommets reliés (ce sont des Vertex)
        self.neighbors_value = [] # liste des mots des sommmets reliés
        self.neighbors_weights = [] # liste des poids des sommets reliés
           
    def increment_edge(self,vertex):
        """ incrément du poids de liaison entre self et un autre vertex """
        self.adjacent[vertex] = self.adjacent.get(vertex,0) + 1
    
    def get_adjacent_nodes(self):
        """  renvoie les vertex adjacents """
        return self.adjacent.keys()
    
    def vertex_probability_map(self):
        """ fabrication des listes neighbors et neighbors_weights
            Les listes neighbors[] et neighbors_weights[] sont utilisés pour 
            tirer au sort le prochain mot dans l'adjacence de self avec le poids
            correspondant, le liste neighbors_value est utilisée par
            get_vertex_adjacent_data, pour la méthode __str__. de Graph"""
        for (vertex,weight) in self.adjacent.items():
            self.neighbors.append(vertex)
            self.neighbors_value.append(vertex.value)
            self.neighbors_weights.append(weight)
            
    #def get_vertex_adjacent_data(self):
    #    """ renvoie la liste des adjacents et leurs poids
    #        cette fonction n'est utilisée que pour la méthode __str__. 
    #        dans la classe Graph """
    #    return self.neighbors_value, self.neighbors_weights
        
    def next_word(self):
        """ utilise random.choices pour tirer au sort dans la liste des neighbors
            avec pour chacun des neighbors la probabilité neighbors_weight
            (choices tire au sort dans une liste avec une liste de poids
            choices renvoie une liste (de longueur k, nous on prend 1) 
            c'est pourquoi pour avoir l'élément on demande l'élément [0] """
        return random.choices(self.neighbors, weights = self.neighbors_weights, k = 1)[0]
    
    
    def __str__(self):
        """ permet d'utiliser la fonction print pour un objet de cette classe"""
        return self.value + " " +  ' '.join([node.value for node in self.adjacent.keys()])


class Graph:
    def __init__(self):
        """ le graphe est un dictionnaire de sommets : {mot:vertex}"""
        self.vertices = {} 
        
    def get_vertex_values(self):
        """renvoie tous les sommets (vertices) du graphe
           c'est un ensemble de ...
        """
        return set(self.vertices.keys())
    
    def add_vertex(self, value):
        self.vertices[value] = Vertex(value)
        
    def get_vertex(self, value):
        if value not in self.vertices : 
            self.add_vertex(value)
        return self.vertices[value]
    
    def get_next_word(self,current_vertex):
        return self.vertices[current_vertex.value].next_word()
        
    def generate_probability_mappings(self):
        ''' pour chaque mot du graphe on parcours son dictionnaire .adjacent 
            afin de contruire la liste des mots et des poids'''
        for vertex in self.vertices.values():
            vertex.vertex_probability_map()
            
    def __str__(self):
        """ ma fonction print pour un objet de cette classe"""
        p = "*** début graphe ***"  
        for k,v in self.vertices.items():
            p+= "\n--- mot ->" + k + " -- suivant(s) ->" + str(v.neighbors_value) + \
            " -- poid(s) ->" + str(v.neighbors_weights) + "\n"
            p+= "------"
        p += "***  fin graphe  ***"
        return p
    
        

### b. building the graph:

1. words extraction
2. graph construction

In [None]:
import string # pour supprimer la poncutation - on peut faire à la main avec regex

def get_words_from_text(text_path):
    """ chargement du fichier texte et préparation : 
            suppression des espaces multiples, des majuscules et des ponctuations."""
    with open(text_path,'r') as f:
        text = f.read()
        text = ' '.join(text.split()) # suppression des espaces multiple
        text = text.lower() # suppression des majuscules
        text = text.translate(str.maketrans('','', string.punctuation)) # supprime ponctu.
        words = text.split() # fait une liste des mots
        #print(words)
        return words
        
def make_graph(words):
    """ construction du graph """
    g = Graph()
    previous_word = None
    # pour chaque mot du texte, s'il n'y est pas on l'ajoute,
    # on récupère le vertex
    for word in words:
        word_vertex = g.get_vertex(word)      
        # s'il y avait un previous_word, on ajoute un lien si pas déjà
        # si le ien existait, on augmente son poids/sa valeur de 1
        if previous_word:
            previous_word.increment_edge(word_vertex)
            
        previous_word = word_vertex
    
        # on a fini de parcourir words, on va générer les probabilités
        # c'est à dire : pour chaque mot, on fait la liste des adjacents et la liste 
        # des poids, pour pouvoir faire le tirage au sort (random.choices)
    g.generate_probability_mappings()
   
    #print(g)
    return g

def compose(g, words, length=50):
    """ composition du texte, à partir d'un mot au hasard
        TODO : on pourrait faire un prompt d'un seul mot en prenant un
        mot de départ en paramètre plutôt que de le choisir au hasard """
    composition = []
    word = g.get_vertex(random.choice(words))
    
    for _ in range(length):
        composition.append(word.value)
        word = g.get_next_word(word)
        
    return composition
        
# -1- get words from text

#words = get_words_from_text("hp1.txt")
words = get_words_from_text("hp.txt")
print(len(words))

#print(words,'\n\n')

# -2- make a graph using those words
    
g = make_graph(words)


   

### c. generate some text

In [None]:
# -3- faire un séquence de N mots parmi words
#     avec la structure du graphe G

composition = compose(g,words,200)
    
print(' '.join(composition)) # on transforme la liste en string


Getting better but still very bad. 

Indeed, we only use word to word frequencies, one can imagine it will be better if we analyze words relations over longer sequences.

In [None]:
# par curiosité, voici à quoi ressemble le fameux graphe : 
# pour chaque mot du texte, on voit un couple de listes ([],[]) 
# la première liste contient l'ensemble des mots qui suivent le mot en question
# la seconde contient les quantités de fois ou chacun des mots de la première liste 
# a suivi. 

print(g)

# on voit aussi que pour beaucoup (quantifier ?) de mots, on n'a collecté qu'un seul suivant, 
# le modèle est assez pauvre.


# 3. So let's build our own GPT (Generative Pre-trained Transformer) 

All of Sheakespeare:  
https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt


Accounting for relationships over many words is knownn as "attention".


This concept has been coined in this seminal paper in 2017: "Attention is all you need" describing the "Transformer": https://arxiv.org/abs/1706.03762


The followong code uses the PyTorch library, and is from Youtube, Andreij Karpathy : Let's build GPT: from scratch, in code, spelled out. 

In [None]:
!curl https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt > "input.txt" 

In [16]:
with open('input.txt', 'r', encoding='utf-8') as f: 
    text = f.read()

In [None]:
print(len(text))
print(text[:1000])

In [None]:
chars = sorted(list(set(text)))
vocab_size = len(chars)
print(''.join(chars))
print(vocab_size)

On va utiliser des tokens de 1 caractère de long, le plus simple possible. (Google utilise le tokenizer SentencePiece (des tokens qui sont des parties de mots, OpenAI avec GTP2 : 50257 différents token (bouts de mots), les séquences encodées sont plus courtes. 


In [None]:
# mapping char <-> int
# tokenizer
stoi = {ch:i for i,ch in enumerate(chars)}
itos = {i:ch for i,ch in enumerate(chars)}
encode = lambda s: [stoi[c] for c in s] # string to list of integers
decode = lambda l: ''.join([itos[i] for i in l]) # list of int to string

print(encode("Bonjour tout le monde"))
print(decode(encode("Bonjour tout le monde")))

on va utiliser une structure  PyTorch pour stocker nos séquences

In [None]:
!pip install torch

In [None]:
import torch
# en cas de souci de performance, on pourra restreindre la taille du texte ici 
# au moment de la définition de data.
data = torch.tensor(encode(text), dtype=torch.long)
print(data.shape, data.dtype)
print(data[:1000])

In [22]:
# on sépare en entrainement et validation datasets:
n = int(0.9*len(data)) 
# 90% en train le reste en val : 
train_data = data[:n]
val_data = data[n:]

On ne va pas mettre l'ensemble du texte dans le transofrmer d'un coup. (Trop difficile à calculer), on y va par chunk (bout), ou bloc.

Voyons par exemple le tout premier bloc que l'on peut extraire de notre texte.

In [None]:
block_size = 8 
train_data[:block_size + 1]

In [None]:
# dans ce block, il y a 8 exemples : 
#  dans le contexte de 18, le suivant est 47
#  dans le contexte de 18,47, le suivant est 56
#  dans le contexte de 18,47,56 le suivant est 57
# ... 
x = train_data[:block_size]
y = train_data[1:block_size+1]
for t in range(block_size): 
    context = x[:t+1]
    target = y[t]
    print(f"when input is {context}, then target is: {target}")

Généralisons, en batch de blocks

In [None]:
torch.manual_seed(1337)
batch_size = 4 # nombre de séquences à traiter 
block_size = 8 # longueur du contexte

# l'array 4 * 8 contient 32 exemple

def get_batch(split):
    """ générate a small batch of data input x, target y"""
    data = train_data if split == 'train' else val_data
    
    ix = torch.randint(len(data) - block_size, (batch_size,)) # retourne 4 (bath_size) random indexes
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    return x,y

xb, yb = get_batch('train')
print('inputs:')
print(xb.shape)
print(xb)
print('targets:')
print(yb.shape)
print(yb)

print('------')

for b in range(batch_size):
    for t in range(block_size):
        context = xb[b, :t+1]
        target = yb[b,t]
        print(f"when input is {context.tolist()} the target is: {target}")
    
    
# le tenseur x va rentrer dans le transformer

In [29]:
# un nn très simple: 

In [None]:
import torch
import torch.nn as nn
from torch.nn import functional as F
torch.manual_seed(1337)

class BigramLanguageModel(nn.Module): 
    
    def __init__(self, vocab_size):
        super().__init__()
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)
        
    def forward(self, idx, targets=None):
        # idx and targets are both B,T tensors of integers
        logits = self.token_embedding_table(idx) # batch, time, channel B T C
        if targets is None: 
            loss = None
        else: 
            B, T, C = logits.shape
            logits = logits.view(B*T,C) # ca c'est pour des questions de format de torch 
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)
    
        return logits, loss # score of what's come next 
    
    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # get the predictions
            logits, loss = self(idx)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx
    
m = BigramLanguageModel(vocab_size)
logits, loss = m(xb,yb)
print(logits.shape)
print(loss) # -ln(1/65)
# on n'a pas fait l'entrainement et on utilise que bigram : [:,-1:] ?
print(decode(m.generate(idx = torch.zeros((1,1), dtype=torch.long), max_new_tokens=100)[0].tolist()))

In [27]:
optimizer = torch.optim.AdamW(m.parameters(), lr=1e-3)

In [None]:
batch_size = 32
for steps in range(100):
    xb,yb = get_batch('train')
    
    logits,loss = m(xb,yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()
    
print(loss.item())

In [None]:
print(decode(m.generate(idx = torch.zeros((1,1), dtype=torch.long), max_new_tokens=1500)[0].tolist()))

In [None]:
torch.manual_seed(1337)
B,T,C = 4,8,2 # batch, time, channels
x = torch.randn(B,T,C)
x.shape

In [32]:
xbow = torch.zeros((B,T,C))  #bow : bag of words
for b in range(B):
    for t in range(T):
        xprev = x[b,:t+1] #(t,C)
        xbow[b,t] = torch.mean(xprev,0)

In [None]:
x[0]

In [None]:
xbow[0] # moyenne par rapport aux lignes ci-dessus (ligne2 = avg(lig0 et lig1) ... )

The mathematical trick in self-attention:

In [None]:
torch.manual_seed(1337)
B,T,C = 4,8,2 # batch, time, channels
x = torch.randn(B,T,C)
x.shape

 we want $x[b,t] = mean_{i<=t} x[b,i]$

In [36]:
xbow = torch.zeros(B,T,C)
for b in range(B):
    for t in range(T): 
        xprev = x[b,:t+1] #(t,C)
        xbow[b,t] = torch.mean(xprev,0)

In [None]:
# toy example
torch.manual_seed(42)
a = torch.ones(3,3)
b = torch.randint(0,10,(3,2)).float()
c = a @ b
print('a=')
print(a)
print('---')
print('b=')
print(b)
print('---')
print('c=')
print(c)
print('---')

# le résultat donne la somme des valeur des colonnes de b, sur chaque ligne de c

In [None]:
# toy example en utilisant la matrice triangulaire de 1 en bas à gauche pour 1, 
# ce qui produit à présent une somme de plus en plus de lignes de b 
torch.manual_seed(42)
a = torch.tril(torch.ones(3,3))
b = torch.randint(0,10,(3,2)).float()
c = a @ b
print('a=')
print(a)
print('---')
print('b=')
print(b)
print('---')
print('c=')
print(c)
print('---')

# le résultat donne la somme des valeur des colonnes de b, sur chaque ligne de c

In [None]:
# toy example en utilisant la matrice triangulaire de 1 en bas à gauche pour 1, 
# ce qui produit à présent une somme de plus en plus de lignes de b 
# maintenant on normalise a pour que chaque ligne ait pour somme 1
torch.manual_seed(42)
a = torch.tril(torch.ones(3,3))
a = a / torch.sum(a,1, keepdim = True)
b = torch.randint(0,10,(3,2)).float()
c = a @ b
print('a=')
print(a)
print('---')
print('b=')
print(b)
print('---')
print('c=')
print(c)
print('---')

# le résultat donne la somme des averages des colonnes de b, sur chaque ligne de c

In [None]:
# on reprend le fil de la construction
wei = torch.tril(torch.ones(T,T)) # wei : short for weights
wei = wei / wei.sum(1,keepdim=True)
wei

In [None]:
xbow2 = wei @ x # wei is T by T, x (B by T by C), pytorch va faire BTT BTC
torch.allclose(xbow,xbow2)

In [None]:
print(xbow[0],"\n",xbow2[0])


In [None]:
# troisième version : use Softmax
tril = torch.tril(torch.ones(T,T))
#wei = torch.zeros((T,T))
#wei = wei.masked_fill(tril == 0, float('-inf'))
tril

In [None]:
wei = torch.zeros((T,T))
wei = wei.masked_fill(tril == 0, float('-inf'))
wei

In [None]:
wei = F.softmax(wei, dim=-1)
wei

In [None]:
# troisième version : use Softmax
tril = torch.tril(torch.ones(T,T))
wei = torch.zeros((T,T))
wei = wei.masked_fill(tril == 0, float('-inf'))
wei = F.softmax(wei, dim=-1)

xbow3 = wei @ x
torch.allclose(xbow,xbow3)

In [None]:
xbow3

In [None]:
# version 4 : self-attention
torch.manual_seed(1337)
B,T,C = 4,8,32 # batch, time, channels
x = torch.randn(B,T,C)

# let's see a single Head perform self-attention
# on ne veut plus un monde uniforme
# on veut des information des tokens précédents
# each node will emit 2 vectors : 
# query : what am I looking for ?  
# key : what do I contain ?
# l'affinity entre les token en séquence : 
#  my_q dot product with all the keys des autres tokens dans la séquence 
#    et donc que le passé
head_size = 16
key = nn.Linear(C, head_size, bias=False)
query = nn.Linear(C, head_size, bias=False)
value = nn.Linear(C, head_size, bias=False)
k = key(x)   # (B, T, 16)
q = query(x) # (B, T, 16)

wei =  q @ k.transpose(-2, -1) # (B, T, 16) @ (B, 16, T) ---> (B, T, T)

tril = torch.tril(torch.ones(T, T))
#wei = torch.zeros((T,T))
wei = wei.masked_fill(tril == 0, float('-inf'))
wei = F.softmax(wei, dim=-1)

v = value(x)
out = wei @ v
#out = wei @ x

out.shape
# la minute 1h09 fait un bon résumé

In [None]:
wei[0]

Notes:
- Attention is a **communication mechanism**. Can be seen as nodes in a directed graph looking at each other and aggregating information with a weighted sum from all nodes that point to them, with data-dependent weights.
- There is no notion of space. Attention simply acts over a set of vectors. This is why we need to positionally encode tokens.
- Each example across batch dimension is of course processed completely independently and never "talk" to each other
- In an "encoder" attention block just delete the single line that does masking with `tril`, allowing all tokens to communicate. This block here is called a "decoder" attention block because it has triangular masking, and is usually used in autoregressive settings, like language modeling.
- "self-attention" just means that the keys and values are produced from the same source as queries. In "cross-attention", the queries still get produced from x, but the keys and values come from some other, external source (e.g. an encoder module)
- "Scaled" attention additional divides `wei` by 1/sqrt(head_size). This makes it so when input Q,K are unit variance, wei will be unit variance too and Softmax will stay diffuse and not saturate too much. Illustration below

In [53]:
k = torch.randn(B,T,head_size)
q = torch.randn(B,T,head_size)
wei = q @ k.transpose(-2, -1) * head_size**-0.5

In [None]:
print(k.var(), q.var(), wei.var())


In [None]:
torch.softmax(torch.tensor([0.1, -0.2, 0.3, -0.2, 0.5]), dim=-1)

In [None]:
torch.softmax(torch.tensor([0.1, -0.2, 0.3, -0.2, 0.5])*8, dim=-1) # gets too peaky, converges to one-hot

In [None]:
class LayerNorm1d: # (used to be BatchNorm1d)
  
  def __init__(self, dim, eps=1e-5, momentum=0.1):
    self.eps = eps
    self.gamma = torch.ones(dim)
    self.beta = torch.zeros(dim)
  
  def __call__(self, x):
    # calculate the forward pass
    xmean = x.mean(1, keepdim=True) # batch mean
    xvar = x.var(1, keepdim=True) # batch variance
    xhat = (x - xmean) / torch.sqrt(xvar + self.eps) # normalize to unit variance
    self.out = self.gamma * xhat + self.beta
    return self.out

  def parameters(self):
    return [self.gamma, self.beta]

torch.manual_seed(1337)
module = LayerNorm1d(100)
x = torch.randn(32, 100) # batch size 32 of 100-dimensional vectors
x = module(x)
x.shape

In [None]:
x[:,0].mean(), x[:,0].std() # mean,std of one feature across all batch inputs

In [None]:
x[0,:].mean(), x[0,:].std() # mean,std of a single input from the batch, of its features

In [None]:
import torch
import torch.nn as nn
from torch.nn import functional as F

# hyperparameters
batch_size = 16 # how many independent sequences will we process in parallel?
block_size = 32 # what is the maximum context length for predictions?
max_iters = 5000
eval_interval = 100
learning_rate = 1e-3
device = 'cuda' if torch.cuda.is_available() else 'cpu'
eval_iters = 200
n_embd = 64
n_head = 4
n_layer = 4
dropout = 0.0
# ------------

torch.manual_seed(1337)

# wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
with open('input.txt', 'r', encoding='utf-8') as f:
    text = f.read()

# here are all the unique characters that occur in this text
chars = sorted(list(set(text)))
vocab_size = len(chars)
# create a mapping from characters to integers
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }
encode = lambda s: [stoi[c] for c in s] # encoder: take a string, output a list of integers
decode = lambda l: ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a string

# Train and test splits
data = torch.tensor(encode(text), dtype=torch.long)
n = int(0.9*len(data)) # first 90% will be train, rest val
train_data = data[:n]
val_data = data[n:]

# data loading
def get_batch(split):
    # generate a small batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    x, y = x.to(device), y.to(device)
    return x, y

@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out

class Head(nn.Module):
    """ one head of self-attention """

    def __init__(self, head_size):
        super().__init__()
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))

        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        B,T,C = x.shape
        k = self.key(x)   # (B,T,C)
        q = self.query(x) # (B,T,C)
        # compute attention scores ("affinities")
        wei = q @ k.transpose(-2,-1) * C**-0.5 # (B, T, C) @ (B, C, T) -> (B, T, T)
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf')) # (B, T, T)
        wei = F.softmax(wei, dim=-1) # (B, T, T)
        wei = self.dropout(wei)
        # perform the weighted aggregation of the values
        v = self.value(x) # (B,T,C)
        out = wei @ v # (B, T, T) @ (B, T, C) -> (B, T, C)
        return out

class MultiHeadAttention(nn.Module):
    """ multiple heads of self-attention in parallel """

    def __init__(self, num_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
        self.proj = nn.Linear(n_embd, n_embd)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        out = torch.cat([h(x) for h in self.heads], dim=-1)
        out = self.dropout(self.proj(out))
        return out

class FeedFoward(nn.Module):
    """ a simple linear layer followed by a non-linearity """

    def __init__(self, n_embd):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, 4 * n_embd),
            nn.ReLU(),
            nn.Linear(4 * n_embd, n_embd),
            nn.Dropout(dropout),
        )

    def forward(self, x):
        return self.net(x)

class Block(nn.Module):
    """ Transformer block: communication followed by computation """

    def __init__(self, n_embd, n_head):
        # n_embd: embedding dimension, n_head: the number of heads we'd like
        super().__init__()
        head_size = n_embd // n_head
        self.sa = MultiHeadAttention(n_head, head_size)
        self.ffwd = FeedFoward(n_embd)
        self.ln1 = nn.LayerNorm(n_embd)
        self.ln2 = nn.LayerNorm(n_embd)

    def forward(self, x):
        x = x + self.sa(self.ln1(x))
        x = x + self.ffwd(self.ln2(x))
        return x

# super simple bigram model
class BigramLanguageModel(nn.Module):

    def __init__(self):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        self.position_embedding_table = nn.Embedding(block_size, n_embd)
        self.blocks = nn.Sequential(*[Block(n_embd, n_head=n_head) for _ in range(n_layer)])
        self.ln_f = nn.LayerNorm(n_embd) # final layer norm
        self.lm_head = nn.Linear(n_embd, vocab_size)

    def forward(self, idx, targets=None):
        B, T = idx.shape

        # idx and targets are both (B,T) tensor of integers
        tok_emb = self.token_embedding_table(idx) # (B,T,C)
        pos_emb = self.position_embedding_table(torch.arange(T, device=device)) # (T,C)
        x = tok_emb + pos_emb # (B,T,C)
        x = self.blocks(x) # (B,T,C)
        x = self.ln_f(x) # (B,T,C)
        logits = self.lm_head(x) # (B,T,vocab_size)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # crop idx to the last block_size tokens
            idx_cond = idx[:, -block_size:]
            # get the predictions
            logits, loss = self(idx_cond)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

model = BigramLanguageModel()
m = model.to(device)
# print the number of parameters in the model
print(sum(p.numel() for p in m.parameters())/1e6, 'M parameters')

# create a PyTorch optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

for iter in range(max_iters):

    # every once in a while evaluate the loss on train and val sets
    if iter % eval_interval == 0 or iter == max_iters - 1:
        losses = estimate_loss()
        print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

# generate from the model
context = torch.zeros((1, 1), dtype=torch.long, device=device)
print(decode(m.generate(context, max_new_tokens=2000)[0].tolist()))

#### Ca commence à être pas mal, avec 2 minutes de calcul sur mon mac, et 100 000 paramètres. 

Les LLM sont dans les 100 milliards de paramètres (soit de l'ordre de million de fois plus grands).

## Sources :


- bigrammes : 12 Beginner Python Projects - Coding Course. Chaine Youtube FreeCodeCamp.org : https://www.youtube.com/watch?v=8ext9G7xspg
- transformer : Let's build GPT: from scratch, in code, spelled out. Chaine Youtube Andreij Karpathy : https://www.youtube.com/watch?v=kCc8FmEb1nY