# Introduction au LLM (petit modèle de langage en réalité) 

## Générer du texte

Gen AI is the buzz ...  GAN's (Générative Adversarial Networks) have been the hype a few years ago (circa 2016) for example, and they provided impressive results (thank to compute resources) see  [thispersondoesnotexist.com](https://thispersondoesnotexist.com/) )

Then we had RNN (Recurrent Neural Network) and more recently the "Transformers" enabling the production of (LLM : Large Language Models).

Let's see here, starting with very basics how an algorithm can generate text. 

There will be 5 steps : 

1. random letters 
2. n-grams
3. random words 
4. words based on their sequence frequencies 
5. a small size LM build with the Transformer
 

*Jm Torres, August 2023, torrejm@fr.ibm.com*

# 1. Random letters 

Without surprise, the result is very bad, but it is very easy to do, so let's do it. 

Given a list of 26 characters we can genarate a list of characters, and even organize them in words. 

1. random characters 
2. random chars grouped in random length words
3. same as above, but now the chars are random with a laguage frequency.
4. to make things a litte bit better, we can use a realsitic distribution for the words lengths. 

In [None]:
# 1 on tire des lettres au hasard
from random import randint
size = 200 # output size  
chars = [c for c in 'abcdefghijklmnopqrstuvwxyz'] 

for i in range(size):
    print(chars[randint(0,len(chars)-1)], end ="")

In [None]:
# 2 on forme des mots de longueur aléatoire
# 
from random import randint
size = 20 # taille du texte (en nombre de mots) 


chars = [c for c in 'abcdefghijklmnopqrstuvwxyz'] 
text = ""
for i in range(size): 
    l = randint(1,10) # max word length = 10 here.
    for i in range(l):
        text += chars[randint(0,len(chars)-1)]
    text += " "

print(text)

On peut faire un peu mieux en utilisant les lettres avec leur fréquence d'apparition dans une langue donnée

https://fr.wikipedia.org/wiki/Fr%C3%A9quence_d%27apparition_des_lettres

In [None]:
# 3  chars chosen with realistic frequency
from random import choices
size = 200 # nombre de caractères 
chars = [c for c in 'abcdefghijklmnopqrstuvwxyz'] # liste des caractères
freq = [71,11,32,37,121,11,12,11,66,3,3,50, 26, 64, 50, 25, 6, 61,65,59, 45, 11, 2,4,5,2]


# on produit une liste de caractères
print(''.join(choices(chars, weights = freq, k = size)))

In [None]:
# on peut grouper en mots (de longueurs aléatoires)
from random import choices
size = 20 # nombre de mots
chars = [c for c in 'abcdefghijklmnopqrstuvwxyz'] 
freq = [71,11,32,37,121,11,12,11,66,3,3,50, 26, 64, 50, 25, 6, 61,65,59, 45, 11, 2,4,5,2 ]

text = ""
for i in range(size): 
    l = randint(1,10) 
    for i in range(l):
        text += choices(chars, freq)[0]
    text += " "

print(text)

In [None]:
# on peut trouver la distribution des longueurs de mots dans un texte (le plus grand c'est mieux) 

import string

long = {}

with open('ibm.txt','r') as f:
    text = f.read()
    text = ' '.join(text.split()) # suppress multiple blanc chars
    text = text.lower() 
    text = text.translate(str.maketrans('','', string.punctuation)) # suppresses punctuation
    mots = text.split() # make a list of words 

for mot in mots:
    long[len(mot)] = long.get(len(mot),0) + 1
    
long_mot = []
freq_mot = []

print("  length quantity")
print("-------- --------")
for k,v in long.items():
    if k < 25:  #longest french word 
        print(f"{k:8.0f}  {v:7.0f}")
        long_mot.append(k)
        freq_mot.append(v)

lf = list(zip(long_mot, freq_mot))
ll,ff = [[i for i, j in sorted(lf)],
       [j for i, j in sorted(lf)]]

print(ll)
print(ff)

import matplotlib.pyplot as plt
plt.plot(ll, ff)
plt.show()

In [None]:
# 4.b
from random import choices
size = 20 # nombre de mots
chars = [c for c in 'abcdefghijklmnopqrstuvwxyz'] 
freq = [71,11,32,37,121,11,12,11,66,3,3,50, 26, 64, 50, 25, 6, 61,65,59, 45, 11, 2,4,5,2]
text = ""
for i in range(size): 
    l = choices(ll,ff)[0]
    for i in range(l):
        text += choices(chars, freq)[0]
    text += " "

print(text)

Frequencies : 

English 

`freq = [82,15,28,42,127,22,20,61,70,2,8,40, 24, 67, 75, 20, 9, 60,63,90,27, 10, 24,2,20,7,]`


French : 

`freq = [71,11,32,37,121,11,12,11,66,3,3,50, 26, 64, 50, 25, 6, 61,65,59, 45, 11, 2,4,5,2]`

... anyways this does not look like a text, by far ... 


# 2. n-gram

on revient sur les lettres, mais en mieux : notion de contexte (synonyme lointain de prompt dans ce cas)

(source https://github.com/alxndrTL/villes)

In [None]:
import random
import torch

In [None]:
# chargement des données

fichier = open('villes.txt')
donnees = fichier.read()
villes = donnees.replace('\n', ',').split(',')

# préparation des données

# on rajoute le token . au début et en fin (fait office de signal de départ et de fin)
for ville, i in zip(villes, range(len(villes))):
    villes[i] = '.' + ville + '.'

# création du vocabulaire
vocabulaire = []

for ville in villes:
    for c in ville:
        if c not in vocabulaire:
            vocabulaire.append(c)

vocabulaire = sorted(vocabulaire)
vocabulaire[0] = '.' # 0 est " " et 3 est "." -> on échange
vocabulaire[3] = ' '

# pour convertir char <-> int
char_to_int = {}
int_to_char = {}

for (c, i) in zip(vocabulaire, range(len(vocabulaire))):
    char_to_int[c] = i
    int_to_char[i] = c

print(vocabulaire)

# unigram ou 1-gram

Cette fois, il y a un tout petit "apprentissage" : on va compter la fréquence de chaque lettre parmis les 36000 noms de communes.

Ensuite, on génère des lettres à partir de ces fréquences.

![alex1gram](1gram.png)

In [None]:
# création du dataset

# matrice des données, qui sera de taille (M, 1) pour le unigram
# on y place simplement toutes les lettres de toutes les communes

X = []

for ville in villes:
    for char in ville:
        X.append([char_to_int[char]])

X = torch.asarray(X) # (M, 1)


In [None]:
# modèle uni-gram
P = torch.zeros((len(vocabulaire))) # liste de probabilités d'apparition de chaque lettre

for i in range(X.shape[0]):
    P[X[i]] += 1 # on augmente le compteur de chaque lettre rencontrée

P = P / P.sum(dim=0, keepdim=True) # on transforme les nombres d'apparitions en probabilités

In [None]:
g = torch.Generator().manual_seed(45)

for _ in range(10): # on génère 10 noms de villes
    # on génère une lettre tant qu'on ne tombe pas sur ".", qui signifie la fin
    nom = "."
    while nom[-1] != "." or len(nom) == 1:
        next_char = int_to_char[torch.multinomial(P, num_samples=1, replacement=True, generator=g).item()]
        nom = nom + next_char
    print(nom[1:-1])

In [None]:
# calcul du coût
# cette fois, on parcours toutes les données
# rappel de la formule : moyenne(log p_modele(lettre suivante | contexte))
# la moyenne se fait sur l'entièreté d'un nom de commune, et sur l'ensemble des noms de communes

nll = 0
for i in range(X.shape[0]):
    nll += torch.log(P[X[i, 0]]) # log p_modele(lettre) (contexte vide ici)
-nll/X.shape[0] # moyenne

# Bigrams

![alex2gram](2gram.png)

In [None]:
# création du dataset

# matrice des données, qui sera de taille (M, 2) pour le bigram
# on y place les lettres deux par deux 
# par exemple, à partir de paris, on construirait les exemples ".p", "pa", "ar", "ri", "is", "s."

X = []

for ville in villes:
    for ch1, ch2 in zip(ville, ville[1:]):
        X.append([char_to_int[ch1], char_to_int[ch2]])

X = torch.asarray(X) # (M, 2)

In [None]:
# modèle bigram
P = torch.zeros((len(vocabulaire), len(vocabulaire))) # liste de probabilités d'apparition de chaque couple

for i in range(X.shape[0]):
    P[X[i, 0], X[i, 1]] += 1 # on augmente le compteur de chaque couple rencontré

P = P / P.sum(dim=1, keepdim=True) # on divise pour obtenir des probabilitiés
# la dimension 0 correspond à la première lettre de chaque couple (le contexte)
# la dimension 1 à la seconde lettre (la lettre prédite)

# par exemple, P[char_to_int['a']] correspond à 45 nombres, la distribution de probabilités sur les lettres qui suivent le a

In [None]:
#g = torch.Generator().manual_seed(42+4)

for _ in range(10):
    nom = "."
    while nom[-1] != "." or len(nom) == 1:
        last_char = nom[-1]
        next_char = int_to_char[torch.multinomial(P[char_to_int[last_char]], num_samples=1, replacement=True, generator=g).item()]
        nom = nom + next_char
    print(nom[1:-1])

In [None]:
# calcul du coût
# rappel de la formule : moyenne(log p_modele(lettre suivante | contexte))
# la moyenne se fait sur l'entièreté d'un nom de commune, et sur l'ensemble des noms de communes

nll = 0
for i in range(X.shape[0]):
    nll += torch.log(P[X[i, 0], X[i, 1]]) # log p_modele(lettre | contexte) avec contexte = X[i, 0], lettre = X[i, 1]
-nll/X.shape[0]

In [None]:
f"Nombre de paramètres : {len(vocabulaire)**3}"

# Trigrams

In [None]:
# on rajoute le token . au début et en fin
for ville, i in zip(villes, range(len(villes))):
    villes[i] = '.' + ville + "."

In [None]:
# création du dataset

X = [] # taille (M, 3) pour le trigram

# pour paris, on construit "..p", ".pa", "par", "ari", "ris", "is.", "s.."
# d'où la nécessite de rajouter un . au début et à la fin

for ville in villes:
    for ch1, ch2, ch3 in zip(ville, ville[1:], ville[2:]):
        X.append([char_to_int[ch1], char_to_int[ch2], char_to_int[ch3]])

X = torch.asarray(X) # (M, 3)

In [None]:
# modèle trigram
P = torch.zeros((len(vocabulaire), len(vocabulaire), len(vocabulaire))) # une proba pour chaque trio de lettre

for i in range(X.shape[0]):
    P[X[i, 0], X[i, 1], X[i, 2]] += 1

P = P / P.sum(dim=2, keepdim=True) # la dernière dimension correspond à la prochain lettre, les 2 premières aux lettres de contexte

In [None]:
#g = torch.Generator().manual_seed(4354)

for _ in range(10):
    nom = ".."
    while nom[-1] != "." or len(nom) == 2:
        char_moins_1 = nom[-1]
        char_moins_2 = nom[-2]

        next_char = int_to_char[torch.multinomial(P[char_to_int[char_moins_2], char_to_int[char_moins_1]], num_samples=1, replacement=True, generator=g).item()]
        nom = nom + next_char

    print(nom[2:-1])

In [None]:
# loss
nll = 0
for i in range(X.shape[0]):
    nll += torch.log(P[X[i, 0], X[i, 1], X[i, 2]]) # log p_modele(lettre | contexte)
-nll/X.shape[0]

In [None]:
f"Nombre de paramètres : {len(vocabulaire)**4}"

et ainsi de suite on pourrait faire des 4gram (164 millions de param), et 5gram (7 milliards)...
Il faut trouver autre chose : 
- réseaux de neurones
- transformers :

![alexnn](nn.png)

.

![alex2tf](transformer.png)

# 3. Random words : from a graph


We will only get a mix of words from the text, will probably not mean anything. 


In [None]:
# random words

long = 100
with open('ibm.txt','r') as f:
    text = f.read()
    text = ' '.join(text.split()) # suppression des espaces multiple
    text = text.lower() # suppression des majuscules
    text = text.translate(str.maketrans('','', string.punctuation)) # supprime ponctu.
    words = text.split() # fait une liste des mots
    # print(words)

t = ""
for i in range(long): 
    t += words[randint(1,len(words))] + " "

print(t)
    

In [None]:
# random words WITH text frequencies

long = 50

with open('ibm.txt','r') as f:
    text = f.read()
    text = ' '.join(text.split()) # suppression des espaces multiple
    text = text.lower() # suppression des majuscules
    text = text.translate(str.maketrans('','', string.punctuation)) # supprime ponctu.
    words = text.split() # fait une liste des mots
    # print(words)

    
#words = "bonjour tout le monde comment ca va le monde".split()
# calcul des fréquences des mots dans le texte. 
fr = {}
for word in words:
    fr[word] = fr.get(word,0) + 1 

#print(fr)    

mots = list(fr.keys())
freq = list(fr.values())
t = ""
for i in range(long): 
    t += choices(mots,freq)[0] + " "

print(t)

no grammar, no meaning... just random, we need something else

# 4. More sophisticated : transform the text into a graph describing the associations between words

Generating text using grammatical rules is a hard problem, so other strategies have been designed trying to "learn" from the structure in existing text, and imitate. 

Let's see a simplisitic example. 

A text will be transformed into a graph (wieghted and oriented graph) : 
- words are VERTICES of the graph 
- EDGES are links between vertices (words) : 
    - the graph is oriented means the edge are directed. (eat will be connected to apple, but not apple to eat) .
    - the graph is weighted : they are given a numerical value : in our case this value is telling something about the frequency of one word appearing after another. (eg: the value between "plane" and "fly", will be higher that the one between "elephant" and "dance" ).

With a large text we can create such a graph  Par exemple : `je vais à la gare, je vais à la maison pour manger des pâtes, il mange des fruits, elle mange des fruits, je chante à la maison, je vais manger des fuits, à la maison je vais dormir` produit un graphe : 

![le graphe correspondant](graphe.png)


When the graph is produced, it can be used to hop from one word to the other in a random manner but not pure random : using the weights of the departing edges as probabilities to go to the next word. 

We would need a very large text to hope for good result 

let's do a small one...



### a. class definitions (Vertex and Graph) 

In [None]:
import random

class Vertex:
    ''' Class for the vertices (representing the words) :
        Attributes : 
        - value (str): the word itself
        - adjacent{} : dictionnary of adjacent vertices (adjacent means : there is an edge)
            elements of this dict are {vertex_adjacent1:weight1, ... 
            vertex_adjacentN:weightN}
            weights are integer values incremented each time the two words( vertices ) are read
            in order in the body of learning
        - neighbors[] : liste of adjacent vertices
        - neighbors_value[] : liste of adjacent vertices values (words)
        - neighbors_weights[] : liste of corresponding edges weights 
        these lists are extracted from the adjacency dict when the latter is computed. 
    '''
    def __init__(self,value):
        self.value = value  # le mot lui même
        self.adjacent = {}  # dictionnaire pour les vertex adjacents
        self.neighbors = [] # liste des sommets reliés (ce sont des Vertex)
        self.neighbors_value = [] # liste des mots des sommmets reliés
        self.neighbors_weights = [] # liste des poids des sommets reliés
           
    def increment_edge(self,vertex):
        """ increments edge weight between self and another vertex"""
        self.adjacent[vertex] = self.adjacent.get(vertex,0) + 1
    
    def get_adjacent_nodes(self):
        """ returns adjacent vertexes  """
        return self.adjacent.keys()
    
    def vertex_probability_map(self):
        """ compute lists : neighbors[] and neighbors_weights[]
            These lists will be used to choose the next word in neighbors[] with the weight 
            from  neighbors_weights[] 
            the list neighbors_value is used by get_vertex_adjacent_data, for __str__. method 
            of class Graph 
        """
        for (vertex,weight) in self.adjacent.items():
            self.neighbors.append(vertex)
            self.neighbors_value.append(vertex.value)
            self.neighbors_weights.append(weight)
            
    #def get_vertex_adjacent_data(self):
    #    """ renvoie la liste des adjacents et leurs poids
    #        cette fonction n'est utilisée que pour la méthode __str__. 
    #        dans la classe Graph """
    #    return self.neighbors_value, self.neighbors_weights
        
    def next_word(self):
        """ uses random.choices to draw with non uniform random.
            random.choices() returns a list of length k, we use k=1 
            we want one word, and we extract the vertex [0] """
        return random.choices(self.neighbors, weights = self.neighbors_weights, k = 1)[0]
    
    def __str__(self):
        """ this is for being able to use print """
        return self.value + " " +  ' '.join([node.value for node in self.adjacent.keys()])


class Graph:
    def __init__(self):
        """ graph is a dict of vertexes : {word:vertex}"""
        self.vertices = {} 
        
    def get_vertex_values(self):
        """ return all vertices of a Graph """
        return set(self.vertices.keys())
    
    def add_vertex(self, value):
        self.vertices[value] = Vertex(value)
        
    def get_vertex(self, value):
        if value not in self.vertices : 
            self.add_vertex(value)
        return self.vertices[value]
    
    def get_next_word(self,current_vertex):
        return self.vertices[current_vertex.value].next_word()
        
    def generate_probability_mappings(self):
        """ for each word/vertex we iterate trough the dict  
            to build the list of adjacent words with weights"""
        for vertex in self.vertices.values():
            vertex.vertex_probability_map()
            
    def __str__(self):
        """ this is to be able to print a Graph object"""
        p = "*** beginning of graph ***"  
        for k,v in self.vertices.items():
            p+= "\n--- word ->" + k + " -- possible next(s) ->" + str(v.neighbors_value) + \
            " -- wieght ->" + str(v.neighbors_weights) + "\n"
            p+= "------"
        p += "***  end of graph   ***"
        return p
    
    #def __rts__(self):
    #    """ ma fonction print pour un objet de cette classe"""
    #    p = "*** début graphe *** \n"  
    #    for e in self.get_vertex_values(): 
    #        p+= e + "->>" + str(g.get_vertex(e).get_vertex_adjacent_data()) + "\n"
    #    p += "***  fin graphe  ***"
    #    return p
        

### b. graph construction

1. words extraction
2. graph construction

In [None]:
import string # to suppress punctuation 

def get_words_from_text(text_path):
    """ text load & transform: multiple spaces are suppressed, lower cases only, no punctuation."""
    with open(text_path,'r') as f:
        text = f.read()
        text = ' '.join(text.split()) # suppression des espaces multiple
        text = text.lower() # suppression des majuscules
        text = text.translate(str.maketrans('','', string.punctuation)) # supprime ponctu.
        words = text.split() # fait une liste des mots
        return words
        
def make_graph(words):
    """ graph construction """
    g = Graph()
    previous_word = None
    # for each word from the text, if it is not in the dict 
    # we add it, and we get the vertex object
    for word in words:
        word_vertex = g.get_vertex(word)      
        # if there is a previous_word, we add a link if not exist yet 
        # if the edge did exist we increment its value 
        if previous_word:
            previous_word.increment_edge(word_vertex)
            
        previous_word = word_vertex
    
    # on a fini de parcourir words, on va générer les probabilités
    # c'est à dire : pour chaque mot, on fait la liste des adjacents et la liste 
    # des poids, pour pouvoir faire le tirage au sort (random.choices)
    g.generate_probability_mappings()
   
    return g

def compose(g, words, length=50):
    """ composition du texte, à partir d'un mot au hasard """
    composition = []
    word = g.get_vertex(random.choice(words))
    
    for _ in range(length):
        composition.append(word.value)
        word = g.get_next_word(word)
        
    return composition
        
# -1- get words from text

#words = get_words_from_text("hp1.txt")
words = get_words_from_text("ibm.txt")

#print(words,'\n\n')

# -2- make a graph using those words
    
g = make_graph(words)
   

### c. fabrication du texte

In [None]:
# -3- faire un séquence de N mots parmi words
#     avec la structure du graphe G

composition = compose(g,words,200)
    
print(' '.join(composition)) # on transforme la liste en string


### Now it is far better ! but not satisfactory, we use only the liaison from one word to another, this may provide the intuition that we need to use liaisons between more than two words, sequences of words...



In [None]:
# par curiosité, voici à quoi ressemble le fameux graphe : 
# pour chaque mot du texte, on voit un couple de listes ([],[]) 
# la première liste contient l'ensemble des mots qui suivent le mot en question
# la seconde contient les quantités de fois ou chacun des mots de la première liste 
# a suivi. 

print(g)

# on voit aussi que pour beaucoup (quantifier ?) de mots, on n'a collecté qu'un seul suivant, 
# le modèle est assez pauvre.


# 5. Construction of a small GPT (Generative Pre-trained Transformer) 

les texte de Sheakespeare : 
https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

Dans les chapitres précédents 1 et 2 : aucune prise en compte des relations entre les éléments (tokens - bouts de textes (1, 2, 3,  n letters ), 3 prise en compte de relation entre 2 voisins stricts). 

Ici justement l'idée est de prendre en compte les laisons sur un plus grand nombre de tokens consacutifs. C'est ce l'on appelle la notion d'attention. 

Ce concept a été énoncé dans un article séminal "Attention is all you need" en 2017 qui a décrit la notion de "Transformer" : https://arxiv.org/abs/1706.03762


Le code qui suit, contrairement aux exemples précédent fait appel à des bibliothèques spécialisées (en l'occurence PyTorch) pour construire notre Transformer. Il est issu de :  `Youtube, Andreij Karpathy : Let's build GPT: from scratch, in code, spelled out.` 
Le début est abordable mais ça se corse beaucoup et je finis par tourner le code et constater que ça fonctionne.

In [None]:
!curl https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt > "input.txt" 

In [None]:
with open('input.txt', 'r', encoding='utf-8') as f: 
    text = f.read()

In [None]:
print(len(text))
print(text[:1000])

In [None]:
chars = sorted(list(set(text)))
vocab_size = len(chars)
print(''.join(chars))
print(vocab_size)

On va utiliser des tokens de 1 caractère de long, le plus simple possible. (Google utilise le tokenizer SentencePiece (des tokens qui sont des parties de mots, OpenAI avec GTP2 : 50257 différents token (bouts de mots), les séquences encodées sont plus courtes. 


In [None]:
# mapping char <-> int
# imbedding, trivial tokenizer
stoi = {ch:i for i,ch in enumerate(chars)}
itos = {i:ch for i,ch in enumerate(chars)}
encode = lambda s: [stoi[c] for c in s] # string to list of integers
decode = lambda l: ''.join([itos[i] for i in l]) # list of int to string

print(encode("Hello every body"))
print(decode(encode("Hello every body")))

on va utiliser une structure  PyTorch pour stocker nos séquences

In [None]:
import torch
# en cas de souci de performance, on pourra restreindre la taille du texte ici 
# au moment de la définition de data.
data = torch.tensor(encode(text), dtype=torch.long)
print(data.shape, data.dtype)
print(data[:1000])

In [None]:
# on sépare en entrainement et validation datasets:
n = int(0.9*len(data)) 
# 90% en train le reste en val : 
train_data = data[:n]
val_data = data[n:]

On ne va pas mettre l'ensemble du texte dans le transofrmer d'un coup. (Trop difficile à calculer), on y va par chunk (bout), ou bloc.

Voyons par exemple le tout premier bloc que l'on peut extraire de notre texte.

In [None]:
block_size = 8 
train_data[:block_size + 1]

In [None]:
#  In the block above there are 8 examples :  
#  with the context 18, next token is 47 
#  with the context 18 and 47 next token is 56 
#  with the context 18, 47 and 56  next token is 57
#  with the context 18, 47, 56 and 57, next token is 58 
# ... 
x = train_data[:block_size]
y = train_data[1:block_size+1]
for t in range(block_size): 
    context = x[:t+1]
    target = y[t]
    print(f"when input is {context}, then target is: {target}")

Generalizing, with block batches 

In [None]:
torch.manual_seed(1337)
batch_size = 4 # nombre de séquences à traiter 
block_size = 8 # longueur du contexte

# l'array 4 * 8 contient 32 exemple

def get_batch(split):
    """ générate a small batch of data input x, target y"""
    data = train_data if split == 'train' else val_data
    
    ix = torch.randint(len(data) - block_size, (batch_size,)) # retourne 4 (bath_size) random indexes
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    return x,y

xb, yb = get_batch('train')
print('inputs:')
print(xb.shape)
print(xb)
print('targets:')
print(yb.shape)
print(yb)

print('------')

for b in range(batch_size):
    for t in range(block_size):
        context = xb[b, :t+1]
        target = yb[b,t]
        print(f"when input is {context.tolist()} the target is: {target}")
    
    
# le tenseur x va rentrer dans le transformer

### Ok... on abrège (mis avec de la patience et un peu de courage, il faut voir la vidéo de Andrej Karpathy) 


In [None]:
import torch
import torch.nn as nn
from torch.nn import functional as F

# hyperparameters
batch_size = 16 # how many independent sequences will we process in parallel?
block_size = 32 # what is the maximum context length for predictions?
max_iters = 5000 # 5000 c'est mieux mais c'est plus long, mettre 2000 si manque de temps
eval_interval = 100
learning_rate = 1e-3
device = 'cuda' if torch.cuda.is_available() else 'cpu'
eval_iters = 200
n_embd = 64
n_head = 4
n_layer = 4
dropout = 0.0
# ------------

torch.manual_seed(1337)

# wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
with open('input.txt', 'r', encoding='utf-8') as f:
    text = f.read()

# here are all the unique characters that occur in this text
chars = sorted(list(set(text)))
vocab_size = len(chars)
# create a mapping from characters to integers
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }
encode = lambda s: [stoi[c] for c in s] # encoder: take a string, output a list of integers
decode = lambda l: ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a string

# Train and test splits
data = torch.tensor(encode(text), dtype=torch.long)
n = int(0.9*len(data)) # first 90% will be train, rest val
train_data = data[:n]
val_data = data[n:]

# data loading
def get_batch(split):
    # generate a small batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    x, y = x.to(device), y.to(device)
    return x, y

@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out

class Head(nn.Module):
    """ one head of self-attention """

    def __init__(self, head_size):
        super().__init__()
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))

        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        B,T,C = x.shape
        k = self.key(x)   # (B,T,C)
        q = self.query(x) # (B,T,C)
        # compute attention scores ("affinities")
        wei = q @ k.transpose(-2,-1) * C**-0.5 # (B, T, C) @ (B, C, T) -> (B, T, T)
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf')) # (B, T, T)
        wei = F.softmax(wei, dim=-1) # (B, T, T)
        wei = self.dropout(wei)
        # perform the weighted aggregation of the values
        v = self.value(x) # (B,T,C)
        out = wei @ v # (B, T, T) @ (B, T, C) -> (B, T, C)
        return out

class MultiHeadAttention(nn.Module):
    """ multiple heads of self-attention in parallel """

    def __init__(self, num_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
        self.proj = nn.Linear(n_embd, n_embd)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        out = torch.cat([h(x) for h in self.heads], dim=-1)
        out = self.dropout(self.proj(out))
        return out

class FeedFoward(nn.Module):
    """ a simple linear layer followed by a non-linearity """

    def __init__(self, n_embd):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, 4 * n_embd),
            nn.ReLU(),
            nn.Linear(4 * n_embd, n_embd),
            nn.Dropout(dropout),
        )

    def forward(self, x):
        return self.net(x)

class Block(nn.Module):
    """ Transformer block: communication followed by computation """

    def __init__(self, n_embd, n_head):
        # n_embd: embedding dimension, n_head: the number of heads we'd like
        super().__init__()
        head_size = n_embd // n_head
        self.sa = MultiHeadAttention(n_head, head_size)
        self.ffwd = FeedFoward(n_embd)
        self.ln1 = nn.LayerNorm(n_embd)
        self.ln2 = nn.LayerNorm(n_embd)

    def forward(self, x):
        x = x + self.sa(self.ln1(x))
        x = x + self.ffwd(self.ln2(x))
        return x

# super simple bigram model
class BigramLanguageModel(nn.Module):

    def __init__(self):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        self.position_embedding_table = nn.Embedding(block_size, n_embd)
        self.blocks = nn.Sequential(*[Block(n_embd, n_head=n_head) for _ in range(n_layer)])
        self.ln_f = nn.LayerNorm(n_embd) # final layer norm
        self.lm_head = nn.Linear(n_embd, vocab_size)

    def forward(self, idx, targets=None):
        B, T = idx.shape

        # idx and targets are both (B,T) tensor of integers
        tok_emb = self.token_embedding_table(idx) # (B,T,C)
        pos_emb = self.position_embedding_table(torch.arange(T, device=device)) # (T,C)
        x = tok_emb + pos_emb # (B,T,C)
        x = self.blocks(x) # (B,T,C)
        x = self.ln_f(x) # (B,T,C)
        logits = self.lm_head(x) # (B,T,vocab_size)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # crop idx to the last block_size tokens
            idx_cond = idx[:, -block_size:]
            # get the predictions
            logits, loss = self(idx_cond)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

model = BigramLanguageModel()
m = model.to(device)
# print the number of parameters in the model
print(sum(p.numel() for p in m.parameters())/1e6, 'M parameters')

# create a PyTorch optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

for iter in range(max_iters):

    # every once in a while evaluate the loss on train and val sets
    if iter % eval_interval == 0 or iter == max_iters - 1:
        losses = estimate_loss()
        print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

# generate from the model
context = torch.zeros((1, 1), dtype=torch.long, device=device)
print(decode(m.generate(context, max_new_tokens=2000)[0].tolist()))

step 100: train loss 2.6567, val loss 2.6669


### pas mal après 4 à 5 minutes de calcul sur un laptop standard (maxiter 2000 à 5000 selon le temps) 
### on obtient environ 200k paramètres ... les "vrais" LLM are reaching 100 Billions parameters


## Sources :


- bigrammes : 12 Beginner Python Projects - Coding Course. Chaine Youtube FreeCodeCamp.org : https://www.youtube.com/watch?v=8ext9G7xspg
- transformer : Let's build GPT: from scratch, in code, spelled out. Chaine Youtube Andreij Karpathy : https://www.youtube.com/watch?v=kCc8FmEb1nY

### Sources : 
- partie 2 : Youtube, freecode camp : 12 beginner Python Projects
- partie 3 : Youtube, Andreij Karpathy : Let's build GPT: from scratch, in code, spelled out. 