# ML News

- Chatbot Arena: https://chat.lmsys.org/?arena

- LaMini Series: https://github.com/mbzuai-nlp/LaMini-LM
    
- OpenLLama: https://github.com/openlm-research/open_llama

- WizardVicuna: https://github.com/nlpxucan/WizardLM,  https://huggingface.co/junelee/wizard-vicuna-13b

- MPT-7 Series: https://www.mosaicml.com/blog/mpt-7b

- Open-Assistant Announces Plugins: https://open-assistant.io/chat/06458aa3-d660-763c-8000-fad0bb3cf277


---

# Tokens und Embeddings

## Data

In [75]:
# PREPARE DATA
import torch

# read words from file
words = open('./english_verbs.txt', 'r').read().splitlines()
# alternative dataset: https://github.com/dwyl/english-words
#words = open('./words_alpha.txt', 'r').read().splitlines()

print(words[:10])

print('num words: ', len(words)) 

# Prepare Alphabet
## count characters 
dataset_characters = []
for word in words:
    word_characters = list(word)
    dataset_characters.extend(word_characters)
distinct_characters = sorted(list(set(dataset_characters)))
print('len dictinct characters: ', len(distinct_characters))
print('distinct_characters: ', distinct_characters)

special_characters = ['_'] # changed to blank only for convenience reasons

# ngram characters = distinct characters + start token and end token -> + 2
num_characters = len(distinct_characters) + len(special_characters)
print('Num Characters: ', num_characters)

# create a character to index mapping because it is easier to work with indices when using tensor matrices -> every character gets assigned an index
character_to_index_map = {character:index for index, character in enumerate(distinct_characters)}
print(character_to_index_map)

# add our special characters that symbolize start and end of a word
character_to_index_map['_'] = 26
print(character_to_index_map)

# write characters into the cells to make it look more nicely
index_to_character_map = {index:character for character, index in character_to_index_map.items()}
print(index_to_character_map)

['abide', 'accelerate', 'accept', 'accomplish', 'achieve', 'acquire', 'acted', 'activate', 'adapt', 'add']
num words:  1041
len dictinct characters:  26
distinct_characters:  ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
Num Characters:  27
{'a': 0, 'b': 1, 'c': 2, 'd': 3, 'e': 4, 'f': 5, 'g': 6, 'h': 7, 'i': 8, 'j': 9, 'k': 10, 'l': 11, 'm': 12, 'n': 13, 'o': 14, 'p': 15, 'q': 16, 'r': 17, 's': 18, 't': 19, 'u': 20, 'v': 21, 'w': 22, 'x': 23, 'y': 24, 'z': 25}
{'a': 0, 'b': 1, 'c': 2, 'd': 3, 'e': 4, 'f': 5, 'g': 6, 'h': 7, 'i': 8, 'j': 9, 'k': 10, 'l': 11, 'm': 12, 'n': 13, 'o': 14, 'p': 15, 'q': 16, 'r': 17, 's': 18, 't': 19, 'u': 20, 'v': 21, 'w': 22, 'x': 23, 'y': 24, 'z': 25, '_': 26}
{0: 'a', 1: 'b', 2: 'c', 3: 'd', 4: 'e', 5: 'f', 6: 'g', 7: 'h', 8: 'i', 9: 'j', 10: 'k', 11: 'l', 12: 'm', 13: 'n', 14: 'o', 15: 'p', 16: 'q', 17: 'r', 18: 's', 19: 't', 20: 'u', 21: 'v', 22: 'w', 23: 'x', 24: 'y',

# Loss Objective

How do we summarize the quality of our language model? -> Maximum Likelihood Approach


In [79]:
# Loss Function / Objective Function: 

## turn the bigram matrix into a probability matrix
# create a bigram matrix with rows = characters, columns = follow up characters
#bigram_matrix = torch.zeros((num_characters,num_characters), dtype=torch.int32) # we want to represent counts -> use integer, -> there are 26 characters in latin alphabet/roman alphabet LOWER CASED + start/end token, we could also figured that out by a character count ...
# model smoothing
bigram_matrix = torch.ones((num_characters,num_characters), dtype=torch.int32) # we want to represent counts -> use integer, -> there are 26 characters in latin alphabet/roman alphabet LOWER CASED + start/end token, we could also figured that out by a character count ...
#print(bigram_matrix)

# BIGRAM MODEL
for word in words: 
    word_characters = ['_'] + list(word) + ['_'] # create a list of all characters we have seen so far
    for character_1, character_2, in zip(word_characters, word_characters[1:]): # zip aligns 2 lists => zip/[a,b,c],[d,e,f]) -> [(a,d), (b,e), (c,f)] -> zip(smile, mile) -> [(s,m), (m,i), ...], zip halts once one list is finished
        index_1 = character_to_index_map[character_1]
        index_2 = character_to_index_map[character_2]
        bigram_matrix[index_1, index_2] += 1       
#print(bigram_matrix)
## compute probs
prob_matrix = bigram_matrix.float()
prob_matrix /= prob_matrix.sum(1, keepdim=True) # divide by row sum
#print(prob_matrix)

# COMPUTE QUALITY METRIC ON WHOLE DATASET 
log_likelihood = 0.0
n = 0
for word in words: 
    word_characters = ['_'] + list(word) + ['_'] # create a list of all characters we have seen so far
    for character_1, character_2, in zip(word_characters, word_characters[1:]): # zip aligns 2 lists => zip/[a,b,c],[d,e,f]) -> [(a,d), (b,e), (c,f)] -> zip(smile, mile) -> [(s,m), (m,i), ...], zip halts once one list is finished        
        index_1 = character_to_index_map[character_1]
        index_2 = character_to_index_map[character_1]
        prob = prob_matrix[index_1, index_2]
        logprob = torch.log(prob)
        log_likelihood += logprob
        n +=1
        #print(f'{character_1}{character_2}: {prob:.4f}')
        #print(f'{character_1}{character_2}: {logprob:.4f}')

#print(log_likelihood)
negative_log_likelihood = -log_likelihood
normalized_negative_log_likelihood = negative_log_likelihood/n
print(normalized_negative_log_likelihood) # average neg log likelihood

# -> summarize the quality of this model in a single number -> avg neg log likelihood loss over the training dataset

## -> idea: Likelihood = the product of the probabilities that make up a word
## Maximum Likelihood = the word that is most likely
## Maximum Log Likelihood -> used for convenience because likelihood gets very low when vocab increases and word size gets big ..

### -> just use the log transform on the probability product -> the less negative the number the better = the higher the probability of a word
### the idea is then to compute the log likelihood over the whole corpus -> and then take this as the evaluation  of our model = how well performs our model on our corpus which we assume as representative for our language
### if the model is perfect -> log likelihoood --> 0, if model is very bad --> neg. infinity

### alternative: sometimes negative log likelihood is used because then we want the loss to be good if it is low -> use this as loss function
### normalized negative log likelihood -> just normalize the average log likelihood per prediction in the dataset -> this tells us how sure am i on average on each prediction that i make. 
### -> this average/normalized neg log likelihood over our dataset gives us a quality of our model. the lower the better. 

### the goal in training is to minimize the avg. neg. log likelihood loss over our dataset 
### when using a neural network -> we can minimize this loss by updating the parameters of a neural network

tensor(4.3897)


# Neural Network Models

What is our objective? Predict the character in a sequence

First naive approach -> compress the look up table that we had before from counting statistics into the weights of a neural network -> let a neural network approximate this discrete function.

In [83]:
# TRAINING DATA
# create training set for neural network = all bigrams of our dataset
inputs, outputs = [], [] # (x,y)

for word in words: 
    word_characters = ['_'] + list(word) + ['_'] # create a list of all characters we have seen so far
    for character_1, character_2, in zip(word_characters, word_characters[1:]): # zip aligns 2 lists => zip/[a,b,c],[d,e,f]) -> [(a,d), (b,e), (c,f)] -> zip(smile, mile) -> [(s,m), (m,i), ...], zip halts once one list is finished        
        index_1 = character_to_index_map[character_1]
        index_2 = character_to_index_map[character_2]
        inputs.append(index_1)
        outputs.append(index_2)
    

# turn training data into tensors
inputs = torch.tensor(inputs)
outputs = torch.tensor(outputs)
print(inputs)
print(outputs)

tensor([26,  0,  1,  ..., 14, 14, 12])
tensor([ 0,  1,  8,  ..., 14, 12, 26])


In [85]:
import torch.nn.functional as F

# ONE HOT ENCODING of CHARACTERS for complete input vector
inputs_encoded = F.one_hot(inputs, num_classes=num_characters).float()
print(inputs_encoded[0])
print(inputs_encoded.shape)
outputs_encoded = F.one_hot(outputs, num_classes=num_characters).float()
print(outputs_encoded.shape)

tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 1.])
torch.Size([6961, 27])
torch.Size([6961, 27])


In [86]:
# SINGLE NEURON/PERCEPTRON

## -> single neuron neural network -> only 1 weight matrix W -> these weights are multiplied by inputs
W = torch.randn((num_characters,1)) # nn.Linear(input_dim, output_dim) > nn.Linear(27, 1)

# multiply input with weights -> vector matrix product (x,27)(27,1) -> (x,1) -> activation for every input
neuron_output = inputs_encoded @ W
print(neuron_output)

tensor([[ 0.1548],
        [-1.4748],
        [ 2.6216],
        ...,
        [ 1.2950],
        [ 1.2950],
        [-0.1914]])


In [87]:
# SINGLE LAYER

## -> single neuron neural network -> only 1 weight matrix W -> these weights are multiplied by inputs
W = torch.randn((num_characters, num_characters))

# multiply input with weights
inputs_encoded @ W #(x,27)(27,27) -> (x,27) # = interpret as distribution over output vocab 

# interpret outputs as log counts of log likelihood -> if output is interpreted as unnormalized probability distribution -> we just need to normalize it -> softmax (= exponentiate + norm)
logits = inputs_encoded @ W # log counts
# SOFTMAX: apply softmax to convert output logits to probabilities
counts = logits.exp() # this is equivalent to the prob matrix we retrieved statistically from counting in ngram models
probs = counts / counts.sum(1, keepdim=True) # normalize rows of the output 
print(probs)
print(probs.shape) 

# we interpret the output per row as the probability distribution over our vocabulary/character vocab
## next step = update W matrix/weight matrix such that loss objective gets minimal

# COMPUTE LOSS using this weight matrix
num_training_examples = 5 #probs.shape[0]
neg_log_likelihoods = torch.zeros(num_training_examples)
# step by step backpropagation
for i in range(num_training_examples):
    # get the i-th bigram
    x = inputs[i].item() 
    y = outputs[i].item()
    print('----------')
    print(f'training example {i}: {index_to_character_map[x]}, {index_to_character_map[y]} ')
    print('input to neural net: ', x)
    print('output from neural net: ', probs[i])
    print('label actual next character: ', y)
    p = probs[i, y]
    print('probability assigned by neural net to correct character: ', p.item())
    logp = torch.log(p)
    print('log likelihood: ', logp.item())
    nll = -logp
    print('negative log likelihood', nll.item())
    neg_log_likelihoods[i] = nll
    
print('----------')
print('average neg log likelihood/loss: ', neg_log_likelihoods.mean().item())

tensor([[0.0474, 0.0125, 0.0121,  ..., 0.0871, 0.1330, 0.0406],
        [0.0161, 0.0609, 0.0333,  ..., 0.0084, 0.0365, 0.0170],
        [0.0059, 0.0305, 0.0247,  ..., 0.0018, 0.0177, 0.1357],
        ...,
        [0.0255, 0.0250, 0.0102,  ..., 0.0170, 0.0333, 0.0170],
        [0.0255, 0.0250, 0.0102,  ..., 0.0170, 0.0333, 0.0170],
        [0.0418, 0.0409, 0.0746,  ..., 0.0078, 0.0677, 0.0192]])
torch.Size([6961, 27])
----------
training example 0: _, a 
input to neural net:  26
output from neural net:  tensor([0.0474, 0.0125, 0.0121, 0.0051, 0.0174, 0.0365, 0.0808, 0.0242, 0.0214,
        0.0336, 0.0229, 0.0166, 0.0086, 0.0370, 0.0185, 0.0222, 0.0059, 0.0262,
        0.0966, 0.0283, 0.0886, 0.0434, 0.0227, 0.0108, 0.0871, 0.1330, 0.0406])
label actual next character:  0
probability assigned by neural net to correct character:  0.04738028347492218
log likelihood:  -3.049549102783203
negative log likelihood 3.049549102783203
----------
training example 1: a, b 
input to neural net:  0
ou

In [None]:
# TRAINING

In [98]:
# inputs
print(inputs)

# outputs
print(outputs)

# SINGLE LAYER NN

## randomly initialize weights
W = torch.randn((num_characters, num_characters), requires_grad=True)

tensor([26,  0,  1,  ..., 14, 14, 12])
tensor([ 0,  1,  8,  ..., 14, 12, 26])


In [111]:
# FORWARD PASS
inputs_enc = F.one_hot(inputs, num_classes=num_characters).float()
# run input through weight layer
logits = inputs_enc @ W # predict log counts
# apply softmax on outputs = normalize to probs
counts = logits.exp() # 
probs = counts / counts.sum(1, keepdims=True)

print(probs.shape)
num_training_examples = probs.shape[0]
loss = -probs[torch.arange(num_training_examples), outputs].log().mean()
print(loss)

torch.Size([6961, 27])
tensor(3.5306, grad_fn=<NegBackward>)


In [112]:
# BACKWARD PASS

#print(W.data)
#print(W.shape)
#print(W.grad)


## initialize gradients -> set gradients to zero
W.grad = None 

## backpropagate the loss
loss.backward()

## print gradients of every weight
#print(W.grad.shape) # every element of W.grad tells us the influence of that weight on the loss function
#print(W.grad) # that means if the gradient is positive -> if you increase a weight with positive gradient -> you will increase the loss -> if you decrease it -> we will decrease the loss!

# UPDATE WEIGHTS
learning_rate = 10
W.data += -learning_rate * W.grad
# iteratively update the forward pass here 

In [116]:
# COMPLETE TRAINING LOOP

# CREATE DATA: create training set for neural network = all bigrams of our dataset
inputs, outputs = [], [] # (x,y)

for word in words: 
    word_characters = ['_'] + list(word) + ['_'] # create a list of all characters we have seen so far
    for character_1, character_2, in zip(word_characters, word_characters[1:]): # zip aligns 2 lists => zip/[a,b,c],[d,e,f]) -> [(a,d), (b,e), (c,f)] -> zip(smile, mile) -> [(s,m), (m,i), ...], zip halts once one list is finished        
        index_1 = character_to_index_map[character_1]
        index_2 = character_to_index_map[character_2]
        inputs.append(index_1)
        outputs.append(index_2)
    
    
# turn training data into tensors
inputs = torch.tensor(inputs)
outputs = torch.tensor(outputs)
#print(inputs)

num_epochs = 50
learning_rate = 100

# CREATE NEURAL NET
W = torch.randn((num_characters, num_characters), requires_grad=True) # same to torch.Linear(input_dim, output_dim)

# RUN TRAINING
print('Num Training Examples: ', inputs.shape[0])

for k in range(num_epochs):
    # forward pass
    inputs_enc = F.one_hot(inputs, num_classes=num_characters).float()
    logits = inputs_enc @ W
    # softmax
    counts = logits.exp()
    probs = counts / counts.sum(1, keepdims=True)
    # compute avg neg log likelihood loss
    num_training_examples = probs.shape[0]
    loss = -probs[torch.arange(num_training_examples), outputs].log().mean() # + 1*(W**2).mean()
    print(loss.item())
    # backward pass
    W.grad = None # set gradients to zero
    loss.backward()
    # update weights
    W.data += -learning_rate * W.grad

Num Training Examples:  6961
3.670109510421753
3.1907551288604736
3.0170533657073975
2.7699875831604004
2.748635768890381
2.668529987335205
2.7009544372558594
2.573179006576538
2.616485118865967
2.558551549911499
2.6176836490631104
2.5132970809936523
2.571725845336914
2.513796806335449
2.580388069152832
2.4856348037719727
2.5498664379119873
2.4892945289611816
2.5588200092315674
2.469641923904419
2.536808490753174
2.4736220836639404
2.5445523262023926
2.459153175354004
2.528049945831299
2.4626822471618652
2.534363269805908
2.4517016410827637
2.5217092037200928
2.454589605331421
2.5267012119293213
2.446089029312134
2.516853094100952
2.448347568511963
2.520723819732666
2.4416754245758057
2.512979030609131
2.4433844089508057
2.5159342288970947
2.4380931854248047
2.509795665740967
2.439347982406616
2.5120184421539307
2.435116767883301
2.5071210861206055
2.436007261276245
2.5087671279907227
2.4325993061065674
2.5048372745513916
2.4332029819488525


In [74]:
# What is the best loss we can assume? -> bigram statistics. 

# Why should we do it using NNs then if we could use our bigram also? -> because this is significantly more scalable and flexible! 

## For bigrams we get a problem already when we want to compute 3-grams, 4-grams .. -> here we can just add more tokens to the input and predict the same matrix. 27*27*27...
## The problem from going away from n-gram models is really that the combinations of follow ups explodes exponentially!

## Another advantage of gradient based learning is implicit smoothing via smooth function approximation using regulization! -> counting statistics have to be smoothed for unlikely cases

In [123]:
# SAMPLING FROM NEURAL NET

num_words_to_sample = 5

for i in range(num_words_to_sample):
    out = []
    index = 26 # start with the blank symbol
    while True:
        x_encoded = F.one_hot(torch.tensor([index]), num_classes=num_characters).float() 
        logits = x_encoded @ W
        # softmax
        counts = logits.exp()
        probs = counts / counts.sum(1, keepdims=True)
        # sampling
        index = torch.multinomial(probs, num_samples=1, replacement=True).item()
        out.append(index_to_character_map[index])
        if index == 26:
            break
    print(''.join(out))

cl_
hit_
dl_
drered_
baigeeroudefow_


# Multi Layer Perceptron

---

![](https://www.researchgate.net/publication/354817375/figure/fig2/AS:1071622807097344@1632506195651/Multi-layer-perceptron-MLP-NN-basic-Architecture.jpg)

---

![](https://miro.medium.com/v2/resize:fit:1200/1*EqKiy4-6tuLSoPP_kub33Q.png)

---

In [48]:
# MULTI LAYER PERCEPTRON

# Idea: Lets go for bigger contexts! -> 4 gram model -> 3 tokens context predict the upcoming 4th token

# Tokens: elements of our fundamental set -> for us this is characters, but could also be words, subwords, ... 
# [a,b,c,d,_] -> [0,1,2,3,26] embedding -> indices

# CREATE DATASET for neural 4 gram model
context_length = 3
X, Y = [], [] 

for word in words[:4]: 
    context = [26] * context_length # _ _ _ word _
    for character in word + '_': 
        index = character_to_index_map[character]
        X.append(context)
        Y.append(index)
        print(''.join(index_to_character_map[i] for i in context), ' -> ', index_to_character_map[index])
        context = context[1:] + [index] # crop first context token and append current token as the new last one of the context
        
X = torch.tensor(X)
Y = torch.tensor(Y)
print(X[:4])
print(Y[:4])

___  ->  a
__a  ->  b
_ab  ->  i
abi  ->  d
bid  ->  e
ide  ->  _
___  ->  a
__a  ->  c
_ac  ->  c
acc  ->  e
cce  ->  l
cel  ->  e
ele  ->  r
ler  ->  a
era  ->  t
rat  ->  e
ate  ->  _
___  ->  a
__a  ->  c
_ac  ->  c
acc  ->  e
cce  ->  p
cep  ->  t
ept  ->  _
___  ->  a
__a  ->  c
_ac  ->  c
acc  ->  o
cco  ->  m
com  ->  p
omp  ->  l
mpl  ->  i
pli  ->  s
lis  ->  h
ish  ->  _
tensor([[26, 26, 26],
        [26, 26,  0],
        [26,  0,  1],
        [ 0,  1,  8]])
tensor([0, 1, 8, 3])


In [131]:
# EMBEDDINGS: Embedding Table -> condensed representation of tokens
emb_dim = 2
emb_layer = torch.randn((27,emb_dim)) 
# get emebbing of index 5
#print(emb_layer[5])
# but how to get it when using one hot encoding? 
character_embedding = F.one_hot(torch.tensor(5), num_classes=num_characters).float() @ emb_layer
#print(character_embedding)
# get embeddings of a list or tensor
character_embeddings = emb_layer[torch.tensor([5,6,7])]
#print(character_embeddings)
# or the embeddings of whole tensors
x_emb = emb_layer[X]
#print(x_emb)
print(x_emb.shape)

torch.Size([35, 3, 2])


In [132]:
# HIDDEN LAYER

# -> next week: MLPs and bigger context sizes. 

# Data Preprocessing

The Pile: https://pile.eleuther.ai/, https://huggingface.co/datasets/EleutherAI/pile

The Pile Paper: https://arxiv.org/pdf/2101.00027.pdf, https://arxiv.org/pdf/2201.07311.pdf

Llama Paper: https://arxiv.org/abs/2302.13971, https://arxiv.org/pdf/2304.14402.pdf, 

Red Pajama: https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T, https://www.together.xyz/blog/redpajama, https://github.com/togethercomputer/RedPajama-Data, https://www.together.xyz/blog/redpajama-models-v1

Modellgröße + Datenformel: https://arxiv.org/abs/2304.03208, https://arxiv.org/abs/2203.15556, https://www.cerebras.net/blog/cerebras-gpt-a-family-of-open-compute-efficient-large-language-models/

Visual Evaluation: https://github.com/nomic-ai/nomic

Quality Estimtimation: Smart Sampling/Statistical Tests + Human Eval, https://arxiv.org/abs/2302.13971, https://s3.amazonaws.com/static.nomic.ai/gpt4all/2023_GPT4All_Technical_Report.pdf

## Questions
- Wo kommen die Daten her? (etwas philosophischer: Macht die Aufteilung Sinn?)
- Wie groß kann ein darauf trainiertes Modell sein? (nach Chinchilla paper oder anderen Empfehlungen)
- Welche Schritte wurden im preprocessing durchlaufen? Sind die Daten dadurch tatsächlich "qualitativ hochwertig"?
- Wie lassen sie sich visualisieren? (Da wird wohl schon ein interaktives Dashboard mittels Meerkat bereitgestellt. Vielleicht damit mal ein bisschen rumprobieren...)

---