### Apprentissage profond - TD n°1 
________________

#### Exo 1 : Apprentissage de FizzBuzz

In [1]:
# imports
import numpy as np
import torch

__Modélisation__ 

On modélise le problème d'apprentissage commme un problème de classification multi-classe.

Les sorties possibles 0 / 1 / 2 / 3 correspondent respectivement :
* au nombre lui-même, 
* à `fizz` pour les multiples de 3, 
* à `buzz` pour les multiples de 5,
* à `fizzbuzz` pour les multiples de 15.

In [2]:
# data encoding : ground truth labels 
def fizz_buzz_encode(i):
    res = 0
    if i % 15 == 0: # fizzbuzz
        res = 3
    elif i % 5 == 0 : # buzz
        res = 2
    elif i % 3 == 0: # fizz
        res =  1
    return res

In [3]:
# data display (when we have a prediction 0/1/2/3 and we want to get the human-readable result)
def fizz_buzz (i , prediction):
    return [str(i), "fizz", "buzz", "fizzbuzz"][prediction]

# example
i, pred = 3, 1 # right prediction
print(f"Encoding for {i} is {pred}, corresponding to {fizz_buzz(i,pred)}.")

# more examples
for i in range(1, 20) : 
    print(i, fizz_buzz_encode(i), fizz_buzz (i , fizz_buzz_encode(i) ))


Encoding for 3 is 1, corresponding to fizz.
1 0 1
2 0 2
3 1 fizz
4 0 4
5 2 buzz
6 1 fizz
7 0 7
8 0 8
9 1 fizz
10 2 buzz
11 0 11
12 1 fizz
13 0 13
14 0 14
15 3 fizzbuzz
16 0 16
17 0 17
18 1 fizz
19 0 19


__Comment encoder les données d'entrées ?__

On propose d'encoder les entiers sous forme binaire au lieu d'utiliser la base 10.

*Astuce : Encodage binaire avec les opérateurs binaires python >> et &*
* operateur >> (bitwise right shift operator): décalle l'écriture de binaire de d places (padding avec des zéros à gauche), cf animation sur https://realpython.com/python-bitwise-operators/#right-shift 
* operateur & (bitmask AND): correspond à un ET bit à bit e.g. 13 & 1 = 1101 & 0001 = 0001

Par exemple 13 en binaire s'écrit 1101, en appliquant successivement les étapes suivantes, on retrouve la décomposition en binaire:
    * 13 >> 0 & 1 = 1101  & 0001 = 1
    * 13 >> 1 & 1 = 0110  & 0001 = 0
    * 13 >> 2 & 1 = 0011  & 0001 = 1
    * 13 >> 3 & 1 = 0001  & 0001 = 0


Autrement dit : 

> a >> n

revient à écrire en binaire

> np.floor(a/2**n)

et

> a & n

revient à appliquer l'opération AND sur les écritures binaires de a et n 

In [4]:
number = 13
nb_digits = 4
decomp = [number >> d & 1 for d in range(nb_digits)]
print(decomp)
decomp.reverse()
print(decomp)

print(number & 1)
print(number >> 0)
print(number >> 1)
print(number >> 2)

[1, 0, 1, 1]
[1, 1, 0, 1]
1
13
6
3


In [5]:
# represent digits by their binary encoding

NUM_DIGITS = 10
def binary_encode (i , num_digits=NUM_DIGITS ) :
    return [i >> d & 1 for d in range(num_digits)]

for i in range(10):
    print(i, binary_encode(i)) 

# NB : on les affiche l'encodage binaire "à l'envers" mais ça n'a aucun impact d'un point de vue de la modélisation ou de l'apprentissage 

0 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
1 [1, 0, 0, 0, 0, 0, 0, 0, 0, 0]
2 [0, 1, 0, 0, 0, 0, 0, 0, 0, 0]
3 [1, 1, 0, 0, 0, 0, 0, 0, 0, 0]
4 [0, 0, 1, 0, 0, 0, 0, 0, 0, 0]
5 [1, 0, 1, 0, 0, 0, 0, 0, 0, 0]
6 [0, 1, 1, 0, 0, 0, 0, 0, 0, 0]
7 [1, 1, 1, 0, 0, 0, 0, 0, 0, 0]
8 [0, 0, 0, 1, 0, 0, 0, 0, 0, 0]
9 [1, 0, 0, 1, 0, 0, 0, 0, 0, 0]


__Préparation des données (train et test)__

In [6]:
start_train, end_train = 101, 1023
start_test, end_test = 1, 101

# train
X_train = torch.FloatTensor ([ binary_encode (i , NUM_DIGITS ) for i in range (start_train, end_train)])
Y_train = torch.LongTensor ([fizz_buzz_encode(i) for i in range(start_train, end_train)]).squeeze()

# test
X_test = torch.FloatTensor ([ binary_encode (i , NUM_DIGITS) for i in range (start_test, end_test) ])
print(X_test[:5])

tensor([[1., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 1., 0., 0., 0., 0., 0., 0., 0., 0.],
        [1., 1., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 1., 0., 0., 0., 0., 0., 0., 0.],
        [1., 0., 1., 0., 0., 0., 0., 0., 0., 0.]])


__Modèle__

Voici une architecture simple de réseau de neurone à 1 seule couche cachée. Vous pouvez expérimenter avec d'autres architectures e.g. un nombre de neurones différents dans la couche cachée, ou deux couches cachées. Attention aux dimensions des couches !

In [7]:
# nombre de neurones dans la couche cachée
NUM_HIDDEN = 100

# définition du MLP à 1 couche cachée (non linearite ReLU)
model = torch.nn.Sequential(
    torch.nn.Linear(NUM_DIGITS, NUM_HIDDEN),
    torch.nn.ReLU(),
    torch.nn.Linear(NUM_HIDDEN, 4)
    )

print(model)

Sequential(
  (0): Linear(in_features=10, out_features=100, bias=True)
  (1): ReLU()
  (2): Linear(in_features=100, out_features=4, bias=True)
)


__Fonction de coût__ 

Quelle est la différence entre la CrossEntropyLoss et la NLLLoss ?

CE loss : Selon la doc PyTorch "The input is expected to contain the unnormalized logits for each class, *which do not need to be positive or sum to 1, in general*. " cf https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html#torch.nn.CrossEntropyLoss. 

NLLLoss : signifie "negative log likelihood loss", attend des entrées positives, typiquement des logits. 

Dans notre MLP, on n'a pas appliqué de softmax aux sorties de la dernière couche, donc on prend ici la CrossEntropyLoss et pas la NLLLoss.

In [8]:
# fonction de coût
loss_fn = torch.nn.CrossEntropyLoss()

In [9]:
# optimiseur --> choix du learning rate 
optimizer = torch.optim.SGD(model.parameters(), lr = 0.05)

__Entrainement__

2 limites dans cette première version du code : 
* on regarde ce que ça donne sur le test set au fur et à mesure de l'entrainement, ce n'est pas une hypothèse réaliste (ou saine)
* il vaudrait mieux partager le train set en un train set et un val set, et suivre l'évolution de l'accuracy sur le val set au lieu du test set

In [10]:
BATCH_SIZE = 128

raw_data_test = np.arange(1, 101) # valeurs de test

for epoch in range(10000):
    for start in range(0, len(X_train), BATCH_SIZE):
        end = start + BATCH_SIZE
        batchX = X_train[start:end]
        batchY = Y_train[start:end]

        # prediction et calcul de la loss
        y_pred = model(batchX)
        loss = loss_fn(y_pred, batchY)
    
        # mettre les gradients à 0 avant la passe retour (backward)
        optimizer.zero_grad()
    
        # rétro-propagation
        loss.backward()
        optimizer.step()

    # calcul coût  (et affichage)
    loss = loss_fn( model(X_train), Y_train)
    if epoch%100 == 0:
        print('epoch {} training loss {}'.format(epoch, round(loss.item(), 3)))

    # visualisation des résultats en cours d'apprentissage
    # (doit être fait sur l'ensemble de validation normalement)
    if(epoch%1000==0):
        Y_test_pred = model(X_test)
        val, idx = torch.max(Y_test_pred,1)
        ii=idx.data.numpy()
        # numbers = np.arange(1, 101)
        output = np.vectorize(fizz_buzz)(raw_data_test, ii)
        print(output)

epoch 0 training loss 1.19
['1' '2' '3' '4' '5' '6' '7' '8' '9' '10' '11' '12' '13' '14' '15' '16'
 '17' '18' '19' '20' '21' '22' '23' '24' '25' '26' '27' '28' '29' '30'
 '31' '32' '33' '34' '35' '36' '37' '38' '39' '40' '41' '42' '43' '44'
 '45' '46' '47' '48' '49' '50' '51' '52' '53' '54' '55' '56' '57' '58'
 '59' '60' '61' '62' '63' '64' '65' '66' '67' '68' '69' '70' '71' '72'
 '73' '74' '75' '76' '77' '78' '79' '80' '81' '82' '83' '84' '85' '86'
 '87' '88' '89' '90' '91' '92' '93' '94' '95' '96' '97' '98' '99' '100']


epoch 100 training loss 1.134
epoch 200 training loss 1.121
epoch 300 training loss 1.1
epoch 400 training loss 1.07
epoch 500 training loss 1.032
epoch 600 training loss 0.983
epoch 700 training loss 0.932
epoch 800 training loss 0.872
epoch 900 training loss 0.813
epoch 1000 training loss 0.736
['1' '2' 'fizz' '4' 'buzz' '6' '7' '8' '9' 'buzz' '11' '12' '13' '14' '15'
 '16' '17' 'fizz' '19' 'buzz' '21' '22' '23' '24' '25' '26' '27' '28' '29'
 '30' '31' '32' 'buzz' '34' 'buzz' '36' '37' '38' '39' 'buzz' '41' 'fizz'
 '43' '44' 'buzz' '46' '47' '48' '49' 'buzz' '51' '52' '53' '54' 'buzz'
 '56' '57' '58' '59' '60' '61' '62' '63' '64' 'buzz' 'fizz' '67' '68' '69'
 'fizz' '71' '72' '73' '74' '75' '76' '77' 'fizz' '79' '80' '81' '82' '83'
 '84' 'buzz' '86' '87' '88' '89' '90' '91' '92' '93' '94' '95' '96' '97'
 '98' '99' '100']
epoch 1100 training loss 0.661
epoch 1200 training loss 0.592
epoch 1300 training loss 0.522
epoch 1400 training loss 0.438
epoch 1500 training loss 0.376
epoch 1600

__Affichage des résultats__

In [11]:
# Sortie finale (calcul lisible)
Y_test_pred = model(X_test)
print(Y_test_pred[:5])
val, idx = torch.max(Y_test_pred,1)
ii=idx.data.numpy()
print(ii[:5])
output = np.vectorize(fizz_buzz)(raw_data_test, ii)
print("============== Final result ============")
print(output)

# Sortie finale (calcul plus compact des predictions)
Y_test_pred = model(X_test)
predictions = zip(range(1, 101), list(Y_test_pred.max(1)[1].data.tolist()))
print("============== Final result ============")
print ([fizz_buzz(i, x) for (i, x) in predictions])

tensor([[ 6.1720, -0.4590, -1.1927, -6.0436],
        [ 5.5546, -2.5170,  0.1957, -4.3396],
        [-5.0009, 10.3592, -1.4913, -5.0316],
        [ 9.2224, -5.6584, -1.8103, -1.9594],
        [ 2.8647, -2.4296,  4.5885, -5.6959]], grad_fn=<SliceBackward0>)
[0 0 1 0 2]
['1' '2' 'fizz' '4' 'buzz' 'fizz' '7' '8' 'fizz' 'buzz' '11' '12' '13'
 '14' 'fizzbuzz' '16' '17' 'fizz' 'buzz' 'buzz' '21' '22' '23' 'fizz' '25'
 '26' 'fizz' '28' '29' 'fizz' '31' '32' 'fizz' '34' 'buzz' '36' '37'
 'buzz' 'fizz' 'buzz' '41' 'fizz' '43' '44' 'fizzbuzz' '46' '47' 'fizz'
 'fizz' 'buzz' 'fizz' '52' '53' 'fizz' 'buzz' '56' 'fizz' '58' '59'
 'fizzbuzz' '61' '62' 'fizz' '64' 'buzz' 'fizz' '67' '68' 'fizz' 'fizz'
 '71' 'fizz' '73' '74' 'fizzbuzz' '76' '77' 'fizz' '79' 'buzz' '81' '82'
 '83' '84' 'buzz' '86' '87' '88' '89' 'fizzbuzz' '91' '92' '93' '94'
 'buzz' '96' '97' '98' 'fizz' 'buzz']
['1', '2', 'fizz', '4', 'buzz', 'fizz', '7', '8', 'fizz', 'buzz', '11', '12', '13', '14', 'fizzbuzz', '16', '17', 'fizz', 'b

__Calcul des performances (classification accuracy)__

In [12]:
Y_test = np.array([fizz_buzz_encode(i) for i in range(start_test, end_test)])
print(Y_test[:10])
print("\n\nTest acc: ", np.mean(Y_test == ii))

[0 0 1 0 2 1 0 0 1 2]


Test acc:  0.86


__Avec un validation set !__

On choisit par exemple de mettre de côté 100 exemples d'entrainement pour constituer un jeu de validation.

In [13]:
NUM_VAL = 100 

# sélection aléatoire
p = np.random.permutation(range(len(X_train)))
print(p[:5])

# train / val 
X_val , Y_val  = X_train [p[-NUM_VAL:],:] , Y_train [p[-NUM_VAL:]]
X_train , Y_train = X_train[p[:- NUM_VAL],:] , Y_train [p[:- NUM_VAL]]
print(len(X_val), len(X_train))


[440 873 317 353  41]
100 822


In [14]:
# nouveau code d'apprentissage

BATCH_SIZE = 128
# nombre de neurones dans la couche cachée
NUM_HIDDEN = 100

# définition du MLP à 1 couche cachée (non linearite ReLU)
model = torch.nn.Sequential(
    torch.nn.Linear(NUM_DIGITS, NUM_HIDDEN),
    torch.nn.ReLU(),
    torch.nn.Linear(NUM_HIDDEN, 4)
    )

# fonction de coût
loss_fn = torch.nn.CrossEntropyLoss()
# optimiseur --> choix du learning rate 
optimizer = torch.optim.SGD(model.parameters(), lr = 0.05)

for epoch in range(10000):
    for start in range(0, len(X_train), BATCH_SIZE):
        end = start + BATCH_SIZE
        batchX = X_train[start:end]
        batchY = Y_train[start:end]

        # prediction et calcul de la loss
        y_pred = model(batchX)
        loss = loss_fn(y_pred, batchY)
    
        # mettre les gradients à 0 avant la passe retour (backward)
        optimizer.zero_grad()
    
        # rétro-propagation
        loss.backward()
        optimizer.step()

    # calcul coût  (et affichage)
    loss = loss_fn( model(X_train), Y_train)
    if epoch%100 == 0:
        print('epoch {} training loss {}'.format(epoch, round(loss.item(), 3)))

    # visualisation des résultats en cours d'apprentissage
    # cette fois-ci sur l'ensemble de validation
    if(epoch%1000==0):
        # train acc
        Y_train_pred = model(X_train)
        pred, idx = torch.max(Y_train_pred,1)
        train_labels = idx.data.numpy()
        print(train_labels[:10])
        print(Y_train.data.numpy()[:10])
        print(" train acc : " , round(np.mean(np.equal(Y_train.data.numpy(), train_labels)),3)) # TODO to fix
        # val ac
        Y_val_pred = model(X_val)
        pred, idx = torch.max(Y_val_pred,1)
        val_labels = idx.data.numpy()
        print(" val acc : " , round(np.mean(np.equal(Y_val.data.numpy(), val_labels )),3)) # TODO to fix

epoch 0 training loss 1.185
[0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 1 0 1]
 train acc :  0.544
 val acc :  0.46


epoch 100 training loss 1.116
epoch 200 training loss 1.102
epoch 300 training loss 1.076
epoch 400 training loss 1.043
epoch 500 training loss 0.988
epoch 600 training loss 0.925
epoch 700 training loss 0.852
epoch 800 training loss 0.77
epoch 900 training loss 0.682
epoch 1000 training loss 0.594
[0 0 0 0 0 0 0 1 0 0]
[0 0 0 0 0 0 0 1 0 1]
 train acc :  0.787
 val acc :  0.69
epoch 1100 training loss 0.519
epoch 1200 training loss 0.454
epoch 1300 training loss 0.401
epoch 1400 training loss 0.355
epoch 1500 training loss 0.315
epoch 1600 training loss 0.279
epoch 1700 training loss 0.25
epoch 1800 training loss 0.226
epoch 1900 training loss 0.205
epoch 2000 training loss 0.187
[0 0 0 0 0 0 0 1 0 0]
[0 0 0 0 0 0 0 1 0 1]
 train acc :  0.955
 val acc :  0.79
epoch 2100 training loss 0.17
epoch 2200 training loss 0.155
epoch 2300 training loss 0.143
epoch 2400 training loss 0.132
epoch 2500 training loss 0.123
epoch 2600 training loss 0.114
epoch 2700 training loss 0.106
epoch 2800 tr

Performances sur le test set

In [15]:
# Sortie finale, 
Y_test_pred = model(X_test)
print(Y_test_pred[:5])
val, idx = torch.max(Y_test_pred,1)
ii=idx.data.numpy()
print(ii[:5])
Y_test = np.array([fizz_buzz_encode(i) for i in range(start_test, end_test)])
print(Y_test[:10])
print("\n\nTest acc: ", np.mean(Y_test == ii))

tensor([[ 4.8325, -7.0029,  3.1564, -1.3612],
        [ 6.8665, -5.8519,  0.8981, -2.0299],
        [-3.2836,  6.0388,  0.2730, -2.9217],
        [ 4.0135, -2.7912,  5.5387, -5.3813],
        [-1.0287, -4.6837,  8.5919, -2.7576]], grad_fn=<SliceBackward0>)
[0 0 1 2 2]
[0 0 1 0 2 1 0 0 1 2]


Test acc:  0.96


__Expérimentez avec d'autres hyperparamètres !__

* learning rate
* optimizer
* scheduler
* number of training samples
* architecture of the MLP (number of hidden units)
* number of epochs
* ...