<a href="https://colab.research.google.com/github/ochekroun/labs/blob/master/IFAGE_Cours_12_Generation_de_texte_avec_GPT_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Génération de texte avec GPT2 et KerasNLP

Adapté de https://keras.io/examples/generative/gpt2_text_generation_with_kerasnlp/

In [None]:
# On install KerasNLP, l'extension pour le traitement de langue de Keras
!pip install git+https://github.com/keras-team/keras-nlp.git -q

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone


In [None]:
import os
os.environ["KERAS_BACKEND"] = "torch"

import keras_nlp
import keras
import tensorflow as tf
import time

# Permets de réduire l'utilisation de mémoire
keras.mixed_precision.set_global_policy("mixed_float16")

In [None]:
# On utilise des contextes de 128 tokens au lieu de 1024 pour accélérer l'utilisation du modèle
# (au prix d'une taille de contexte plus limitée)
preprocessor = keras_nlp.models.GPT2CausalLMPreprocessor.from_preset(
    "gpt2_base_en",
    sequence_length=128,
)

# On télécharge le plus petit model GPT-2, qui a 124.44M paramètres
# Voir https://keras.io/api/keras_nlp/models/gpt2/gpt2_causal_lm/
gpt2_lm = keras_nlp.models.GPT2CausalLM.from_preset(
    "gpt2_base_en"
    #"gpt2_medium_en"
)

Downloading from https://www.kaggle.com/api/v1/models/keras/gpt2/keras/gpt2_base_en/2/download/preprocessor.json...
Downloading from https://www.kaggle.com/api/v1/models/keras/gpt2/keras/gpt2_base_en/2/download/task.json...
Downloading from https://www.kaggle.com/api/v1/models/keras/gpt2/keras/gpt2_base_en/2/download/preprocessor.json...


In [None]:
gpt2_lm.generate("My trip to Switzerland was", max_length=128)

"My trip to Switzerland was very interesting, but I'm sure you've noticed the same thing. I was there to see the world and see what was going on around them. I didn't have the time to see the world, so I didn't get to see the people I was looking to see and the people who I wanted to see.\n\nThe tour was a little more difficult. I had some time to think about the world before I even started. I had some time to think about my life before I actually got here. I had to be able to see what was going on and see what was happening. I had"

In [None]:
gpt2_lm.generate("Ce restaurant italien est", max_length=128)

'Ce restaurant italien estancias en la vida.\n\nLa vez que se pueda en una ciabana, que esta una ciabana en una ciabana en una ciabana en una ciabana,\n\nCe ciabana estancias estancias en la vida.\n\nLa vida estancias estancias esse una ciabana.\n\nPuerto de una ciabana en una ciabana en un'

In [None]:
gpt2_lm.generate("", max_length=64)

'\nThe first of the two-part documentary series, "The New York Times Best Seller, " examines the impact of the housing crisis on the American economy. In this first episode, "The Times\' David Sirota and I look into the impact of the housing crisis on the American economy and the role the'

In [None]:
gpt2_lm.generate("List of countries and their capitals:Russia: Moscow, Switzerland: Bern, Finland:", max_length=32)

'List of countries and their capitals:Russia: Moscow, Switzerland: Bern, Finland: Helsinki, Germany: Berlin, Hong Kong, New Zealand: Wellington,'

## Apprentissage par transfert ("finetuning") de GPT-2

In [None]:
import tensorflow_datasets as tfds

math_qa = tfds.load("math_qa", split="train")

In [None]:
for document in math_qa:
    print(document['Problem'])
    break

tf.Tensor(b'pascal has 96 miles remaining to complete his cycling trip . if he reduced his current speed by 4 miles per hour , the remainder of the trip would take him 16 hours longer than it would if he increased his speed by 50 % . what is his current speed w ?', shape=(), dtype=string)


In our case, we are performing next word prediction in a language model, so we
only need the 'document' feature.

In [None]:
train_ds = (
    math_qa.map(lambda document: document['Problem'])
    .batch(4)
    .cache()
    .prefetch(tf.data.AUTOTUNE)
)

In [None]:
train_ds = train_ds.take(100)
num_epochs = 1

# Linearly decaying learning rate.
#learning_rate = keras.optimizers.schedules.PolynomialDecay(
#    5e-5,
#    decay_steps=train_ds.cardinality() * num_epochs,
#    end_learning_rate=0.0,
#)
#loss = keras.losses.SparseCategoricalCrossentropy(from_logits=True)
gpt2_lm.compile(
    optimizer=keras.optimizers.Adam(),
    loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    weighted_metrics=["accuracy"],
)

gpt2_lm.fit(train_ds, epochs=num_epochs)

[1m100/100[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m143s[0m 1s/step - accuracy: 0.2601 - loss: 0.1687 


<keras.src.callbacks.history.History at 0x7ecc446faa70>

In [None]:
gpt2_lm.generate("", max_length=64)

'a train is running at 60 kmph with its speed of 60 kmph . what is the least amount the train will be required to pass a man who is running at 60 mph ?'

In [None]:
print(gpt2_lm.generate("If a circle", max_length=128))

If a circle is cut by 50 cm , the radius of a circle is reduced by 5 km . what is the area of the triangle ?


## Choix du prochain token : top-k

In [None]:
gpt2_lm.compile(sampler=keras_nlp.samplers.TopKSampler(k=5))
print(gpt2_lm.generate("I like basketball", max_length=128))

I like basketball and is a team of the NBA team is going to win 5 points . if a team is playing at 5 - inch and the team is playing at 5 - inch , what is the greatest possible probability that the team can play at 5 - inch ?


In [None]:
gpt2_lm.compile(sampler=keras_nlp.samplers.TopKSampler(k=5))
print(gpt2_lm.generate("I like basketball", max_length=128))

I like basketball at a certain team , the team will have a winning streak of 5 . the team will be the same team will be the team that will be played at 5 p and the team will be the team that is playing in 8 p . if the team is playing at 4 p , and the team is playing p to win 5 p , what is the team that has the team playing p ?


In [None]:
gpt2_lm.compile(sampler=keras_nlp.samplers.TopKSampler(k=1))
print(gpt2_lm.generate("I like basketball", max_length=128))

I like basketball is the least possible positive integer that is divisible by 5 ^ 2 + 5 ^ 2 + 5 ^ 2 + 5 ^ 2 + 5 ^ 2 + 5 ^ 2 + 5 + 5 + 5 + 5 + 5 + 5 + 5 + 5 + 5 + 5 + 5 + 5 + 5 + 5 + 5 + 5 + 5 + 5 + 5 + 5 + 5 + 5 + 5 + 5 + 5 + 5 + 5 + 5 + 5 + 5 + 5 + 5 + 5 + 5 + 5 + 5 + 5 + 5 + 5 + 5 + 5 + 5 + 5 + 5 + 5


In [None]:
gpt2_lm.compile(sampler=keras_nlp.samplers.TopKSampler(k=1))
print(gpt2_lm.generate("I like basketball", max_length=128))

I like basketball is the least possible positive integer that is divisible by 5 ^ 2 + 5 ^ 2 + 5 ^ 2 + 5 ^ 2 + 5 ^ 2 + 5 ^ 2 + 5 + 5 + 5 + 5 + 5 + 5 + 5 + 5 + 5 + 5 + 5 + 5 + 5 + 5 + 5 + 5 + 5 + 5 + 5 + 5 + 5 + 5 + 5 + 5 + 5 + 5 + 5 + 5 + 5 + 5 + 5 + 5 + 5 + 5 + 5 + 5 + 5 + 5 + 5 + 5 + 5 + 5 + 5 + 5 + 5
