# Tokenization

In [1]:
import tensorflow as tf
tokenizer = tf.keras.preprocessing.text.Tokenizer()
text_corpus = ['bob ate apples, and pears', 'fred ate apples!']
tokenizer.fit_on_texts(text_corpus)
new_texts = ['bob ate pears', 'fred ate pears']
print(tokenizer.texts_to_sequences(new_texts))
print(tokenizer.word_index)

2026-01-29 17:36:26.622302: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2026-01-29 17:36:26.641953: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2026-01-29 17:36:26.647851: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2026-01-29 17:36:26.712691: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE4.1 SSE4.2 AVX AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


[[3, 1, 5], [6, 1, 5]]
{'ate': 1, 'apples': 2, 'bob': 3, 'and': 4, 'pears': 5, 'fred': 6}


## Tokenizer parameters

The Tokenizer object can be initialized with a number of optional parameters. By default, the Tokenizer filters out any punctuation and white space. You can specify custom filtering with the filters parameter. The parameter takes in a string, where each character in the string is filtered out.

When a new text contains words not in the corpus vocabulary, those words are known as out-of-vocabulary (OOV) words. The texts_to_sequences automatically filters out all OOV words. However, if we want to specify each OOV word with a special vocabulary token (e.g. 'OOV'), we can initialize the Tokenizer with the oov_token parameter.

In [3]:
import tensorflow as tf
tokenizer = tf.keras.preprocessing.text.Tokenizer(
  oov_token='OOV')
text_corpus = ['bob ate apples, and pears', 'fred ate apples!']
tokenizer.fit_on_texts(text_corpus)
print(tokenizer.texts_to_sequences(['bob ate bacon']))
print(tokenizer.word_index)

[[4, 2, 1]]
{'OOV': 1, 'ate': 2, 'apples': 3, 'bob': 4, 'and': 5, 'pears': 6, 'fred': 7}


The num_words parameter lets us specify the maximum number of vocabulary words to use. For example, if we set num_words=100 when initializing the Tokenizer, it will only use the 100 most frequent words in the vocabulary and filter out the remaining vocabulary words. This can be useful when the text corpus is large and you need to limit the vocabulary size to increase training speed or prevent overfitting on infrequent words.

In [4]:
import tensorflow as tf
tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=2)
text_corpus = ['bob ate apples, and pears', 'fred ate apples!']
tokenizer.fit_on_texts(text_corpus)

# the two most common words are 'ate' and 'apples'
# the tokenizer will filter out all other words
# for the sentence 'bob ate pears', only 'ate' will be kept
# since 'ate' maps to an integer ID of 1, the only value 
# in the token sequence will be 1
print(tokenizer.texts_to_sequences(['bob ate pears']))

[[1]]


# Embeddings

## 1. Introduction

* **Embeddings** : C'est une représentation vectorielle des mots ou des phrases qui capturent leurs relations semantiques. Les embeddings sont utilisés dans de nombreuses taches NLP, notamment pour les taches de classification, de retraitement et de generation de textes.

---

## 2. Utilisation de l'objet Tokenizer (TensorFlow/Keras)

L'outil principal utilisé est la classe `tf.keras.preprocessing.text.Tokenizer`. Elle automatise plusieurs étapes cruciales :

* **Indexation** : Chaque mot du vocabulaire est associé à un identifiant entier unique, attribué selon la fréquence d'apparition (les mots les plus fréquents ont les index les plus bas).
* **fit_on_texts** : Cette méthode analyse le corpus pour créer le dictionnaire interne (le vocabulaire).
* **texts_to_sequences** : Cette méthode transforme une liste de textes en listes de nombres (vecteurs), remplaçant chaque mot par son nombre correspondant.

---

## 3. Paramètres et Gestion des Exceptions

Le `Tokenizer` offre des options pour affiner le traitement des données :

* **Filtrage** : Par défaut, la ponctuation et les majuscules sont supprimées pour normaliser le texte.
* **OOV (Out-Of-Vocabulary)** : Lorsqu'un nouveau texte contient un mot absent du vocabulaire initial, il est normalement ignoré. Le paramètre `oov_token` permet de remplacer ces mots inconnus par un jeton spécial (ex: "OOV") afin de conserver la structure de la phrase.
* **num_words** : Ce paramètre limite la taille du vocabulaire aux $N$ mots les plus fréquents. C'est essentiel pour réduire la complexité du modèle et éviter le surapprentissage sur des mots rares.

---

# Implémentation : La fonction `tokenize_text_corpus`

Voici le code complété et l'explication technique du processus de transformation.



the exercices

In [None]:
import tensorflow as tf

def get_window_indices(sequence, target_index, half_window_size):
    left_incl=max(0,(target_index-half_window_size))
    right_excl=min(len(sequence),(target_index+half_window_size+1))
    return [left_incl,right_excl]

def get_target_and_size(sequence, target_index, window_size):
    target_word=sequence[target_index]
    half_window_size=window_size // 2
    return target_word , half_window_size

# Skip-gram embedding model
class EmbeddingModel(object):
    # Model Initialization
    def __init__(self, vocab_size, embedding_dim):
        self.vocab_size = vocab_size
        self.embedding_dim = embedding_dim
        self.tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=self.vocab_size)

    # Convert a list of text strings into word sequences
    def get_target_and_context(self, sequence, target_index, window_size):
        target_word, half_window_size = get_target_and_size(
            sequence, target_index, window_size
        )
        left_incl, right_excl = get_window_indices(
            sequence, target_index, half_window_size)
        return target_word, left_incl, right_excl

skip-gram

In [6]:

import tensorflow as tf

class EmbeddingModel(object):
    def __init__(self, vocab_size, embedding_dim):
        self.vocab_size = vocab_size
        self.embedding_dim = embedding_dim
        self.tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=self.vocab_size)

    def tokenize_text_corpus(self, texts):
        self.tokenizer.fit_on_texts(texts)
        sequences = self.tokenizer.texts_to_sequences(texts)
        return sequences

    def get_target_and_context(self, sequence, target_index, window_size):
        target_word = sequence[target_index]
        half_window_size = window_size // 2
        left_incl = max(0, target_index - half_window_size)
        right_excl = min(len(sequence), target_index + half_window_size + 1)
        return target_word, left_incl, right_excl
    
    def create_target_context_pairs(self, texts, window_size):
        pairs = []
        # 1. Conversion du texte en séquences numériques
        sequences = self.tokenize_text_corpus(texts)      
        
        for sequence in sequences:
            for i in range(len(sequence)):
                # 2. Récupération des limites de la fenêtre pour chaque mot cible
                target_word, left_incl, right_excl = self.get_target_and_context(
                    sequence, i, window_size)
                
                # 3. Création des paires (Target, Context)
                for j in range(left_incl, right_excl):
                    # On s'assure de ne pas créer une paire avec le mot cible lui-même
                    if j != i:
                        pairs.append((target_word, sequence[j]))
        return pairs

Embedding lookup
When training the embedding model, the "forward" run consists of variable initialization/retrieval followed by embedding lookup for the current iteration's training batch. Embedding lookup refers to retrieving the embedding vectors for each word ID in the training batch. Since the embedding matrix's rows are each unique embedding vectors, we perform the lookup simply by retrieving the rows corresponding to the training batch's word IDs.

The function we use to retrieve the embedding vectors is tf.nn.embedding_lookup. It takes in two required arguments, which are the embedding matrix variable and vocabulary IDs to lookup

In [None]:
class EmbeddingModel(object):
    def __init__(self, vocab_size, embedding_dim):
        self.vocab_size = vocab_size
        self.embedding_dim = embedding_dim
        self.tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=self.vocab_size)
        self.embedding_matrix = None

    def forward(self, target_ids):
        # 1. Préparation de l'initialiseur
        initializer = get_initializer(self.embedding_dim, self.vocab_size)
        
        # 2. Création ou récupération de la matrice (Variable TensorFlow)
        self.embedding_matrix = tf.compat.v1.get_variable(
            'embedding_matrix', 
            initializer=initializer
        )
        
        # 3. Extraction des vecteurs d'embedding pour les IDs cibles
        embeddings = tf.nn.embedding_lookup(self.embedding_matrix, target_ids)
        
        return embeddings

to calculate the loss for our candidate sampling algorithm, we need to create weight and bias variables. The weight variable will have shape [self.vocab_size, self.embedding_dim], while the bias variable will have shape [self.vocab_size].

We'll initialize the values for both the weight and bias variables to all 0's.

In [None]:
import tensorflow as tf

# Skip-gram embedding model
class EmbeddingModel(object):
    # Model Initialization
    def __init__(self, vocab_size, embedding_dim):
        self.vocab_size = vocab_size
        self.embedding_dim = embedding_dim
        self.tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=self.vocab_size)

    # Get bias and weights for calculating loss
    def get_bias_weights(self):
        weights_initializer = tf.zeros([self.vocab_size, self.embedding_dim])
        bias_initializer = tf.zeros([self.vocab_size])
        weights = tf.compat.v1.get_variable('weights',
            initializer=weights_initializer)
        bias = tf.compat.v1.get_variable('bias',
            initializer=bias_initializer)
        return weights, bias
        