<a href="https://colab.research.google.com/github/gyxcit/nlp_course1/blob/main/pw1_embedding.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<a target="_blank" href="https://colab.research.google.com/github/PaulLerner/aivancity_nlp/blob/main/pw1_embedding.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# Part 0 - Installation and imports

Hit `Ctrl+S` to save a copy of the Colab notebook to your drive

Run on Google Colab GPU:
- Connect
- Modify execution
- GPU

![image.png](https://paullerner.github.io/aivancity_nlp/_static/colab_gpu.png)

In [63]:
%pip install datasets



In [64]:
import torch

In [65]:
assert torch.cuda.is_available(), "Connect to GPU and try again (ask teacher for help)"

# Part 1 - Training Skipgram with negative sampling

(aka word2vec)
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). [Distributed Representations of Words and Phrases and their Compositionality](https://papers.neurips.cc/paper/2013/hash/9aa42b31882ec039965f3c4923ce901b-Abstract.html). Advances in Neural Information Processing Systems, 26.

Eisenstein, J. (2018). [Natural Language Processing](https://csunibo.github.io/natural-language-processing/books/eisenstein-natural-language-processing.pdf) Chapter 14.5

Not to be confused with Softmax Skipgram



## Training data

In [66]:
from datasets import load_dataset, DatasetDict

texts = load_dataset('wikitext', 'wikitext-103-raw-v1')['train'].shuffle(seed=1111).select(range(10000))["text"]
len(texts)

10000

### Tokenization

Unsing [`re`](https://docs.python.org/3/library/re.html), tokenize the text in a `List[str]` to keep only words, no punctuation or space. We will also preprocess the text to make it lowercase, so our model is *case-insensitive*

If you don't know anything about regex, a good place to start is https://regex101.com/

In [67]:
import re

In [68]:
texts[0]

" Following the death of Finan , bishop of Lindisfarne , Alhfrith of Deira , in collusion with Wilfred of York , Agilbert of Wessex and others , were determined to persuade Oswiu to rule in favour of the Roman rite of Christianity within the kingdoms over which he had imperium . The case was debated in Oswiu 's presence at the Synod of Whitby in 664 , with Colmán , Hild and Cedd defending the Celtic rite and the tradition inherited from Aidan , and Wilifred speaking for the Roman position . The Roman cause prevailed and the former division of ecclesiastical authorities was set aside . Those who could not accept it , including Colmán , departed elsewhere . \n"

In [69]:
type(texts)

list

In [70]:
def tokenize_text(text: str) -> list[str]:
    # Convert text to lowercase

    text = text.lower()
    # Find all words using regex (alphanumeric sequences)
    tokens = re.findall(r'\b\w+\b', text)
    return tokens

In [71]:
def one_list_to_govern_them_all(liste:list) -> list[str]:
  to_unique=[]
  for i in range(0,len(liste)):
    to_unique.extend(liste[i])
  return to_unique

In [72]:
#unique list
unique_texts=one_list_to_govern_them_all(texts)

In [73]:
texts=[tokenize_text(x) for x in texts]

In [74]:
texts

[['following',
  'the',
  'death',
  'of',
  'finan',
  'bishop',
  'of',
  'lindisfarne',
  'alhfrith',
  'of',
  'deira',
  'in',
  'collusion',
  'with',
  'wilfred',
  'of',
  'york',
  'agilbert',
  'of',
  'wessex',
  'and',
  'others',
  'were',
  'determined',
  'to',
  'persuade',
  'oswiu',
  'to',
  'rule',
  'in',
  'favour',
  'of',
  'the',
  'roman',
  'rite',
  'of',
  'christianity',
  'within',
  'the',
  'kingdoms',
  'over',
  'which',
  'he',
  'had',
  'imperium',
  'the',
  'case',
  'was',
  'debated',
  'in',
  'oswiu',
  's',
  'presence',
  'at',
  'the',
  'synod',
  'of',
  'whitby',
  'in',
  '664',
  'with',
  'colmán',
  'hild',
  'and',
  'cedd',
  'defending',
  'the',
  'celtic',
  'rite',
  'and',
  'the',
  'tradition',
  'inherited',
  'from',
  'aidan',
  'and',
  'wilifred',
  'speaking',
  'for',
  'the',
  'roman',
  'position',
  'the',
  'roman',
  'cause',
  'prevailed',
  'and',
  'the',
  'former',
  'division',
  'of',
  'ecclesiastical',

### Vocabulary



Using `collections.Counter`, construct the vocabulary by tokenizing every text in the dataset and keeping only the $V$ most frequent words. Let's set $V=1000$ to start

Then use this vocabulary to vectorize examples, assigning integer identifiers to words (from $0$ to $V-1$)

In [75]:
from collections import Counter
V = 1000

In [76]:
def build_vocabulary(lists: list[str], V: int) -> list[str]:
    """Builds a vocabulary with the V most frequent words."""
    # Fusionner les compteurs de chaque liste
    word_counts = sum((Counter(lst) for lst in lists), Counter())
    most_common_words = [word for word, _ in word_counts.most_common(V)]
    print(f"most commmon words:\n{most_common_words}")
    vocab = {word: idx for idx, word in enumerate(most_common_words)}
    return word_counts, vocab

In [77]:
vocab

{'the': 0,
 'of': 1,
 'and': 2,
 'in': 3,
 'to': 4,
 'a': 5,
 'was': 6,
 'on': 7,
 's': 8,
 'as': 9,
 'for': 10,
 'with': 11,
 'that': 12,
 'by': 13,
 'is': 14,
 'he': 15,
 'his': 16,
 'at': 17,
 'it': 18,
 'from': 19,
 'were': 20,
 'an': 21,
 'had': 22,
 'which': 23,
 'be': 24,
 'but': 25,
 'this': 26,
 'are': 27,
 'first': 28,
 'not': 29,
 'after': 30,
 'also': 31,
 'one': 32,
 'their': 33,
 'two': 34,
 'her': 35,
 'its': 36,
 'they': 37,
 'or': 38,
 'who': 39,
 'when': 40,
 'have': 41,
 'new': 42,
 'has': 43,
 'she': 44,
 'been': 45,
 'would': 46,
 'other': 47,
 'i': 48,
 'during': 49,
 'time': 50,
 'all': 51,
 'more': 52,
 'into': 53,
 '1': 54,
 'game': 55,
 'over': 56,
 'most': 57,
 'him': 58,
 'while': 59,
 'only': 60,
 'than': 61,
 'between': 62,
 'later': 63,
 'up': 64,
 'out': 65,
 '2': 66,
 'three': 67,
 'about': 68,
 'before': 69,
 'film': 70,
 'there': 71,
 'such': 72,
 'some': 73,
 'may': 74,
 'made': 75,
 '000': 76,
 'world': 77,
 'year': 78,
 'where': 79,
 'series': 80,


In [78]:
occurrences,vocab = build_vocabulary(texts,V)
occurrences

most commmon words:
['the', 'of', 'and', 'in', 'to', 'a', 'was', 'on', 's', 'as', 'for', 'with', 'that', 'by', 'is', 'he', 'his', 'at', 'it', 'from', 'were', 'an', 'had', 'which', 'be', 'but', 'this', 'are', 'first', 'not', 'after', 'also', 'one', 'their', 'two', 'her', 'its', 'they', 'or', 'who', 'when', 'have', 'new', 'has', 'she', 'been', 'would', 'other', 'i', 'during', 'time', 'all', 'more', 'into', '1', 'game', 'over', 'most', 'him', 'while', 'only', 'than', 'between', 'later', 'up', 'out', '2', 'three', 'about', 'before', 'film', 'there', 'such', 'some', 'may', 'made', '000', 'world', 'year', 'where', 'series', 'through', 'second', '3', 'season', 'years', 'no', 'them', 'm', 'used', 'these', 'became', 'state', 'war', 'however', 'being', 'music', 'can', 'several', 'against', 'then', 'many', 'city', 'album', '5', 'number', 'including', 'both', 'song', 'four', 'north', '4', 'part', 'team', 'south', 'did', 'united', 'well', 'because', 'early', 'following', 'under', 'episode', 'day', 

Counter({'following': 324,
         'the': 35377,
         'death': 176,
         'of': 14994,
         'finan': 1,
         'bishop': 23,
         'lindisfarne': 1,
         'alhfrith': 1,
         'deira': 1,
         'in': 12074,
         'collusion': 3,
         'with': 3738,
         'wilfred': 5,
         'york': 186,
         'agilbert': 1,
         'wessex': 1,
         'and': 13832,
         'others': 101,
         'were': 1868,
         'determined': 31,
         'to': 11022,
         'persuade': 8,
         'oswiu': 2,
         'rule': 41,
         'favour': 8,
         'roman': 44,
         'rite': 7,
         'christianity': 15,
         'within': 181,
         'kingdoms': 4,
         'over': 601,
         'which': 1525,
         'he': 2713,
         'had': 1615,
         'imperium': 7,
         'case': 96,
         'was': 5883,
         'debated': 2,
         's': 4221,
         'presence': 38,
         'at': 2479,
         'synod': 3,
         'whitby': 3,
         '664'

In [79]:
# expected results
occurrences.most_common(10)

[('the', 35377),
 ('of', 14994),
 ('and', 13832),
 ('in', 12074),
 ('to', 11022),
 ('a', 9897),
 ('was', 5883),
 ('on', 4257),
 ('s', 4221),
 ('as', 3989)]

In [80]:
# expected results
len(occurrences)

39030

In [81]:
def vectorize_texts(token_text, vocab) -> list[list[int]]:
    """Convert texts into sequences of integer identifiers based on the vocabulary."""
    return [[vocab[word] for word in token_text if word in vocab]]

In [82]:
vectorized_texts = vectorize_texts(texts[0], vocab)

In [83]:
vectorized_texts

[[120,
  0,
  251,
  1,
  1,
  1,
  3,
  11,
  1,
  235,
  1,
  2,
  509,
  20,
  4,
  4,
  3,
  1,
  0,
  1,
  241,
  0,
  56,
  23,
  15,
  22,
  0,
  537,
  6,
  3,
  8,
  17,
  0,
  1,
  3,
  11,
  2,
  0,
  2,
  0,
  19,
  2,
  10,
  0,
  457,
  0,
  2,
  0,
  299,
  278,
  1,
  6,
  248,
  236,
  39,
  145,
  29,
  18,
  106]]

### Self-supervision: input and targets from raw text

using `window_size=2`, iterate through the tokenized data and collect indices of target word $w$ and positive context words $c_+$

After debugging, apply this function to all text in dataset

In [84]:
word_contexts = []
window_size=2

In [85]:
def collect_context_pairs(vectorized_texts: list[list[int]], window_size: int = 2) -> list[tuple[int, int]]:
    """Collects (target, context) pairs from vectorized texts using a given window size."""
    context_pairs = []
    for text in vectorized_texts:
        for i, target in enumerate(text):
            left = max(i - window_size, 0)
            right = min(i + window_size + 1, len(text))

            for j in range(left, right):
                if i != j:  # Ne pas inclure le mot cible lui-même comme contexte
                    context_pairs.append((target, text[j]))

    return context_pairs

In [137]:
word_contexts = collect_context_pairs(vectorized_texts, window_size=2)

In [138]:
word_contexts[:10]

[(120, 0),
 (120, 251),
 (0, 120),
 (0, 251),
 (0, 1),
 (251, 120),
 (251, 0),
 (251, 1),
 (251, 1),
 (1, 0)]

In [139]:
len(word_contexts)

230

In [140]:
len(vectorized_texts)

1

In [154]:
def collect_context_pairs_for_all(lists,vocabulary,window_size_number):
  collect=[]
  #print(f'len :: {len(lists)}')
  for i in range(len(lists)):
    vectorized_texts = vectorize_texts(lists[i], vocabulary)
    word_contexts = collect_context_pairs(vectorized_texts, window_size=window_size_number)
    #print(f'word  {word_contexts}')
    collect.extend(word_contexts)
    #print(len(collect))
  return collect

In [155]:
words_contexts=collect_context_pairs_for_all(lists=texts,vocabulary=vocab,window_size_number=2)

In [156]:
len(words_contexts)

1202312

In [157]:
len(words_contexts)

1202312

### DataLoader

`torch.utils.data.DataLoader` provides a convenient interface for batching and randomly sampling data.

use `batch_size=1024` to wrap the above generate data

and `drop_last=True` so that the batch has always the same size


In [108]:
import torch
from torch.utils.data import DataLoader, TensorDataset

In [165]:
# Convertir les paires (w, c+) en tenseurs PyTorch
def create_dataloader(context_pairs, batch_size=1024, drop_last=True):
    """Wraps context pairs into a PyTorch DataLoader for batching."""
    targets, contexts = zip(*context_pairs)  # Sépare les cibles et les contextes

    # Conversion en tenseurs
    targets_tensor = torch.tensor(targets, dtype=torch.long)
    contexts_tensor = torch.tensor(contexts, dtype=torch.long)

    # Création du dataset PyTorch
    dataset = TensorDataset(targets_tensor, contexts_tensor)

    # Création du DataLoader
    dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True, drop_last=drop_last)
    #print(f'data loader ::{dataloader}')
    return dataloader

In [169]:
# Définition des paramètres
BATCH_SIZE = 1024

# Création du DataLoader
dataloader = create_dataloader(words_contexts, batch_size=BATCH_SIZE, drop_last=True)

# Afficher un batch pour vérification
for batch in dataloader:
    print(f'{batch}')
    target_batch, context_batch = batch
    print("Target batch shape:", target_batch.shape)
    print("Context batch shape:", context_batch.shape)
    print("Example batch (targets, contexts):", list(zip(target_batch[:5].tolist(), context_batch[:5].tolist())))

data loader ::<torch.utils.data.dataloader.DataLoader object at 0x7e44ea3a6850>
[tensor([ 14,  14,  10,  ...,   3,  10, 322]), tensor([ 1, 55,  5,  ..., 76, 17,  6])]
Target batch shape: torch.Size([1024])
Context batch shape: torch.Size([1024])
Example batch (targets, contexts): [(14, 1), (14, 55), (10, 5), (860, 2), (18, 3)]
[tensor([268,  13,  63,  ...,   0,   0, 808]), tensor([753,  10,  12,  ...,   3, 316,   8])]
Target batch shape: torch.Size([1024])
Context batch shape: torch.Size([1024])
Example batch (targets, contexts): [(268, 753), (13, 10), (63, 12), (121, 906), (6, 119)]
[tensor([699, 719,  49,  ...,  24,  12, 793]), tensor([ 12,   0,   2,  ...,  89, 934,  67])]
Target batch shape: torch.Size([1024])
Context batch shape: torch.Size([1024])
Example batch (targets, contexts): [(699, 12), (719, 0), (49, 2), (67, 65), (525, 9)]
[tensor([ 92,  12,  10,  ...,   9,  16, 102]), tensor([767,  21,  51,  ...,  72,   2, 526])]
Target batch shape: torch.Size([1024])
Context batch shape

In [170]:
batch_size = 1024

In [171]:
loader = torch.utils.data.DataLoader(word_contexts, batch_size=batch_size, shuffle=True, drop_last=True)

In [175]:
for words, contexts in dataloader:
    print("Batch de mots (targets):", words.shape)
    print("Batch de contextes:", contexts.shape)
    print("Exemple de paires (w, c+):", list(zip(words[:5].tolist(), contexts[:5].tolist())))
    break  # Arrêter après le premier batch


Batch de mots (targets): torch.Size([1024])
Batch de contextes: torch.Size([1024])
Exemple de paires (w, c+): [(5, 4), (46, 258), (998, 734), (4, 3), (17, 263)]


### Negative examples

Using `torch.randint` draw $k=10$ random negative examples from the vocabulary for each target word $w$


In [181]:
K_size=10
Vocab_size = len(vocab)


In [178]:
def generate_negative_samples(vocab_size: int, batch_size: int, k: int) -> torch.Tensor:
    """
    Generates k negative samples for each target word in the batch.

    Args:
    - vocab_size (int): Taille du vocabulaire.
    - batch_size (int): Nombre d'exemples dans le batch.
    - k (int): Nombre d'exemples négatifs à tirer.

    Returns:
    - neg_samples (torch.Tensor): Tensor de shape (batch_size, k) contenant les indices des mots négatifs.
    """
    neg_samples = torch.randint(low=0, high=vocab_size, size=(batch_size, k))
    return neg_samples

In [182]:
for words, contexts in dataloader:
    neg_samples = generate_negative_samples(Vocab_size, words.shape[0], K_size)

    print("Target words batch:", words.shape)
    print("Positive context batch:", contexts.shape)
    print("Negative samples batch:", neg_samples.shape)
    print("Example negative samples for first word in batch:", neg_samples[0].tolist())
    break  # Afficher un seul batch

Target words batch: torch.Size([1024])
Positive context batch: torch.Size([1024])
Negative samples batch: torch.Size([1024, 10])
Example negative samples for first word in batch: [248, 931, 611, 386, 257, 743, 591, 483, 536, 180]


Concatenate the indices of negatives and positives examples but keep track of the labels (`1.0` for positive, `0.0` for negative)

In [185]:
def prepare_training_data(dataloader, vocab_size, k=10):
    """
    Prepares the training data by concatenating positive and negative samples
    and assigning the corresponding labels.

    Args:
    - dataloader: DataLoader fournissant les (words, context).
    - vocab_size (int): Taille du vocabulaire.
    - k (int): Nombre d'exemples négatifs par mot cible.

    Returns:
    - targets (torch.Tensor): Tenseurs des mots cibles.
    - contexts (torch.Tensor): Concaténation des exemples positifs et négatifs.
    - labels (torch.Tensor): Labels associés (1.0 pour positif, 0.0 pour négatif).
    """
    all_targets = []
    all_contexts = []
    all_labels = []

    for words, contexts in dataloader:
        batch_size = words.shape[0]

        # Générer des exemples négatifs
        neg_samples = generate_negative_samples(vocab_size, batch_size, k)

        # Concaténer les exemples positifs et négatifs
        context_combined = torch.cat([contexts.unsqueeze(1), neg_samples], dim=1)  # Shape (batch_size, k+1)
        labels = torch.cat([torch.ones(batch_size, 1), torch.zeros(batch_size, k)], dim=1)  # Shape (batch_size, k+1)

        # Stocker les données
        all_targets.append(words)  # Shape (batch_size,)
        all_contexts.append(context_combined)  # Shape (batch_size, k+1)
        all_labels.append(labels)  # Shape (batch_size, k+1)

    # Convertir en tenseurs PyTorch concaténés
    targets_tensor = torch.cat(all_targets)
    contexts_tensor = torch.cat(all_contexts)
    labels_tensor = torch.cat(all_labels)

    return targets_tensor, contexts_tensor, labels_tensor

In [186]:
K = 10  # Nombre d'exemples négatifs par mot
targets, contexts, labels = prepare_training_data(dataloader, Vocab_size, k=K_size)

In [187]:
labels.shape

torch.Size([1202176, 11])

In [188]:
contexts.shape

torch.Size([1202176, 11])

## Model


![diagram](https://paullerner.github.io/aivancity_nlp/_static/skipgram.png)

Using `nn.Embedding`, initialize the two word embedding matrices, $W$ (word) and $C$ (context), of shape $(V, d)$. Let's see $d=100$ for now.

In [189]:
d=100

In [190]:
import torch.nn as nn

# Paramètres du modèle
V = len(vocab)  # Taille du vocabulaire
d = 100  # Dimension des embeddings

# Initialisation des embeddings
embedding_W = nn.Embedding(num_embeddings=V, embedding_dim=d)
embedding_C = nn.Embedding(num_embeddings=V, embedding_dim=d)

# Vérification des dimensions
print("Word embedding matrix W shape:", embedding_W.weight.shape)
print("Context embedding matrix C shape:", embedding_C.weight.shape)

Word embedding matrix W shape: torch.Size([1000, 100])
Context embedding matrix C shape: torch.Size([1000, 100])


You can then call these embeddings by passing them the tokens indices computed above to get the vectors

In [192]:
embedding_W(words).shape

torch.Size([1024, 100])

In [193]:
embedding_C(contexts).shape

torch.Size([1202176, 11, 100])

## Probability of similarity (sigmoid)
$$P(+ | w,c) = \sigma(c \cdot w) = \frac{1}{1+\exp(-c \cdot w)}$$


Compute the similarity $c \cdot w$. No need to compute Sigmoid as it is integrated in the loss function (see below).

In [199]:
def compute_similarity(targets, contexts, embedding_W, embedding_C):
    """
    Compute the similarity c ⋅ w between word and context embeddings.

    Args:
    - targets (torch.Tensor): Indices des mots cibles (shape: [batch_size])
    - contexts (torch.Tensor): Indices des mots de contexte (shape: [batch_size, k+1])
    - embedding_W (nn.Embedding): Matrice d'embedding des mots (V, d)
    - embedding_C (nn.Embedding): Matrice d'embedding des contextes (V, d)

    Returns:
    - similarities (torch.Tensor): Produit scalaire entre c et w (shape: [batch_size, k+1])
    """
    # Récupérer les embeddings des mots cibles (shape: [batch_size, d])
    w_vectors = embedding_W(targets)  # (batch_size, d)

    # Récupérer les embeddings des mots de contexte (shape: [batch_size, k+1, d])
    c_vectors = embedding_C(contexts)  # (batch_size, k+1, d)

    # Reshape w_vectors pour qu'il soit (batch_size, d, 1)
    w_vectors = w_vectors.unsqueeze(2)  # (batch_size, d, 1)

    # Produit scalaire entre chaque paire (w, c) -> shape: (batch_size, k+1)
    similarities = torch.matmul(c_vectors, w_vectors).squeeze(2)

    return similarities

In [200]:

# 🔥 Test avec un batch d'entraînement
for words, contexts in dataloader:
    similarities = compute_similarity(words, contexts, embedding_W, embedding_C)

    print("Similarity tensor shape:", similarities.shape)  # (batch_size, k+1)
    print("Example similarities:", similarities[:5])
    break  # Tester uniquement sur un batch

Similarity tensor shape: torch.Size([1024, 1024])
Example similarities: tensor([[ -4.4987, -12.2394,   4.2754,  ..., -13.9737, -14.6826,   7.6650],
        [-20.0720,  -6.8195,  11.2739,  ...,   7.8222,   0.1693, -13.4752],
        [  5.1556,  -8.1852,  -4.3052,  ..., -12.6521,   8.0265,  10.4137],
        [ -6.2136,   9.3837,  -9.5842,  ..., -16.6773,  -7.9885,   8.3415],
        [  3.8548,  15.6787,   2.3566,  ...,   8.8896,  -9.3696, -12.1108]],
       grad_fn=<SliceBackward0>)


In [201]:
similarities.shape

torch.Size([1024, 1024])



## Loss: Binary Cross Entropy

Compute the loss using `nn.BCEWithLogitsLoss`

$$- \left(\log\sigma(c_+ \cdot w) + \sum_{i=1}^{k} \log\sigma(-c_i \cdot w)\right) $$

You will need to reshape similarities and labels

In [202]:
def compute_loss(similarities, labels):
    """
    Compute the Binary Cross Entropy Loss using BCEWithLogitsLoss.

    Args:
    - similarities (torch.Tensor): Produit scalaire entre c et w (shape: [batch_size, k+1])
    - labels (torch.Tensor): Labels (1.0 pour positif, 0.0 pour négatif) (shape: [batch_size, k+1])

    Returns:
    - loss (torch.Tensor): Valeur scalaire de la perte.
    """
    # Initialiser la fonction de perte BCEWithLogitsLoss
    criterion = nn.BCEWithLogitsLoss()

    # Reshape les tenseurs pour qu'ils aient la même forme
    similarities = similarities.view(-1)  # (batch_size * (k+1),)
    labels = labels.view(-1)  # (batch_size * (k+1),)

    # Calcul de la perte
    loss = criterion(similarities, labels)
    return loss

In [204]:
for words, contexts in dataloader:
    print("Words shape:", words.shape)
    print("Contexts shape:", contexts.shape)  # Vérifier la forme

    # S'assurer que `contexts` est bien en 2D
    if contexts.dim() == 1:
        contexts = contexts.unsqueeze(1)  # Ajoute une dimension si nécessaire

    batch_size = words.shape[0]
    k = contexts.shape[1] - 1  # k+1 (1 positif + k négatifs)

    labels = torch.cat([torch.ones(batch_size, 1), torch.zeros(batch_size, k)], dim=1).to(contexts.device)

    # Calcul des similarités
    similarities = compute_similarity(words, contexts, embedding_W, embedding_C)

    # Calcul de la perte
    loss = compute_loss(similarities, labels)

    print("Loss:", loss.item())
    break  # Tester sur un seul batch

Words shape: torch.Size([1024])
Contexts shape: torch.Size([1024])
Loss: 4.290567874908447


In [205]:
similarities.reshape(-1).shape

torch.Size([1024])

## Training loop

Now that you checked everything is working, pack everything above in a `nn.Module` and training loop

Ensure that everything is on GPU by calling `.cuda()` or passing `device="cuda"` on init

In [206]:
import torch
import torch.nn as nn
import torch.optim as optim

In [214]:
class Skipgram(nn.Module):
    def __init__(self, vocab_size, embedding_dim, device="cuda"):
        super(Skipgram, self).__init__()
        self.device = torch.device(device if torch.cuda.is_available() else "cpu")

        # Word embedding matrix W
        self.W = nn.Embedding(num_embeddings=vocab_size, embedding_dim=embedding_dim)

        # Context embedding matrix C
        self.C = nn.Embedding(num_embeddings=vocab_size, embedding_dim=embedding_dim)

        # Move model to GPU if available
        self.to(self.device)

    def forward(self, words, contexts):
        """
        Computes similarities (dot product between word and context embeddings)
        """
        # Get word embeddings (batch_size, d)
        word_vectors = self.W(words.to(self.device))  # (batch_size, d)

        # Ensure `contexts` has the correct shape: (batch_size, k+1)
        if contexts.dim() == 1:
            contexts = contexts.unsqueeze(1)

        # Get context embeddings (batch_size, k+1, d)
        context_vectors = self.C(contexts.to(self.device))  # (batch_size, k+1, d)

        # Compute dot product (batch_size, k+1)
        similarities = torch.matmul(context_vectors, word_vectors.unsqueeze(2)).squeeze(2)

        return similarities  # Correct shape: (batch_size, k+1)

In [215]:
# Model Parameters
V = len(vocab)   # Vocabulary size
d = 100          # Embedding dimension
K = 10           # Negative samples
BATCH_SIZE = 1024
EPOCHS = 5
LR = 0.001       # Learning rate

In [217]:
# Initialize model
model = Skipgram(V, d).cuda()

# Loss function & Optimizer
criterion = nn.BCEWithLogitsLoss()
optimizer = optim.Adam(model.parameters(), lr=LR)

criterion = criterion.cuda()  # Ensure loss runs on GPU

Run tensorboard before training. Refresh during training.

In [218]:
def train_model(model, dataloader, epochs):
    for epoch in range(epochs):
        total_loss = 0.0

        for words, contexts in dataloader:
            batch_size = words.shape[0]

            # Ensure `contexts` has correct shape (batch_size, k+1)
            if contexts.dim() == 1:
                contexts = contexts.unsqueeze(1)

            # Generate labels (1 for positive, 0 for negatives)
            labels = torch.cat([torch.ones(batch_size, 1), torch.zeros(batch_size, K)], dim=1).to(model.device)

            # Forward pass
            similarities = model(words, contexts)

            # **Fix shape issue**: Make sure `similarities` matches `labels`
            assert similarities.shape == labels.shape, f"Shape mismatch: {similarities.shape} vs {labels.shape}"

            # Compute loss
            loss = criterion(similarities, labels)

            # Backpropagation
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

            total_loss += loss.item()

        # Print loss after every epoch
        print(f"Epoch [{epoch+1}/{epochs}], Loss: {total_loss / len(dataloader):.4f}")


In [219]:
train_model(model, dataloader, EPOCHS)

AssertionError: Shape mismatch: torch.Size([1024, 1]) vs torch.Size([1024, 11])

In [None]:
skipgram = Skipgram().cuda()

optimizer = torch.optim.AdamW(skipgram.parameters(), lr=0.0001)

batch_size = 1024
k=10
loader = torch.utils.data.DataLoader(word_contexts, batch_size=batch_size, shuffle=True, drop_last=True)

steps = 0
for epoch in range(10):
    for words, contexts in loader:
        words = words.cuda()
        contexts = contexts.cuda()

        # TODO
        negatives =

        similarities =

        loss = loss_fct(similarities.reshape(-1), labels.reshape(-1))
        writer.add_scalar("Loss/train", loss.item(), steps)
        steps+=1
        loss.backward()
        optimizer.step()


# Part 2 - Analogies and Intrinsic Evaluation of Embeddings

## Data and imports

We'll manipulate numpy arrays instead of torch tensors from here

In [None]:
%pip install gensim

In [None]:
import torch
import numpy as np

In [None]:
import gensim.downloader

In [None]:
# take the embeddings from the model you just trained
#embeddings = skipgram.W.weight.detach().cpu().numpy()

In [None]:
# load from a model you previously trained
#embeddings = torch.load("embeddings.bin").numpy()

In [None]:
# load a model someone else trained, probably better than yours (more data, larger vocabulary, etc.)
keyedvectors = gensim.downloader.load("word2vec-google-news-300")

embeddings = keyedvectors.vectors
vocabulary = {word: index for index, word in enumerate(keyedvectors.index_to_key)}
i2token = {index: word for index, word in enumerate(keyedvectors.index_to_key)}
V = len(vocabulary)

## Qualitative/playing around

### visualization

We'll reduce embeddings to 2D using [PCA](https://en.wikipedia.org/wiki/Principal_component_analysis) to be able to plot them

What can you see from the plot? Take a look at near-synonyms like "large" and "massive", or "interesting" and "fascinating".

In [None]:
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

In [None]:
pca = PCA(n_components=2)
embeddings_2d = pca.fit_transform(embeddings)


Using `plt.scatter` plot the 2D embeddings of the words in `words`. Also show the word using `plt.annotate`

In [None]:
words = ['movie', 'book', 'mysterious', 'story', 'fascinating', 'good', 'interesting', 'large', 'massive', 'huge', "woman","man","he","she",'july',
 'december',
 'february',
 'november',
 'october',
 'january',
 'april',
 'june']


### Nearest neighbors


Now that we have word vectors, we need a way to quantify the similarity between individual words, according to these vectors. One such metric is cosine-similarity. We will be using this to find words that are "close" and "far" from one another.

We can think of n-dimensional vectors as points in n-dimensional space. If we take this perspective [L1](http://mathworld.wolfram.com/L1-Norm.html) and [L2](http://mathworld.wolfram.com/L2-Norm.html) Distances help quantify the amount of space "we must travel" to get between these two points. Another approach is to examine the angle between two vectors. From trigonometry we know that:



Instead of computing the actual angle, we can leave the similarity in terms of similarity $s = \cos(\Theta)$. Formally the [Cosine Similarity](https://en.wikipedia.org/wiki/Cosine_similarity) $s$ between two vectors $p$ and $q$ is defined as:

$$s = \frac{p \cdot q}{||p|| \cdot ||q||}, \textrm{ where } s \in [-1, 1] $$


Compute the 10 nearest neighbors of the words from the above `words` list

#### Bias

What word appears in the top-10 closest to "woman" and "she"? Compare to "man" and "he".

### Analogies

#### Semantic analogies
Word vectors have been shown to *sometimes* exhibit the ability to solve analogies.

As an example, for the analogy "man : grandfather :: woman : t" (read: man is to grandfather as woman is to t), what is t?

noting analogies $x:y :: z:t$ ($x$ is to $y$ as $z$ is to?) solve the following analogies using:
$$\max_{t\in V} \cos(t, y-x+z)$$

You'll see that it does not work perfectly. Instead of keep only the very $\max_{t\in V}$, keep the top 5 (like in the nearest neighbors above)

What's the difference between the first analogies `[("France", "Paris", "Germany"), ("France", "Sarkozy", "Germany")]` and the others?

Try your own examples

In [None]:
for x, y, z in [("France", "Paris", "Germany"), ("France", "Sarkozy", "Germany"), ("France", "French", "Germany"), ("dog","puppy","cat"), ("man", "grandfather", "woman"), ("man", "king", "woman"), ("apple", "tree", "grape")]:
    raise NotImplementedError("TODO")

#### Biases

It's important to be cognizant of the biases (gender, race, sexual orientation etc.) implicit in our word embeddings. Bias can be dangerous because it can reinforce stereotypes through applications that employ these models.

Using the same code as above, notice how "gynecologist" is top-2 of "man is to doctor as woman is to?" and "nurse" is 3

Why is that? Try to find other problematic analogies/biases in the embeddings

In [None]:
for x, y, z in [("man", "doctor", "woman")]:

### Syntactic analogies

Word embeddings capture both semantic and syntactic analogies! Use the same code as above but for the following word list.

Notice how it works even with irregular verbs like "eat"


Try your own examples

In [None]:
for x, y, z in [
    # derivational morphology: suffixation
    ("speak", "speaker", "sing"),
    ("short", "shortly", "rapid"),
    # derivational morphology: prefixation
    ("like", "unlike", "able"),
    # inflectional morphology: verbs
    ("walking", "walked", "work"),
    ("walking", "walked", "eating"),
    ("walking", "walked", "going"),
    # inflectional morphology: nouns
    ("shoe", "shoes", "table"),
]:

## Quantitative evaluation using academic benchmarks

### Correlation of similarity with human judgements

#### data

In [None]:
!wget https://staff.fnwi.uva.nl/e.bruni/resources/MEN.zip

In [None]:
!unzip MEN

In [None]:
import pandas as pd

In [None]:
dataset = pd.read_csv("MEN/MEN_dataset_natural_form_full", delimiter=" ",names=["a","b","label"])

In [None]:
dataset

#### evaluation

Compute the cosine similarity of words `a` and `b` for all words in the dataset, except (of course), those that are absent from our `vocabulary`.

Compute Spearmann correlation between the cosine similarity you computed and the label, using `scipy.stats.spearmanr`

In [None]:
import scipy

Compare this value to what is reported in Levy, O., & Goldberg, Y. (2014). Neural Word Embedding as Implicit Matrix Factorization. Advances in Neural Information Processing Systems, 27. https://proceedings.neurips.cc/paper/2014/hash/feab05aa91085b7a8012516bc3533958-Abstract.html


Inspect this correlation visually: plot the cosine similarities against the labels

### Accuracy of analogies

#### Data

In [None]:
!wget https://www.fit.vut.cz/person/imikolov/public/rnnlm/word-test.v1.txt

In [None]:
with open("word-test.v1.txt","rt") as file:
    lines = file.read().strip().split("\n")

In [None]:
lines.pop(0)

In [None]:
dataset = []
subset_ok=False
for i, line in enumerate(lines):
    # We keep a single subset to save computation time.
    # As a bonus, compute the results on the entire dataset, or for each subset
    if line[0]==":":
        if line ==": family":
            subset_ok=True
        if line == ": gram1-adjective-to-adverb":
            break
        continue
    if subset_ok:
        x,y,z,t = line.split(" ")
        ok = True
        for token in [x,y,z,t]:
            if token not in vocabulary:
                ok = False
                break
        if ok:
            dataset.append((x,y,z,t))

In [None]:
len(dataset)

In [None]:
for i in np.random.choice(len(dataset), 10):
    x, y, z, t = dataset[i]
    print(f"{x}:{y} :: {z}:{t}")

#### evaluation

Using the same code as above, solve the analogy for $x:y :: z:t$ ($x$ is to $y$ as $z$ is to?) using:
$$\max_{t\in V} \cos(t, y-x+z)$$

For all samples of the dataset

Compute the accuracy (check that you retrieve the correct $t$)

In [None]:
accuracy

Compare this value to what is reported in Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space (arXiv:1301.3781). arXiv. https://doi.org/10.48550/arXiv.1301.3781

Remember we used only the "family" subset, not all semantic analogies.


# Bonus Part

- Try different hyperparameters in Part. 1:
  - Vocabulary size $V$
  - `window_size`
  - Number of negative examples $k$
  - embedding dimension $d$

- Try other models than Skipgram in Part. 2:
  - [fastText](https://fasttext.cc/docs/en/english-vectors.html)
  - [GloVe](https://nlp.stanford.edu/projects/glove/) (also available via `gensim` used above)
  - CBOW