# Word2Vec / Skip-gram Practice (Full Execution from Notebook)

This notebook runs the **entire Word2Vec pipeline** by calling a single function from the modular `w2v` package.  
Each step is briefly explained below.

In [1]:
import sys, os

base_dir = "C:/Users/tomas/Desktop/PLN/PRACTICA 2/Practica2_PLN"
sys.path.insert(0, os.path.join(base_dir))
sys.path.insert(0, os.path.join(base_dir, 'w2v'))
print('Base directory:', base_dir)

Base directory: C:/Users/tomas/Desktop/PLN/PRACTICA 2/Practica2_PLN


## Running the full pipeline
The `run_pipeline()` function performs:
1. Corpus loading and tokenization  
2. Vocabulary building  
3. (center, context) pair generation  
4. Model training  
5. Analysis of nearest neighbors and analogies

In [2]:
from word2vec.main import run_program

corpus_path = os.path.join(base_dir, 'resources', 'dataset_word2vec.txt')
if not os.path.exists(corpus_path):
    sample = [
        "París es la capital de Francia",
        "Madrid es la capital de España",
        "el perro ladra en la casa",
        "el gato maúlla en la silla",
        "el coche del conductor está en la calle"
    ]
    os.makedirs(os.path.dirname(corpus_path), exist_ok=True)
    with open(corpus_path, 'w', encoding='utf-8') as f:
        f.write("\n".join(sample))
    print("Sample corpus created at:", corpus_path)

model, vocab, inv_vocab, pairs = run_program(
    corpus_path=corpus_path,
    window_size=2,
    embedding_dim=50,
    learning_rate=0.05,
    epochs=100,
    min_count=1
)

**********************************************************************
WORD2VEC - EJECUCIÓN DEL PROGRAMA MEDIANTE SKIP-GRAM
**********************************************************************

 Primer paso) Cargar y tokenizar el corpus

 Segundo paso) Construir el vocabulario
Tamaño del vocabulario: 356
Top 10 palabras: [('la', 131), ('el', 129), ('es', 37), ('está', 27), ('en', 26), ('un', 20), ('por', 15), ('una', 15), ('al', 14), ('gato', 11)]

 Tercer paso) Generar el par 'Centro-Contexto'
Total pairs: 2440

 Cuarto paso) Entrenar el modelo Skip-Gram
Epoch 10/100: Loss: 4.1203
Epoch 20/100: Loss: 3.2734
Epoch 30/100: Loss: 2.8909
Epoch 40/100: Loss: 2.7477
Epoch 50/100: Loss: 2.6983
Epoch 60/100: Loss: 2.6871
Epoch 70/100: Loss: 2.6779
Epoch 80/100: Loss: 2.6686
Epoch 90/100: Loss: 2.6606
Epoch 100/100: Loss: 2.6566

**********************************************************************
ANÁLISIS DE LOS EMBEDDINGS
******************************************************************