# training word2vec on the new podcast dataset

this notebook is my first attempt at training a word2vec model on the new podcast dataset alex showed me. it uses gensim for the word2vec model and my own preprocessing file to aggregate the podcast content. 

## setup

we start off by importing all modules necessary for the task. cython enables gensim's word2vec model to run quickly on multiple cores using the "workers" attribute (see later).

In [1]:
import gensim # for model
import logging          
import cython # for performance (multicore)
import preprocessing # from same directory
import random

to ensure that we'll actually profit from cython, we must check if we're running the fast version of the model:

In [2]:
print(gensim.models.word2vec.FAST_VERSION) # if 1, cython's being used

1


next, we format logs for tracking training progress. `logging.basicConfig()` sets up the base configuration for logging. `level=logging.INFO` sets the minimum logging level to `INFO` (higher than (excludes) `DEBUG`, lower than (additionally includes) `WARNING`, `ERROR`, and `CRITICAL`).

In [3]:
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO) # set logging configuration

next, we use the `random` class to select a random subset of the large podcast datafile.

In [7]:
import random
import shutil

random.seed(42)
with open('speaker_data.jsonl', 'r') as infile, open('output.jsonl', 'w') as outfile:
    buffer = []
    for row in infile:
        if random.random() < 0.0001:
            buffer.append(row)
        if len(buffer) >= 1000: 
            outfile.writelines(buffer)
            buffer = []
    if buffer:
        outfile.writelines(buffer)

we then preprocess the data according to the parsing method in the preprocessing module, and out of interest check its length (number of sentences).

In [15]:
sentences = preprocessing.process_json_file_sentences_subsample('speaker_data.jsonl', 0.1)
print(len(sentences))

20767134


## creating the model

now that we have our data, we can initialize a word2vec model using gensim. you could initialize, build vocabulary and train in one step using `model = gensim.models.Word2Vec(sentences, parameters)`, but decomposing it into multiple steps is good for understanding. 

this next command simply initializes the model. gensim's word2vec model has an extensive parameter list. here are some of the simpler parameters, with their default values in parentheses: 

- `sentences` (`None`): list of tokenized sentences to train on
- `vector_size` (`100`): number of dimensions in the word vectors
- `window` (`5`): max context window size (how many words before/after)
- `min_count` (`5`): ignores words that appear fewer times than this
- `sg` (`0`): training model: `0` = CBOW, `1` = skip-gram
- `workers` (`3`): number of CPU threads for training
- `epochs` (`5`): number of training iterations
- `hs` (`0`): `0` = negative sampling, `1` = hierarchical softmax
- `negative` (`5`): number of negative samples (ignored if `hs=1`)
- `sample` (`1e-3`): threshold for downsampling high-frequency words.

and some of the more advanced ones: 

- `alpha` (`0.025`): initial learning rate.
- `min_alpha` (`0.0001`): minimum learning rate.
- `cbow_mean` (`1`): `1` = use mean of context word vectors (CBOW), `0` = sum instead
- `compute_loss` (`False`): `True` = compute and store training loss
- `max_vocab_size` (`None`): limits RAM usage during vocabulary building
- `seed` (`1`): random seed for reproducibility
- `batch_words` (`10000`): batch size for training (affects speed/memory)

In [18]:
model = gensim.models.Word2Vec(
    vector_size=100,
    workers=6
)

2025-03-06 08:32:36,009 : INFO : Word2Vec lifecycle event {'params': 'Word2Vec<vocab=0, vector_size=100, alpha=0.025>', 'datetime': '2025-03-06T08:32:36.008991', 'gensim': '4.3.3', 'python': '3.11.2 (v3.11.2:878ead1ac1, Feb  7 2023, 10:02:41) [Clang 13.0.0 (clang-1300.0.29.30)]', 'platform': 'macOS-15.2-arm64-arm-64bit', 'event': 'created'}


next, we wanna build the model's vocabulary. this step prepares the model for training by scanning the input corpus and building an internal vocabulary dictionary of words. 

In [19]:
model.build_vocab(sentences)  # builds vocabulary

2025-03-06 08:32:39,647 : INFO : collecting all words and their counts
2025-03-06 08:32:39,649 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2025-03-06 08:32:39,683 : INFO : PROGRESS: at sentence #10000, processed 131549 words, keeping 8603 word types
2025-03-06 08:32:39,703 : INFO : PROGRESS: at sentence #20000, processed 237089 words, keeping 12309 word types
2025-03-06 08:32:39,722 : INFO : PROGRESS: at sentence #30000, processed 345728 words, keeping 15145 word types
2025-03-06 08:32:39,744 : INFO : PROGRESS: at sentence #40000, processed 467330 words, keeping 17898 word types
2025-03-06 08:32:39,764 : INFO : PROGRESS: at sentence #50000, processed 567853 words, keeping 20220 word types
2025-03-06 08:32:39,786 : INFO : PROGRESS: at sentence #60000, processed 687064 words, keeping 22591 word types
2025-03-06 08:32:39,805 : INFO : PROGRESS: at sentence #70000, processed 783983 words, keeping 24008 word types
2025-03-06 08:32:39,827 : INFO : PROGRESS: at s

we are ready to train the model. again, there are some parameters we can play with: 

- `sentences` (required): list of tokenized sentences (each sentence is a list of words)
- `total_examples` (required): total number of sentences in the corpus. typically set to `model.corpus_count`
- `epochs` (`5`): number of iterations (epochs) over the corpus
- `total_words` (optional): total number of words in the corpus. defaults to `None`, and is generally not necessary if you use `total_examples`
- `queue_factor` (`2`): size of the queue used during training. controls the memory usage and how many sentences are preloaded into memory for each worker
- `workers` (optional): number of workers to use for training (overrides the `workers` parameter used during initialization)
- `min_count` (optional): minimum number of occurrences of words to be considered during training
- `compute_loss` (`False`): if `True`, it computes and stores the training loss at each epoch
- `callbacks` (optional): list of callbacks (such as `TrainingProgressCallback`) for tracking the training progress
- `reset_weights` (`False`): if `True`, it resets the word embeddings before training
- `seed` (optional): random seed for reproducibility


In [20]:
model.train(sentences, total_examples=model.corpus_count, epochs=model.epochs)

2025-03-06 08:36:00,848 : INFO : Word2Vec lifecycle event {'msg': 'training model with 6 workers on 115470 vocabulary and 100 features, using sg=0 hs=0 sample=0.001 negative=5 window=5 shrink_windows=True', 'datetime': '2025-03-06T08:36:00.848168', 'gensim': '4.3.3', 'python': '3.11.2 (v3.11.2:878ead1ac1, Feb  7 2023, 10:02:41) [Clang 13.0.0 (clang-1300.0.29.30)]', 'platform': 'macOS-15.2-arm64-arm-64bit', 'event': 'train'}
2025-03-06 08:36:01,937 : INFO : EPOCH 0 - PROGRESS: at 0.81% examples, 1246433 words/s, in_qsize 10, out_qsize 1
2025-03-06 08:36:02,938 : INFO : EPOCH 0 - PROGRESS: at 1.55% examples, 1196813 words/s, in_qsize 12, out_qsize 1
2025-03-06 08:36:03,948 : INFO : EPOCH 0 - PROGRESS: at 2.48% examples, 1253315 words/s, in_qsize 11, out_qsize 0
2025-03-06 08:36:04,954 : INFO : EPOCH 0 - PROGRESS: at 3.40% examples, 1293028 words/s, in_qsize 11, out_qsize 0
2025-03-06 08:36:05,971 : INFO : EPOCH 0 - PROGRESS: at 4.22% examples, 1282198 words/s, in_qsize 11, out_qsize 0
20

(773121647, 1098769470)

next, we can save the model for later use:

In [21]:
model.save("word2vec_podcast_model")

2025-03-06 08:48:59,472 : INFO : Word2Vec lifecycle event {'fname_or_handle': 'word2vec_podcast_model', 'separately': 'None', 'sep_limit': 10485760, 'ignore': frozenset(), 'datetime': '2025-03-06T08:48:59.472741', 'gensim': '4.3.3', 'python': '3.11.2 (v3.11.2:878ead1ac1, Feb  7 2023, 10:02:41) [Clang 13.0.0 (clang-1300.0.29.30)]', 'platform': 'macOS-15.2-arm64-arm-64bit', 'event': 'saving'}
2025-03-06 08:48:59,482 : INFO : storing np array 'vectors' to word2vec_podcast_model.wv.vectors.npy
2025-03-06 08:48:59,529 : INFO : storing np array 'syn1neg' to word2vec_podcast_model.syn1neg.npy
2025-03-06 08:48:59,554 : INFO : not storing attribute cum_table
2025-03-06 08:49:00,739 : INFO : saved word2vec_podcast_model


finally, extract the word vectors (results in a `KeyedVectors` instance): 

In [22]:
word_vectors = model.wv

## playing around

time to see what this can do. 

In [46]:
vector = word_vectors['computer'] # get a numpy vector of a word
sims = word_vectors.most_similar('mit', topn=10)
print(sims)

[('nyu', 0.8722857236862183), ('harvard', 0.8512818217277527), ('yale', 0.844366192817688), ('caltech', 0.8434088230133057), ('stanford', 0.8279602527618408), ('princeton', 0.8225548267364502), ('uc', 0.8170067071914673), ('university', 0.7867748141288757), ('mellon', 0.7812279462814331), ('berkeley', 0.7789393663406372)]


this tests the word vectors on google's analogy test:

In [24]:
word_vectors.evaluate_word_analogies('questions-words.txt')

2025-03-06 08:49:26,527 : INFO : Evaluating word analogies for top 300000 words in the model on questions-words.txt
2025-03-06 08:49:29,302 : INFO : capital-common-countries: 17.1% (79/462)
2025-03-06 08:49:37,013 : INFO : capital-world: 6.5% (116/1780)
2025-03-06 08:49:39,331 : INFO : currency: 1.3% (7/548)
2025-03-06 08:49:49,781 : INFO : city-in-state: 11.3% (271/2394)
2025-03-06 08:49:51,732 : INFO : family: 66.0% (305/462)
2025-03-06 08:49:56,083 : INFO : gram1-adjective-to-adverb: 10.9% (108/992)
2025-03-06 08:49:59,207 : INFO : gram2-opposite: 17.2% (130/756)
2025-03-06 08:50:04,856 : INFO : gram3-comparative: 75.5% (1005/1332)
2025-03-06 08:50:09,146 : INFO : gram4-superlative: 55.1% (618/1122)
2025-03-06 08:50:13,735 : INFO : gram5-present-participle: 73.0% (771/1056)
2025-03-06 08:50:19,327 : INFO : gram6-nationality-adjective: 11.4% (148/1299)
2025-03-06 08:50:26,290 : INFO : gram7-past-tense: 57.0% (889/1560)
2025-03-06 08:50:32,086 : INFO : gram8-plural: 46.2% (616/1332)
2

(0.34682117131224555,
 [{'section': 'capital-common-countries',
   'correct': [('ATHENS', 'GREECE', 'BANGKOK', 'THAILAND'),
    ('ATHENS', 'GREECE', 'BERLIN', 'GERMANY'),
    ('ATHENS', 'GREECE', 'PARIS', 'FRANCE'),
    ('ATHENS', 'GREECE', 'TOKYO', 'JAPAN'),
    ('BAGHDAD', 'IRAQ', 'BERLIN', 'GERMANY'),
    ('BAGHDAD', 'IRAQ', 'KABUL', 'AFGHANISTAN'),
    ('BAGHDAD', 'IRAQ', 'PARIS', 'FRANCE'),
    ('BAGHDAD', 'IRAQ', 'TEHRAN', 'IRAN'),
    ('BAGHDAD', 'IRAQ', 'TOKYO', 'JAPAN'),
    ('BANGKOK', 'THAILAND', 'BERLIN', 'GERMANY'),
    ('BANGKOK', 'THAILAND', 'OSLO', 'NORWAY'),
    ('BANGKOK', 'THAILAND', 'PARIS', 'FRANCE'),
    ('BANGKOK', 'THAILAND', 'TOKYO', 'JAPAN'),
    ('BANGKOK', 'THAILAND', 'ATHENS', 'GREECE'),
    ('BEIJING', 'CHINA', 'BERLIN', 'GERMANY'),
    ('BEIJING', 'CHINA', 'OSLO', 'NORWAY'),
    ('BEIJING', 'CHINA', 'OTTAWA', 'CANADA'),
    ('BEIJING', 'CHINA', 'PARIS', 'FRANCE'),
    ('BEIJING', 'CHINA', 'TOKYO', 'JAPAN'),
    ('BEIJING', 'CHINA', 'BANGKOK', 'THAILAND'),