## Pre trained embeddings

Word embeddings are generally computed using word-occurrence statistics (observations about what words co-occur in sentences or documents), using a variety of  techniques, some involving neural networks, others not. The idea of a dense, lowdimensional embedding space for words, computed in an unsupervised way, was initially explored by Bengio et al. in the early 2000s,1 but it only started to take off in research and industry applications after the release of one of the most famous and successful word-embedding schemes: the Word2vec algorithm (https://code.google.com/ archive/p/word2vec), developed by Tomas Mikolov at Google in 2013. Word2vec dimensions capture specific semantic properties, such as gender.

There are various precomputed databases of word embeddings that you can download and use in a Keras Embedding layer. Word2vec is one of them. Another popular one is called Global Vectors for Word Representation (GloVe, https://nlp.stanford.edu/projects/glove), which was developed by Stanford researchers in 2014. This embedding technique is based on factorizing a matrix of word co-occurrence statistics. Its developers have made available precomputed embeddings for millions of English tokens, obtained from Wikipedia data and Common Crawl data.

One of the most widely used pretrained word embeddings is Glove and can be downloaded from https://nlp.stanford.edu/projects/glove/

GloVe is pre-computed embeddings from 2014 English Wikipedia. It's a 822MB zip file named glove.6B.zip, containing 100-dimensional embedding vectors for 400,000 words (or non-word tokens).

In [1]:
!wget https://nlp.stanford.edu/data/wordvecs/glove.2024.wikigiga.50d.zip

--2025-08-18 08:18:16--  https://nlp.stanford.edu/data/wordvecs/glove.2024.wikigiga.50d.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://downloads.cs.stanford.edu/nlp/data/wordvecs/glove.2024.wikigiga.50d.zip [following]
--2025-08-18 08:18:17--  https://downloads.cs.stanford.edu/nlp/data/wordvecs/glove.2024.wikigiga.50d.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 301036094 (287M) [application/zip]
Saving to: ‘glove.2024.wikigiga.50d.zip’


2025-08-18 08:19:10 (5.45 MB/s) - ‘glove.2024.wikigiga.50d.zip’ saved [301036094/301036094]



In [2]:
!mkdir glove
!unzip glove.2024.wikigiga.50d.zip -d glove/

Archive:  glove.2024.wikigiga.50d.zip
  inflating: glove/wiki_giga_2024_50_MFT20_vectors_seed_123_alpha_0.75_eta_0.075_combined.txt  


In [1]:
!head -20 glove/wiki_giga_2024_50_MFT20_vectors_seed_123_alpha_0.75_eta_0.075_combined.txt

the -0.383396 -0.480674 -0.27414099999999997 0.132697 0.06488499999999997 -0.092489 -0.0029909999999999937 0.168918 0.42769 0.032062000000000035 0.600652 0.0045550000000000035 0.221568 -0.106817 0.339955 -0.24816500000000002 -0.06553399999999998 0.16503099999999998 -0.098915 -0.26706 -0.351612 -0.595836 -0.748328 -0.262191 -0.400183 0.306477 0.13247 0.149532 -0.462982 0.436406 -0.328406 -0.11229700000000001 -0.318158 -0.586177 -0.24174600000000002 -0.38285899999999995 -0.433786 5.955204999999999 0.137239 -0.064516 0.456522 -0.019239000000000006 0.260624 -0.293577 0.19897499999999999 0.277653 0.164552 0.40649 -0.227907 -0.400316
, 0.151841 0.07100100000000001 -0.306629 -0.24564200000000003 0.163437 0.343526 -0.635576 -0.08061000000000007 -0.281144 0.01024799999999998 -0.100773 0.041249999999999995 -0.284503 0.45056999999999997 -0.44594599999999995 -0.058714999999999996 -0.06081500000000001 0.006295999999999996 -0.35699400000000003 -0.10028200000000001 0.270828 0.40327899999999994 0.3603

### Exploring the embeddings

In [4]:
!pip install gensim

Collecting gensim
  Downloading gensim-4.3.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (8.1 kB)
Collecting numpy<2.0,>=1.18.5 (from gensim)
  Downloading numpy-1.26.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.0/61.0 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting scipy<1.14.0,>=1.7.0 (from gensim)
  Downloading scipy-1.13.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (60 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.6/60.6 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
Downloading gensim-4.3.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (26.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m26.7/26.7 MB[0m [31m68.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading numpy-1.26.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.3 MB)
[2K   [90m━━━━━━━━━━━

In [2]:
!head -20 glove/wiki_giga_2024_50_MFT20_vectors_seed_123_alpha_0.75_eta_0.075_combined.txt

the -0.383396 -0.480674 -0.27414099999999997 0.132697 0.06488499999999997 -0.092489 -0.0029909999999999937 0.168918 0.42769 0.032062000000000035 0.600652 0.0045550000000000035 0.221568 -0.106817 0.339955 -0.24816500000000002 -0.06553399999999998 0.16503099999999998 -0.098915 -0.26706 -0.351612 -0.595836 -0.748328 -0.262191 -0.400183 0.306477 0.13247 0.149532 -0.462982 0.436406 -0.328406 -0.11229700000000001 -0.318158 -0.586177 -0.24174600000000002 -0.38285899999999995 -0.433786 5.955204999999999 0.137239 -0.064516 0.456522 -0.019239000000000006 0.260624 -0.293577 0.19897499999999999 0.277653 0.164552 0.40649 -0.227907 -0.400316
, 0.151841 0.07100100000000001 -0.306629 -0.24564200000000003 0.163437 0.343526 -0.635576 -0.08061000000000007 -0.281144 0.01024799999999998 -0.100773 0.041249999999999995 -0.284503 0.45056999999999997 -0.44594599999999995 -0.058714999999999996 -0.06081500000000001 0.006295999999999996 -0.35699400000000003 -0.10028200000000001 0.270828 0.40327899999999994 0.3603

## Notes:

Restart the session and start executing only the step below. No need to execute the steps above.

In [3]:
import pandas as pd
import numpy as np
import gensim

In [4]:
pd.__version__

'2.2.2'

In [5]:
np.__version__

'1.26.4'

In [6]:
gensim.__version__

'4.3.3'

In [7]:
from gensim.test.utils import datapath, get_tmpfile
from gensim.models import KeyedVectors

word2vec_output_file = "glove/wiki_giga_2024_50_MFT20_vectors_seed_123_alpha_0.75_eta_0.075_combined.txt"

In [8]:
pretrained_w2v_model = KeyedVectors.load_word2vec_format(word2vec_output_file, binary=False, no_header=True)

In [9]:
len(pretrained_w2v_model)

1291147

In [10]:
pretrained_w2v_model.most_similar('bangalore')

[('hyderabad', 0.8623518943786621),
 ('pune', 0.8492687940597534),
 ('chennai', 0.8478153347969055),
 ('ahmedabad', 0.8424375057220459),
 ('delhi', 0.8024435639381409),
 ('mumbai', 0.8013363480567932),
 ('kolkata', 0.8008414506912231),
 ('lahore', 0.7752841114997864),
 ('kanpur', 0.7704088687896729),
 ('calcutta', 0.7690011858940125)]

In [11]:
pretrained_w2v_model.most_similar('dhoni')

[('dravid', 0.9094754457473755),
 ('ganguly', 0.8888667225837708),
 ('yuvraj', 0.8874509930610657),
 ('kumble', 0.8816227316856384),
 ('harbhajan', 0.8652089834213257),
 ('sehwag', 0.856694221496582),
 ('laxman', 0.8446274399757385),
 ('raina', 0.8368719220161438),
 ('tendulkar', 0.8359326720237732),
 ('gambhir', 0.8311445713043213)]

In [12]:
pretrained_w2v_model.most_similar('google')

[('yahoo', 0.8946718573570251),
 ('microsoft', 0.8838525414466858),
 ('yahoo!', 0.8728740215301514),
 ('aol', 0.8662694692611694),
 ('facebook', 0.8487240076065063),
 ('online', 0.833015501499176),
 ('web', 0.8253262639045715),
 ('internet', 0.823790431022644),
 ('msn', 0.8227612376213074),
 ('myspace', 0.8194539546966553)]

In [13]:
pretrained_w2v_model.most_similar('hp')

[('packard', 0.8151170611381531),
 ('ibm', 0.8041609525680542),
 ('hewlett', 0.7981472015380859),
 ('compaq', 0.7965227961540222),
 ('xerox', 0.7596661448478699),
 ('intel', 0.7398403882980347),
 ('ge', 0.7182276248931885),
 ('dell', 0.7047104239463806),
 ('cisco', 0.6985154747962952),
 ('motors', 0.6971229910850525)]

In [14]:
pretrained_w2v_model.most_similar('wikipedia')

[('edits', 0.7263020873069763),
 ('website', 0.7162041664123535),
 ('dictionary', 0.7095853686332703),
 ('bilingual', 0.7037280797958374),
 ('blog', 0.7022649645805359),
 ('translations', 0.7017492651939392),
 ('text', 0.6987094283103943),
 ('encyclopedia', 0.6901077628135681),
 ('blogs', 0.6876814961433411),
 ('websites', 0.6855925917625427)]

In [15]:
def analogy(a, b, c):
    result = pretrained_w2v_model.most_similar([c, b], [a])
    return result[0][0]

In [16]:
analogy('india', 'indian', 'japan')

'japanese'

In [17]:
analogy('india', 'delhi', 'france')

'paris'

In [18]:
analogy('india', 'dhoni', 'england')

'collingwood'