<a href="https://colab.research.google.com/github/LorenzoBellomo/InformationRetrieval/blob/main/notebooks/2_TFIDFandEmbeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text processing with vectors
In this lecture we focus on techinques that allow to model the text as vectors of floating point numbers. This allows us to easily process and compute similarities between words, sentences, and documents.

In [2]:
!pip install scikit-learn
!pip install nltk



In [3]:
from nltk.tokenize import word_tokenize
import nltk
import numpy as np
import json

nltk.download('stopwords')
nltk.download('punkt_tab')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [4]:
!wget https://raw.githubusercontent.com/giusprencunipi/IR-Master/main/data/5articles.json

--2026-03-01 11:09:14--  https://raw.githubusercontent.com/giusprencunipi/IR-Master/main/data/5articles.json
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 12566 (12K) [text/plain]
Saving to: ‘5articles.json’


2026-03-01 11:09:15 (67.6 MB/s) - ‘5articles.json’ saved [12566/12566]



Let's load this json file containing 5 articles, comprised of maintext, title, date of publishment, and news source.

In [5]:
with open("5articles.json", "r") as f:
    articles = json.load(f)

articles

[{'title': 'American Airlines orders 60 Overture supersonic jets',
  'maintext': "The revival of supersonic passenger travel, thought to be long dead with the demise of Concorde nearly two decades ago, could be about to take wing as American Airlines has put in an order for 60 aircraft capable of flying at 1.7 times the speed of sound. \nBoom is a start-up based in Denver, Colorado, whose development of Overture, an ultra-fast successor to Concorde that seats 65 to 88 passengers, is so advanced that it showed off designs at last month's Farnborough air show.",
  'date': '2022-08-18',
  'source': 'The New York Times'},
 {'title': "Conte: 'Chelsea are not in the race to sign Sanchez'",
  'maintext': 'Antonio Conte. Pic: PA\nHead coach Antonio Conte does not think Chelsea are in the race to sign Arsenal forward Alexis Sanchez.\nSanchez is out of contract this summer and seemed certain to join Manchester City this month.\nBut the Premier League leaders on Monday evening decided to end thei

## Simple bag-of-words vectorizers (Count and TF-IDF)

In [6]:
from sklearn.feature_extraction.text import CountVectorizer # Just counts the occurrences of terms
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

Let's make a simple test, and use CountVectorizer and TFIDF Vectorizer on the titles (5 tot documents)

In [7]:
tfidf_vectorizer = TfidfVectorizer(input='content')
count_vectorizer = CountVectorizer(input='content')
titles = [a["title"] for a in articles]
tfidf_vectors = tfidf_vectorizer.fit_transform(titles)

In [None]:
tfidf_vectors

In [8]:
unique_tokens = {}
for title in titles:
  tokens = title.split()
  for token in tokens:
    if token not in unique_tokens:
      unique_tokens[token] = 1
    else:
      unique_tokens[token] = unique_tokens[token] + 1

len(unique_tokens)

42

In [9]:
unique_tokens = {}
for title in titles:
  tokens = title.split()
  for token in tokens:
      token = token.lower()
      prev_count = unique_tokens.get(token, 0)
      unique_tokens[token] = prev_count + 1

len(unique_tokens)
unique_tokens

{'american': 1,
 'airlines': 1,
 'orders': 1,
 '60': 1,
 'overture': 1,
 'supersonic': 1,
 'jets': 1,
 'conte:': 1,
 "'chelsea": 1,
 'are': 1,
 'not': 1,
 'in': 1,
 'the': 1,
 'race': 1,
 'to': 2,
 'sign': 1,
 "sanchez'": 1,
 'gunman': 1,
 'opens': 1,
 'fire': 1,
 'on': 1,
 'car': 1,
 'just': 1,
 'metres': 1,
 'from': 1,
 'scene': 1,
 'of': 1,
 'hamid': 1,
 'sanambar': 1,
 'murder': 1,
 "'one-punch": 1,
 "killer's": 1,
 'sentence': 1,
 'will': 1,
 'make': 1,
 'others': 1,
 'think': 1,
 "twice'": 1,
 'leclerc': 1,
 'dedicates': 1,
 'win': 1,
 'hubert': 1}

In [10]:
len(list(tfidf_vectorizer.get_feature_names_out()))

43

In [11]:
list_of_features = list(tfidf_vectorizer.get_feature_names_out())
[w for w in list_of_features if w not in unique_tokens.keys()]

['chelsea', 'conte', 'killer', 'one', 'punch', 'sanchez', 'twice']

In [12]:
unique_tokens["conte:"]

1

Let's report now the TFIDF of the words, writing in a specific row ("\_\_Document Frequency\_\_") the number of times said "token" appears over all documents.

In [13]:
import pandas as pd
tfidf_df = pd.DataFrame(tfidf_vectors.toarray(), index=titles, columns=tfidf_vectorizer.get_feature_names_out())
tfidf_df.loc['__Document Frequency__'] = (tfidf_df > 0).sum()
tfidf_df[['airlines', 'chelsea', 'car', 'murder', 'think', 'one','the', 'to']].sort_index().round(decimals=2)

Unnamed: 0,airlines,chelsea,car,murder,think,one,the,to
'One-punch killer's sentence will make others think twice',0.0,0.0,0.0,0.0,0.33,0.33,0.0,0.0
American Airlines orders 60 Overture supersonic jets,0.38,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Conte: 'Chelsea are not in the race to sign Sanchez',0.0,0.32,0.0,0.0,0.0,0.0,0.32,0.26
Gunman opens fire on car just metres from scene of Hamid Sanambar murder,0.0,0.0,0.28,0.28,0.0,0.0,0.0,0.0
Leclerc dedicates win to Hubert,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.37
__Document Frequency__,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0


Let's define a function that reports the top n words by count score (countvectorizer) and TFIDF score in the collection.

In [14]:
def get_top_n_words(documents, tfidf_vectorizer, count_vectorizer, top_n = 10):
  tfidf_vectors, count_vectors = tfidf_vectorizer.fit_transform(documents), count_vectorizer.fit_transform(documents)
  feature_names_tfidf, feature_names_count = tfidf_vectorizer.get_feature_names_out(), count_vectorizer.get_feature_names_out()
  top_indices_tfidf, top_indices_count = np.argsort(tfidf_vectors.data)[:-(top_n):-1], np.argsort(count_vectors.data)[:-(top_n):-1]
  print("TFIDF       -        COUNT")
  for tfidx, cidx in zip(top_indices_tfidf, top_indices_count):
    print("{} ({}) - {} ({})".format(feature_names_tfidf[tfidf_vectors.indices[tfidx]], round(tfidf_vectors.data[tfidx]*100)/100, feature_names_count[count_vectors.indices[cidx]], count_vectors.data[cidx]))

Running it on the titles does not make that much sense, let's run it on a bigger corpus (maintexts). The document count is the same (5), but we can expect a larger number of tokens.

In [15]:
maintexts = [a["maintext"] for a in articles]
get_top_n_words(maintexts, tfidf_vectorizer, count_vectorizer, top_n=12)

TFIDF       -        COUNT
the (0.5) - the (49)
the (0.41) - the (26)
the (0.41) - to (25)
the (0.38) - the (22)
to (0.34) - that (22)
of (0.29) - of (21)
area (0.24) - to (20)
concorde (0.24) - his (20)
his (0.24) - and (19)
in (0.23) - the (19)
to (0.23) - was (17)


Stopwords get an extremely high score. That is due to the fact that the total document count is extremely low (5), making it impossible for the IDF factor of the formula to properly scale down the scores. In this case, we can simply remove all the stopwords in the collection using the built-in "stop_words" parameter.

In [16]:
tfidf_vectorizer = TfidfVectorizer(input='content', stop_words="english")
count_vectorizer = CountVectorizer(input='content', stop_words="english")
get_top_n_words(maintexts, tfidf_vectorizer, count_vectorizer)

TFIDF       -        COUNT
area (0.34) - reilly (12)
reilly (0.33) - luke (11)
hubert (0.3) - hall (11)
leclerc (0.3) - ellis (11)
hall (0.3) - said (9)
luke (0.3) - mr (8)
ellis (0.3) - brien (7)
concorde (0.3) - night (6)
chelsea (0.25) - area (6)


In [17]:
!wget https://raw.githubusercontent.com/giusprencunipi/IR-Master/main/data/500news.json

--2026-03-01 11:11:32--  https://raw.githubusercontent.com/giusprencunipi/IR-Master/main/data/500news.json
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.109.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 147867 (144K) [text/plain]
Saving to: ‘500news.json’


2026-03-01 11:11:33 (5.33 MB/s) - ‘500news.json’ saved [147867/147867]



In [18]:
with open("500news.json", "r") as f:
    news = json.load(f)
news[0]["maintext"]

'Victims are civilians: the attacker took his own life'

In [19]:
maintexts = [a["maintext"] for a in news]
tfidf_vectorizer = TfidfVectorizer(input='content')
count_vectorizer = CountVectorizer(input='content')
get_top_n_words(maintexts, tfidf_vectorizer, count_vectorizer)

TFIDF       -        COUNT
regionals (0.98) - the (12)
episodeth (0.77) - the (11)
research (0.73) - the (11)
adjourned (0.67) - the (10)
california (0.67) - the (10)
38 (0.66) - the (9)
progress (0.65) - the (9)
happened (0.61) - the (9)
february (0.61) - the (9)


In [20]:
tfidf_vectorizer.fit_transform(maintexts)

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 10383 stored elements and shape (500, 3476)>

let's read the "Alice in Wonderland" book, and let's first try to run Count and TF-IDF vectorizers.

## Making queries with TF_IDF
For queries, we need to also "vectorize" the query. Let's try with "car".

In [21]:
query = "cars"
maintexts = [a["maintext"] for a in articles]
tfidf_vectors = tfidf_vectorizer.fit_transform(maintexts) # here we rerun the vectorizer for the maintexts of the articles
query_vector = tfidf_vectorizer.transform([query]) # here we create the vector for "car"
cosine_similarities = cosine_similarity(query_vector, tfidf_vectors).flatten() # compute all cosine similarities
print(cosine_similarities)
top_indices = np.argsort(cosine_similarities)[::-1][:3] # sort them decreasingly and limit to the top 3 most similar
print("Top 3 matching documents with \"{}\":".format(query))
for index in top_indices:
    print(f"\nScore: {cosine_similarities[index]:.4f} - {maintexts[index][:200]}...")

[0.         0.         0.         0.         0.04565353]
Top 3 matching documents with "cars":

Score: 0.0457 - Charles Leclerc
Charles Leclerc registered the maiden win of his Formula One career after romping to victory at the Belgian Grand Prix.
Less than 24 hours after Leclerc's French motor racing contempor...

Score: 0.0000 - Luke O'Reilly with his mother Janet O'Brien Luke O'Reilly Jack Hall Ellis The Metro One Bar in Tallaght, where Hall Ellis had earlier accused Luke O'Reilly of talking to his girlfriend
The mother of a...

Score: 0.0000 - Hamid Sanambar
Gardai are hunting for a gunman who opened fire on a car in north Dublin - just metres from where Hamid Sanambar was gunned down last week.
Emergency services were alerted to reports of...


Why do we get a 0 score for the "charles Leclerc" article (which corresponds to maintexts[1])

In [22]:
print("Car" in maintexts[4]) # as we see, only "Car", with uppercase C, is present in the maintext
print(" car " in maintexts[4])
print("Cars" in maintexts[4]) # as we see, only "Car", with uppercase C, is present in the maintext
print(" cars " in maintexts[4])

False
False
False
True


In [23]:
!wget https://raw.githubusercontent.com/giusprencunipi/IR-Master/main/data/alice.txt

--2026-03-01 11:12:49--  https://raw.githubusercontent.com/giusprencunipi/IR-Master/main/data/alice.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 151255 (148K) [text/plain]
Saving to: ‘alice.txt’


2026-03-01 11:12:49 (7.29 MB/s) - ‘alice.txt’ saved [151255/151255]



In [24]:
with open("alice.txt", 'r') as alice_file:
  alice = alice_file.read().lower()
sentences = [a for a in alice.split('\n') if a]
print(sentences[:10])
print(len(sentences))
tfidf_vectorizer = TfidfVectorizer(input='content')
count_vectorizer = CountVectorizer(input='content')
get_top_n_words(sentences, tfidf_vectorizer, count_vectorizer)

["\ufeff\ufeff*** start of the project gutenberg ebook alice's adventures in", 'wonderland ***', '[illustration]', 'alice’s adventures in wonderland', 'by lewis carroll', 'the millennium fulcrum edition 3.0', 'contents', ' chapter i.     down the rabbit-hole', ' chapter ii.    the pool of tears', ' chapter iii.   a caucus-race and a long tale']
2498
TFIDF       -        COUNT
cheered (1.0) - you (5)
illustration (1.0) - you (5)
wonderland (1.0) - not (5)
off (1.0) - not (5)
too (1.0) - you (5)
sighing (1.0) - the (5)
think (1.0) - mouse (5)
yourself (1.0) - you (5)
chapter (1.0) - the (4)


In [25]:
def run_query(tfidf_matrix, tfidf_vectorizer, documents, query, top_n=3):
  query_vector = tfidf_vectorizer.transform([query])
  cosine_similarities = cosine_similarity(query_vector, tfidf_matrix).flatten() # compute all cosine similarities
  top_indices = np.argsort(cosine_similarities)[::-1][:top_n]
  print("Top {} matching documents with \"{}\":".format(top_n, query))
  for index in top_indices:
      print(f"\nScore: {cosine_similarities[index]:.4f} - {documents[index][:200]}...")

In [26]:
tfidf_vectorizer = TfidfVectorizer(input='content', stop_words="english")
tfidf_vectors = tfidf_vectorizer.fit_transform(sentences)
run_query(tfidf_vectors, tfidf_vectorizer, sentences, "alice rabbit")

Top 3 matching documents with "alice rabbit":

Score: 0.5061 - alice....

Score: 0.4960 - down the rabbit-hole...

Score: 0.4470 - “we must burn the house down!” said the rabbit’s voice; and alice...


This is to show the importance of running proper preprocessing algorithms. Remember lecture 1.

## BM25, another bag-of-word metric

In [27]:
!pip install rank_bm25
from rank_bm25 import BM25Okapi

Collecting rank_bm25
  Downloading rank_bm25-0.2.2-py3-none-any.whl.metadata (3.2 kB)
Downloading rank_bm25-0.2.2-py3-none-any.whl (8.6 kB)
Installing collected packages: rank_bm25
Successfully installed rank_bm25-0.2.2


In [28]:
tokenized_corpus = [doc.split() for doc in maintexts]
bm25 = BM25Okapi(tokenized_corpus)

In [29]:
print("BM25 score of \"car\"\n")
scores = bm25.get_scores("car")
for title, score in zip(titles, scores):
  print(title, " - ", score)

BM25 score of "car"

American Airlines orders 60 Overture supersonic jets  -  0.36056091095119297
Conte: 'Chelsea are not in the race to sign Sanchez'  -  0.49688407493580367
Gunman opens fire on car just metres from scene of Hamid Sanambar murder  -  0.4793952947855894
'One-punch killer's sentence will make others think twice'  -  0.4842349898476344
Leclerc dedicates win to Hubert  -  0.5021737251539523


## Word Embeddings and word2vec
Let's now move to more advanced vectorization techniques. These techinques use Machine Learning and try to learn the patterns in which words tend to co-occur.

In [30]:
!pip install gensim
import gensim

Collecting gensim
  Downloading gensim-4.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.metadata (8.4 kB)
Downloading gensim-4.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (27.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.9/27.9 MB[0m [31m41.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: gensim
Successfully installed gensim-4.4.0


In [31]:
alice_tokens = []
for sentence in nltk.sent_tokenize(alice):
  sentence_tokens = []
  for w in word_tokenize(sentence):
    sentence_tokens.append(w.lower())
  alice_tokens.append(sentence_tokens)
alice_tokens[500]

['“',
 'yes',
 ',',
 'but',
 'some',
 'crumbs',
 'must',
 'have',
 'got',
 'in',
 'as',
 'well',
 ',',
 '”',
 'the',
 'hatter',
 'grumbled',
 ':',
 '“',
 'you',
 'shouldn',
 '’',
 't',
 'have',
 'put',
 'it',
 'in',
 'with',
 'the',
 'bread-knife.',
 '”',
 'the',
 'march',
 'hare',
 'took',
 'the',
 'watch',
 'and',
 'looked',
 'at',
 'it',
 'gloomily',
 ':',
 'then',
 'he',
 'dipped',
 'it',
 'into',
 'his',
 'cup',
 'of',
 'tea',
 ',',
 'and',
 'looked',
 'at',
 'it',
 'again',
 ':',
 'but',
 'he',
 'could',
 'think',
 'of',
 'nothing',
 'better',
 'to',
 'say',
 'than',
 'his',
 'first',
 'remark',
 ',',
 '“',
 'it',
 'was',
 'the',
 '_best_',
 'butter',
 ',',
 'you',
 'know.',
 '”',
 'alice',
 'had',
 'been',
 'looking',
 'over',
 'his',
 'shoulder',
 'with',
 'some',
 'curiosity',
 '.']

In [32]:
len(alice_tokens)

981

The two models for Word2Vec are CBOW (Continuous Bag of Words Model) and Skip-Gram.
CBOW mira a predirre il token i-esimo a partire da una finestra che specifica il suo contesto. Skip-Gram invece svolge il compito opposto (predice il contesto a partire dalla parola corrente).

In [33]:
# CBOW model
cbow_model = gensim.models.Word2Vec(alice_tokens, min_count=1,
                                vector_size=100, window=5)
# Skip Gram model
skipgram_model = gensim.models.Word2Vec(alice_tokens, min_count=1, vector_size=100,
                                window=5, sg=1)

In [34]:
print("Cosine similarity between 'alice' " + "and 'wonderland' - CBOW : ",
      cbow_model.wv.similarity('alice', 'wonderland'))
print("Cosine similarity between 'alice' " + "and 'wonderland' - SkipGram : ",
      skipgram_model.wv.similarity('alice', 'wonderland'))

Cosine similarity between 'alice' and 'wonderland' - CBOW :  0.98197067
Cosine similarity between 'alice' and 'wonderland' - SkipGram :  0.6709046


In [35]:
print("Cosine similarity between 'alice' " + "and 'gloomily' - CBOW : ",
      cbow_model.wv.similarity('alice', 'gloomily'))
print("Cosine similarity between 'alice' " + "and 'gloomily' - SkipGram : ",
      skipgram_model.wv.similarity('alice', 'gloomily'))

Cosine similarity between 'alice' and 'gloomily' - CBOW :  0.9539487
Cosine similarity between 'alice' and 'gloomily' - SkipGram :  0.88177806


In [36]:
import gensim.downloader
print(list(gensim.downloader.info()['models'].keys()))

['fasttext-wiki-news-subwords-300', 'conceptnet-numberbatch-17-06-300', 'word2vec-ruscorpora-300', 'word2vec-google-news-300', 'glove-wiki-gigaword-50', 'glove-wiki-gigaword-100', 'glove-wiki-gigaword-200', 'glove-wiki-gigaword-300', 'glove-twitter-25', 'glove-twitter-50', 'glove-twitter-100', 'glove-twitter-200', '__testing_word2vec-matrix-synopsis']


In [37]:
word2vec_precomputed_model = gensim.downloader.load('word2vec-google-news-300')



In [38]:
word2vec_precomputed_model.most_similar('sport')

[('sports', 0.6914728283882141),
 ('Snooki_wannabes', 0.5916634798049927),
 ('painkillers_throat_lozenges', 0.5643172264099121),
 ('racing', 0.5616023540496826),
 ('sporting', 0.559779703617096),
 ('athletics', 0.5516576766967773),
 ('alpine_ski_racing', 0.5514240264892578),
 ('Pole_vaulting', 0.5459784269332886),
 ('motorsport', 0.5384281277656555),
 ('boxing', 0.5330564379692078)]

In [39]:
#get the most similar vector to "alice"
cbow_model.wv.most_similar('alice', topn=5)

[('the', 0.999674379825592),
 (':', 0.9996700286865234),
 ('and', 0.9996694326400757),
 ('that', 0.9996629357337952),
 ('“', 0.999659538269043)]

Now let's see how to handle phrases on word2vec. This is not the suggested solution, as "full-phrase" models like doc2vec have been shown to outperform word2vec.
We can handle handle phrases as list of word2vec vectors, and perform some mathematical operations on them (i.e., sum, average, subtract).

In [40]:
query_phrase = "sport in italy"
#sum the vectors of the individual words
query_vector_sum = np.zeros(300)
for word in query_phrase.split():
  query_vector_sum += word2vec_precomputed_model.get_vector(word)

In [41]:
print("Cosine similarity with 'football' - Google News (SUM) : ",
      cosine_similarity([query_vector_sum], [word2vec_precomputed_model.get_vector("football")])[0][0])
print("Cosine similarity with 'hockey' - Google News (SUM) : ",
      cosine_similarity([query_vector_sum], [word2vec_precomputed_model.get_vector('hockey')])[0][0])
print("Cosine similarity with 'politics' - Google News (SUM) : ",
      cosine_similarity([query_vector_sum], [word2vec_precomputed_model.get_vector('politics')])[0][0])

Cosine similarity with 'football' - Google News (SUM) :  0.43777203627070355
Cosine similarity with 'hockey' - Google News (SUM) :  0.41549963160203685
Cosine similarity with 'politics' - Google News (SUM) :  0.24661070953269673


## Other types of embeddings (Entity and Graph embeddings)

And we can also apply this concept to entity embeddings, using Wikipedia as a backend

In [None]:
!pip install wikipedia2vec
!wget https://raw.githubusercontent.com/LorenzoBellomo/InformationRetrieval/refs/heads/main/data/enwiki_20180420_100d_part.txt

In [None]:
from wikipedia2vec import Wikipedia2Vec
wiki2vec = Wikipedia2Vec.load_text("enwiki_20180420_100d_part.txt")

In [None]:
wiki2vec.most_similar(wiki2vec.get_word('the'), 5)

In [None]:
wiki2vec.most_similar(wiki2vec.get_word('biology'), 5)

And also Embeddings for Graphs

In [None]:
!pip install networkx node2vec
import networkx as nx
from node2vec import Node2Vec

Random walks with a length of 30 and a total number of walks equal to 200.

In [None]:
G = nx.fast_gnp_random_graph(n=100, p=0.5)
node2vec = Node2Vec(G, dimensions=64, walk_length=30, num_walks=200, workers=4)

In [None]:
model = node2vec.fit(window=10, min_count=1, batch_words=4)

In [None]:
model.wv.save_word2vec_format("embeddings_node2vec.txt")

In [None]:
embeddings = {str(node): model.wv[str(node)] for node in G.nodes()}

In [None]:
embeddings["0"]