<a href="https://colab.research.google.com/github/LorenzoBellomo/InformationRetrieval/blob/main/notebooks/2_TFIDFandEmbeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text processing with vectors
In this lecture we focus on techinques that allow to model the text as vectors of floating point numbers. This allows us to easily process and compute similarities between words, sentences, and documents.

In [None]:
!pip install scikit-learn
!pip install nltk

In [None]:
from nltk.tokenize import word_tokenize
import nltk
import numpy as np
import json

nltk.download('stopwords')
nltk.download('punkt_tab')

In [None]:
!wget https://raw.githubusercontent.com/LorenzoBellomo/InformationRetrieval/refs/heads/main/data/5articles.json

Let's load this json file containing 5 articles, comprised of maintext, title, date of publishment, and news source.

In [None]:
with open("5articles.json", "r") as f:
    articles = json.load(f)

articles

## Simple bag-of-words vectorizers (Count and TF-IDF)

In [16]:
from sklearn.feature_extraction.text import CountVectorizer # Just counts the occurrences of terms
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

Let's make a simple test, and use CountVectorizer and TFIDF Vectorizer on the titles (5 tot documents)

In [18]:
tfidf_vectorizer = TfidfVectorizer(input='content')
count_vectorizer = CountVectorizer(input='content')
titles = [a["title"] for a in articles]
tfidf_vectors = tfidf_vectorizer.fit_transform(titles)

In [None]:
tfidf_vectors

In [None]:
unique_tokens = {}
for title in titles:
  tokens = title.split()
  for token in tokens:
    if token not in unique_tokens:
      unique_tokens[token] = 1
    else:
      unique_tokens[token] = unique_tokens[token] + 1

len(unique_tokens)

In [None]:
unique_tokens = {}
for title in titles:
  tokens = title.split()
  for token in tokens:
      token = token.lower()
      prev_count = unique_tokens.get(token, 0)
      unique_tokens[token] = prev_count + 1

len(unique_tokens)
unique_tokens

In [None]:
len(list(tfidf_vectorizer.get_feature_names_out()))

In [None]:
list_of_features = list(tfidf_vectorizer.get_feature_names_out())
[w for w in list_of_features if w not in unique_tokens.keys()]

In [None]:
unique_tokens["conte:"]

Let's report now the TFIDF of the words, writing in a specific row ("\_\_Document Frequency\_\_") the number of times said "token" appears over all documents.

In [None]:
import pandas as pd
tfidf_df = pd.DataFrame(tfidf_vectors.toarray(), index=titles, columns=tfidf_vectorizer.get_feature_names_out())
tfidf_df.loc['__Document Frequency__'] = (tfidf_df > 0).sum()
tfidf_df[['airlines', 'chelsea', 'car', 'murder', 'think', 'one','the', 'to']].sort_index().round(decimals=2)

Let's define a function that reports the top n words by count score (countvectorizer) and TFIDF score in the collection.

In [44]:
def get_top_n_words(documents, tfidf_vectorizer, count_vectorizer, top_n = 10):
  tfidf_vectors, count_vectors = tfidf_vectorizer.fit_transform(documents), count_vectorizer.fit_transform(documents)
  feature_names_tfidf, feature_names_count = tfidf_vectorizer.get_feature_names_out(), count_vectorizer.get_feature_names_out()
  top_indices_tfidf, top_indices_count = np.argsort(tfidf_vectors.data)[:-(top_n):-1], np.argsort(count_vectors.data)[:-(top_n):-1]
  print("TFIDF       -        COUNT")
  for tfidx, cidx in zip(top_indices_tfidf, top_indices_count):
    print("{} ({}) - {} ({})".format(feature_names_tfidf[tfidf_vectors.indices[tfidx]], round(tfidf_vectors.data[tfidx]*100)/100, feature_names_count[count_vectors.indices[cidx]], count_vectors.data[cidx]))

Running it on the titles does not make that much sense, let's run it on a bigger corpus (maintexts). The document count is the same (5), but we can expect a larger number of tokens.

In [None]:
maintexts = [a["maintext"] for a in articles]
get_top_n_words(maintexts, tfidf_vectorizer, count_vectorizer, top_n=12)

Stopwords get an extremely high score. That is due to the fact that the total document count is extremely low (5), making it impossible for the IDF factor of the formula to properly scale down the scores. In this case, we can simply remove all the stopwords in the collection using the built-in "stop_words" parameter.

In [None]:
tfidf_vectorizer = TfidfVectorizer(input='content', stop_words="english")
count_vectorizer = CountVectorizer(input='content', stop_words="english")
get_top_n_words(maintexts, tfidf_vectorizer, count_vectorizer)

In [None]:
!wget https://raw.githubusercontent.com/LorenzoBellomo/InformationRetrieval/refs/heads/main/data/500news.json

In [None]:
with open("500news.json", "r") as f:
    news = json.load(f)
news[0]["maintext"]

In [None]:
maintexts = [a["maintext"] for a in news]
tfidf_vectorizer = TfidfVectorizer(input='content')
count_vectorizer = CountVectorizer(input='content')
get_top_n_words(maintexts, tfidf_vectorizer, count_vectorizer)

In [None]:
tfidf_vectorizer.fit_transform(maintexts)

let's read the "Alice in Wonderland" book, and let's first try to run Count and TF-IDF vectorizers.

## Making queries with TF_IDF
For queries, we need to also "vectorize" the query. Let's try with "car".

In [None]:
query = "cars"
maintexts = [a["maintext"] for a in articles]
tfidf_vectors = tfidf_vectorizer.fit_transform(maintexts) # here we rerun the vectorizer for the maintexts of the articles
query_vector = tfidf_vectorizer.transform([query]) # here we create the vector for "car"
cosine_similarities = cosine_similarity(query_vector, tfidf_vectors).flatten() # compute all cosine similarities
print(cosine_similarities)
top_indices = np.argsort(cosine_similarities)[::-1][:3] # sort them decreasingly and limit to the top 3 most similar
print("Top 3 matching documents with \"{}\":".format(query))
for index in top_indices:
    print(f"\nScore: {cosine_similarities[index]:.4f} - {maintexts[index][:200]}...")

Why do we get a 0 score for the "charles Leclerc" article (which corresponds to maintexts[1])

In [None]:
print("Car" in maintexts[4]) # as we see, only "Car", with uppercase C, is present in the maintext
print(" car " in maintexts[4])
print("Cars" in maintexts[4]) # as we see, only "Car", with uppercase C, is present in the maintext
print(" cars " in maintexts[4])

In [None]:
!wget https://raw.githubusercontent.com/LorenzoBellomo/InformationRetrieval/refs/heads/main/data/alice.txt

In [None]:
with open("alice.txt", 'r') as alice_file:
  alice = alice_file.read().lower()
sentences = [a for a in alice.split('\n') if a]
print(sentences[:10])
print(len(sentences))
tfidf_vectorizer = TfidfVectorizer(input='content')
count_vectorizer = CountVectorizer(input='content')
get_top_n_words(sentences, tfidf_vectorizer, count_vectorizer)

In [88]:
def run_query(tfidf_matrix, tfidf_vectorizer, documents, query, top_n=3):
  query_vector = tfidf_vectorizer.transform([query])
  cosine_similarities = cosine_similarity(query_vector, tfidf_matrix).flatten() # compute all cosine similarities
  top_indices = np.argsort(cosine_similarities)[::-1][:top_n]
  print("Top {} matching documents with \"{}\":".format(top_n, query))
  for index in top_indices:
      print(f"\nScore: {cosine_similarities[index]:.4f} - {documents[index][:200]}...")

In [None]:
tfidf_vectorizer = TfidfVectorizer(input='content', stop_words="english")
tfidf_vectors = tfidf_vectorizer.fit_transform(sentences)
run_query(tfidf_vectors, tfidf_vectorizer, sentences, "alice rabbit")

This is to show the importance of running proper preprocessing algorithms. Remember lecture 1.

## BM25, another bag-of-word metric

In [None]:
!pip install rank_bm25
from rank_bm25 import BM25Okapi

In [92]:
tokenized_corpus = [doc.split() for doc in maintexts]
bm25 = BM25Okapi(tokenized_corpus)

In [None]:
print("BM25 score of \"car\"\n")
scores = bm25.get_scores("car")
for title, score in zip(titles, scores):
  print(title, " - ", score)

## Word Embeddings and word2vec
Let's now move to more advanced vectorization techniques. These techinques use Machine Learning and try to learn the patterns in which words tend to co-occur.

In [None]:
!pip install gensim
import gensim

In [None]:
alice_tokens = []
for sentence in nltk.sent_tokenize(alice):
  sentence_tokens = []
  for w in word_tokenize(sentence):
    sentence_tokens.append(w.lower())
  alice_tokens.append(sentence_tokens)
alice_tokens[500]

In [None]:
len(alice_tokens)

The two models for Word2Vec are CBOW (Continuous Bag of Words Model) and Skip-Gram.
CBOW mira a predirre il token i-esimo a partire da una finestra che specifica il suo contesto. Skip-Gram invece svolge il compito opposto (predice il contesto a partire dalla parola corrente).

In [101]:
# CBOW model
cbow_model = gensim.models.Word2Vec(alice_tokens, min_count=1,
                                vector_size=100, window=5)
# Skip Gram model
skipgram_model = gensim.models.Word2Vec(alice_tokens, min_count=1, vector_size=100,
                                window=5, sg=1)

In [None]:
print("Cosine similarity between 'alice' " + "and 'wonderland' - CBOW : ",
      cbow_model.wv.similarity('alice', 'wonderland'))
print("Cosine similarity between 'alice' " + "and 'wonderland' - SkipGram : ",
      skipgram_model.wv.similarity('alice', 'wonderland'))

In [None]:
print("Cosine similarity between 'alice' " + "and 'gloomily' - CBOW : ",
      cbow_model.wv.similarity('alice', 'gloomily'))
print("Cosine similarity between 'alice' " + "and 'gloomily' - SkipGram : ",
      skipgram_model.wv.similarity('alice', 'gloomily'))

In [None]:
import gensim.downloader
print(list(gensim.downloader.info()['models'].keys()))

In [None]:
word2vec_precomputed_model = gensim.downloader.load('word2vec-google-news-300')

In [None]:
word2vec_precomputed_model.most_similar('sport')

In [None]:
#get the most similar vector to "alice"
cbow_model.wv.most_similar('alice', topn=5)

Now let's see how to handle phrases on word2vec. This is not the suggested solution, as "full-phrase" models like doc2vec have been shown to outperform word2vec.
We can handle handle phrases as list of word2vec vectors, and perform some mathematical operations on them (i.e., sum, average, subtract).

In [125]:
query_phrase = "sport in italy"
#sum the vectors of the individual words
query_vector_sum = np.zeros(300)
for word in query_phrase.split():
  query_vector_sum += word2vec_precomputed_model.get_vector(word)

In [None]:
print("Cosine similarity with 'football' - Google News (SUM) : ",
      cosine_similarity([query_vector_sum], [word2vec_precomputed_model.get_vector("football")])[0][0])
print("Cosine similarity with 'hockey' - Google News (SUM) : ",
      cosine_similarity([query_vector_sum], [word2vec_precomputed_model.get_vector('hockey')])[0][0])
print("Cosine similarity with 'politics' - Google News (SUM) : ",
      cosine_similarity([query_vector_sum], [word2vec_precomputed_model.get_vector('politics')])[0][0])

## Other types of embeddings (Entity and Graph embeddings)

And we can also apply this concept to entity embeddings, using Wikipedia as a backend

In [None]:
!pip install wikipedia2vec
!wget https://raw.githubusercontent.com/LorenzoBellomo/InformationRetrieval/refs/heads/main/data/enwiki_20180420_100d_part.txt

In [None]:
from wikipedia2vec import Wikipedia2Vec
wiki2vec = Wikipedia2Vec.load_text("enwiki_20180420_100d_part.txt")

In [None]:
wiki2vec.most_similar(wiki2vec.get_word('the'), 5)

In [None]:
wiki2vec.most_similar(wiki2vec.get_word('biology'), 5)

And also Embeddings for Graphs

In [None]:
!pip install networkx node2vec
import networkx as nx
from node2vec import Node2Vec

Random walks with a length of 30 and a total number of walks equal to 200.

In [None]:
G = nx.fast_gnp_random_graph(n=100, p=0.5)
node2vec = Node2Vec(G, dimensions=64, walk_length=30, num_walks=200, workers=4)

In [None]:
model = node2vec.fit(window=10, min_count=1, batch_words=4)

In [None]:
model.wv.save_word2vec_format("embeddings_node2vec.txt")

In [None]:
embeddings = {str(node): model.wv[str(node)] for node in G.nodes()}

In [None]:
embeddings["0"]