# Comparing TF-IDF and Word2vec Embeddings: Joke Retrieval

A basic information retrieval system, exemplifying tf-idf and word2vec.

The goal of the project is to retrieve relevant jokes to a query.

Data from https://www.countryliving.com/life/a27452412/best-dad-jokes/ 

1. Import libraries
2. Load data
3. Preprocessing
4. TF-IDF embeddings and their use for retrieval
5. Word2Vec embeddings and their use for retrieval
6. Comparison
7. Exercises
8. Summary

## 1. Import libraries

```
conda create --name nlp2 python=3.11
conda info --env
conda activate nlp
conda install nltk
conda install scikit-learn
conda install gensim
```

In [56]:
import nltk                         # the natural langauage toolkit, open-source NLP
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer, PorterStemmer
import string                                      # for string.punctuation

# consine similarity
from sklearn.metrics.pairwise import cosine_similarity
# tf-idf
# import the two classes from the scikit-learn library that we will use to convert the preprocessed jokes into a matrix of TF-IDF features and to calculate the similarity between the jokes with a query.
from sklearn.feature_extraction.text import TfidfVectorizer

# word2vec
import gensim.downloader as api

In [57]:
# Download external resources
nltk.download('stopwords')          # stopwords are common words that carry less meaning than keywords, usually removed from text
nltk.download('wordnet')            # wordnet is a lexical database of English words, used for text analysis
nltk.download('punkt')              # punkt is a pre-trained model that helps tokenize words, used for text analysis

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\petra\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\petra\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\petra\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

## 2. Load the data

In [58]:

# Dataset of jokes for testing
# Open file and read the content in a list
with open('best_dad_jokes.txt', 'r', encoding="utf-8") as filehandle:
    jokes = [row[:-1] for row in filehandle.readlines()]
   # jokes = [row.rstrip()[1:-1] for row in filehandle.readlines()]



# Print the first 5 jokes, one per line
for joke in jokes[:5]:
    print(joke)


"I'm afraid for the calendar. Its days are numbered."
"My wife said I should do lunges to stay in shape. That would be a big step forward."
"Why do fathers take an extra pair of socks when they go golfing?" "In case they get a hole in one!"
"Singing in the shower is fun until you get soap in your mouth. Then it's a soap opera."
"What do a tick and the Eiffel Tower have in common?" "They're both Paris sites."


## 3. Preprocessing

The same preprocessing pipeline will be used for both TF-IDF embeddings and for Word2vec embeddings.

Defining the preprocessing pipeline:
1. Tokenize
2. Transform to lower case
3. Remove stop words
4. Remove unwanted characters
5. Lemmatize or (stem)


In [59]:
#A class for the preprocessing pipeline which can be reused & adapted for several documents

class PreprocessingPipeline:
    def __init__(self):
        self.stop_words = set(stopwords.words('english'))
        self.lemmatizer = WordNetLemmatizer()    # lemmatization is the process of converting a word to its dictionary form
     #   self.stemming = PorterStemmer()           # stemming is the process of reducing a word to its root form
        self.punctuation = string.punctuation

    
    #Converting text into tokens
    def tokenize(self, text):
        return word_tokenize(text)
    
    #Converting the tokens to lowercase
    def case_fold(self, token):
        return token.lower()
    
    #Removing stop-words
    def remove_stop_words(self, token):
        if token is not None and token not in self.stop_words:
            return token
        return None

    #Removing unwanted characters
    def remove_unwanted_characters(self, token):
        if token is not None and not token.isalpha():
            return None
        return token
    
    #Lemmatizing tokens
    def lemmatize(self,token):
        lemmatized_token = self.lemmatizer.lemmatize(token)
        return lemmatized_token
   
    def token_stemmer(self,token):
        stemmed_token = self.stemming.stem(token)
        return stemmed_token

    #Preprocessing text by applying all steps from above
    def preprocess_text(self, text):
        """Returns a list of preprocessed tokens from the input text."""
        tokens = self.tokenize(text)
        preprocessed_tokens = []
        for token in tokens:
            token = self.case_fold(token)
            token = self.remove_stop_words(token)
            token = self.remove_unwanted_characters(token)
            
            if token is not None:
                token = self.lemmatize(token)
                #token = self.token_stemmer(token)
                preprocessed_tokens.append(token)
        
        return preprocessed_tokens

In [60]:
# Preprocess the jokes with the pipeline 

preprocessor = PreprocessingPipeline()
preprocessed_jokes = [preprocessor.preprocess_text(joke) for joke in jokes]


# Print the first 5 jokes and their preprocessed tokens
for joke, tokens in zip(jokes[:5], preprocessed_jokes[:5]):
    print(joke, "\n ----> ", tokens)

"I'm afraid for the calendar. Its days are numbered." 
 ---->  ['afraid', 'calendar', 'day', 'numbered']
"My wife said I should do lunges to stay in shape. That would be a big step forward." 
 ---->  ['wife', 'said', 'lunge', 'stay', 'shape', 'would', 'big', 'step', 'forward']
"Why do fathers take an extra pair of socks when they go golfing?" "In case they get a hole in one!" 
 ---->  ['father', 'take', 'extra', 'pair', 'sock', 'go', 'golfing', 'case', 'get', 'hole', 'one']
"Singing in the shower is fun until you get soap in your mouth. Then it's a soap opera." 
 ---->  ['singing', 'shower', 'fun', 'get', 'soap', 'mouth', 'soap', 'opera']
"What do a tick and the Eiffel Tower have in common?" "They're both Paris sites." 
 ---->  ['tick', 'eiffel', 'tower', 'common', 'paris', 'site']


## 3. TF-IDF embedding

We will first embed the jokes and the query in a TF-IDF space. In this space, each token (word) is a dimension. 


TF-IDF (Term Frequency-Inverse Document Frequency) evaluates the importance of a word within a document relative to a collection of documents. It consists of two main components: term frequency (TF) and inverse document frequency (IDF). The term frequency measures how often a term appears in a document, while the inverse document frequency quantifies the rarity of a term across the entire document collection. By combining these two factors, TF-IDF assigns higher weights to terms that are frequent within a document but rare across the document collection, thereby emphasizing the significance of terms that are unique to the document.
- t: term
- d: document
- D: collection of documents (corpus)

Term frequency
$$
\text{TF}(t, d) = \frac{\text{Number of times term \( t \) appears in document \( d \)}}{\text{Total number of terms in document \( d \)}}\\
$$

Inverse document frequency
$$
\text{IDF}(t, D) = \log\left(\frac{\text{Total number of documents in \( D \)}}{\text{Number of documents containing term \( t \) in \( D \)}}\right)\\
$$

TF-IDF
$$
\text{TF-IDF}(t, d, D) = \text{TF}(t, d) \times \text{IDF}(t, D)\\
$$



We will use the implementation of the TF-IDF vectorizer from the scikit-learn library to convert the preprocessed jokes into a matrix of TF-IDF features.

We will then use **cosine similarity** to calculate the similarity between the jokes with a query.

$$
\text{Cosine Similarity}(\mathbf{A}, \mathbf{B}) = \frac{\mathbf{A} \cdot \mathbf{B}}{\|\mathbf{A}\| \|\mathbf{B}\|}
$$


In [61]:

def dummy_fun(doc):
    """A dummy function that just returns the input document directly. 
    This is used to bypass the tokenization and pre-processing steps."""
    return doc


# Create the vectorizer
# takes as input a list of lists of strings (the jokes) and returns a matrix of TF-IDF features
tfidf_vectorizer =   TfidfVectorizer(     
    analyzer='word',
    tokenizer=dummy_fun,
    preprocessor=dummy_fun,
    token_pattern=None)   

# Fit the vectorizer on the jokes
tfidf_vectorizer.fit(preprocessed_jokes)                          # this can take up to a minute for the large dataset

# Transform the jokes
tfidf_jokes = tfidf_vectorizer.transform(preprocessed_jokes)

# Print the shape of the matrix
print("Number of jokes:", tfidf_jokes.shape[0])    
print("Embedding dimensions (number of different (non-stop) words):", tfidf_jokes.shape[1])    

# the first number is the number of jokes, the second number is the number of unique words in the jokes 

Number of jokes: 148
Embedding dimensions (number of different (non-stop) words): 569


In [62]:
# Print the words in the vocabulary
print(tfidf_vectorizer.get_feature_names_out())

# These are the dimensions of our tf-idf embedding

['accidentally' 'account' 'act' 'addicted' 'affect' 'afraid' 'ago' 'agree'
 'ahead' 'along' 'alphabet' 'always' 'amazon' 'answer' 'anything'
 'apparent' 'apparently' 'apple' 'april' 'area' 'argument' 'arm' 'asked'
 'astronaut' 'atom' 'award' 'away' 'baby' 'back' 'backflip' 'bagel'
 'banana' 'bank' 'bar' 'bark' 'bartender' 'bathroom' 'bay' 'beach'
 'become' 'becomes' 'bee' 'beef' 'beer' 'belt' 'bent' 'best' 'bicycle'
 'big' 'billy' 'blockbuster' 'boat' 'body' 'boogie' 'book' 'born' 'bowtie'
 'box' 'boxing' 'brace' 'bring' 'brown' 'brush' 'build' 'butter' 'buy'
 'ca' 'calendar' 'call' 'called' 'canned' 'capital' 'car' 'card' 'carded'
 'case' 'cashier' 'cat' 'catch' 'cent' 'cheese' 'cheeseburger' 'cheesy'
 'chemistry' 'chicken' 'child' 'chip' 'chocolate' 'circus' 'classic'
 'claus' 'clean' 'climb' 'closed' 'closet' 'clothes' 'cloud' 'clove'
 'coffee' 'common' 'company' 'computer' 'concentrate' 'concert'
 'construction' 'contest' 'corduroy' 'corn' 'corner' 'cost' 'could'
 'count' 'country'

Let's look at the values 

In [63]:
# Print the first joke and its TF-IDF representation
print(jokes[0])
print(tfidf_jokes[0])
# print the keywords
print(tfidf_vectorizer.get_feature_names_out()[tfidf_jokes[0].nonzero()[1]])


"I'm afraid for the calendar. Its days are numbered."
  (0, 333)	0.5094398629437035
  (0, 121)	0.4705455112225615
  (0, 67)	0.5094398629437035
  (0, 5)	0.5094398629437035
['numbered' 'day' 'calendar' 'afraid']


Note, since the dataset is very small (each joke is short and the number of jokes is small), Tf-Idf gives many tokens the same weight.

### 3.1 Retrieving relevant jokes based on TF-IDF embedding

In [64]:
from sklearn.metrics.pairwise import cosine_similarity
query = "a cat and a dog"
query = "I need a joke about students and teachers"

# transform the query: apply the same preprocessing and vectorization as for the jokes
tfidf_query = tfidf_vectorizer.transform([preprocessor.preprocess_text(query)])

# tf-idf scores for the query
print(tfidf_query)
# print the keywords
print(tfidf_vectorizer.get_feature_names_out()[tfidf_query.nonzero()[1]])

# find the most similar joke
similarities = cosine_similarity(tfidf_query, tfidf_jokes)  # the similarities between the query and the jokes
          # cosine_similarity returns a matrix with the similarities between each pair of jokes, we need only the first row

# print the 5 most relevant (similar) jokes
# sort the similarities
sorted_similarities = similarities.argsort()[0][::-1]  # argsort returns the indices that would sort the array
# print the most similar jokes
for i in range(5):
    print(similarities[0][sorted_similarities[i]] , jokes[sorted_similarities[i]])    


  (0, 498)	0.8393828598192675
  (0, 258)	0.5435406283265565
['teacher' 'joke']
0.3822465320495546 "Where do math teachers go on vacation?" "Times Square."
0.30800626713050494 "When does a joke become a dad joke? When it becomes apparent."
0.21290086669346053 "Why don't eggs tell jokes? They'd crack each other up."
0.1953847899498888 "I was going to tell a time-traveling joke, but you guys didn't like it."
0.18610540215205892 "I have a joke about chemistry, but I don't think it will get a reaction."


## 4. Word2vec embedding

Word2Vec is a word embedding technique that represents words as dense vectors in a continuous vector space. Developed by researchers at Google, Word2Vec is trained on large text corpora to learn distributed representations of words based on their context. The key idea behind Word2Vec is that words with similar meanings tend to occur in similar contexts, and therefore should have similar vector representations. There are two main architectures for training Word2Vec embeddings: Continuous Bag of Words (CBOW) and Skip-gram.

In the CBOW architecture, the model predicts the target word based on its context, which consists of surrounding words in a fixed window. The input to the model is the context words, and the output is the target word. Conversely, in the Skip-gram architecture, the model predicts surrounding context words given a target word. Both architectures use shallow neural networks with a single hidden layer to learn word embeddings.

Word2Vec embeddings capture semantic relationships between words, enabling mathematical operations such as vector addition and subtraction to capture analogies and relationships between words. For example, the vector representation of "king" minus "man" plus "woman" might result in a vector close to the vector representation of "queen." This property makes Word2Vec embeddings useful for various NLP tasks, including sentiment analysis, machine translation, and named entity recognition. Moreover, because Word2Vec embeddings capture semantic information in a dense vector space, they often outperform traditional sparse representations like one-hot encoding in terms of efficiency and effectiveness for downstream NLP tasks.

We will use the `gensim` library.

In [65]:
# import gensim.downloader as api

# api.info() returns information about available models

# Print information about available models
   
# list the corpora and models available in gensim-data
print("Availanle pretrained embeddings:", api.info()['models'].keys())
print("Info about glove-twitter-25:", api.info()['models']["glove-twitter-25"])

# pretty print json with the information about the glove-twitter-25 model
import json
print(json.dumps(api.info()['models']["word2vec-google-news-300"], indent=2))


Availanle pretrained embeddings: dict_keys(['fasttext-wiki-news-subwords-300', 'conceptnet-numberbatch-17-06-300', 'word2vec-ruscorpora-300', 'word2vec-google-news-300', 'glove-wiki-gigaword-50', 'glove-wiki-gigaword-100', 'glove-wiki-gigaword-200', 'glove-wiki-gigaword-300', 'glove-twitter-25', 'glove-twitter-50', 'glove-twitter-100', 'glove-twitter-200', '__testing_word2vec-matrix-synopsis'])
Info about glove-twitter-25: {'num_records': 1193514, 'file_size': 109885004, 'base_dataset': 'Twitter (2B tweets, 27B tokens, 1.2M vocab, uncased)', 'reader_code': 'https://github.com/RaRe-Technologies/gensim-data/releases/download/glove-twitter-25/__init__.py', 'license': 'http://opendatacommons.org/licenses/pddl/', 'parameters': {'dimension': 25}, 'description': 'Pre-trained vectors based on 2B tweets, 27B tokens, 1.2M vocab, uncased (https://nlp.stanford.edu/projects/glove/).', 'preprocessing': 'Converted to w2v format with `python -m gensim.scripts.glove2word2vec -i <fname> -o glove-twitter

In [66]:
# Download pre-trained Word2Vec model
model = api.load("word2vec-google-news-300")   # 1GB MB to download (and load)

# Get embedding for a specific word
embedding = model["apple"]
print("Embedding for 'apple':", embedding)

# Find similar words
similar_words = model.most_similar("apple")
print("Similar words to 'apple':", similar_words)

Embedding for 'apple': [-0.06445312 -0.16015625 -0.01208496  0.13476562 -0.22949219  0.16210938
  0.3046875  -0.1796875  -0.12109375  0.25390625 -0.01428223 -0.06396484
 -0.08056641 -0.05688477 -0.19628906  0.2890625  -0.05151367  0.14257812
 -0.10498047 -0.04736328 -0.34765625  0.35742188  0.265625    0.00188446
 -0.01586914  0.00195312 -0.35546875  0.22167969  0.05761719  0.15917969
  0.08691406 -0.0267334  -0.04785156  0.23925781 -0.05981445  0.0378418
  0.17382812 -0.41796875  0.2890625   0.32617188  0.02429199 -0.01647949
 -0.06494141 -0.08886719  0.07666016 -0.15136719  0.05249023 -0.04199219
 -0.05419922  0.00108337 -0.20117188  0.12304688  0.09228516  0.10449219
 -0.00408936 -0.04199219  0.01409912 -0.02111816 -0.13476562 -0.24316406
  0.16015625 -0.06689453 -0.08984375 -0.07177734 -0.00595093 -0.00482178
 -0.00089264 -0.30664062 -0.0625      0.07958984 -0.00909424 -0.04492188
  0.09960938 -0.33398438 -0.3984375   0.05541992 -0.06689453 -0.04467773
  0.11767578 -0.13964844 -0.2

In [67]:
# Get embedding for a specific word
embedding = model["cat"]

# Find similar words
similar_words = model.most_similar("cat")

print("Embedding for 'cat':", embedding)
print("Similar words to 'cat':", similar_words)

Embedding for 'cat': [ 0.0123291   0.20410156 -0.28515625  0.21679688  0.11816406  0.08300781
  0.04980469 -0.00952148  0.22070312 -0.12597656  0.08056641 -0.5859375
 -0.00445557 -0.296875   -0.01312256 -0.08349609  0.05053711  0.15136719
 -0.44921875 -0.0135498   0.21484375 -0.14746094  0.22460938 -0.125
 -0.09716797  0.24902344 -0.2890625   0.36523438  0.41210938 -0.0859375
 -0.07861328 -0.19726562 -0.09082031 -0.14160156 -0.10253906  0.13085938
 -0.00346375  0.07226562  0.04418945  0.34570312  0.07470703 -0.11230469
  0.06738281  0.11230469  0.01977539 -0.12353516  0.20996094 -0.07226562
 -0.02783203  0.05541992 -0.33398438  0.08544922  0.34375     0.13964844
  0.04931641 -0.13476562  0.16308594 -0.37304688  0.39648438  0.10693359
  0.22167969  0.21289062 -0.08984375  0.20703125  0.08935547 -0.08251953
  0.05957031  0.10205078 -0.19238281 -0.09082031  0.4921875   0.03955078
 -0.07080078 -0.0019989  -0.23046875  0.25585938  0.08984375 -0.10644531
  0.00105286 -0.05883789  0.05102539 

### Exercise king - man + woman = ?
Test the capability of Word2Vec embeddings to compute analogies by completing the analogy "king - man + woman = ?" and observe the resulting word vector.

Test also: Paris is to France as Rome is to ________?

In [68]:
# Your code goes here


In [69]:
# Solution
# Experiment king - man + woman = ?

# Get embedding for a specific word
embedding = model["king"] - model["man"] + model["woman"]
# Find similar words
similar_words = model.most_similar([embedding])
print (similar_words)

[('king', 0.8449392318725586), ('queen', 0.7300518155097961), ('monarch', 0.645466148853302), ('princess', 0.6156251430511475), ('crown_prince', 0.5818676948547363), ('prince', 0.5777117609977722), ('kings', 0.5613664388656616), ('sultan', 0.5376776456832886), ('Queen_Consort', 0.5344247221946716), ('queens', 0.5289887189865112)]


In [70]:
# Get embedding for a specific word
embedding = model["France"] - model["Paris"] + model["Rome"]

# Find similar words
similar_words = model.most_similar([embedding])
print (similar_words)

[('Italy', 0.7115296125411987), ('Rome', 0.7092384696006775), ('France', 0.590425431728363), ('Sicily', 0.5600441694259644), ('Italians', 0.5599856376647949), ('Flaminio_Stadium', 0.5327231884002686), ('Bambino_Gesu_Hospital', 0.505158007144928), ('Italian', 0.4975103735923767), ('Spain', 0.4952991306781769), ('Antonio_Martino', 0.4828406870365143)]


In [71]:
def word2vec_vectorizer(words, model):
    """Returns the average word embedding for the words in the input list. 
    If a word is not in the model's vocabulary, it is ignored."""
    embedding = []
    for word in words:
        if word in model.key_to_index:
            embedding.append(model[word])
  
    if embedding:
        embedding = sum(embedding) / len(embedding)
        return embedding
    return np.zeros(model.vector_size)

# word2vec_vectorizer(["caffft", "dofffg"], model)


In [72]:

# Generate embeddings for jokes

# For each joke, its embedding is the average embedding of all words in the joke 
joke_embeddings = []
for joke in preprocessed_jokes:
    joke_embedding = word2vec_vectorizer(joke, model)
    joke_embeddings.append(joke_embedding)  



### 4.1 Retrieve jokes based on query and Word2Vec embedding

In [73]:
query = "I need a joke about students and teachers"

# Generate embedding for the query
query_embedding = word2vec_vectorizer(preprocessor.preprocess_text(query), model)

# Calculate cosine similarity between query and jokes
similarities = cosine_similarity([query_embedding], joke_embeddings)


# print the 5 most similar jokes
# sort the similarities
sorted_similarities = similarities.argsort()[0][::-1]  # argsort returns the indices that would sort the array
# print the most similar jokes
for i in range(5):
   # print(jokes[sorted_similarities[i]])
    print(similarities[0][sorted_similarities[i]] , jokes[sorted_similarities[i]])    


0.6684581 "Where do math teachers go on vacation?" "Times Square."
0.5728766 "Why did the math book look so sad? Because of all of its problems!"
0.5301609 "I like telling Dad jokes. Sometimes he laughs!"
0.5207011 "I have a joke about chemistry, but I don't think it will get a reaction."
0.51812315 "I was going to tell a time-traveling joke, but you guys didn't like it."


## 5. Comparison
Qualitatively compare the relevance of the jokes retrieved by using TF-IDF embedding and Word2Vec embeddings. 

In [75]:
# Aggregated code, comparing the two methods

query = "snowman"

# Preprocess the query
preprocessed_query = preprocessor.preprocess_text(query)

# Embed the query into the two spaces
tfidf_query = tfidf_vectorizer.transform([preprocessed_query])
word2vec_query = word2vec_vectorizer(preprocessed_query, model)

# Compute the cosine similarities
tfidf_similarities = cosine_similarity(tfidf_query, tfidf_jokes)
word2vec_similarities = cosine_similarity([word2vec_query], joke_embeddings)


# Get the top 5 most similar jokes for each method
sorted_tfidf_similarities = tfidf_similarities.argsort()[0][::-1]
sorted_word2vec_similarities = word2vec_similarities.argsort()[0][::-1]

# Print the top 5 most similar jokes for each method
print("Query:", query)
print("TF-IDF:")
for i in range(5):
  print("{:.3f}".format(tfidf_similarities[0][sorted_tfidf_similarities[i]]), jokes[sorted_tfidf_similarities[i]])

print("\nWord2Vec:")
for i in range(5):
  print("{:.3f}".format(word2vec_similarities[0][sorted_word2vec_similarities[i]]), jokes[sorted_word2vec_similarities[i]])


Query: snowman
TF-IDF:
0.478 "What do you call it when a snowman throws a tantrum?" "A meltdown."
0.000 "If you see a crime happen at the Apple store, what does it make you?" "An iWitness.
0.000 "I have a joke about chemistry, but I don't think it will get a reaction."
0.000 "Why didn't the skeleton climb the mountain?" "It didn't have the guts."
0.000 "What time did the man go to the dentist? Tooth hurt-y."

Word2Vec:
0.566 "What do you call it when a snowman throws a tantrum?" "A meltdown."
0.535 "How does a penguin build its house? Igloos it together."
0.486 "How much does it cost Santa to park his sleigh?" "Nothing, it's on the house."
0.385 "How do you get a good price on a sled?" "You have toboggan."
0.380 "How can you tell if a tree is a dogwood tree?" "By its bark."


## 6. Exercises
1. Change the dataset to Grimm's fairy tales: https://www.kaggle.com/datasets/tschomacker/grimms-fairy-tales or another text dataset of your interest.
2. Cluster the jokes, find the medoids, find keywords for each cluster.


## 7. Review Questions
1. What does TF-IDF stand for, and what does it represent in natural language processing?
2. Explain the intuition behind TF-IDF
3. What are the advantages of using TF-IDF over simple word frequency for text representation?
4. How are TF-IDF scores calculated for individual terms in a document?
5. How does TF-IDF handle stopwords and rare terms in a document collection?
6. Explain the term weighting scheme used in TF-IDF and how it affects the importance of terms in documents.
7. What are some limitations or challenges associated with using TF-IDF?
8. What is Word2Vec?
9. Describe the Skip-gram architecture.
10. Discuss the training objective of Word2Vec and how it learns to capture semantic similarities between words.
11. What are some advantages of using Word2Vec embeddings over traditional one-hot encodings or bag-of-words representations?
12. Explain the notion of cosine similarity in word embeddings.
13. How can Word2Vec embeddings be evaluated and validated for their quality and effectiveness?
14. Discuss the transfer learning capabilities of Word2Vec embeddings and their applications in downstream NLP tasks.
15. Compare and contrast text representation techniques, such as bag-of-words, TF-IDF and word embeddings.


## 8. Conclusion
This notebook serves as a demonstration of a basic information retrieval system, illustrating the utilization of both TF-IDF and Word2Vec embeddings. While TF-IDF treats words in isolation, lacking contextual understanding, pre-trained Word2Vec embeddings capture semantic relationships by leveraging vast corpora through self-supervised learning. 

Beyond information retrieval, these embeddings find utility in a variety of tasks including text classification, topic detection, and beyond.