Authored by: Aryan Mistry

# Embeddings and Semantic Search

Representing text as numerical vectors (*embeddings*) allows us to compute similarities between documents and queries. In this lab you will build a simple retrieval system based on TF‑IDF vectors and cosine similarity, then read reference code for using dense embeddings via the `sentence-transformers` library. [9]

## Part 1 – TF‑IDF Vectors

TF‑IDF (Term Frequency–Inverse Document Frequency) is a classic technique for turning a collection of documents into numerical vectors. The *term frequency* (TF) of a word reflects how often it appears in a document, while the *inverse document frequency* (IDF) down‑weights words that appear in many documents. The TF‑IDF weight for a word *t* in document *d* is defined as:

$$\text{TF–IDF}(t, d) = \text{TF}(t, d) \times \log
\frac{N}{1 + n(t)}$$

where **N** is the total number of documents and **n(t)** is the number of documents containing the word. Common words such as "the" or "and" have low IDF and therefore contribute less to the vector. Once each document is represented as a vector, we can compute similarities using the cosine of the angle between vectors.

In the following code we build a simple TF‑IDF vectoriser, define a search function and then explore the highest‑weighted terms in each document. [7]

In [3]:
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

# A small corpus of documents
documents = [
    'Large language models are revolutionizing natural language processing.',
    'Transformers use attention mechanisms to model relationships between words.',
    'TF–IDF vectors provide a baseline for information retrieval.',
    'Word embeddings capture semantic information about terms.',
]

# Build the vectoriser and transform the corpus
vectorizer = TfidfVectorizer(stop_words='english')
tfidf_matrix = vectorizer.fit_transform(documents)

# Function to search the corpus given a query
def search(query: str, top_k: int = 2):
    """Compute similarity between the query and each document and return the top_k documents."""
    query_vec = vectorizer.transform([query])
    sims = cosine_similarity(query_vec, tfidf_matrix).flatten()
    top_indices = sims.argsort()[::-1][:top_k]
    return [(documents[i], sims[i]) for i in top_indices]

# Inspect the most significant terms in each document
feature_names = vectorizer.get_feature_names_out()
for idx, doc in enumerate(documents):
    vector = tfidf_matrix[idx].toarray().flatten()
    top_terms_idx = vector.argsort()[::-1][:5]
    top_terms = [(feature_names[i], vector[i]) for i in top_terms_idx]
    print(f'Document {idx} top terms:', top_terms)

# Example search
results = search('attention models')
for doc, score in results:
    print(f'Result: {doc}Score: {score:.3f}')


Document 0 top terms: [('language', np.float64(0.6666666666666666)), ('natural', np.float64(0.3333333333333333)), ('models', np.float64(0.3333333333333333)), ('processing', np.float64(0.3333333333333333)), ('large', np.float64(0.3333333333333333))]
Document 1 top terms: [('words', np.float64(0.3779644730092272)), ('transformers', np.float64(0.3779644730092272)), ('use', np.float64(0.3779644730092272)), ('mechanisms', np.float64(0.3779644730092272)), ('relationships', np.float64(0.3779644730092272))]
Document 2 top terms: [('vectors', np.float64(0.388614292631317)), ('tf', np.float64(0.388614292631317)), ('baseline', np.float64(0.388614292631317)), ('provide', np.float64(0.388614292631317)), ('retrieval', np.float64(0.388614292631317))]
Document 3 top terms: [('word', np.float64(0.4217647821447532)), ('semantic', np.float64(0.4217647821447532)), ('terms', np.float64(0.4217647821447532)), ('capture', np.float64(0.4217647821447532)), ('embeddings', np.float64(0.4217647821447532))]
Result:

### Discussion

In the example above we vectorised a handful of short documents using TF‑IDF and explored the top weighted terms for each. The function `search` computes cosine similarity between a query and all documents and returns the top hits. Note how terms with low overall frequency (e.g. "transformers") tend to have high TF‑IDF weights, making them influential in similarity calculations.

However, TF‑IDF does not capture deeper semantic relationships or synonyms. For instance, a query containing "neural network" would not match a document about "deep learning" because the words differ. To address this, dense vector representations trained with neural networks can embed related words close together in a high‑dimensional space. The next section introduces this idea. [9]

## Part 2 – Dense Embeddings (Reference)

Dense embeddings map sentences or documents into a continuous vector space such that semantically similar texts have vectors that are close together. Modern models like those provided by the `sentence-transformers` library (based on transformer architectures) produce embeddings that capture meaning beyond individual word counts. These embeddings can be indexed with libraries such as Facebook AI Similarity Search (FAISS) to enable efficient nearest‑neighbour search across millions of vectors.

To use dense embeddings you typically need to:
1. **Install dependencies:** `pip install sentence-transformers faiss-cpu` (or `faiss-gpu` if you have a GPU).
2. **Load a pre‑trained model:** e.g. `'all-MiniLM-L6-v2'` is a compact model for sentence embeddings.
3. **Encode your documents:** Pass each sentence or paragraph through the model to obtain a 384‑dimensional vector.
4. **Index the vectors:** Build a FAISS index to support fast similarity search.
5. **Query the index:** Encode the query and retrieve the most similar documents along with their scores.

Below is an example illustrating these steps (the code is presented as reference and will not run in this environment without the required libraries). [9]

In [4]:
!pip install sentence-transformers faiss-cpu
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np

model = SentenceTransformer('all-MiniLM-L6-v2')

# # Encode the corpus into dense vectors
dense_vectors = model.encode(documents, convert_to_tensor=False, show_progress_bar=False)
dense_vectors = np.array(dense_vectors).astype('float32')

# Build a FAISS index
dim = dense_vectors.shape[1]
index = faiss.IndexFlatL2(dim)
index.add(dense_vectors)

# Encode the query and perform search
query_embedding = model.encode(['attention in transformers'], convert_to_tensor=False)
query_embedding = np.array(query_embedding).astype('float32')
distances, indices = index.search(query_embedding, k=2)
for dist, idx in zip(distances[0], indices[0]):
    print(f'Result: {documents[idx]} (distance={dist:.4f})')


Result: Transformers use attention mechanisms to model relationships between words. (distance=0.4766)
Result: Large language models are revolutionizing natural language processing. (distance=1.4741)


## Exercises

1. **Explore different corpora:** Replace the `documents` list with sentences from a news article, a technical report or your own writing. Observe how TF‑IDF weights and top terms change.
2. **Implement a document ranking:** Modify the `search` function to return the top 3 documents instead of 2. For each query, explain why certain documents rank higher or lower.
3. **Compare TF‑IDF with dense embeddings:** Install `sentence-transformers` and `faiss-cpu` in your own environment. Encode the same documents with a sentence embedding model and perform a similarity search. Compare the results with the TF‑IDF system. Which queries benefit most from dense embeddings?
4. **Synonym handling:** Create queries that use synonyms (e.g. 'artificial intelligence' vs 'AI'). Evaluate how TF‑IDF and dense embeddings handle these synonyms differently.
5. **Extend to multi‑paragraph documents:** Group multiple sentences into longer documents (e.g. full paragraphs). Recompute TF‑IDF vectors and observe how term weights distribute across larger contexts. What challenges arise when documents vary greatly in length?


Foundational LLMs & Transformers
1. Vaswani, A., et al. (2017). Attention is All You Need. Advances in Neural Information Processing Systems (NIPS 2017).
2. Brown, T. B., et al. (2020). Language Models are Few-Shot Learners. NeurIPS 2020.
3. Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. NAACL-HLT 2019.
4. OpenAI (2023). GPT-4 Technical Report. arXiv:2303.08774.
5. Touvron, H., et al. (2023). LLaMA 2: Open Foundation and Fine-Tuned Chat Models. Meta AI.

Generative AI & Sampling

6. Goodfellow, I., et al. (2014). Generative Adversarial Nets. NeurIPS 2014.
7. Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer.
8. Neal, R. M. (1993). Probabilistic Inference Using Markov Chain Monte Carlo Methods. Technical Report CRG-TR-93-1, University of Toronto.

Retrieval-Augmented Generation (RAG) & Knowledge Grounding

9. Lewis, P., et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP. NeurIPS 2020.
10. deepset ai (2023). Haystack: Open-Source Framework for Search and RAG Applications. https://haystack.deepset.ai
11. LangChain (2023). LangChain Documentation and Cookbook. https://python.langchain.com

Evaluation & Safety

12. Papineni, K., et al. (2002). BLEU: A Method for Automatic Evaluation of Machine Translation. ACL 2002.
13. Lin, C.-Y. (2004). ROUGE: A Package for Automatic Evaluation of Summaries. ACL Workshop 2004.
14. OpenAI (2024). Evaluating Model Outputs: Faithfulness and Grounding. OpenAI Docs.
15. Guardrails AI (2024). Open-Source Guardrails Framework. https://github.com/shreyar/guardrails

Prompt Engineering & Instruction Tuning

16. White, J. (2023). The Prompting Guide. https://www.promptingguide.ai
17. Ouyang, L., et al. (2022). Training Language Models to Follow Instructions with Human Feedback. NeurIPS 2022.

Agents & Tool Use

18. Yao, S., et al. (2022). ReAct: Synergizing Reasoning and Acting in Language Models. arXiv:2210.03629.
19. LangChain (2024). LangChain Agents and Tools Documentation.
20. Microsoft (2023). Semantic Kernel Developer Guide. https://learn.microsoft.com/en-us/semantic-kernel/
21. Google DeepMind (2024). Gemini Technical Report. arXiv:2312.11805.

State, Memory & Orchestration

22. LangGraph (2024). Stateful Agent Orchestration Framework. https://langchain-langgraph.vercel.app
23. Park, J. S., et al. (2023). Generative Agents: Interactive Simulacra of Human Behavior. arXiv:2304.03442.

Pedagogical and Course Design References

24. fast.ai (2023). fast.ai Deep Learning Course Notebooks. https://course.fast.ai
25. Ng, A. (2023). DeepLearning.AI Short Courses on Generative AI.
26. MIT 6.S191, Stanford CS324, UC Berkeley CS294-158. (2022–2024). Course Materials and Public Notebooks for ML and LLMs.