# Lab: Building a Vector Database with FAISS for Nearest Neighbor Search

In this hands-on lab, you will learn how to **build a high-performance vector database** using **FAISS** (Facebook AI Similarity Search) for performing nearest neighbor search. The lab will focus on a real-world *scenario:* a **job portal system** that matches candidates to job opportunities based on **semantic meaning**, rather than traditional keyword-based searches.

The key idea is to generate **embeddings** — dense vector representations — of both job descriptions and candidate resumes using Hugging Face pre-trained models like **DistilBERT**. These embeddings capture the semantic context of the text, allowing for better job matching even when the wording varies.



**DistilBERT** is a smaller, faster, and more efficient version of the popular **BERT** (Bidirectional Encoder Representations from Transformers) model. It was created using a technique called **knowledge distillation**, where a smaller model (the student) learns from a larger pre-trained model (the teacher, in this case, BERT).

While **BERT** is a powerful transformer model known for its deep understanding of context in text, **DistilBERT** retains much of BERT's performance but with fewer parameters, making it **lighter** and **faster**.








# 1. Install dependencies
1.   To install the necessary dependencies, run the following command:



In [4]:
!pip install faiss-cpu transformers numpy



#### **Expected Output:**
After running the !pip install faiss-cpu transformers numpy command, you'll likely see installation logs for each of these libraries. If the installations succeed, the output will confirm that all three libraries are installed.


This means:

* **FAISS** is ready to use for similarity search tasks.

* **Transformers** is installed and can be used for NLP tasks.

* **NumPy** is ready for numerical computations.

2. To build the vector search system, we need specific libraries like **FAISS, Hugging Face Transformers, and NumPy.** FAISS is used for vector search, Hugging Face provides pre-trained models for generating embeddings, and NumPy is essential for handling arrays and performing numerical operations.


These libraries form the backbone of the system. They allow us to **generate embeddings**, **index them**, and search for similar items based on vector distance, which is the core functionality of this system.


In [5]:
import faiss
import numpy as np
from transformers import AutoTokenizer, AutoModel
import torch  # For model inference

#### **Expected Output:**

With these libraries imported, you can now perform tasks such as:

* **FAISS:** Use it for building search indexes from high-dimensional data (e.g., embeddings from transformers) and then search for the most similar items in that data.

* **NumPy:** Handle numerical data efficiently for operations like matrix manipulations and embeddings.

* **Transformers:** Use state-of-the-art NLP models (like BERT, GPT, etc.) for various NLP tasks. Tokenize text, get embeddings, and run inference with pre-trained models.

* **PyTorch:** Perform inference (and possibly training) on neural networks. It is used by many Hugging Face models for computation.

* **Tokenization:** Convert text into token IDs using the AutoTokenizer.

* **Model Inference:** Use the pre-trained BERT model to get embeddings (vector representations) of the text.

#  2. Load Pre-trained Hugging Face Model for Embeddings

Transformers like **DistilBERT** are pre-trained models capable of converting text into fixed-size embeddings (dense vectors). These embeddings capture the **semantic meaning** of the text, allowing for more accurate matching beyond simple keyword comparison.


1. This step is essential for converting the job descriptions and resumes into **numerical vectors** that can be compared for similarity. The embeddings generated here are stored and indexed using **FAISS to allow for fast similarity searches.**


In [6]:
tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/distilbert-base-nli-mean-tokens")
model = AutoModel.from_pretrained("sentence-transformers/distilbert-base-nli-mean-tokens")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/450 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/550 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/265M [00:00<?, ?B/s]

### **Explanation**

* **AutoTokenizer:** Converts raw text into tokens.

* **AutoModel:** Loads the pre-trained model that will generate embeddings for the input text.


2. The embeddings generated in this step are used for similarity search in the FAISS database. They **provide a consistent numerical representation** of job descriptions and resumes, making it possible to search for similar text.


In [7]:
def get_embeddings(texts):
    # Tokenize the input texts
    inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")

    # Generate embeddings using the pre-trained model
    with torch.no_grad():  # Disable gradient calculation for inference
        embeddings = model(**inputs).last_hidden_state.mean(dim=1)

    return embeddings.numpy()

### **Explanation**

The **embedding generation function** is responsible for taking input text (job descriptions or resumes) and converting it into a **high-dimensional vector.** This process involves **tokenizing the text** (splitting it into meaningful units) and feeding it into the pre-trained model to generate the vector.


* **Tokenization:** Converts raw text into tokens that the model can process.


* **Model Inference:** The model generates embeddings (numerical representations) for each token, which are then averaged to create a fixed-size vector for each sentence.

* The **```with torch.no_grad()```**: context manager in PyTorch temporarily disables gradient computation, which reduces memory usage and speeds up computations when you don't need to calculate gradients, such as during model inference or evaluation. It ensures that no gradients are stored for the operations inside the block, making the process more efficient.

# 3. Generate Embeddings for Job Descriptions and Resumes

1. This step applies the embedding generation function to a set of job descriptions and candidate resumes to convert them into vectors. These vectors will then be used in the FAISS index for fast retrieval during similarity searches.

In [8]:
# Job descriptions and candidate resumes
job_descriptions = [
    "Data Scientist position with expertise in Python, Machine Learning, and Data Analysis.",
    "Software Engineer with experience in Java, cloud technologies, and software development.",
    "Marketing Manager with experience in digital marketing, content strategy, and team leadership."
]


candidate_resumes = [
    "Experienced Data Scientist proficient in Python, Machine Learning, and Data Analysis.",
    "Software Engineer with strong experience in Java, cloud platforms, and agile development.",
    "Digital Marketing professional skilled in content creation, SEO, and social media strategies."
]


#### **Explanation:**

In the above code, you have two lists:

```job_descriptions```: Contains strings that describe job requirements or job descriptions.

```candidate_resumes```: Contains strings that describe the candidates' resumes or their skill sets and experiences.

The expected output would be the **matching or comparison between each job description and each candidate's resume** to see how well the candidate fits the requirements of the job.

 In practical terms, you might use text comparison or similarity metrics (like cosine similarity) to compare them. However, in this example, we'll be doing a simple match based on keywords.

2. This step creates the **actual dataset of vectors** that will be indexed in FAISS. It ensures that each job description and resume has a **corresponding embedding,** which is crucial for querying and finding the best matches later.


In [9]:
# Generate embeddings for job descriptions and candidate resumes
job_embeddings = get_embeddings(job_descriptions)
resume_embeddings = get_embeddings(candidate_resumes)


# Print the shape of the generated embeddings
print("Generated Embeddings Shape for Job Descriptions:", job_embeddings.shape)
print("Generated Embeddings Shape for Candidate Resumes:", resume_embeddings.shape)

Generated Embeddings Shape for Job Descriptions: (3, 768)
Generated Embeddings Shape for Candidate Resumes: (3, 768)


#### **Explanation:**

In the above code, you are **generating embeddings for job descriptions** and candidate resumes. Embeddings are vector representations of text that capture semantic meaning. These embeddings are typically generated using models like BERT, Sentence-BERT, or other pre-trained transformer models.

* ```get_embeddings()```: This function (which is assumed to be implemented elsewhere in the code) is expected to generate embeddings for the provided input text (job descriptions and candidate resumes).

* ```job_descriptions```: A list of job descriptions you want to generate embeddings for.

* ```candidate_resumes```: A list of candidate resumes you want to generate embeddings for.

# 4. Build the FAISS Index
This step allows us to **store all the embeddings** of job descriptions in a format that makes it possible to **quickly search** for the most similar resumes when a candidate uploads their resume.



1. Create a FAISS index to store and search embeddings for job descriptions.

In [10]:

# FAISS: Create the index for job descriptions embeddings
dimension = job_embeddings.shape[1]  # The dimension of the embeddings
index = faiss.IndexFlatL2(dimension)  # Initialize FAISS index with L2 distance (Euclidean distance)


# Add job embeddings to the FAISS index
index.add(job_embeddings)


print("Job embeddings have been added to the FAISS index.")

Job embeddings have been added to the FAISS index.


### **Explanation**:

FAISS is used to index the embeddings, which allows for **fast nearest neighbor search**. The IndexFlatL2 index is created for exact searches, but more advanced indexing methods can be used for larger datasets to speed up searches.


* **IndexFlatL2:** A simple FAISS index that computes the Euclidean distance (L2 distance) for similarity.


* **index.add():** Adds the job description embeddings to the FAISS index, making them searchable.

# 5. Perform Nearest Neighbor Search

The **nearest neighbor search** is the process of finding the most similar job descriptions to a candidates resume, based on their embeddings. FAISS compares the query (the resume) with the indexed job descriptions and returns the closest matches.

This step ensures that when a new resume is uploaded or a job description is queried, the **system can quickly find the most relevant job descriptions or candidate resumes based on semantic similarity.**

1. Now, perform a nearest neighbor search to match a candidate’s resume with the most relevant job description.

2. Define the Query (Candidate Resume): Generate an embedding for the query (the resume).


In [11]:
# Now, generate embedding for a query resume
query_resume = ["Experienced Data Scientist proficient in Python, Machine Learning, and Data Analysis."]
query_embedding = get_embeddings(query_resume)


# Print the shape of the generated query embedding
print("Query Embedding Shape:", query_embedding.shape)

Query Embedding Shape: (1, 768)


### **Explanation**:

This above code takes a resume text, generates an embedding for it using the ```get_embeddings()``` function, and then prints the shape of the embedding, which tells us the dimensions of the vector representation of the resume.

Explanation of the Code:

* The variable ***```query_resume```** stores a list with one string: "Experienced Data Scientist proficient in Python, Machine Learning, and Data Analysis."

* The function **```get_embeddings(query_resume)```*** generates an embedding (a numerical vector) for the given query resume using the pre-trained model.


* ```print("Query Embedding Shape:", query_embedding.shape)``` prints the shape (size) of the generated embedding. The .shape attribute is used to get the dimensions of the tensor that holds the embedding.

3. Search for Nearest Neighbors: Use FAISS to search for the top 3 closest matches to the query.


In [12]:
k = 3  # Number of nearest neighbors to retrieve
distances, indices = index.search(query_embedding, k)


# Display the results
print(f"Query Resume: {query_resume[0]}")
for i in range(k):
    print(f"Job {i+1}: {job_descriptions[indices[0][i]]} (Distance: {distances[0][i]:.4f})")

Query Resume: Experienced Data Scientist proficient in Python, Machine Learning, and Data Analysis.
Job 1: Data Scientist position with expertise in Python, Machine Learning, and Data Analysis. (Distance: 52.4818)
Job 2: Software Engineer with experience in Java, cloud technologies, and software development. (Distance: 130.7507)
Job 3: Marketing Manager with experience in digital marketing, content strategy, and team leadership. (Distance: 186.1873)


#### **Explanation**

The code prints the top k job descriptions, helping you understand which jobs are the most relevant to the given resume.

* **index.search():** Finds the nearest neighbors of the query (candidate resume) in the FAISS index.


* **distances:** The similarity score (lower values indicate more similarity).


* **indices:** The indices of the most similar job descriptions from the original dataset.

# 6. Experiment with Advanced Index Types (Optional)

1. For larger datasets, **advanced FAISS indexes** like **Inverted File Index (IVF)** or **HNSW (Hierarchical Navigable Small World)** can be used to speed up searches by approximating the nearest neighbors, rather than searching exhaustively.




In [13]:
# Create the coarse quantizer for IVF (Inverted File Index)
quantizer = faiss.IndexFlatL2(dimension)  # This will use L2 distance (Euclidean distance)


# Create the IVF index with the quantizer, dimension, and nlist
nlist = 3  # Reduced the number of clusters
index_ivf = faiss.IndexIVFFlat(quantizer, dimension, nlist)



#### **Explanation**

This step enhances the system’s ability to **handle large volumes** of data, ensuring that even with millions of job descriptions and resumes, the system remains responsive and efficient.

* **Coarse quantizer:** A coarse quantizer is a method used in approximate nearest neighbor search to speed up the process of finding the closest vectors in a large dataset. It works by grouping similar vectors into clusters (called centroids), which allows the search algorithm to focus only on the relevant clusters, rather than checking every vector in the dataset.

* **IVF index:** Organizes vectors into clusters (with nlist=3) for faster search.

* **```faiss.IndexIVFFlat(quantizer, dimension, nlist)```** creates an Inverted File Index with flat quantization, using a coarse quantizer to organize embeddings into clusters for efficient approximate nearest neighbor searches.

2. This step is part of the process of setting up an Inverted File Index (IVF) for efficient approximate nearest neighbor search.

In [14]:

# Train the index with the job description embeddings
index_ivf.train(job_embeddings)  # Training the index with embeddings


# Add embeddings to the IVF index
index_ivf.add(job_embeddings)  # Adding job embeddings to the IVF index

#### **Explanation**

This helps in speeding up search and retrieval when comparing a query embedding with a large number of job description embeddings.

* ```index_ivf.train(job_embeddings)``` trains the IVF index using the job description embeddings to create clusters for efficient search.

* ```index_ivf.add(job_embeddings)``` adds the job description embeddings into the trained IVF index for searching.

# 7. Evaluate the Performance

Evaluating the search performance measures how efficiently the system is returning the most relevant results. **By timing how long the system takes to find the nearest neighbors**, we can understand how well the system performs at scale.

This step ensures that the job portal’s vector search system remains fast and responsive as more data is added, and helps **fine-tune indexing strategies** and **search algorithms** for optimal performance.

Finally, measure the search time to evaluate the performance of the FAISS index and similarity search:


In [15]:
import time

# Measure search time for a query
start_time = time.time()
distances, indices = index.search(query_embedding, k)
end_time = time.time()


print(f"Search Time: {end_time - start_time:.4f} seconds")


Search Time: 0.0002 seconds


### **Expected Output**

The output will be the time it took to perform the search in seconds. For example, in our case, the output shows:

Search Time: 0.0002 seconds

This indicates that the search for the nearest neighbors took 0.0002 seconds, which is extremely fast, demonstrating the efficiency of the FAISS library for similarity search.

* ```start_time = time.time()``` stores the current time before the search operation begins. This helps to track the time taken for the search.

* ```index.search(query_embedding, k)``` is a method from the FAISS library used to search for the k nearest neighbors of the given query in the index.



## **Other Use Case Example:**

IT professionals, especially those working in fields like cloud computing, machine learning, or data engineering, often need to build systems that can perform fast, scalable searches on large datasets. A **vector database** like FAISS helps in storing and searching high-dimensional vector representations (embeddings) of data such as text, images, or documents, making it ideal for use cases such as:

* **Cloud Resource Management & Cost Optimization:** FAISS can analyze cloud resource utilization data to identify inefficiencies and optimize costs. By embedding resource metrics, IT professionals can search for similar patterns and allocate resources more efficiently.

* **Security Incident Detection:** FAISS helps security engineers by embedding cloud security logs, enabling quick searches for similar incidents. This speeds up threat detection and improves response time for security breaches.

* **AI-Powered Cloud Service Recommendation:** FAISS can match cloud services with project requirements by embedding service documentation, allowing IT professionals to automatically recommend the most relevant services for specific use cases.

* **Multi-Cloud Deployment Optimization:** FAISS can optimize deployments in multi-cloud environments by searching past deployment strategies for cost-effective and efficient solutions, automating decision-making.
