### What is Faiss DB?

Faiss (Facebook AI Similarity Search) is a library developed by Facebook AI Research for efficient similarity search and clustering of dense vectors. It is widely used for tasks like nearest neighbor search, vector similarity search, and large-scale clustering. Faiss is optimized for high performance and can handle large datasets efficiently.

#### Key Features:
- **Efficient Similarity Search**: Faiss is designed to perform fast nearest neighbor searches on dense vector data.
- **Scalability**: It can handle datasets with millions or even billions of vectors.
- **GPU Support**: Faiss supports GPU acceleration for faster computation.
- **Indexing Options**: Provides multiple indexing methods (e.g., flat, IVF, HNSW) to balance speed and memory usage.
- **Clustering**: Includes tools for clustering large datasets.

#### Applications:
- Semantic search
- Recommendation systems
- Image and video similarity search
- Large-scale clustering
- Natural language processing (e.g., text embeddings)

#### Installation:
You can install Faiss using pip:
```bash
pip install faiss-cpu
# Or for GPU support
pip install faiss-gpu

### Faiss vs Chroma DB

Both **Faiss** and **Chroma DB** are tools used for managing and querying vector embeddings, but they are designed for different use cases and have distinct features. Below is a comparison:

---

#### **Faiss (Facebook AI Similarity Search)**
- **Purpose**: Faiss is a library developed by Facebook AI Research for efficient similarity search and clustering of dense vectors.
- **Key Features**:
  - Optimized for **nearest neighbor search** and **vector similarity search**.
  - Supports **GPU acceleration** for high-performance computation.
  - Provides multiple **indexing methods** (e.g., Flat, IVF, HNSW) to balance speed and memory usage.
  - Designed for **large-scale datasets** with millions or billions of vectors.
  - Focuses on **low-level operations** for vector search.
- **Use Cases**:
  - Semantic search
  - Recommendation systems
  - Image and video similarity search
  - Large-scale clustering
- **Strengths**:
  - Extremely fast and scalable.
  - Highly customizable for advanced users.
- **Limitations**:
  - Does not provide a built-in database or persistence layer.
  - Requires additional effort to manage metadata and persistence.

---

#### **Chroma DB**
- **Purpose**: Chroma DB is a vector database designed to store, manage, and query embeddings along with their associated metadata.
- **Key Features**:
  - Provides a **persistent database** for storing embeddings and metadata.
  - Supports **metadata filtering** for advanced queries.
  - Built-in support for **LangChain** and other machine learning workflows.
  - Easy to use with a high-level API for managing embeddings.
  - Focuses on **end-to-end workflows** for vector search.
- **Use Cases**:
  - Storing embeddings for text, images, or other data.
  - Semantic search with metadata filtering.
  - Applications requiring persistent storage of embeddings.
- **Strengths**:
  - Simple to set up and use.
  - Built-in persistence and metadata management.
- **Limitations**:
  - May not be as fast or scalable as Faiss for extremely large datasets.
  - Lacks GPU acceleration for high-performance computation.

---

#### **Comparison Table**

| Feature                  | Faiss                          | Chroma DB                     |
|--------------------------|---------------------------------|-------------------------------|
| **Primary Use**          | Similarity search and clustering | Persistent vector database    |
| **Scalability**          | Handles millions/billions of vectors | Suitable for medium-scale datasets |
| **Performance**          | Extremely fast, GPU support    | Slower compared to Faiss      |
| **Persistence**          | No built-in persistence        | Built-in persistence          |
| **Metadata Support**     | Requires manual handling       | Built-in metadata management  |
| **Ease of Use**          | Requires advanced knowledge    | Easy to use with high-level API |
| **Integration**          | Low-level library              | Integrated with LangChain and ML workflows |
| **Indexing Options**     | Flat, IVF, HNSW, etc.          | Limited indexing options      |

---

#### **When to Use Faiss**
- You need **high-performance similarity search** for large-scale datasets.
- You are comfortable managing metadata and persistence manually.
- You require **GPU acceleration** for faster computation.

#### **When to Use Chroma DB**
- You need a **persistent vector database** with metadata support.
- You want an easy-to-use solution integrated with **LangChain** or other ML workflows.
- Your dataset is medium-sized and does not require extreme scalability.

---

Both tools are powerful in their respective domains, and the choice depends on your specific use case and requirements.

In [1]:
from langchain_community.document_loaders import TextLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain.embeddings import OllamaEmbeddings
from langchain.vectorstores import FAISS

In [3]:
loader = TextLoader("speech.txt")
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
split_docs = text_splitter.split_documents(documents)
split_docs

[Document(metadata={'source': 'speech.txt'}, page_content='The world must be made safe for democracy. Its peace must be planted upon the tested foundations of political liberty. We have no selfish ends to serve. We desire no conquest, no dominion. We seek no indemnities for ourselves, no material compensation for the sacrifices we shall freely make. We are but one of the champions of the rights of mankind. We shall be satisfied when those rights have been made as secure as the faith and the freedom of nations can make them.\n\nJust because we fight without rancor and without selfish object, seeking nothing for ourselves but what we shall wish to share with all free peoples, we shall, I feel confident, conduct our operations as belligerents without passion and ourselves observe with proud punctilio the principles of right and of fair play we profess to be fighting for.\n\n…'),
 Document(metadata={'source': 'speech.txt'}, page_content='It will be all the easier for us to conduct ourselve

In [5]:
# Use OllamaEmbeddings to create embeddings for the documents
from langchain_ollama import OllamaEmbeddings
embeddings = OllamaEmbeddings(model="nomic-embed-text")
embeddings

OllamaEmbeddings(model='nomic-embed-text', base_url=None, client_kwargs={}, mirostat=None, mirostat_eta=None, mirostat_tau=None, num_ctx=None, num_gpu=None, keep_alive=None, num_thread=None, repeat_last_n=None, repeat_penalty=None, temperature=None, stop=None, tfs_z=None, top_k=None, top_p=None)

In [6]:
# Store in FAISS DB
from langchain.vectorstores import FAISS
db = FAISS.from_documents(split_docs, embeddings)
db

<langchain_community.vectorstores.faiss.FAISS at 0x21d24cc8740>

In [7]:
# Query the vector store for the most similar text document to the query text
query_text = "What is peace?"

# similarity_search is the function to get the most similar text document to the query text, it returns a list of Document objects, 
# each object has a page_content and metadata attributes

# the k parameter is the number of most similar text documents to return, in this case we are getting only one document, 
# by default it returns 4 documents and we can change it to any number we want, recommended is 4 or 5 documents
# the similarity_search function uses the cosine similarity to find the most similar text document to the query text
query_result = db.similarity_search(query_text, k=1)
query_result[0].page_content

'The world must be made safe for democracy. Its peace must be planted upon the tested foundations of political liberty. We have no selfish ends to serve. We desire no conquest, no dominion. We seek no indemnities for ourselves, no material compensation for the sacrifices we shall freely make. We are but one of the champions of the rights of mankind. We shall be satisfied when those rights have been made as secure as the faith and the freedom of nations can make them.\n\nJust because we fight without rancor and without selfish object, seeking nothing for ourselves but what we shall wish to share with all free peoples, we shall, I feel confident, conduct our operations as belligerents without passion and ourselves observe with proud punctilio the principles of right and of fair play we profess to be fighting for.\n\n…'

In [None]:
# Query the vector store for the most similar text document to the query text
query_text = "What is our rights?"

query_result = db.similarity_search(query_text, k=1)
query_result[0].page_content

'It is a distressing and oppressive duty, gentlemen of the Congress, which I have performed in thus addressing you. There are, it may be, many months of fiery trial and sacrifice ahead of us. It is a fearful thing to lead this great peaceful people into war, into the most terrible and disastrous of all wars, civilization itself seeming to be in the balance. But the right is more precious than peace, and we shall fight for the things which we have always carried nearest our hearts—for democracy, for the right of those who submit to authority to have a voice in their own governments, for the rights and liberties of small nations, for a universal dominion of right by such a concert of free peoples as shall bring peace and safety to all nations and make the world itself at last free.'

In [11]:
query_text = "How does the speaker describe the desired outcome of the war?"
query_result = db.similarity_search(query_text, k=1)
query_result[0].page_content

'It will be all the easier for us to conduct ourselves as belligerents in a high spirit of right and fairness because we act without animus, not in enmity toward a people or with the desire to bring any injury or disadvantage upon them, but only in armed opposition to an irresponsible government which has thrown aside all considerations of humanity and of right and is running amuck. We are, let me say again, the sincere friends of the German people, and shall desire nothing so much as the early reestablishment of intimate relations of mutual advantage between us—however hard it may be for them, for the time being, to believe that this is spoken from our hearts.'

### Explanation: Retrieval with FAISS as a Retriever

In LangChain, a **retriever** is an abstraction that allows you to retrieve the most relevant documents for a given query. By converting a **vector store** (like FAISS) into a retriever, you can seamlessly integrate it with other LangChain components, such as chains or pipelines, which are designed to work with retrievers.

#### Why Convert FAISS into a Retriever?
- **Ease of Use**: Many LangChain methods and workflows are designed to work with retrievers, making it easier to integrate FAISS into your pipeline.
- **Simplified Querying**: The retriever interface abstracts away the complexity of querying the vector store, allowing you to focus on the results.
- **Compatibility**: Retrievers are compatible with LangChain's higher-level methods, enabling you to use FAISS in tasks like question answering, summarization, or chatbots.

#### How It Works:
1. **Convert FAISS to a Retriever**:
   - Use the `as_retriever()` method to convert the FAISS vector store into a retriever object.
   - This retriever object can now be used to query the vector store.

2. **Retrieve Relevant Documents**:
   - Use the `invoke()` method of the retriever to find the most relevant documents for a given query.
   - Specify the number of documents to retrieve using the `k` parameter.

3. **Access Retrieved Documents**:
   - The retrieved documents are returned as a list of `Document` objects, which contain attributes like `page_content` (the text of the document) and `metadata`.


#### Key Points:
- **`db.as_retriever()`**: Converts the FAISS vector store into a retriever object.
- **`retriever.invoke(query_text, k=1)`**: Retrieves the top `k` most similar documents for the query text.
- **`docs[0].page_content`**: Accesses the content of the first retrieved document.

#### Benefits:
- **Integration**: Easily integrate FAISS with LangChain workflows.
- **Flexibility**: Use the retriever in tasks like question answering, summarization, or semantic search.
- **Abstraction**: Simplifies the process of querying the vector store.

This approach makes it easier to use FAISS in end-to-end machine learning pipelines and leverages LangChain's capabilities for building intelligent applications.

In [14]:
# retriever
# Convert the FAISS vector store into a retriever object
retriever = db.as_retriever()

# Use the retriever to find the most relevant documents for the given query
# The `invoke` method retrieves the top `k` most similar documents based on the query
docs = retriever.invoke(query_text, k=1)

# Access and print the content of the first retrieved document
docs[0].page_content

'It will be all the easier for us to conduct ourselves as belligerents in a high spirit of right and fairness because we act without animus, not in enmity toward a people or with the desire to bring any injury or disadvantage upon them, but only in armed opposition to an irresponsible government which has thrown aside all considerations of humanity and of right and is running amuck. We are, let me say again, the sincere friends of the German people, and shall desire nothing so much as the early reestablishment of intimate relations of mutual advantage between us—however hard it may be for them, for the time being, to believe that this is spoken from our hearts.'

### Similarity Search with Score

**Similarity Search** is a process of finding the most relevant documents or items in a dataset based on their similarity to a given query. In the context of vector databases like FAISS or Chroma DB, similarity is typically measured using mathematical metrics such as cosine similarity, Euclidean distance, or dot product.

When performing a similarity search, the results often include a **score** that indicates how similar each retrieved document is to the query. This score helps rank the results, with higher scores (or lower distances, depending on the metric) indicating greater similarity.

---

#### How Similarity Search Works:
1. **Query Vector**: The query text or item is converted into a dense vector representation using an embedding model.
2. **Similarity Metric**: The similarity between the query vector and the vectors in the database is computed using a metric like:
   - **Cosine Similarity**: Measures the cosine of the angle between two vectors.
   - **Euclidean Distance**: Measures the straight-line distance between two vectors.
   - **Dot Product**: Measures the projection of one vector onto another.
3. **Ranking**: The results are ranked based on their similarity scores.
4. **Top-k Results**: The top `k` most similar items are returned, along with their scores.



#### Explanation of the Code:
- **`similarity_search_with_score`**: This method retrieves the top `k` most similar documents along with their similarity scores.
- **`doc.page_content`**: Contains the text content of the retrieved document.
- **`score`**: Represents the similarity score between the query and the document.



---

#### Use Cases:
- **Semantic Search**: Retrieve documents or items that are semantically similar to a query.
- **Recommendation Systems**: Suggest items (e.g., products, movies) similar to a user's preferences.
- **Question Answering**: Find the most relevant context for answering a query.
- **Clustering**: Group similar items based on their embeddings.

---

#### Key Points:
- The **score** helps rank the results, making it easier to identify the most relevant items.
- The choice of similarity metric (e.g., cosine similarity, Euclidean distance) depends on the use case and the nature of the data.
- Similarity search with scores is a powerful tool for building intelligent applications like search engines, chatbots, and recommendation systems.


In [None]:
Result 1:
Content: "Peace is a state of harmony and absence of conflict."
Score: 0.95
---
Result 2:
Content: "The speaker emphasizes the importance of peace in society."
Score: 0.89
---
Result 3:
Content: "Achieving peace requires mutual understanding and cooperation."
Score: 0.87
---

In [16]:
# Perform a similarity search with scores
query_text = "What is peace?"
results = db.similarity_search_with_score(query_text, k=3)

# Display the results
for i, (doc, score) in enumerate(results):
    print(f"Result {i+1}:")
    print(f"Content: {doc.page_content}")
    print(f"Score: {score}")
    print("---")

Result 1:
Content: The world must be made safe for democracy. Its peace must be planted upon the tested foundations of political liberty. We have no selfish ends to serve. We desire no conquest, no dominion. We seek no indemnities for ourselves, no material compensation for the sacrifices we shall freely make. We are but one of the champions of the rights of mankind. We shall be satisfied when those rights have been made as secure as the faith and the freedom of nations can make them.

Just because we fight without rancor and without selfish object, seeking nothing for ourselves but what we shall wish to share with all free peoples, we shall, I feel confident, conduct our operations as belligerents without passion and ourselves observe with proud punctilio the principles of right and of fair play we profess to be fighting for.

…
Score: 0.8123457431793213
---
Result 2:
Content: It is a distressing and oppressive duty, gentlemen of the Congress, which I have performed in thus addressing

#### We can also get the Docs with Score by passing the Embedding Vectors

In [17]:
embedding_vector = embeddings.embed_query(query_text)
embedding_vector

[0.080493905,
 0.031007947,
 -0.13425358,
 -0.012883559,
 0.078819826,
 0.03272236,
 -0.042165823,
 -0.02717044,
 0.035894085,
 -0.012842001,
 -0.00031560895,
 -0.034261215,
 0.02525579,
 -0.017779302,
 -0.008960651,
 -0.047003407,
 -0.0053089377,
 -0.04213668,
 -0.004509077,
 -0.0040011466,
 -0.04388008,
 0.0072962493,
 -0.01755423,
 -0.017147146,
 0.13861056,
 0.05559193,
 0.021880273,
 -0.011439422,
 -0.029114421,
 -0.022841526,
 0.025394883,
 -0.043792747,
 0.014084642,
 0.059528068,
 0.0008247177,
 -0.019048797,
 0.076762944,
 0.026414575,
 0.06463296,
 -0.021938562,
 -0.025682133,
 -0.024936039,
 0.027952494,
 0.003661403,
 0.051455032,
 -0.03161116,
 -0.054588597,
 0.016689241,
 0.03536898,
 -0.045995113,
 0.013263655,
 -0.061693612,
 -0.0016022736,
 -0.017221382,
 0.020633299,
 0.06825542,
 0.028618403,
 -0.001630879,
 -0.04312915,
 -0.014909995,
 0.03590043,
 0.026402442,
 0.01995912,
 0.097270794,
 0.08051038,
 -0.02272046,
 -0.07603095,
 -0.007812331,
 0.032093666,
 -0.02085

In [19]:
docs_and_score = db.similarity_search_by_vector(embedding_vector)
docs_and_score

[Document(id='334d8623-4be7-46de-afdf-a4658f2e3759', metadata={'source': 'speech.txt'}, page_content='The world must be made safe for democracy. Its peace must be planted upon the tested foundations of political liberty. We have no selfish ends to serve. We desire no conquest, no dominion. We seek no indemnities for ourselves, no material compensation for the sacrifices we shall freely make. We are but one of the champions of the rights of mankind. We shall be satisfied when those rights have been made as secure as the faith and the freedom of nations can make them.\n\nJust because we fight without rancor and without selfish object, seeking nothing for ourselves but what we shall wish to share with all free peoples, we shall, I feel confident, conduct our operations as belligerents without passion and ourselves observe with proud punctilio the principles of right and of fair play we profess to be fighting for.\n\n…'),
 Document(id='7864e2e3-13e0-4064-89d9-2e2c27916260', metadata={'sour

#### Saving and Loading the Vector Store Db

In [22]:
#### Saving and Loading the Vector Store Db
# Save the vector store to a file
db.save_local("faiss_index")

In [28]:
new_db = FAISS.load_local("faiss_index", embeddings, allow_dangerous_deserialization=True)
docs = new_db.similarity_search(query_text, k=1)
docs[0].page_content

'The world must be made safe for democracy. Its peace must be planted upon the tested foundations of political liberty. We have no selfish ends to serve. We desire no conquest, no dominion. We seek no indemnities for ourselves, no material compensation for the sacrifices we shall freely make. We are but one of the champions of the rights of mankind. We shall be satisfied when those rights have been made as secure as the faith and the freedom of nations can make them.\n\nJust because we fight without rancor and without selfish object, seeking nothing for ourselves but what we shall wish to share with all free peoples, we shall, I feel confident, conduct our operations as belligerents without passion and ourselves observe with proud punctilio the principles of right and of fair play we profess to be fighting for.\n\n…'