## Faiss (Facebook AI Similarity Search) 
Faiss is a library for efficient similarity search and clustering of dense vectors. It’s particularly useful when dealing with large datasets, as it can index millions or even billions of vectors and retrieve the nearest neighbors quickly.



## Key Components:
1. Indexing with Faiss: Indexes document embeddings efficiently.
2. Similarity Search: Finds the closest document embeddings to a query using Faiss.
3. Embedding Model: Converts text into embeddings (like OpenAI, Google, Ollama, HuggingFace, etc.).
4. Langchain Integration: Langchain’s FAISS store integrates easily with the document retrieval and LLM applications.

In [1]:
import os
from dotenv import load_dotenv
load_dotenv()

True

In [2]:
os.environ['GOOGLE_API_KEY'] = os.getenv('GOOGLE_API_KEY')

In [3]:
from langchain_google_genai import GoogleGenerativeAIEmbeddings
embedding_model = GoogleGenerativeAIEmbeddings(model='models/embedding-001')
embedding_model

  from .autonotebook import tqdm as notebook_tqdm


GoogleGenerativeAIEmbeddings(client=<google.ai.generativelanguage_v1beta.services.generative_service.client.GenerativeServiceClient object at 0x00000225C499B1A0>, model='models/embedding-001', task_type=None, google_api_key=None, credentials=None, client_options=None, transport=None, request_options=None)

In [16]:
documents = [
    "Langchain is a framework to simplify building LLM applications.",
    "Faiss is a library for efficient similarity search in dense vectors.",
    "Embeddings are numerical representations of text data.",
    "Langchain supports document loaders, vector stores, and chains."
]

In [17]:
# Convert these documents into embeddings
doc_embeddings = embedding_model.embed_documents(documents)
doc_embeddings

[[0.054663535207509995,
  -0.0007159478846006095,
  -0.030806895345449448,
  0.03285706043243408,
  0.057091861963272095,
  0.024680033326148987,
  0.03332042694091797,
  -0.013018781319260597,
  0.038905516266822815,
  0.044998981058597565,
  -0.017555197700858116,
  -0.000633825606200844,
  -0.03778188303112984,
  -0.03914717957377434,
  -0.006325179245322943,
  -0.0469781830906868,
  0.0358087420463562,
  0.01767580211162567,
  0.018184714019298553,
  0.032805800437927246,
  0.021089522168040276,
  0.014242793433368206,
  -0.007381695322692394,
  -0.017804371193051338,
  -0.016401762142777443,
  0.007745340466499329,
  -0.00789612252265215,
  -0.04038790240883827,
  -0.051794543862342834,
  0.0031213401816785336,
  0.00039297877810895443,
  -0.004335593897849321,
  -0.03979293256998062,
  0.05088233947753906,
  0.027025489136576653,
  0.003878028364852071,
  -0.02159847877919674,
  0.01915932632982731,
  -0.004684668034315109,
  -0.047751810401678085,
  0.016953665763139725,
  -0.06

In [18]:
# Setting up Faiss Index

import faiss
import numpy as np 

embedding_dimension = len(doc_embeddings[0])


# Initialize a Faiss index (IndexFlatL2 is a basic one based on L2 distance)
index = faiss.IndexFlatL2(embedding_dimension)

In [19]:
embedding_array= np.array(doc_embeddings).astype("float32")
embedding_array

array([[ 0.05466354, -0.00071595, -0.0308069 , ...,  0.007485  ,
        -0.03760041,  0.05135082],
       [ 0.07780488, -0.00249262, -0.04513231, ...,  0.02161311,
        -0.00949594, -0.00609586],
       [ 0.01646773, -0.01763307, -0.0475939 , ...,  0.04220385,
        -0.00375746,  0.02770498],
       [ 0.06113728,  0.00062664, -0.04865748, ...,  0.01786266,
        -0.03548918,  0.02543955]], dtype=float32)

In [20]:
index.add(embedding_array)

In [21]:
# Search the index

query = "What is Langchain?"
query_embedding = embedding_model.embed_query(query)

query_embedding_np = np.array([query_embedding]).astype("float32")




In [22]:
# Search for the nearest documents (k=2 means returning the top 2 closest results)
k = 2

distances , indices = index.search(query_embedding_np,k)

In [23]:
# Retrieve the matching documents

results = [documents[i] for i in indices[0]]

In [24]:
# Print the results
for i, result in enumerate(results):
    print(f"Result {i+1}: {result}")

Result 1: Langchain is a framework to simplify building LLM applications.
Result 2: Langchain supports document loaders, vector stores, and chains.


In [31]:
documents

['Langchain is a framework to simplify building LLM applications.',
 'Faiss is a library for efficient similarity search in dense vectors.',
 'Embeddings are numerical representations of text data.',
 'Langchain supports document loaders, vector stores, and chains.']

In [36]:
# Integrating Faiss with Langchain's VectorStores

from langchain.vectorstores import FAISS
from langchain.document_loaders import TextLoader
from langchain.chains import RetrievalQA
from langchain_google_genai import ChatGoogleGenerativeAI

In [37]:

# Initialize Faiss vector store with embeddings
faiss_store = FAISS.from_texts(documents, embedding_model)

In [42]:
# Define a query and retrieve relevant documents
query = "What is Langchain ?"
docs = faiss_store.similarity_search(query)

llm = ChatGoogleGenerativeAI(model='gemini-1.5-flash')
qa_chain = RetrievalQA.from_chain_type(llm=llm , chain_type="stuff",retriever = faiss_store.as_retriever())


# Get the answer to the query 
answer = qa_chain.run(query)
print(answer)






Langchain is a framework that simplifies building applications using large language models (LLMs). 

