# Document Embedding and Retrieval with FAISS and Sentence Transformers

## Description
This code demonstrates how to generate embeddings for a set of documents using a pre-trained Sentence Transformer model, store them in a FAISS vector store with LangChain, and perform a similarity search query. It showcases the process of embedding text, creating a retrievable vector database, and retrieving the most relevant documents based on a user query—in this case, determining Qualcomm’s focus.

### Install required libraries if not already installed

In [9]:
! pip install sentence-transformers langchain faiss-cpu openai

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




### Import necessary libraries

In [10]:
from sentence_transformers import SentenceTransformer
from langchain.vectorstores import FAISS
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.docstore.document import Document 

### Load an open-source embedding model

In [13]:
embedding_model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
embedding_function = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

### Sample documents

In [15]:
docs = [
    "Qualcomm specializes in AI and 5G innovations.",
    "RAG enhances LLM capabilities.",
    "FAISS enables efficient vector search.",
    "Large Language Models (LLMs) power modern AI applications."
]

### Convert documents into FAISS-compatible format

In [None]:
documents = [Document(page_content = text) for text in docs]
documents

[Document(page_content='Qualcomm specializes in AI and 5G innovations.'),
 Document(page_content='RAG enhances LLM capabilities.'),
 Document(page_content='FAISS enables efficient vector search.'),
 Document(page_content='Large Language Models (LLMs) power modern AI applications.')]

### Compute embeddings for the documents

In [18]:
vectors = embedding_model.encode([doc.page_content for doc in documents])
vectors

array([[-8.8474870e-02, -6.6863187e-02,  3.9061911e-02, ...,
        -7.1872070e-02,  6.4849868e-02,  7.2620027e-02],
       [-1.2495761e-02, -5.8157807e-03,  4.1664895e-02, ...,
        -8.8449091e-02, -1.6910434e-02,  1.2876995e-05],
       [-6.2785096e-02,  1.1287932e-02, -1.0999277e-03, ...,
        -2.0743957e-02,  4.5821484e-02, -3.7647974e-02],
       [-5.0568869e-03, -6.1584532e-02,  3.4379099e-02, ...,
         3.3113774e-02, -6.4779609e-02, -4.1570116e-02]], dtype=float32)

### Combine text and embeddings manually

In [21]:
text_embeddings = [(doc.page_content, emb) for doc, emb in zip(documents, vectors)]
text_embeddings
vector_db = FAISS.from_embeddings(text_embeddings, embedding=embedding_function)
vector_db

<langchain_community.vectorstores.faiss.FAISS at 0x31a498150>

### Perform retrieval

In [23]:
query = "What FAISS does?"
query_embedding = embedding_model.encode(query)
retrieved_docs = vector_db.similarity_search_by_vector(query_embedding, k=2)