### Embeddings, Vector Databases, and Search

* Convert text into embedding vectors as the initial step in a text processing pipeline.
* Store embedding vectors in a vector database or index to avoid recomputation and speed up retrieval.
* Use the stored vectors to search for relevant documents based on a specific query.
  * Convert the query to embeddign and serch for the most similar embedding.


In [None]:
%pip install faiss-cpu -U
%pip install sentence-transformers

## Step 1: Reading data

- Use data on news topics collected by the NewsCatcher, which indexes and releases news articles for open-source use.
- The dataset is available for download on [Kaggle](https://www.kaggle.com/kotartemiy/topic-labeled-news-dataset).


In [None]:
!curl -L -o ./archive.zip https://www.kaggle.com/api/v1/datasets/download/kotartemiy/topic-labeled-news-dataset

In [None]:
!du -sch archive.zip

In [None]:
!unzip archive.zip

In [None]:
import csv
texts = []
with open('labelled_newscatcher_dataset.csv', 'r') as file:
    csv_reader = csv.reader(file, delimiter=';')
    # Skip the header row if it exists
    header = next(csv_reader)
    
    # Read the remaining rows
    for row in csv_reader:
        texts.append(row)
        if len(texts) == 100:
            break
    

In [None]:
texts[0:10]

In [None]:
list(map(lambda x: x[4], texts))[0:10]

## Vector Library: FAISS
- Vector libraries work well for small, static datasets but lack full database functionality, such as CRUD (Create, Read, Update, Delete) operations.
- Once a vector index is built, it cannot be updated incrementally; changes require a complete rebuild of the index.
- Vector libraries are easy to use, lightweight, and fast, making them practical for quick similarity searches.
- Examples of vector libraries include [FAISS](https://faiss.ai/), [ScaNN](https://github.com/google-research/google-research/tree/master/scann), [ANNOY](https://github.com/spotify/annoy), and [HNSM](https://arxiv.org/abs/1603.09320).
- FAISS supports similarity searches using metrics like L2 (Euclidean distance) and cosine similarity. More information is available on their [GitHub](https://github.com/facebookresearch/faiss/wiki/Getting-started#searching) and in their [blog post](https://engineering.fb.com/2017/03/29/data-infrastructure/faiss-a-library-for-efficient-similarity-search/).
- For a comparison between vector libraries and databases, see this [blog post](https://weaviate.io/blog/vector-library-vs-vector-database#feature-comparison---library-versus-database).


The overall workflow of FAISS is captured in the diagram below. 
<img src="https://miro.medium.com/v2/resize:fit:1400/0*ouf0eyQskPeGWIGm" width=700>
Source: [How to use FAISS to build your first similarity search by Asna Shafiq](https://medium.com/loopio-tech/how-to-use-faiss-to-build-your-first-similarity-search-bf0f708aa772).


### Step 2: Vectorize text into embedding vectors
We will be using `Sentence-Transformers` [library](https://www.sbert.net/) to load a language model to vectorize our text into embeddings. The library hosts some of the most popular transformers on [Hugging Face Model Hub](https://huggingface.co/sentence-transformers).

Here, we are using the `model = SentenceTransformer("all-MiniLM-L6-v2")` to generate embeddings.


In [4]:
titles = list(map(lambda x: x[4], texts))
titles[0:10]

In [5]:
from langchain_community.embeddings import HuggingFaceEmbeddings
embeddings = HuggingFaceEmbeddings(
    model_name="all-MiniLM-L6-v2",
    cache_folder="./",
    model_kwargs={'device': 'cpu'}  # Use 'cuda' if you want to use GPU
)


For example, replace imports like: `from langchain_core.pydantic_v1 import BaseModel`
with: `from pydantic import BaseModel`
or the v1 compatibility namespace if you are working in a code base that has not been fully upgraded to pydantic 2 yet. 	from pydantic.v1 import BaseModel

  return _bootstrap._gcd_import(name[level:], package, level)
  from tqdm.autonotebook import tqdm, trange


### Step 3: Saving embedding vectors to FAISS index
Below, we create the FAISS index object based on our embedding vectors, normalize vectors, and add these vectors to the FAISS index. 


In [6]:
from langchain_core.documents import Document

docs = [Document(page_content=text) for text in titles]
docs[0]

Document(metadata={}, page_content="A closer look at water-splitting's solar fuel potential")

In [7]:
from langchain_community.vectorstores import FAISS
db = FAISS.from_documents(docs, embeddings, distance_strategy="METRIC_COSINE")

## Step 4: Search for relevant documents
We define a search function below to first vectorize our query text, and then search for the vectors with the closest distance. 


In [8]:
query = "The Maserati company is unveiling a new car"
docs = db.similarity_search(query, k=2)
docs

[Document(metadata={}, page_content='Maserati unveils Trofeo super sedans'),
 Document(metadata={}, page_content='Xiaomi patents a phone with a detachable display')]

In [9]:
docs_and_dist = db.similarity_search_with_score(query, k=2)
docs_and_dist

[(Document(metadata={}, page_content='Maserati unveils Trofeo super sedans'),
  0.89076364),
 (Document(metadata={}, page_content='Xiaomi patents a phone with a detachable display'),
  1.5669622)]

## Alternative Method

In [16]:
from langchain_community.vectorstores import FAISS
import numpy as np

text_embeddings = embeddings.embed_documents(titles)

len(text_embeddings)

100

In [17]:
len(text_embeddings[0])

384

In [20]:
# Normalize the embeddings
normalized_embeddings = [embedding / np.linalg.norm(embedding) for embedding in text_embeddings]

len(normalized_embeddings)

100

In [23]:

sum(normalized_embeddings[0]**2)

1.0

In [26]:
import faiss
dimension = len(normalized_embeddings[0])
index = faiss.IndexFlatIP(dimension)  # Here, IP stands for Inner Product



In [27]:
index.add(np.array(normalized_embeddings).astype('float32'))
index


<faiss.swigfaiss.IndexFlatIP; proxy of <Swig Object of type 'faiss::IndexFlatIP *' at 0x336710b40> >

In [34]:
db = FAISS.from_texts(
    texts=texts,
    embedding=embeddings,
    metadatas=[doc.metadata for doc in docs]
)

In [36]:
# When searching
query_embedding = embeddings.embed_query(query)
query_embedding = query_embedding / np.linalg.norm(query_embedding)  # Normalize query vector
docs_and_dist = db.similarity_search_with_score(query, k=2)

for doc, dist in docs_and_dist:
    print(f"Content: {doc.page_content}")
    print(f"Similarity: {dist}")

Content: Maserati unveils Trofeo super sedans
Similarity: 0.8907636404037476
Content: Xiaomi patents a phone with a detachable display
Similarity: 1.5669622421264648
