This notebook involves the implementation of a simple vector search system.

It involves:

1. loading and processing a csv dataset
2. Vectorizing the text data using a huggingface embedding model
3. Storing the vectorized data on an opensource vector database called Chromadb.
4. Performing similarity searches on the vectorized data.

In [None]:
# Installing required libraries: chroma, langchain, etc.
!pip install chromadb
!pip install langchain sentence-transformers langchain_chroma
!pip install langchain_community
!pip install huggingface_hub

Collecting chromadb
  Downloading chromadb-0.5.5-py3-none-any.whl.metadata (6.8 kB)
Collecting chroma-hnswlib==0.7.6 (from chromadb)
  Downloading chroma_hnswlib-0.7.6-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (252 bytes)
Collecting fastapi>=0.95.2 (from chromadb)
  Downloading fastapi-0.114.1-py3-none-any.whl.metadata (27 kB)
Collecting uvicorn>=0.18.3 (from uvicorn[standard]>=0.18.3->chromadb)
  Downloading uvicorn-0.30.6-py3-none-any.whl.metadata (6.6 kB)
Collecting posthog>=2.4.0 (from chromadb)
  Downloading posthog-3.6.5-py2.py3-none-any.whl.metadata (2.0 kB)
Collecting onnxruntime>=1.14.1 (from chromadb)
  Downloading onnxruntime-1.19.2-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (4.5 kB)
Collecting opentelemetry-api>=1.2.0 (from chromadb)
  Downloading opentelemetry_api-1.27.0-py3-none-any.whl.metadata (1.4 kB)
Collecting opentelemetry-exporter-otlp-proto-grpc>=1.2.0 (from chromadb)
  Downloading opentelemetry_exporter_otlp_pro

In [None]:
from langchain.vectorstores import Chroma
from sentence_transformers import SentenceTransformer
from langchain.embeddings import HuggingFaceEmbeddings
import pandas as pd
from langchain.docstore.document import Document  # Import the Document class

  from tqdm.autonotebook import tqdm, trange


### 1. Loading and preprocessing the dataset:

*    We only include the 1000 lines of the dataset for now, for simplicity and speed
*   in the next block, we turn the dataframe to list, and then Document object in langchain (which stores each row's content as page_content)



In [None]:
df_total = pd.read_csv("cleaned_TEP-fa.csv")
df_total = df_total [["English"]]
df = df_total.head(10000)

In [None]:
# Assuming the sentences are stored in a column called 'English'
documents = df['English'].tolist()

# Create a list of Document objects
document_objects = [Document(page_content=doc) for doc in documents]

### 2.& 3. Vectorizing the text data using a huggingface embedding model and Storing on Chromadb.

Now we need an embedding model! Here we chose the sentence-transformer "all-mpnet-base-v2" model from huggingface, which maps sentences and paragraphs to a 768 dimensional dense vector space.
- More about this this model: [here](https://https://huggingface.co/sentence-transformers/all-mpnet-base-v2)
- SentenceTransformers documentation: [here](https://www.sbert.net/)

In [None]:
model_name = "sentence-transformers/all-mpnet-base-v2"
embed_model = HuggingFaceEmbeddings(model_name = model_name)

# Create a vector database and add the vectorized documents
vector_db = Chroma.from_documents(document_objects, embed_model)

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

### 4. Performing similarity searches on the vectorized data


*   We define a function `search_similar` that takes in parameters k and query, and returnes the top k documents most similar to the query.



In [None]:
# Function to search for similar sentences based on a query
def search_similar(query, k=10):
    similar_docs = vector_db.similarity_search(query, k=k)
    return similar_docs

In [None]:
# Example query
query = "movies are an excellent story telling medium"
results = search_similar(query)

In [None]:
# Print the results
for result in results:
    print(result.page_content)  # Access the content of each Document

people want human interest stories
people want human interest stories
can i read you a story
can i read you a story
you wrote me a play
you wrote me a play
bringing with it the experiences
bringing with it the experiences
about the supernatural i know
about the supernatural i know


**Additional information:** the TEP dataset which was used in this notebook, has multiple double entries, hence the double documents in results!