### In this notebook/script we use Chroma to create a vectordB with the cosmology arxiv dataset that we have already prepared

In [5]:
import pandas as pd
from langchain.document_loaders import DataFrameLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma

In [2]:
df_cosmo = pd.read_csv('arxiv_astro-ph_data_cosmo.csv')    
df_cosmo.head()

Unnamed: 0,id,title,abstract,categories,cat_text,prepared_text
0,705.2176,Gravitational particle production in braneworl...,Gravitational particle production in time vari...,hep-ph astro-ph.CO gr-qc,"High Energy Physics - Phenomenology, Cosmology...",Gravitational particle production in braneworl...
1,705.2299,Time evolution of T_{\mu\nu} and the cosmologi...,We study the cosmic time evolution of an effec...,hep-ph astro-ph.CO gr-qc,"High Energy Physics - Phenomenology, Cosmology...",Time evolution of T_{\mu\nu} and the cosmologi...
2,705.3289,Helium abundance in galaxy clusters and Sunyae...,It has long been suggested that helium nuclei ...,astro-ph astro-ph.CO astro-ph.HE astro-ph.IM,"Astrophysics, Cosmology and Nongalactic Astrop...",Helium abundance in galaxy clusters and Sunyae...
3,705.4139,Our Peculiar Motion Away from the Local Void,The peculiar velocity of the Local Group of ga...,astro-ph astro-ph.CO,"Astrophysics, Cosmology and Nongalactic Astrop...",Our Peculiar Motion Away from the Local Void \...
4,707.1351,Inverse approach to Einstein's equations for f...,We expand previous work on an inverse approach...,gr-qc astro-ph.CO,"General Relativity and Quantum Cosmology, Cosm...",Inverse approach to Einstein's equations for f...


In [3]:
df_cosmo.shape

(66103, 6)

So we will create the vectordB from these ~66k documents

In [4]:
# Create a DataFrameLoader
loader = DataFrameLoader(df_cosmo, page_content_column='prepared_text')
arxiv_documents = loader.load()

arxiv_documents[0]

Document(page_content='Gravitational particle production in braneworld cosmology \n Gravitational particle production in time variable metric of an expanding universe is efficient only when the Hubble parameter $H$ is not too small in comparison with the particle mass. In standard cosmology, the huge value of the Planck mass $M_{Pl}$ makes the mechanism phenomenologically irrelevant. On the other hand, in braneworld cosmology the expansion rate of the early universe can be much faster and many weakly interacting particles can be abundantly created. Cosmological implications are discussed.', metadata={'id': '0705.2176', 'title': 'Gravitational particle production in braneworld cosmology', 'abstract': 'Gravitational particle production in time variable metric of an expanding universe is efficient only when the Hubble parameter $H$ is not too small in comparison with the particle mass. In standard cosmology, the huge value of the Planck mass $M_{Pl}$ makes the mechanism phenomenologically

### Refer to the MTEB Embeddings Leaderboard for the best performing Embedding: https://huggingface.co/spaces/mteb/leaderboard

#### Here we optimize for time, will improve on this later, embedding model used here: https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2

In [6]:
# Get the embedding model

model_name = "sentence-transformers/all-MiniLM-l6-v2" #"BAAI/bge-small-en-v1.5"#"sentence-transformers/all-MiniLM-l6-v2" #"sentence-transformers/all-mpnet-base-v2"
# bge-base-en-v1.5 or bge-small taking too much time for all the cosmo docs, ~66k
model_kwargs = {"device": "cpu"} # Since we are running on local machine, we will use CPU

embeddings = HuggingFaceEmbeddings(model_name=model_name, model_kwargs=model_kwargs)

  from .autonotebook import tqdm as notebook_tqdm


**Note** What is the optimal chunking strategy here?

In [9]:
### Split the documents into smaller chunks

splitter = RecursiveCharacterTextSplitter(chunk_size=200, chunk_overlap=20) 
# Keeping this small initially, since these are just abstracts, not full paper text

chunked_docs = splitter.split_documents(arxiv_documents)

In [10]:
chunked_docs[0]

Document(page_content='Gravitational particle production in braneworld cosmology', metadata={'id': '0705.2176', 'title': 'Gravitational particle production in braneworld cosmology', 'abstract': 'Gravitational particle production in time variable metric of an expanding universe is efficient only when the Hubble parameter $H$ is not too small in comparison with the particle mass. In standard cosmology, the huge value of the Planck mass $M_{Pl}$ makes the mechanism phenomenologically irrelevant. On the other hand, in braneworld cosmology the expansion rate of the early universe can be much faster and many weakly interacting particles can be abundantly created. Cosmological implications are discussed.', 'categories': 'hep-ph astro-ph.CO gr-qc', 'cat_text': 'High Energy Physics - Phenomenology, Cosmology and Nongalactic Astrophysics, General Relativity and Quantum Cosmology'})

**Note** Try this with FAISS etc as well, how does it affect the performance?

In [11]:
# Create the vectordb using Chroma and persist it for future use -> Took about ~40 minutes on a Macbook M2 Pro 2023

vectordb = Chroma.from_documents(documents=chunked_docs, embedding=embeddings, persist_directory="arxiv_cosmo_chroma_db")

In [12]:
vectordb.persist()