This is an example notebook to run a simple ETL from an example.csv to a vector database

In [1]:
import pandas as pd

df = pd.read_csv("example.csv")
df.head()

Unnamed: 0,id,title,summary
0,1,Study on Diabetes,A randomized controlled trial to study the eff...
1,2,Cancer Immunotherapy,Examining T-cell response in patients treated ...
2,3,Parkinson's Disease,Early detection using gait analysis and neural...
3,4,COVID-19 Vaccine,Study on vaccine efficacy against emerging SAR...
4,5,Alzheimer's Trial,Testing the efficacy of amyloid-beta targeting...


1. Load the data with langchain

In [3]:
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain.document_loaders import CSVLoader

# Load CSV
loader = CSVLoader(file_path="example.csv")
docs = loader.load()

docs[:3]  # Show first 3 documents

[Document(metadata={'source': 'example.csv', 'row': 0}, page_content='id: 1\ntitle: Study on Diabetes\nsummary: A randomized controlled trial to study the effects of Metformin on blood sugar regulation.'),
 Document(metadata={'source': 'example.csv', 'row': 1}, page_content='id: 2\ntitle: Cancer Immunotherapy\nsummary: Examining T-cell response in patients treated with checkpoint inhibitors.'),
 Document(metadata={'source': 'example.csv', 'row': 2}, page_content="id: 3\ntitle: Parkinson's Disease\nsummary: Early detection using gait analysis and neural biomarkers.")]

2. Embed the datasource into a model (a free one here, in production we would use a more sophisticated model). 

In [None]:
# Create embeddings
embedding_model = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

  from .autonotebook import tqdm as notebook_tqdm


3. Store the embedded vector representations in a vector index (FAISS). 

In [7]:
# Vector store
db = FAISS.from_documents(docs, embedding_model)

# Save index
db.save_local("faiss_index")

We can now run the lambda function to do RAG using the FAISS index for augmentation