<a href="https://colab.research.google.com/github/zainasaadeddin/palrag/blob/main/palrag_Ingestion.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [54]:
%pip install langchain chromadb sentence_transformers



In [55]:
from langchain.vectorstores.chroma import Chroma
from langchain_core.documents import Document
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders.csv_loader import CSVLoader

In [63]:
def ingestion(file_csv_path: str, chunk_size: int = 1000, chunk_overlap: int = 150,
             model_name: str = 'BAAI/bge-base-en-v1.5') -> Chroma:
    """
    Load the document, split it into chunks, embed each chunk and load it into the Chroma vector store.

    Args:
        file_csv_path (str): The path to the CSV file.
        chunk_size (int, optional): The size of document chunks. Defaults to 1000.
        chunk_overlap (int, optional): The overlap between document chunks. Defaults to 150.
        model_name (str, optional): The name of the Hugging Face model. Defaults to 'BAAI/bge-base-en-v1.5'.

    Returns:
        Chroma: The embedded Chroma database.
    """

    # Load documents
    loader = CSVLoader(file_path=file_csv_path, source_column='content')
    loaded_docs: list[Document] = loader.load()

    # Split documents into chunks
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
    all_splits: list[Document] = text_splitter.split_documents(loaded_docs)

    # Embeddings
    hfe = HuggingFaceEmbeddings(model_name=model_name)

    # Create Chroma database from embedded documents
    db: Chroma = Chroma.from_documents(all_splits, hfe)


    return db



In [65]:
chromadb = ingestion('data_docs.csv')

In [66]:
query = "what is Definition of the term refugee"
docs = chromadb.similarity_search(query)
print(docs[0].page_content)

Have agreed as follows: CHAPTER I GENERAL PROVISIONS Article 1. - Definition of the term "refugee" A. For the purposes of the present Convention, the term "refugee,, shall apply to any person who: (1) Has been considered a refugee under the Arrangements of 12 May 1926 and 30 June 1928 or under the Conventions of 28 October 1933 and 10 February 1938, the Protocol of 14 September 1939 or the Constitution of the International Refugee Organization; Decisions of non-eligibility taken by the International Refugee Organization during the period of its activities shall not prevent the status of refugee being accorded to persons who fulfil the conditions of paragraph 2 of this section; (2) As a result of events occurring before I January 1951 and owing to well-founded fear of being persecuted for reasons of race, religion, nationality, membership of a particular social group or political opinion, is outside the country of his nationality and is unable, or owing to such fear, is unwilling to
