In [2]:
from langchain_community.vectorstores import Chroma
from langchain_community.document_loaders import TextLoader
from langchain_community.embeddings import OllamaEmbeddings
from langchain_text_splitters import CharacterTextSplitter

Lien des Vector stores

https://docs.langchain.com/oss/python/integrations/vectorstores

In [9]:
# Data Ingestion
loader = TextLoader("speech.txt")
documents = loader.load()

In [10]:
# Data Transformation / split
text_splitter = CharacterTextSplitter(chunk_size=500,chunk_overlap=0)
splits = text_splitter.split_documents(documents)

Created a chunk of size 670, which is longer than the specified 500
Created a chunk of size 984, which is longer than the specified 500
Created a chunk of size 791, which is longer than the specified 500


In [11]:
# Embedding
embeddings = OllamaEmbeddings(model='gemma:2b') 

In [12]:
# VectorStore
db=Chroma.from_documents(splits,embeddings)

In [13]:
# Querying -> pour tester
query = "What does the speaker believe is the main reason the United States should enter the war?"

result = db.similarity_search(query)
result[0].page_content

'It will be all the easier for us to conduct ourselves as belligerents in a high spirit of right and fairness because we act without animus, not in enmity toward a people or with the desire to bring any injury or disadvantage upon them, but only in armed opposition to an irresponsible government which has thrown aside all considerations of humanity and of right and is running amuck. We are, let me say again, the sincere friends of the German people, and shall desire nothing so much as the early reestablishment of intimate relations of mutual advantage between usâ€”however hard it may be for them, for the time being, to believe that this is spoken from our hearts.'

In [14]:
# Saving to the disk
vectordb=Chroma.from_documents(documents=splits,embedding=embeddings,persist_directory="./chroma_db")

In [15]:
# Load from disk
db2=Chroma(persist_directory="./chroma_db",embedding_function=embeddings)

  db2=Chroma(persist_directory="./chroma_db",embedding_function=embeddings)


In [17]:
query = "What does the speaker believe is the main reason the United States should enter the war?"

result = db2.similarity_search(query)
print(result[0].page_content)

It will be all the easier for us to conduct ourselves as belligerents in a high spirit of right and fairness because we act without animus, not in enmity toward a people or with the desire to bring any injury or disadvantage upon them, but only in armed opposition to an irresponsible government which has thrown aside all considerations of humanity and of right and is running amuck. We are, let me say again, the sincere friends of the German people, and shall desire nothing so much as the early reestablishment of intimate relations of mutual advantage between usâ€”however hard it may be for them, for the time being, to believe that this is spoken from our hearts.


# Pour être branché dans un pipeline RAG (retriever)

In [18]:
# Retriever option
retriever=vectordb.as_retriever()
retriever.invoke(query)[0].page_content

'It will be all the easier for us to conduct ourselves as belligerents in a high spirit of right and fairness because we act without animus, not in enmity toward a people or with the desire to bring any injury or disadvantage upon them, but only in armed opposition to an irresponsible government which has thrown aside all considerations of humanity and of right and is running amuck. We are, let me say again, the sincere friends of the German people, and shall desire nothing so much as the early reestablishment of intimate relations of mutual advantage between usâ€”however hard it may be for them, for the time being, to believe that this is spoken from our hearts.'

Pour tester à la main -> db.similarity_search(query)


Pour un RAG avec LangChain -> retriever = db.as_retriever()    //    retriever.invoke(query)