## Ingesting PDF

In [1]:
%pip install --q unstructured langchain
%pip install --q "unstructured[all-docs]"


Note: you may need to restart the kernel to use updated packages.
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
opentelemetry-proto 1.27.0 requires protobuf<5.0,>=3.19, but you have protobuf 5.28.2 which is incompatible.[0m[31m
[0mNote: you may need to restart the kernel to use updated packages.


In [2]:
%pip install langchain-community


Note: you may need to restart the kernel to use updated packages.


In [3]:
from langchain_community.document_loaders import UnstructuredPDFLoader
from langchain_community.document_loaders import OnlinePDFLoader

In [4]:
local_path = "monte_cristo_1-merged.pdf"

# Local PDF file uploads
if local_path:
  loader = UnstructuredPDFLoader(file_path=local_path)
  data = loader.load()
else:
  print("Upload a PDF file")

  from .autonotebook import tqdm as notebook_tqdm


In [5]:
# Preview first page
data[0].page_content

'The Project Gutenberg eBook of Le comte de Monte-Cristo, Tome I\n\nThis ebook is for the use of anyone anywhere in the United States and most other parts of the world at no cost and with almost no restrictions whatsoever. You may copy it, give it away or re-use it under the terms of the Project Gutenberg License included with this ebook or online at www.gutenberg.org. If you are not located in the United States, you will have to check the laws of the country where you are located before using this eBook.\n\nTitle: Le comte de Monte-Cristo, Tome I\n\nAuthor: Alexandre Dumas\n\nAuguste Maquet\n\nRelease date: March 15, 2006 [eBook #17989] Most recently updated: July 27, 2021\n\nLanguage: French\n\nCredits: Chuck Greif and www.ebooksgratuits.com\n\n*** START OF THE PROJECT GUTENBERG EBOOK LE COMTE DE MONTE-CRISTO, TOME I ***\n\nL E C O M T E D E\n\nM O N T E - C R I S T O\n\nAlexandre Dumas\n\nTome I (1845-1846)\n\nTable des matières\n\nI—Marseille.—L’arrivée. II—Le père et le fils. III—

## Vector Embeddings

In [7]:
#!ollama pull nomic-embed-text

!ollama pull aya

[?25lpulling manifest ⠋ [?25h[?25l[2K[1Gpulling manifest ⠙ [?25h[?25l[2K[1Gpulling manifest ⠹ [?25h[?25l[2K[1Gpulling manifest ⠸ [?25h[?25l[2K[1Gpulling manifest ⠼ [?25h[?25l[2K[1Gpulling manifest ⠴ [?25h[?25l[2K[1Gpulling manifest ⠦ [?25h[?25l[2K[1Gpulling manifest ⠧ [?25h[?25l[2K[1Gpulling manifest ⠇ [?25h[?25l[2K[1Gpulling manifest ⠏ [?25h[?25l[2K[1Gpulling manifest ⠋ [?25h[?25l[2K[1Gpulling manifest ⠙ [?25h[?25l[2K[1Gpulling manifest ⠹ [?25h[?25l[2K[1Gpulling manifest ⠸ [?25h[?25l[2K[1Gpulling manifest ⠼ [?25h[?25l[2K[1Gpulling manifest ⠴ [?25h[?25l[2K[1Gpulling manifest ⠦ [?25h[?25l[2K[1Gpulling manifest ⠧ [?25h[?25l[2K[1Gpulling manifest ⠇ [?25h[?25l[2K[1Gpulling manifest ⠏ [?25h[?25l[2K[1Gpulling manifest ⠋ [?25h[?25l[2K[1Gpulling manifest ⠙ [?25h[?25l[2K[1Gpulling manifest ⠹ [?25h[?25l[2K[1Gpulling manifest ⠸ [?25h[?25l[2K[1Gpulling manifest ⠼ [?25h[?25l[2K[1Gpulling manifest ⠴ 

In [8]:
!ollama list

NAME                        ID              SIZE      MODIFIED          
aya:latest                  7ef8c4942023    4.8 GB    3 seconds ago        
stablelm2:latest            714a6116cffa    982 MB    39 minutes ago       
nomic-embed-text:latest     0a109f422b47    274 MB    About an hour ago    
llama3.1:latest             62757c860e01    4.7 GB    2 months ago         
dolphincoder:latest         677555f1f316    4.2 GB    3 months ago         
llama2-uncensored:latest    44040b922233    3.8 GB    9 months ago         


In [9]:
%pip install --q chromadb
%pip install --q langchain-text-splitters

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
grpcio-status 1.66.2 requires protobuf<6.0dev,>=5.26.1, but you have protobuf 4.25.5 which is incompatible.[0m[31m
[0mNote: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [10]:
from langchain_community.embeddings import OllamaEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma

In [11]:
# Split and chunk 
text_splitter = RecursiveCharacterTextSplitter(chunk_size=7500, chunk_overlap=100)
chunks = text_splitter.split_documents(data)

In [12]:
# Add to vector database
vector_db = Chroma.from_documents(
    documents=chunks, 
    embedding=OllamaEmbeddings(model="nomic-embed-text",show_progress=True),
    collection_name="local-rag"
)

OllamaEmbeddings: 100%|██████████| 397/397 [02:13<00:00,  2.97it/s]


## Retrieval

In [13]:
from langchain.prompts import ChatPromptTemplate, PromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_community.chat_models import ChatOllama
from langchain_core.runnables import RunnablePassthrough
from langchain.retrievers.multi_query import MultiQueryRetriever

In [14]:
# LLM from Ollama
local_model = "aya"
llm = ChatOllama(model=local_model)

In [15]:
QUERY_PROMPT = PromptTemplate(
    input_variables=["question"],
    template="""You are an AI language model assistant that speaks French fluently. Your task is to generate five
    different versions of the given user question to retrieve relevant documents from
    a vector database. By generating multiple perspectives on the user question, your
    goal is to help the user overcome some of the limitations of the distance-based
    similarity search. Provide these alternative questions separated by newlines.
    Original question: {question}""",
)

In [16]:
retriever = MultiQueryRetriever.from_llm(
    vector_db.as_retriever(), 
    llm,
    prompt=QUERY_PROMPT
)

# RAG prompt
template = """Answer the question based ONLY on the following context:
{context}
Question: {question}
"""

prompt = ChatPromptTemplate.from_template(template)

In [17]:
chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

In [None]:
#chain.invoke(input(""))

In [18]:
# Is not greatly affected by the spelling error in the prompt below
chain.invoke("Pouvez-vous me donner un summaire de ce livre?")

OllamaEmbeddings: 100%|██████████| 1/1 [00:00<00:00,  2.55it/s]
OllamaEmbeddings: 100%|██████████| 1/1 [00:00<00:00, 66.38it/s]
OllamaEmbeddings: 100%|██████████| 1/1 [00:00<00:00, 67.34it/s]
OllamaEmbeddings: 100%|██████████| 1/1 [00:00<00:00, 65.48it/s]
OllamaEmbeddings: 100%|██████████| 1/1 [00:00<00:00, 55.71it/s]
OllamaEmbeddings: 100%|██████████| 1/1 [00:00<00:00, 72.05it/s]
OllamaEmbeddings: 100%|██████████| 1/1 [00:00<00:00, 68.96it/s]
OllamaEmbeddings: 100%|██████████| 1/1 [00:00<00:00, 70.66it/s]
OllamaEmbeddings: 100%|██████████| 1/1 [00:00<00:00, 63.45it/s]
OllamaEmbeddings: 100%|██████████| 1/1 [00:00<00:00, 59.86it/s]
OllamaEmbeddings: 100%|██████████| 1/1 [00:00<00:00, 50.00it/s]


'Voici un suméraire du livre "Le Comte de Monte-Cristo" d\'Alexandre Dumas:\n\nLe livre suit l\'histoire de Edmond Dantès, un jeune marin prometteur qui est trahi par ceux de ses amis et de sa famille qui sont jaloux de son succès et de son bonheur. Il est arrêté à tort et incarcéré dans le château d\'If, une prison isolée sur une île au large de Marseille. Pendant son emprisonnement, il rencontre l\'abbé Faria, un autre prisonnier qui lui apprend plusieurs sujets tels que les mathématiques, la physique, l\'histoire et les langues. Ils forment une amitié forte et l\'abbé Faria révèle à Dantès qu\'il a un trésor caché sur l\'île.\n\nAprès 14 ans d\'emprisonnement, Dantès s\'évade du château d\'If avec l\'aide de l\'abbé Faria et trouve le trésor caché. Il utilise sa richesse pour prendre une nouvelle identité et se faire appeler le Comte de Monte-Cristo. Il retourne à Marseille et commence à mettre en œuvre son plan de vengeance contre ceux qui l\'ont trahi.\n\nDantès découvre que plusi

In [19]:
chain.invoke("Qui est le personnage principal dans ce livre?")

OllamaEmbeddings: 100%|██████████| 1/1 [00:00<00:00,  8.89it/s]
OllamaEmbeddings: 100%|██████████| 1/1 [00:00<00:00, 66.15it/s]
OllamaEmbeddings: 100%|██████████| 1/1 [00:00<00:00, 68.02it/s]
OllamaEmbeddings: 100%|██████████| 1/1 [00:00<00:00, 61.93it/s]
OllamaEmbeddings: 100%|██████████| 1/1 [00:00<00:00, 52.12it/s]
OllamaEmbeddings: 100%|██████████| 1/1 [00:00<00:00, 42.77it/s]


'Danglars'

The response of Danglars is interesting since he is indeed one of the main antagonists of the book, yet Dantes is more likely to be considered the main character in my opinion.

In [None]:
# Delete all collections in the db
#vector_db.delete_collection()