## Ingesting PDF

In [1]:
%pip install --q unstructured langchain
%pip install --q "unstructured[all-docs]"

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [1]:
from langchain_community.document_loaders import UnstructuredPDFLoader
from langchain_community.document_loaders import OnlinePDFLoader

In [2]:
local_path = "../pdf_files/paper_11.pdf"

# Local PDF file uploads
if local_path:
  loader = UnstructuredPDFLoader(file_path=local_path)
  data = loader.load()
else:
  print("Upload a PDF file")

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
# Preview first page
data[0].page_content

'CatBoost: gradient boosting with categorical features support\n\nAnna Veronika Dorogush, Vasily Ershov, Andrey Gulin Yandex\n\nAbstract\n\nIn this paper we present CatBoost, a new open-sourced gradient boosting library that successfully handles categorical features and outperforms existing publicly available implementations of gradient boosting in terms of quality on a set of popular publicly available datasets. The library has a GPU implementation of learning algorithm and a CPU implementation of scoring algorithm, which are significantly faster than other gradient boosting libraries on ensembles of similar sizes.\n\n1 Introduction\n\nGradient boosting is a powerful machine-learning technique that achieves state-of-the-art results in a variety of practical tasks. For a number of years, it has remained the primary method for learning problems with heterogeneous features, noisy data, and c endencies: web search, recommendation systems, weather forecasting, and many others . It is backe

## Vector Embeddings

In [1]:
!ollama pull nomic-embed-text

[?25lpulling manifest ⠋ [?25h[?25l[2K[1Gpulling manifest ⠙ [?25h[?25l[2K[1Gpulling manifest ⠹ [?25h[?25l[2K[1Gpulling manifest ⠸ [?25h[?25l[2K[1Gpulling manifest ⠼ [?25h[?25l[2K[1Gpulling manifest ⠴ [?25h[?25l[2K[1Gpulling manifest ⠦ [?25h[?25l[2K[1Gpulling manifest ⠧ [?25h[?25l[2K[1Gpulling manifest ⠇ [?25h[?25l[2K[1Gpulling manifest ⠏ [?25h[?25l[2K[1Gpulling manifest ⠋ [?25h[?25l[2K[1Gpulling manifest ⠙ [?25h[?25l[2K[1Gpulling manifest ⠹ [?25h[?25l[2K[1Gpulling manifest ⠸ [?25h[?25l[2K[1Gpulling manifest ⠼ [?25h[?25l[2K[1Gpulling manifest ⠴ [?25h[?25l[2K[1Gpulling manifest ⠦ [?25h[?25l[2K[1Gpulling manifest ⠧ [?25h[?25l[2K[1Gpulling manifest ⠇ [?25h[?25l[2K[1Gpulling manifest ⠏ [?25h[?25l[2K[1Gpulling manifest ⠋ [?25h[?25l[2K[1Gpulling manifest ⠙ [?25h[?25l[2K[1Gpulling manifest 
pulling 970aa74c0a90...   0% ▕                ▏    0 B/274 MB                  [?25h[?25l[2K[1G[A[2K[1Gpulling 

In [12]:
!ollama pull mistral

[?25lpulling manifest ⠋ [?25h[?25l[2K[1Gpulling manifest ⠙ [?25h[?25l[2K[1Gpulling manifest ⠹ [?25h[?25l[2K[1Gpulling manifest ⠸ [?25h[?25l[2K[1Gpulling manifest ⠼ [?25h[?25l[2K[1Gpulling manifest ⠴ [?25h[?25l[2K[1Gpulling manifest ⠦ [?25h[?25l[2K[1Gpulling manifest ⠧ [?25h[?25l[2K[1Gpulling manifest ⠇ [?25h[?25l[2K[1Gpulling manifest ⠏ [?25h[?25l[2K[1Gpulling manifest ⠋ [?25h[?25l[2K[1Gpulling manifest ⠙ [?25h[?25l[2K[1Gpulling manifest ⠹ [?25h[?25l[2K[1Gpulling manifest ⠸ [?25h[?25l[2K[1Gpulling manifest ⠼ [?25h[?25l[2K[1Gpulling manifest ⠴ [?25h[?25l[2K[1Gpulling manifest ⠦ [?25h[?25l[2K[1Gpulling manifest ⠧ [?25h[?25l[2K[1Gpulling manifest ⠇ [?25h[?25l[2K[1Gpulling manifest ⠏ [?25h[?25l[2K[1Gpulling manifest ⠋ [?25h[?25l[2K[1Gpulling manifest 
pulling ff82381e2bea...   0% ▕                ▏    0 B/4.1 GB                  [?25h[?25l[2K[1G[A[2K[1Gpulling manifest 
pulling ff82381e2bea...   0% 

In [13]:
!ollama list

NAME                   	ID          	SIZE  	MODIFIED           
mistral:latest         	f974a74358d6	4.1 GB	About a minute ago	
nomic-embed-text:latest	0a109f422b47	274 MB	15 minutes ago    	


In [5]:
%pip install --q chromadb
%pip install --q langchain-text-splitters

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [3]:
from langchain_community.embeddings import OllamaEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma

In [4]:
# Split and chunk 
text_splitter = RecursiveCharacterTextSplitter(chunk_size=7500, chunk_overlap=100)
chunks = text_splitter.split_documents(data)

In [5]:
# Add to vector database
vector_db = Chroma.from_documents(
    documents=chunks, 
    embedding=OllamaEmbeddings(model="nomic-embed-text",show_progress=True),
    collection_name="local-rag"
)

OllamaEmbeddings: 100%|██████████| 4/4 [00:06<00:00,  1.57s/it]


## Retrieval

In [6]:
from langchain.prompts import ChatPromptTemplate, PromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_community.chat_models import ChatOllama
from langchain_core.runnables import RunnablePassthrough
from langchain.retrievers.multi_query import MultiQueryRetriever

In [7]:
# LLM from Ollama
local_model = "mistral"
llm = ChatOllama(model=local_model)

In [8]:
QUERY_PROMPT = PromptTemplate(
    input_variables=["question"],
    template="""You are an AI language model assistant. Your task is to generate five
    different versions of the given user question to retrieve relevant documents from
    a vector database. By generating multiple perspectives on the user question, your
    goal is to help the user overcome some of the limitations of the distance-based
    similarity search. Provide these alternative questions separated by newlines.
    Original question: {question}""",
)

In [9]:
retriever = MultiQueryRetriever.from_llm(
    vector_db.as_retriever(), 
    llm,
    prompt=QUERY_PROMPT
)

# RAG prompt
template = """Answer the question based ONLY on the following context:
{context}
Question: {question}
"""

prompt = ChatPromptTemplate.from_template(template)

In [10]:
chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

In [14]:
chain.invoke(input(""))

OllamaEmbeddings: 100%|██████████| 1/1 [00:01<00:00,  1.90s/it]
OllamaEmbeddings: 100%|██████████| 1/1 [00:00<00:00, 13.75it/s]
OllamaEmbeddings: 100%|██████████| 1/1 [00:00<00:00,  8.38it/s]
OllamaEmbeddings: 100%|██████████| 1/1 [00:00<00:00, 14.84it/s]
OllamaEmbeddings: 100%|██████████| 1/1 [00:00<00:00,  6.24it/s]


' This text appears to be describing a machine learning algorithm called CatBoost, which is used for dealing with categorical features in data analysis and machine learning tasks. Categorical features are discrete values that cannot be directly used in binary decision trees due to their non-comparability. The article explains two common methods of handling such features: one-hot encoding and replacing the category with the average label value of examples from the same category.\n   CatBoost uses an efficient strategy for overcoming overfitting when dealing with categorical features by performing a random permutation of the dataset, computing average label values for each example with the same category placed before it in the permutation, and adding a prior value to reduce noise from low-frequency categories. It also discusses the concept of feature combinations, where any combination of several categorical features could be considered as a new one, allowing more detailed analysis while

In [15]:
chain.invoke("How is CatBoost better than the other methods?")

OllamaEmbeddings: 100%|██████████| 1/1 [00:00<00:00,  7.67it/s]
OllamaEmbeddings: 100%|██████████| 1/1 [00:00<00:00, 12.69it/s]
OllamaEmbeddings: 100%|██████████| 1/1 [00:00<00:00,  4.75it/s]
OllamaEmbeddings: 100%|██████████| 1/1 [00:00<00:00, 11.93it/s]
OllamaEmbeddings: 100%|██████████| 1/1 [00:00<00:00,  7.89it/s]


" CatBoost has several advantages over other methods, especially when dealing with categorical features and ensemble learning. Here are some key points that make it stand out:\n\n1. Efficient handling of categorical features: CatBoost uses a technique called Ordered Target Encoding (OTE) which can handle both numerical and ordinal categorical features without the need for manual one-hot encoding or label binning. This reduces the dimensionality of the data, making it computationally efficient.\n\n2. Overfitting reduction: CatBoost uses a random permutation of the dataset to compute average label values for each category, which reduces overfitting and allows using the whole dataset for training. It also provides an option to add a prior value (average label value or a priori probability) to help reduce noise from low-frequency categories.\n\n3. Ensemble learning: CatBoost is an ensemble learning method that combines weak learners (decision trees) into a strong learner. It uses a novel s

In [None]:
# Delete all collections in the db
vector_db.delete_collection()