<a href="https://colab.research.google.com/github/keyom-ai/rag/blob/main/RAG_using_PDF_file.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## This Notebook is created to solve RAG (retrieval augmented generation) use case - particularly we are feeding external document - in this example it's using Amazon's 2023 annual report

This section of the code is installing required libraries:
- [langchain](https://www.langchain.com/): mainly used for orchestration and chaining of prompts
- duckdb: open source OLAP database mainly used for data analysis
- unstructured: this is provided by [unstructured.io](https://unstructured.io) to allow us to work with unstructured data (such as pdf, document, image, audio, video etc.)
- [chromadb](https://docs.trychroma.com/getting-started): open-source vector database
- openai: no explanation needed :)
- tiktoken: [tokenizer](https://github.com/openai/tiktoken) used by OpenAI cookbook

---



Once you install these libraries in Colab, it requires you to restart the kernel. So follow the instruction as it ask you to restart the kernel.

---



In [2]:
%pip install langchain duckdb unstructured chromadb openai tiktoken
%pip install "unstructured[pdf]"


Collecting langchain
  Downloading langchain-0.0.344-py3-none-any.whl (1.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.9/1.9 MB[0m [31m21.6 MB/s[0m eta [36m0:00:00[0m
Collecting unstructured
  Downloading unstructured-0.11.2-py3-none-any.whl (1.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m72.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting chromadb
  Downloading chromadb-0.4.18-py3-none-any.whl (502 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m502.4/502.4 kB[0m [31m37.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting openai
  Downloading openai-1.3.7-py3-none-any.whl (221 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m221.4/221.4 kB[0m [31m22.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting tiktoken
  Downloading tiktoken-0.5.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0

This is setting up the environment variable OPENAI_API_KEY. Please make sure to replace `sk-abcdefghijklmnopqrstuvwxyz0123456789` with your own OpenAI API key. You can create an API key via their [API key page](https://platform.openai.com/api-keys).

**A note: ** You may need to [add](https://platform.openai.com/account/billing/overview) $10 (minimum) in your API balance if you start getting capacity error or error code 401

---



In [1]:
%env OPENAI_API_KEY={'sk-abcdefghijklmnopqrstuvwxyz0123456789'}

env: OPENAI_API_KEY=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


Here is where we are loading AMZN's annual report (10K) document. You can replace this with any other document you prefer.

This file is stored in the GitHub and you will need to upload it to wherever you are executing this code. If you are using Google Colab then you need to manually upload this file, unless you are cloing the GitHub repository.

In [3]:
from langchain.document_loaders.unstructured import UnstructuredFileLoader

loader = UnstructuredFileLoader('amzn-10k-2023.pdf')

documents = loader.load()

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


This step is mainly spillitng that PDF document into chunks so it can be fed to Embeddings model and then it can be stored in the vector database.

In [4]:
from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(documents)



This is where we are telling langchain to use OpenAI Embeddings model. There are many other Embeddings model exist, such as [Amazon Titan Embeddings](https://docs.aws.amazon.com/bedrock/latest/userguide/embeddings.html), [Cohere Embeddings](https://docs.cohere.com/docs/multilingual-language-models) and many more via [HuggingFace](https://huggingface.co/models?other=embeddings)

**A note: ** You may need to [add](https://platform.openai.com/account/billing/overview) $10 (minimum) in your API balance if you start getting capacity error or error code 401 *italicized text*

In [5]:
from langchain.embeddings import OpenAIEmbeddings
embeddings = OpenAIEmbeddings()

Here we are storing those embeddings into Chroma vector database.

In [6]:
from langchain.vectorstores import Chroma
db = Chroma.from_documents(texts, embeddings)

Here is where we are using VectorDBQA - part of langchain to support our Questins and Answers mode.

In [7]:
from langchain.chains import VectorDBQA
from langchain.chat_models import ChatOpenAI

qa = VectorDBQA.from_chain_type(llm=ChatOpenAI(), chain_type="stuff", vectorstore=db, k=1)



This is where our actual fun starts in terms of asking question to the document.

In [8]:
query = "What is the document about"
qa.run(query)

'The document is an Annual Report on Form 10-K, which provides information about Amazon.com, Inc. and its subsidiaries. It includes financial information, business strategies, risk factors, and other relevant information about the company.'

In [9]:
query = "What is amount of sales from the document"
qa.run(query)

'According to the document, the amount of net sales for the year ended December 31, 2021, is $469,822 million.'

In [10]:
query = "What is the profit in 2021 from the document?"
qa.run(query)

'The document does not provide information about the profit in 2021. It only mentions the operating income for 2021, which was $24.9 billion. Profit is a different financial metric that takes into account various other factors such as taxes, interest, and non-operating expenses.'

In [11]:
query = "What is the COGS in 2021 from the document?"
qa.run(query)

'The document does not provide the specific figure for COGS (Cost of Goods Sold) in 2021.'