# Introduction

In this project, we will develop an application called "Ask the PDF", which utilizes [LangChain](https://www.langchain.com/) and the [OpenAI API](https://openai.com/blog/openai-api) to provide answers based on the content of a PDF document. This application harnesses the capabilities of OpenAI to respond to questions using the data available in the PDF. Its biggest advantage is that it won't [hallucinate](https://en.wikipedia.org/wiki/Hallucination_(artificial_intelligence)) if it doesn't know the answer. Instead, it relies solely on the information contained within the document to formulate its responses.

More documentation on how to ask questions related to the contents of documents can be found [here](https://python.langchain.com/docs/use_cases/question_answering/).

# Configuration

In this section, we will install the necesarry libraries and download an example PDF.

In [1]:
!pip install PyPDF2 -q
!pip install langchain -q
!pip install faiss-cpu -q
!pip install openai -q
!pip install python-dotenv -q
!pip install sentence_transformers -q

In [2]:
import requests

# Download example PDF
URL = "https://www.ipcc.ch/report/ar6/syr/downloads/report/IPCC_AR6_SYR_SPM.pdf"
doc_to_download = requests.get(URL)
# Save example PDF locally
pdf_file = open("IPCC_AR6_SYR_SPM.pdf", "wb")
pdf_file.write(doc_to_download.content)

5552060

# Read PDF File

In this section, we will load the downloaded PDF and read its content.

In [3]:
from PyPDF2 import PdfReader

In [4]:
# Load PDF file
pdf_file_obj = open('IPCC_AR6_SYR_SPM.pdf', 'rb')
pdf_reader = PdfReader(pdf_file_obj)

In [5]:
# Store contents of PDF file in new variable 'text'
text = ""
for page in pdf_reader.pages:
  text += page.extract_text()

In [6]:
# Print some of the data stored in 'text'
text[8000:10000]

' the IPCC Sixth Assessment Report (AR6) summarises the state of knowledge of climate change, \nits widespread impacts and risks, and climate change mitigation and adaptation. It integrates the main findings of the Sixth \nAssessment Report (AR6) based on contributions from the three Working Groups1, and the three Special Reports2. The summary \nfor Policymakers (SPM) is structured in three parts: SPM.A Current Status and Trends, SPM.B Future Climate Change, Risks, and \nLong-Term Responses, and SPM.C Responses in the Near Term3. \nThis report recognizes the interdependence of climate, ecosystems and biodiversity, and human societies; the value of diverse \nforms of knowledge; and the close linkages between climate change adaptation, mitigation, ecosystem health, human well-being \nand sustainable development, and reflects the increasing diversity of actors involved in climate action. \nBased on scientific understanding, key findings can be formulated as statements of fact or associate

# Create Chunks

In this section, we will split the content of the document (already stored in the variable "text") into chunks of equal size. These chunks will serve as the context that we provide to the OpenAI API for answering our questions. This segmentation is necessary because OpenAI has a limitation on the input size and cannot process the entire document simultaneously. Furthermore, supplying the entire document to OpenAI for each question would significantly escalate our costs.

In [7]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [8]:
# Create text splitter object
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 800,
    chunk_overlap = 100,
    length_function = len
)

In [9]:
# Split text into chunks
chunks = text_splitter.split_text(text)

In [10]:
# Number of resulting chunks
len(chunks)

205

In [11]:
# Print a random chunk
chunks[13]

'respectively by 31 January 2021, 1 September 2021 and 11 October 2021.\n2 The three Special Reports are: Global Warming of 1.5°C (2018): an IPCC Special Report on the impacts of global warming of 1.5°C above pre-industrial \nlevels and related global greenhouse gas emission pathways, in the context of strengthening the global response to the threat of climate change, sustainable \ndevelopment, and efforts to eradicate poverty (SR1.5); Climate Change and Land (2019): an IPCC Special Report on climate change, desertification, land \ndegradation, sustainable land management, food security, and greenhouse gas fluxes in terrestrial ecosystems (SRCCL); and The Ocean and Cryosphere in'

# Create Embeddings

In this section, we will generate embeddings from the chunks of text previously created. [Embeddings](https://developers.google.com/machine-learning/crash-course/embeddings/video-lecture) serve as vector representations of data, encompassing words, phrases, or documents as numbers. They play a crucial role in NLP tasks, enabling us to input information into the model. Models cannot directly process human-readable text; rather, they rely on embeddings to interpret and work with the data effectively.

In [12]:
from langchain.embeddings import HuggingFaceEmbeddings

Instead of using OpenAI to creathe these embeddings, we will be using an open-source alternative from LangChain.

Thee following are two pre-trained models from the ["sentence-transformers"](https://huggingface.co/sentence-transformers) library, which is a library focused on sentence and text embeddings.

*   'sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2' - of size 420 MB
*   'sentence-transformers/paraphrase-multilingual-mpnet-base-v2' - of size 970 MB

We will be using the lighter option.



In [13]:
embedding_model = HuggingFaceEmbeddings(model_name='sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2')

In [14]:
# Example: Creating embeddings
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2')
sentence_embeddings = model.encode("Example sentence.")

In [15]:
len(sentence_embeddings)

384

Next, we will create embeddings for the whole document. We will be using [FAISS](https://faiss.ai/index.html) (short for Facebook AI Similarity Search), a library that provides efficient algorithms to quickly search and cluster embedding vectors.

First, we will create these embeddings, and then we will search the document for sections with content related to the question being asked. This will be the context provided to OpenAI to answer the question.

In [16]:
# Creating embeddings for the whole document
from langchain.vectorstores import FAISS

In [17]:
knowledge_base = FAISS.from_texts(chunks, embedding_model)

In [18]:
# Search for similar text in the document
question = "How will continued emissions affect climate change?"
context = knowledge_base.similarity_search(question, top_k=3)

In [19]:
context

[Document(page_content='sectors and communities ( high confidence ). {3.4, 4.2, 4.4, 4.5, 4.7, 4.8 } (Figure SPM.6 )\nC.1.3 Continued emissions will further affect all major climate system components, and many changes will be irreversible on \ncentennial to millennial time scales and become larger with increasing global warming. Without urgent, effective, and \nequitable mitigation and adaptation actions, climate change increasingly threatens ecosystems, biodiversity, and the \nlivelihoods, health and well-being of current and future generations.  (high confidence ) {3.1.3, 3.3.3, 3.4.1, Figure 3.4, \n4.1, 4.2, 4.3, 4.4 } (Figure SPM.1, Figure SPM.6 )25'),
 Document(page_content='reductions.  Targeted reductions of air pollutant emissions lead to more rapid improvements in air quality within years \ncompared to reductions in GHG emissions only, but in the long term, further improvements are projected in scenarios \nthat combine efforts to reduce air pollutants as well as GHG emissions3

# Ask the Document

In this section, we will provide OpenAI with context information and a question to answer based on this context. We will be using the [ChatOpenAI](https://python.langchain.com/docs/integrations/chat/openai) library from LangChain to make these calls.

In [20]:
# Load OpenAI API Key

import os
os.environ['OPENAI_API_KEY'] = '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'

# from dotenv import load_dotenv
# load_dotenv()

In [21]:
from langchain.chat_models import ChatOpenAI
from langchain.chains.question_answering import load_qa_chain

In [22]:
llm = ChatOpenAI(model_name='gpt-3.5-turbo')
chain = load_qa_chain(llm, chain_type='stuff')

In [23]:
# Search for similar text in the document
question = "How will continued emissions affect climate change?"
docs = knowledge_base.similarity_search(question, top_k=3)

# Use similar text to give ChatGPT some context
answer = chain.run(input_documents=docs, question=question)
print(f'Answer: {answer}')

Answer: Continued emissions will further affect all major climate system components and lead to irreversible changes on centennial to millennial time scales. These changes will become larger with increasing global warming. Without urgent mitigation and adaptation actions, climate change will increasingly threaten ecosystems, biodiversity, livelihoods, and the health and well-being of current and future generations. Continued emissions will also intensify the global water cycle, including its variability, global monsoon precipitation, and extreme weather events. Additionally, economic damages have already been detected in climate-exposed sectors, such as agriculture, forestry, fishery, energy, and tourism. Adverse effects on human health, livelihoods, and infrastructure, especially in urban areas, have also been observed due to climate change.


Next, let's try to ask a question that goes beyond the information provided in the document.

In [24]:
# Search for similar text in the document
question_out_of_context = "How many oscars did Titanic win?"
docs = knowledge_base.similarity_search(question_out_of_context, top_k=3)

# Use similar text to give ChatGPT some context
answer = chain.run(input_documents=docs, question=question_out_of_context)
print(f'Answer: {answer}')

Answer: I don't know the answer to that question.


# Review Cost

In this section, we will review the costs of the API call.

In [25]:
from langchain.callbacks import get_openai_callback

In [26]:
with get_openai_callback() as cb:
    answer = chain.run(input_documents=docs, question=question)
    print(cb)

Tokens Used: 1000
	Prompt Tokens: 814
	Completion Tokens: 186
Successful Requests: 1
Total Cost (USD): $0.0015929999999999998
