# PDF QUERY 
input: pdf
- Save pdf as a temp txt file
  - save each page as a separate txt file for metadata handling
  
- apply embeddings
  - set number of tokens, tokens in each page, overlapping
  
- use a vectorstore to store embeddings
- query using QnA
input: question


In [75]:
# setup envrionment
import sys
sys.path.insert(0, '..')
import os
from constants import keys

# llm
from langchain import OpenAI

# langchain pdf loader and vectorstore
from langchain.document_loaders import PyPDFLoader
from langchain.vectorstores import FAISS
from langchain.text_splitter import CharacterTextSplitter
# OpenAI embeddings
from langchain.embeddings.openai import OpenAIEmbeddings

# we can use a retreival question and answer api to query the document
from langchain.chains import RetrievalQA
# set API KEYS here
os.environ['openai_api_key'] = keys['openai']

# constants
OPENAI_MODEL =  "gpt-3.5-turbo" #"text-ada-001" #


In [48]:
# use pypdf to load data, split into pages. This will only load text data from each page 
loader = PyPDFLoader("../data/ocbc_net_zero_report.pdf")
splitter = CharacterTextSplitter(chunk_size = 500, chunk_overlap = 100)
pages = loader.load_and_split(text_splitter = splitter)

In [61]:
pages

[Document(page_content='Partnering  \nClients towards  \na Net Zero  \nASEAN and \nGreater China', metadata={'source': '../data/ocbc_net_zero_report.pdf', 'page': 0}),
 Document(page_content='Disclaimer\nThe information and opinions presented in this paper are for information only. This does not constitute an offer or solicitation to buy \nor sell or subscribe for any security or financial instrument or to enter into any transaction or to participate in any particular trading or \ninvestment strategy. OCBC makes no representation or warranty whatsoever as to the quality, accuracy, adequacy, timeliness or \ncompleteness of such information or opinions, and is under no obligation to update or correct any such information or opinions. Any \nopinion or view expressed in any third-party resource or article used in this paper are those of the third-party, and not of OCBC or \nour affiliates, directors or employees. OCBC reserves the right to amend the information and opinions presented in th

In [69]:
for content in pages:
    pagelen = len((content.page_content).split(' '))
    if pagelen > 500:
        print(f"pg: {content.metadata['page']} -- {pagelen}" )

pg: 4 -- 540
pg: 14 -- 555
pg: 17 -- 583
pg: 19 -- 519
pg: 22 -- 601
pg: 24 -- 523
pg: 29 -- 615
pg: 31 -- 511
pg: 33 -- 539
pg: 39 -- 505
pg: 41 -- 526
pg: 60 -- 675
pg: 67 -- 591
pg: 70 -- 572


In [60]:
# now we load all the pages into a vectorstore. 
# faiss was developed by Facebook. Could explore other vectorstores like Chroma 
faiss_index = FAISS.from_documents(pages, OpenAIEmbeddings())

In [77]:
# initialize openai model 

llm = OpenAI(model_name=OPENAI_MODEL)  
qa = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=faiss_index.as_retriever())

In [53]:
query = "give me a concise summary of this document. suggest 2 points phrased as potential questions a reader may have for this document."
result = qa.run(query)
print(result)



1. What are the main principles behind Net Zero targets?
2. What are the specific steps that are required in order to achieve Net Zero targets?


In [100]:
result

'\n\n1. What are the main principles behind Net Zero targets?\n2. What are the specific steps that are required in order to achieve Net Zero targets?'

### Extracting more metadata from pdf -- page summary

In [76]:
# can loop through pages to get a concise summary of each page
from langchain.chains.question_answering import load_qa_chain
from langchain.chains import AnalyzeDocumentChain
llm = OpenAI(model_name=OPENAI_MODEL, n=2)  
qa_chain = load_qa_chain(llm, chain_type="map_reduce")
qa_document_chain = AnalyzeDocumentChain(combine_docs_chain=qa_chain)
content_to_summarise = pages[17].page_content
qa_document_chain.run(input_document=content_to_summarise, question="give me a concise summary of this document. suggest 2 key points raised in the document.")
### something fishy going on with OpenAI api right now 




"The document outlines OCBC's approach to reducing greenhouse gas emissions in their portfolio by focusing on specific sub-sectors, evaluating their scope of emissions, and prioritizing emissions reduction in sectors with the biggest impact. Two key points raised are the importance of evaluating Scope 3 emissions and the limitations of setting targets based on data availability."

In [None]:
content_to_summarise

In [None]:
pages[0].metadata['section'] = 'title'

### local model loading

In [None]:
from transformers import AutoTokenizer, LlamaForCausalLM
from torch import cuda, bfloat16

tokenizer = AutoTokenizer.from_pretrained("JosephusCheung/Guanaco")
model = LlamaForCausalLM.from_pretrained("JosephusCheung/Guanaco")

device =  f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'
model.to(device)

In [None]:
generate_text = transformers.pipeline(
    model=model, tokenizer=tokenizer,
    return_full_text=True,  # langchain expects the full text
    task='text-generation',
    device=device,
    # we pass model parameters here too
    # stopping_criteria=stopping_criteria,  # without this model will ramble
    temperature=0.1,  # 'randomness' of outputs, 0.0 is the min and 1.0 the max
    top_p=0.15,  # select from top tokens whose probability add up to 15%
    top_k=0,  # select from top 0 tokens (because zero, relies on top_p)
    max_new_tokens=64,  # max number of tokens to generate in the output
    repetition_penalty=1.1  # penalizes repetition in tokens generated 
)

In [70]:

from langchain import PromptTemplate, LLMChain
from langchain.llms import HuggingFacePipeline

template = """Answer the question based on the context below. If the
question cannot be answered using the information provided answer
with "It is unclear".

Context:  {context}

Question: Give a concise summary of this page. Use only 1 sentence.

Answer: """


# template for an instruction with no input
prompt = PromptTemplate(
    input_variables=["context"],
    template=template
)

# llm = HuggingFacePipeline(pipeline=generate_text)
llm = OpenAI(model_name=OPENAI_MODEL, n=2)

llm_chain = LLMChain(llm=llm, prompt=prompt)

In [49]:
print(llm_chain.predict(context = content_to_summarise).lstrip()) # need to tweak the prompt

The OCBC portfolio includes a large proportion of our corporate and commercial banking lending. We seek to address a large proportion of our portfolio; 42% of our corporate and commercial banking banking lending is captured within the scope of our targets. Within each sector, we have focused our targets on specific parts of the sector value chains based on the following considerations:
• In each sector, what is the sub-sectors that are the most critical to decarbonise? Within each sector, we have identified the sub-sectors responsible for the majority of the emissions in that sector. For example, we focused on electricity generation in the Power sector and not on transmission grids, as the bulk of emissions in the Power sector arise from the generation of electricity6. By decarbonising the power generation sub-sector, a vast majority of emissions in the overall Power sector will be removed;
• What do the sector-specific reference pathways seek to measure and address? Within a sector, r

In [94]:
# Evaluation
from langchain.evaluation.qa import QAEvalChain

qa_true = [
    {'query': 'What are the 6 sectors covered in the OCBC Net Zero report?', 
     'answer': 'The six sectors covered in the OCBC Net Zero report are: Shipping, Real Estate, Power, Steel, Aviation, and Oil & Gas.'},
    {'query': 'What reference pathway(s) was/were used in deriving alignment delta in Real Estate Sector within this report?', 
    'answer': 'The Real Estate sector used the Carbon Risk Real Estate Monitor (CRREM) reference pathway to derive alignment delta within this report'},
]

other_questions = [
    {'query': 'Which company has OCBC partnered with to develop a structured loan programme to finance solar power developments?',
     'answer': 'Sembcorp Industries'},
    {}
    
]


predictions = qa.apply(qa_true)


In [95]:
predictions

[{'query': 'What are the 6 sectors covered in the OCBC Net Zero report?',
  'answer': 'The six sectors covered in the OCBC Net Zero report are: Shipping, Real Estate, Power, Steel, Aviation, and Oil & Gas.',
  'result': 'The six sectors covered in the OCBC Net Zero report are Power, Oil and Gas, Real Estate, Steel, Aviation, and Shipping.'},
 {'query': 'What reference pathway(s) was/were used in deriving alignment delta in Real Estate Sector within this report?',
  'answer': 'The Real Estate sector used the Carbon Risk Real Estate Monitor (CRREM) reference pathway to derive alignment delta within this report',
  'result': 'The CRREM reference pathway was used in deriving alignment delta in Real Estate Sector within this report.'}]

In [98]:
# Start your eval chain
eval_chain = QAEvalChain.from_llm(llm)

evaluations = []
for qaset in qa_true:
# Have it grade itself. The code below helps the eval_chain know where the different parts are
    graded_outputs = eval_chain.evaluate([qaset],
                                     predictions,
                                     question_key="query",
                                     prediction_key="result",
                                     answer_key='answer')
    evaluations.append(graded_outputs)

In [99]:
evaluations # need to frame the answers correctly. gpt3.5 is better but evaluation is not accurate :(

[[{'text': 'CORRECT'}], [{'text': 'INCORRECT'}]]