# PDF QUERY 
input: pdf
- Save pdf as a temp txt file
  - save each page as a separate txt file for metadata handling
  
- apply embeddings
  - set number of tokens, tokens in each page, overlapping
  
- use a vectorstore to store embeddings
- query using QnA
input: question


In [None]:
# setup envrionment
import sys
sys.path.insert(0, '..')
import os
from constants import keys

# llm
from langchain import OpenAI

# langchain pdf loader and vectorstore
from langchain.document_loaders import PyPDFLoader
from langchain.vectorstores import FAISS
# OpenAI embeddings
from langchain.embeddings.openai import OpenAIEmbeddings

# we can use a retreival question and answer api to query the document
from langchain.chains import RetrievalQA
# set API KEYS here
os.environ['openai_api_key'] = keys['openai']

# constants
OPENAI_MODEL = "text-ada-001" # "gpt-3.5-turbo"


In [None]:
# use pypdf to load data, split into pages. This will only load text data from each page 
loader = PyPDFLoader("../data/ocbc_net_zero_report.pdf")
pages = loader.load_and_split()

In [None]:
# now we load all the pages into a vectorstore. 
# faiss was developed by Facebook. Could explore other vectorstores, but this iwl 
faiss_index = FAISS.from_documents(pages, OpenAIEmbeddings())

In [None]:
# initialize openai model 

llm = OpenAI(model_name=OPENAI_MODEL, n=2)  
qa = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=faiss_index.as_retriever())

In [None]:
query = "give me a concise summary of this document. suggest 2 points phrased as potential questions a reader may have for this document."
result = qa.run(query)
print(result)

In [None]:
pages[0].metadata['section'] = 'title'

### Extracting more metadata from pdf -- page summary

In [None]:
# can loop through pages to get a concise summary of each page
from langchain.chains.question_answering import load_qa_chain
from langchain.chains import AnalyzeDocumentChain
llm = OpenAI(model_name=OPENAI_MODEL, n=2)  
qa_chain = load_qa_chain(llm, chain_type="map_reduce")
qa_document_chain = AnalyzeDocumentChain(combine_docs_chain=qa_chain)
content_to_summarise = pages[17].page_content
qa_document_chain.run(input_document=content_to_summarise, question="give me a concise summary of this document. suggest 2 key points raised in the document.")
### something fishy going on with OpenAI api right now 


In [None]:
content_to_summarise

### local model loading

In [None]:
from transformers import AutoTokenizer, LlamaForCausalLM
from torch import cuda, bfloat16

tokenizer = AutoTokenizer.from_pretrained("JosephusCheung/Guanaco")
model = LlamaForCausalLM.from_pretrained("JosephusCheung/Guanaco")

device =  f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'
model.to(device)

In [None]:
generate_text = transformers.pipeline(
    model=model, tokenizer=tokenizer,
    return_full_text=True,  # langchain expects the full text
    task='text-generation',
    device=device,
    # we pass model parameters here too
    # stopping_criteria=stopping_criteria,  # without this model will ramble
    temperature=0.1,  # 'randomness' of outputs, 0.0 is the min and 1.0 the max
    top_p=0.15,  # select from top tokens whose probability add up to 15%
    top_k=0,  # select from top 0 tokens (because zero, relies on top_p)
    max_new_tokens=64,  # max number of tokens to generate in the output
    repetition_penalty=1.1  # penalizes repetition in tokens generated 
)

In [48]:

from langchain import PromptTemplate, LLMChain
from langchain.llms import HuggingFacePipeline

template = """Answer the question based on the context below. If the
question cannot be answered using the information provided answer
with "It is unclear".

Context:  {context}

Question: Give a concise summary of this page. Use only 1 sentence.

Answer: """


# template for an instruction with no input
prompt = PromptTemplate(
    input_variables=["context"],
    template=template
)

# llm = HuggingFacePipeline(pipeline=generate_text)
llm = OpenAI(model_name=OPENAI_MODEL, n=2)

llm_chain = LLMChain(llm=llm, prompt=prompt)

In [49]:
print(llm_chain.predict(context = content_to_summarise).lstrip()) # need to tweak the prompt

The OCBC portfolio includes a large proportion of our corporate and commercial banking lending. We seek to address a large proportion of our portfolio; 42% of our corporate and commercial banking banking lending is captured within the scope of our targets. Within each sector, we have focused our targets on specific parts of the sector value chains based on the following considerations:
• In each sector, what is the sub-sectors that are the most critical to decarbonise? Within each sector, we have identified the sub-sectors responsible for the majority of the emissions in that sector. For example, we focused on electricity generation in the Power sector and not on transmission grids, as the bulk of emissions in the Power sector arise from the generation of electricity6. By decarbonising the power generation sub-sector, a vast majority of emissions in the overall Power sector will be removed;
• What do the sector-specific reference pathways seek to measure and address? Within a sector, r