<a href="https://colab.research.google.com/github/nrimsky/qa/blob/main/pdf_qa.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [4]:
%%capture
!pip install InstructorEmbedding sentence-transformers pypdf faiss-cpu langchain openai

In [9]:
import requests
import io
from InstructorEmbedding import INSTRUCTOR
from langchain.vectorstores import FAISS
from langchain.embeddings.base import Embeddings
from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI
from langchain.prompts.chat import (
    ChatPromptTemplate,
    SystemMessagePromptTemplate,
    HumanMessagePromptTemplate,
)
from langchain.document_loaders import PyPDFLoader
import torch
import os
from langchain.prompts import PromptTemplate


In [None]:
os.environ['OPENAI_API_KEY'] = input("Paste OpenAI API Key: ")

In [11]:
model = INSTRUCTOR('hkunlp/instructor-xl')

Downloading (…)7f436/.gitattributes:   0%|          | 0.00/1.48k [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/270 [00:00<?, ?B/s]

Downloading (…)/2_Dense/config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/3.15M [00:00<?, ?B/s]

Downloading (…)0daf57f436/README.md:   0%|          | 0.00/66.3k [00:00<?, ?B/s]

Downloading (…)af57f436/config.json:   0%|          | 0.00/1.52k [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/4.96G [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

Downloading spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)7f436/tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/2.40k [00:00<?, ?B/s]

Downloading (…)f57f436/modules.json:   0%|          | 0.00/461 [00:00<?, ?B/s]

load INSTRUCTOR_Transformer
max_seq_length  512


In [14]:
device = "cuda" if torch.cuda.is_available() else "cpu"

print("Using device", device)

INDEX_TEXT = "Represent this section of a document for retrieval given a question about the document:"
RETRIEVAL_TEXT = "Represent this question about a document for retrieving relevant sections of the document:"


def encode_instructor(instruction, sentences):
    return model.encode([[instruction,sentence] for sentence in sentences])

def get_pdf_documents(url):
    loader = PyPDFLoader(url)
    return loader.load_and_split()

class InstructorEmbeddings(Embeddings):

    def embed_documents(self, texts):
        return encode_instructor(INDEX_TEXT, texts)

    def embed_query(self, text):
        return encode_instructor(RETRIEVAL_TEXT, [text])[0]

def cli_ask_questions(url):
    documents = get_pdf_documents(url)
    embeddings = InstructorEmbeddings()
    vectorstore = FAISS.from_documents(documents, embeddings)

    chain_type_kwargs = {
        "prompt": ChatPromptTemplate.from_messages([
            SystemMessagePromptTemplate.from_template(
                "You are a helpful assistant that answers questions about documents given some snippets from the document."
            ),
            HumanMessagePromptTemplate.from_template("""
                Some relevant snippets:

                {context}

                Question: {question}
                Answer:
            """)
        ])
    }

    qa = RetrievalQA.from_chain_type(
        llm=ChatOpenAI(model_name='gpt-3.5-turbo'),  # gpt-4
        chain_type="stuff",
        retriever=vectorstore.as_retriever(),
        chain_type_kwargs=chain_type_kwargs
    )

    while True:
        question = input("Enter your question about the PDF (or 'quit' to stop): ")
        if question.lower() == 'quit':
            break
        else:
            try:
                print(qa.run(question))
            except Exception as e:
                print("An error occurred while processing your question.")
                print(str(e))


Using device cuda


In [15]:
url = input("Enter the URL of the PDF you want to ask questions about: ")
cli_ask_questions(url)

Enter the URL of the PDF you want to ask questions about: https://arxiv.org/pdf/2302.08500.pdf
Enter your question about the PDF (or 'quit' to stop): What are some ways we could audit language models?
The article proposes a three-layered approach to auditing large language models (LLMs). The first layer involves governance audits, which assess the organizational procedures, incentive structures, and management systems of technology providers working on LLMs. The second layer involves model audits, which focus on the design and development of LLMs themselves, including the training data, algorithms, and model specifications. The third layer involves application audits, which assess how LLMs are being used in downstream applications and whether they are being used in ways that are ethical, legal, and technically robust. The goal of this approach is to ensure that LLMs are designed and deployed in ways that are responsible and aligned with social and ethical values.


KeyboardInterrupt: ignored