<a href="https://colab.research.google.com/github/mswaringen/Personal-RAG/blob/main/Personal_RAG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Personal RAG

Upload a PDF and ask questions, get answers - all through a web UI!

## Install Packages

In [1]:
!pip install pypdf
!pip install langchain
!pip install faiss-gpu
!pip install openai
!pip install sentence-transformers
!pip install langchain-openai
!pip install gradio



## Config OpenAI API

In [2]:
import os
import getpass
from google.colab import userdata

# first look for API in Colab Secrets
os.environ["OPENAI_API_KEY"] = userdata.get('OPENAI_API_KEY')

if os.getenv("OPENAI_API_KEY") is None:
  os.environ['OPENAI_API_KEY'] = getpass.getpass('OpenAI API Key:')

print("OpenAI API key configured")

OpenAI API key configured


## Upload Document

In [3]:
!wget https://www.annreports.com/microsoft/microsoft-ar-2022.pdf

--2024-02-20 14:58:34--  https://www.annreports.com/microsoft/microsoft-ar-2022.pdf
Resolving www.annreports.com (www.annreports.com)... 85.10.143.41, 2a01:7c8:bb0e:c1:5054:ff:fe01:f98d
Connecting to www.annreports.com (www.annreports.com)|85.10.143.41|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 877980 (857K) [application/pdf]
Saving to: ‘microsoft-ar-2022.pdf.2’


2024-02-20 14:58:36 (1.14 MB/s) - ‘microsoft-ar-2022.pdf.2’ saved [877980/877980]



In [4]:
from langchain_community.document_loaders import PyPDFLoader

pdf_path = "microsoft-ar-2022.pdf"
pdf_loader = PyPDFLoader(file_path=pdf_path)
pages = pdf_loader.load()

## Text Splitting

[Five Levels of Text Splitting](https://github.com/FullStackRetrieval-com/RetrievalTutorials/blob/main/5_Levels_Of_Text_Splitting.ipynb)

In [5]:
from langchain.text_splitter import SentenceTransformersTokenTextSplitter

text_splitter = SentenceTransformersTokenTextSplitter(chunk_overlap=0, tokens_per_chunk=256)
documents = text_splitter.split_documents(pages)

print (f'You have {len(documents)} document(s) in your data')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


You have 332 document(s) in your data


In [6]:
documents[10]

Document(page_content='part i item 1 our products include operating systems, cross - device productivity and collaboration applications, server applications, business solution applications, desktop and server management tools, software development tools, and video games. we also design and sell devices, including pcs, tablets, gaming and entertainment consoles, other intelligent devices, and related accessories. the ambitions that drive us to achieve our vision, our research and development efforts focus on three interconnected ambitions : • reinvent productivity and business processes. • build the intelligent cloud and intelligent edge platform. • create more personal computing. reinvent productivity and business processes at microsoft, we provide technology and resources to help our customers create a secure hybrid work environment. our family of products plays a key role in the ways the world works, learns, and connects. our growth depends on securely delivering continuous innovatio

## Vector Embeddings

In [7]:
from langchain_community.vectorstores import FAISS
from langchain_community.embeddings import SentenceTransformerEmbeddings

embeddings = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")
db = FAISS.from_documents(documents, embeddings)

## Similarity Search

In [8]:
query = "What was the total revenue?"
query_embedding = embeddings.embed_query(query) # create vector embedding of query

docs = db.similarity_search_by_vector(query_embedding)
docs_page_content = " ".join([d.page_content for d in docs]) # extract and combine results into one doc

print(docs_page_content)

part ii item 8 item 8. financial st atements and supplement ary da ta income statements ( in millions, except per share amounts ) year ended june 30, 2022 2021 2020 revenue : product $ 72, 732 $ 71, 074 $ 68, 041 service and other 125, 538 97, 014 74, 974 total revenue 198, 270 168, 088 143, 015 cost of revenue : product 19, 064 18, 219 16, 017 service and other 43, 586 34, 013 30, 061 total cost of revenue 62, 650 52, 232 46, 078 gross margin 135, 620 115, 856 96, 937 research and development 24, 512 20, 716 19, 269 sales and marketing 21, 825 20, 117 19, 598 general and administrative 5, 900 5, 107 5, 111 operating income 83, 383 69, 916 52, 959 other income, net 333 1, 186 77 income before income taxes 83, 716 71, 102 53, 036 provision for income taxes 10, 978 9, 831 8, 755 net income $ 72, 738 $ 61, 271 $ 44, 281 earnings per share : basic $ 9. part ii item 8 revenue and costs are generally directly attributed to our segments. however, due to the integrated structure of our busines

## Answer Generation

In [9]:
from langchain_openai import ChatOpenAI
from langchain import PromptTemplate
from langchain.chains import LLMChain

llm = ChatOpenAI(model_name="gpt-3.5-turbo-16k",temperature=0)

prompt = PromptTemplate(
    input_variables=["question", "docs"],
    template="""
      This bot engages in discussions on a wide range of topics, including cultural, philosophical, and political matters. It analyzes provided articles to inform its responses. Please adhere to the truth. If no resources are available, share your personal opinion.

      Question to be answered: {question}

      Referenced articles for analysis: {docs}

      Instructions for the bot:
      1. Extract and use only factual information from the specified documents.
      2. Highlight key phrases and evidence from the articles to support your answers.
      3. If the articles do not sufficiently cover the topic to provide an informed response, please state, "I don't have enough information to answer this question."

      Remember, the goal is to provide well-informed, accurate, and thoughtful responses based on the available resources. If personal opinion is necessary due to a lack of information, it should be clearly identified as such.
      """,
    )

chain = LLMChain(llm=llm, prompt=prompt)

response = chain.run(question=query, docs=docs_page_content,return_source_documents=True)
response_text = str(response)

print(response_text)

  warn_deprecated(


Based on the provided articles, the total revenue for the year ended June 30, 2022, was $198,270 million. This information is stated in the financial statements and supplementary data section of the articles. The breakdown of the revenue is as follows:

- Product revenue: $72,732 million
- Service and other revenue: $125,538 million

It is important to note that the revenue figures are in millions.


## Evaluation

In [10]:
prompt_eval = PromptTemplate(
        input_variables=["answer", "docs"],
        template="""
          Your task is to assess whether the provided response accurately and faithfully reflects the context of a given question or statement.

          Evaluate the following response: {answer}
          Reference article for evaluation: {docs}

          Instructions for the evaluation:
          1. Start your evaluation with a clear "Yes" or "No" to indicate if the response is faithful to the context provided by the reference article.
          2. Provide a detailed reason for your judgment. Mention specific aspects of the response and the article that support your evaluation. Highlight any direct correlations, discrepancies, or notable omissions in the response compared to the factual content of the article.
          3. If the response incorporates elements not found in the article but remains relevant and truthful to the broader topic, please acknowledge this as a factor in your assessment.

          Your evaluation should focus on the accuracy, relevance, and completeness of the response in relation to the information presented in the referenced article. This ensures a thorough and reasoned assessment of the response's faithfulness to the context.
          """,
    )

eval_chain = LLMChain(llm=llm, prompt=prompt_eval)

evals = eval_chain.run(answer=response_text, docs=docs_page_content)
eval_text = str(evals)

print(eval_text)

Yes, the response accurately and faithfully reflects the context of the given question or statement.

The response provides the total revenue for the year ended June 30, 2022, as $198,270 million, which is consistent with the information stated in the financial statements and supplementary data section of the articles. The breakdown of the revenue into product revenue ($72,732 million) and service and other revenue ($125,538 million) is also accurately reflected in the response.

Furthermore, the response correctly mentions that the revenue figures are in millions, which is stated in the article.

Overall, the response accurately presents the revenue figures and their breakdown as provided in the referenced article.


## Output Results

In [11]:
from pprint import pprint

# Print question, answer, and evaluations
print("\n\n> Question:")
pprint(query)
print("\n> Answer:")
pprint(response)
print("\n> Eval:")
pprint(evals)

# Print the relevant sources used for the answer
print("----------------------------------SOURCE DOCUMENTS---------------------------")
for document in docs:
    print("\n> " + document.metadata["source"])
    pprint(document.page_content[:1000])
print("----------------------------------SOURCE DOCUMENTS---------------------------")




> Question:
'What was the total revenue?'

> Answer:
('Based on the provided articles, the total revenue for the year ended June '
 '30, 2022, was $198,270 million. This information is stated in the financial '
 'statements and supplementary data section of the articles. The breakdown of '
 'the revenue is as follows:\n'
 '\n'
 '- Product revenue: $72,732 million\n'
 '- Service and other revenue: $125,538 million\n'
 '\n'
 'It is important to note that the revenue figures are in millions.')

> Eval:
('Yes, the response accurately and faithfully reflects the context of the '
 'given question or statement.\n'
 '\n'
 'The response provides the total revenue for the year ended June 30, 2022, as '
 '$198,270 million, which is consistent with the information stated in the '
 'financial statements and supplementary data section of the articles. The '
 'breakdown of the revenue into product revenue ($72,732 million) and service '
 'and other revenue ($125,538 million) is also accurately refl

## Web UI

In [12]:
def process_pdf(pdf_file):
    loader = PyPDFLoader(pdf_file)
    pages = loader.load()

    text_splitter = SentenceTransformersTokenTextSplitter(chunk_overlap=0, tokens_per_chunk=256)
    texts = text_splitter.split_documents(pages)
    db = FAISS.from_documents(texts, embeddings)

    return db


def get_response_from_query(db, query):

    # embed query, find k nearest docs, combine docs
    query_embedding = embeddings.embed_query(query)
    docs = db.similarity_search_by_vector(query_embedding)
    docs_page_content = " ".join([d.page_content for d in docs])

    llm = ChatOpenAI(model_name="gpt-3.5-turbo-16k",temperature=0)

    # generate LLM answer based on similar docs
    prompt = PromptTemplate(
        input_variables=["question", "docs"],
        template="""
      This bot engages in discussions on a wide range of topics, including cultural, philosophical, and political matters. It analyzes provided articles to inform its responses. Please adhere to the truth. If no resources are available, share your personal opinion.

      Question to be answered: {question}

      Referenced articles for analysis: {docs}

      Instructions for the bot:
      1. Extract and use only factual information from the specified documents.
      2. Highlight key phrases and evidence from the articles to support your answers.
      3. If the articles do not sufficiently cover the topic to provide an informed response, please state, "I don't have enough information to answer this question."

      Remember, the goal is to provide well-informed, accurate, and thoughtful responses based on the available resources. If personal opinion is necessary due to a lack of information, it should be clearly identified as such.
      """
      ,
    )

    chain = LLMChain(llm=llm, prompt=prompt)
    response = chain.run(question=query, docs=docs_page_content,return_source_documents=True)
    r_text = str(response)

    # use LLM to evaluate answer
    prompt_eval = PromptTemplate(
        input_variables=["answer", "docs"],
        template="""
          Your task is to assess whether the provided response accurately and faithfully reflects the context of a given question or statement.

          Evaluate the following response: {answer}
          Reference article for evaluation: {docs}

          Instructions for the evaluation:
          1. Start your evaluation with a clear "Yes" or "No" to indicate if the response is faithful to the context provided by the reference article.
          2. Provide a detailed reason for your judgment. Mention specific aspects of the response and the article that support your evaluation. Highlight any direct correlations, discrepancies, or notable omissions in the response compared to the factual content of the article.
          3. If the response incorporates elements not found in the article but remains relevant and truthful to the broader topic, please acknowledge this as a factor in your assessment.

          Your evaluation should focus on the accuracy, relevance, and completeness of the response in relation to the information presented in the referenced article. This ensures a thorough and reasoned assessment of the response's faithfulness to the context.
          """
          ,
    )

    chain_part_2 = LLMChain(llm=llm, prompt=prompt_eval)
    evals = chain_part_2.run(answer=r_text, docs=docs_page_content)

    return response,docs,evals



In [14]:
import gradio as gr

def greet(pdf_file, query):
    db = process_pdf(pdf_file)
    answer,sources,evals = get_response_from_query(db,query)
    return answer,sources,evals


demo = gr.Interface(fn=greet,
                    title="Personal-RAG",
                    inputs=[gr.components.File(label="Upload PDF"), "text"],
                    outputs=[gr.components.Textbox(lines=3, label="Response"),
                             gr.components.Textbox(lines=3, label="Source"),
                             gr.components.Textbox(lines=3, label="Evaluation")],
                   )

demo.launch(share=True, debug=True)




Colab notebook detected. This cell will run indefinitely so that you can see errors and logs. To turn off, set debug=False in launch().
Running on public URL: https://be65cc2c4ee72821b5.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)


Keyboard interruption in main thread... closing server.
Killing tunnel 127.0.0.1:7860 <> https://be65cc2c4ee72821b5.gradio.live


