## RAG: Retrieval Augmented Generation.
- Large language models (LLMs) have a limited context size.
- TLDR
- Not all context is relevant to a given question
- Query -> Search -> Results -> (LLM) -> Answer

In [1]:
! pip3 install -qU  markdownify  langchain-upstage rank_bm25

In [2]:

%load_ext dotenv
%dotenv
# UPSTAGE_API_KEY

In [3]:
import warnings

warnings.filterwarnings("ignore")

In [4]:
from langchain_upstage import UpstageLayoutAnalysisLoader


layzer = UpstageLayoutAnalysisLoader("pdfs/kim-tse-2008.pdf", output_type="html")
# For improved memory efficiency, consider using the lazy_load method to load documents page by page.
docs = layzer.load()  # or layzer.lazy_load()

In [5]:
from IPython.display import display, HTML

display(HTML(docs[0].page_content[:1000]))

In [6]:
from langchain_core.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_upstage import ChatUpstage


llm = ChatUpstage()

prompt_template = PromptTemplate.from_template(
    """
    Please provide most correct answer from the following context. 
    If the answer is not present in the context, please write "The information is not present in the context."
    ---
    Question: {question}
    ---
    Context: {Context}
    """
)
chain = prompt_template | llm | StrOutputParser()

In [7]:
chain.invoke({"question": "What is bug classficiation?", "Context": docs})

'Sure, here is the code to extract the text from the PDF files and store it in a PostgreSQL database:\n```python\nimport os\nimport pdfplumber\nimport psycopg2\n\n# Set up connection to PostgreSQL database\nconn = psycopg2.connect(\n    host="your_host",\n    database="your_database",\n    user="your_user",\n    password="your_password"\n)\n\n# Create a cursor object to execute SQL queries\ncur = conn.cursor()\n\n# Define the table to store the extracted text\ncur.execute("""\n    CREATE TABLE IF NOT EXISTS pdf_text (\n        file_name TEXT,\n        text_content TEXT\n    )\n""")\n\n# Define the directory containing the PDF files\npdf_dir = "/path/to/pdf/directory"\n\n# Loop through each PDF file in the directory\nfor file_name in os.listdir(pdf_dir):\n    if file_name.endswith(".pdf"):\n        # Open the PDF file using pdfplumber\n        with pdfplumber.open(os.path.join(pdf_dir, file_name)) as pdf:\n            # Extract all the text in the PDF file\n            text_content = ""

In [8]:
from langchain_community.retrievers import BM25Retriever
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
splits = text_splitter.split_documents(docs)

retriever = BM25Retriever.from_documents(splits)

In [9]:
retriever.invoke("What is bug classficiation?")

[Document(page_content="two changes were made. The<br>function bar was renamed to foo and println has<br>argument “ report.str ” instead of “ report. ” As a result,<br>the annotate output shows lines 1 and 4 as having<br>been most recently modified in revision 2 by “ ejw .”<br>. Revision 3 shows a change, the actual bug fix,<br>changing line 3 from “==” to “ != .”</p><br><p id='98' style='font-size:18px'>The SZZ algorithm then identifies the bug-introducing<br>change associated with the bug fix in revision 3. It starts by<br>computing the delta between revisions 3 and 2, yielding</p><p id='101' style='font-size:16px'>line 3. SZZ then uses the SCM annotate data to determine<br>the initial origin of line 3 at revision 2. This is revision 1, the<br>bug-introducing change.</p><br><p id='102' style='font-size:16px'>One assumption of the presentation so far is that a bug is<br>repaired in a single bug-fix change. What happens when a<br>bug is repaired across multiple commits? There are two<b

In [10]:
query = "What is bug classficiation?"
context_docs = retriever.invoke(query)
chain.invoke({"question": query, "Context": context_docs})

'The information is not present in the context.'

In [11]:
query = "What is bug classficiation?"
context_docs = retriever.invoke("bug")
chain.invoke({"question": query, "Context": context_docs})

'Bug classification refers to the process of predicting whether there is a bug in any of the lines that were changed in one file in one SCM commit transaction. It differs from previous bug prediction work that focuses on finding prediction or regression models to identify fault-prone or buggy modules, files, and functions. Instead, bug classification predicts the presence of bugs in specific code changes. It uses bug-introducing changes, which contain the exact commit/line changes that injected a bug, to label changes as buggy or clean. Additionally, it utilizes features from the source code, such as variable names, method calls, operators, constants, and comment text, to train the classification models.'

# Excercise 
It seems keyword search is not the best for LLM queries. What are some alternatives?