## RAG: Retrieval Augmented Generation.
- Large language models (LLMs) have a limited context size.
- TLDR
- Not all context is relevant to a given question
- Query -> Search -> Results -> (LLM) -> Answer

## Keyword VS Semantic Search 
![Vector](https://blog.dataiku.com/hs-fs/hubfs/dftt%202.webp?width=1346&height=632&name=dftt%202.webp)

from https://blog.dataiku.com/semantic-search-an-overlooked-nlp-superpower

![Emb_search](figures/emb_search.png)

from https://sreent.medium.com/llms-embeddings-and-vector-search-d4bd9362df56

In [11]:
! pip3 install -qU  markdownify  langchain-upstage rank_bm25

In [12]:

%load_ext dotenv
%dotenv
# UPSTAGE_API_KEY

The dotenv extension is already loaded. To reload it, use:
  %reload_ext dotenv


In [4]:
import warnings

warnings.filterwarnings("ignore")

In [None]:
from langchain_upstage import UpstageEmbeddings

embeddings_model = UpstageEmbeddings()
embeddings = embeddings_model.embed_documents(
    [
        "Whay is the best season to visit Korea?",
    ])

len(embeddings), len(embeddings[0])

In [None]:
# RAG 1. load doc (done), 2. chunking, splits, 3. embeding - indexing, 4. retrieve


In [6]:
from langchain_upstage import UpstageLayoutAnalysisLoader


layzer = UpstageLayoutAnalysisLoader("kim-tse-2008.pdf", output_type="html")
# For improved memory efficiency, consider using the lazy_load method to load documents page by page.
docs = layzer.load()  # or layzer.lazy_load()

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

# 2. Split
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
splits = text_splitter.split_documents(docs)
print("Splits:", len(splits))

In [None]:
from langchain_chroma import Chroma

# 3. Embed & indexing
vectorstore = Chroma.from_documents(documents=splits, embedding=UpstageEmbeddings())

In [9]:
from langchain_core.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_upstage import ChatUpstage


llm = ChatUpstage()

prompt_template = PromptTemplate.from_template(
    """
    Please provide most correct answer from the following context. 
    If the answer is not present in the context, please write "The information is not present in the context."
    ---
    Question: {question}
    ---
    Context: {Context}
    """
)
chain = prompt_template | llm | StrOutputParser()

In [10]:
chain.invoke({"question": "What is bug classficiation?", "Context": docs})

'Bug classification is a technique used in software development to predict the presence of latent software bugs in file-level changes by using machine learning classification algorithms. The technique involves two steps: training and classification. The change classification algorithms learn from a training set, which is a collection of changes that are known to belong to an existing class (buggy or clean changes). Features are extracted from the changes, and the classification algorithm learns which features are the most useful for discriminating among the various classes. The trained classifier can then classify new changes as buggy or clean. The goal of bug classification is to use a machine learning classifier to predict bugs in changes. The change classification technique is programming language independent since it uses a bag-of-words method for generating features from the source code. The projects that were analyzed span many popular current programming languages, including C/C

In [25]:
from langchain_community.retrievers import BM25Retriever

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
splits = text_splitter.split_documents(docs)

retriever = BM25Retriever.from_documents(splits)

In [26]:
retriever.invoke("What is bug classficiation?")

[Document(page_content="two changes were made. The<br>function bar was renamed to foo and println has<br>argument “ report.str ” instead of “ report. ” As a result,<br>the annotate output shows lines 1 and 4 as having<br>been most recently modified in revision 2 by “ ejw .”<br>. Revision 3 shows a change, the actual bug fix,<br>changing line 3 from “==” to “ != .”</p><br><p id='98' style='font-size:18px'>The SZZ algorithm then identifies the bug-introducing<br>change associated with the bug fix in revision 3. It starts by<br>computing the delta between revisions 3 and 2, yielding</p><p id='101' style='font-size:16px'>line 3. SZZ then uses the SCM annotate data to determine<br>the initial origin of line 3 at revision 2. This is revision 1, the<br>bug-introducing change.</p><br><p id='102' style='font-size:16px'>One assumption of the presentation so far is that a bug is<br>repaired in a single bug-fix change. What happens when a<br>bug is repaired across multiple commits? There are two<b

In [27]:
query = "What is bug classficiation?"
context_docs = retriever.invoke(query)
chain.invoke({"question": query, "Context": context_docs})

'The information is not present in the context.'

In [28]:
query = "What is bug classficiation?"
context_docs = retriever.invoke("bug")
chain.invoke({"question": query, "Context": context_docs})

'Change classification is a process that classifies changes in source code. It is different from previous bug prediction work because it focuses on predicting whether there is a bug in any of the lines that were changed in one file in one SCM commit transaction, rather than making bug predictions at the module, file, or method level. Change classification uses bug-introducing changes, which contain the exact commit/line changes that introduced a bug, and uses features from the source code, such as variables, method calls, operators, constants, and comment text, to train the change classification models.'

# Excercise 
It seems keyword search is not the best for LLM queries. What are some alternatives?