## RAG: Retrieval Augmented Generation.
- Large language models (LLMs) have a limited context size.
- TLDR
- Not all context is relevant to a given question
- Query -> Search -> Results -> (LLM) -> Answer

## Keyword VS Semantic Search 
![Vector](https://blog.dataiku.com/hs-fs/hubfs/dftt%202.webp?width=1346&height=632&name=dftt%202.webp)

from https://blog.dataiku.com/semantic-search-an-overlooked-nlp-superpower

![Emb_search](figures/emb_search.png)

from https://sreent.medium.com/llms-embeddings-and-vector-search-d4bd9362df56

In [1]:
! pip3 install -qU  markdownify  langchain-upstage rank_bm25

In [2]:

%load_ext dotenv
%dotenv
# UPSTAGE_API_KEY

In [3]:
import warnings

warnings.filterwarnings("ignore")

In [4]:
from langchain_upstage import UpstageEmbeddings

embeddings_model = UpstageEmbeddings()
embeddings = embeddings_model.embed_documents(
    [
        "What is the best season to visit Korea?",
    ])

len(embeddings), len(embeddings[0])

(1, 4096)

In [5]:
# RAG 1. load doc (done), 2. chunking, splits, 3. embeding - indexing, 4. retrieve


In [6]:
from langchain_upstage import UpstageLayoutAnalysisLoader


layzer = UpstageLayoutAnalysisLoader("pdfs/kim-tse-2008.pdf", output_type="html")
# For improved memory efficiency, consider using the lazy_load method to load documents page by page.
docs = layzer.load()  # or layzer.lazy_load()

FileNotFoundError: File not found: kim-tse-2008.pdf

In [None]:
from langchain_text_splitters import (
    Language,
    RecursiveCharacterTextSplitter,
)

# 2. Split
text_splitter = RecursiveCharacterTextSplitter.from_language(
    chunk_size=1000, chunk_overlap=100, language=Language.HTML
)
splits = text_splitter.split_documents(docs)
print("Splits:", len(splits))

Splits: 99


In [None]:
from langchain_chroma import Chroma

# 3. Embed & indexing
vectorstore = Chroma.from_documents(documents=splits, embedding=UpstageEmbeddings())

In [None]:
# 4. retrive
retriever = vectorstore.as_retriever()
result_docs = retriever.invoke("What is Bug Classification?")
print(len(result_docs))
print(result_docs[0].page_content[:100])

4
promise for reducing the time required to find<br>software bugs and reducing the time that bugs stay


In [None]:
from langchain_core.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_upstage import ChatUpstage


llm = ChatUpstage()

prompt_template = PromptTemplate.from_template(
    """
    Please provide most correct answer from the following context. 
    If the answer is not present in the context, please write "The information is not present in the context."
    ---
    Question: {question}
    ---
    Context: {Context}
    """
)
chain = prompt_template | llm | StrOutputParser()

In [None]:
chain.invoke({"question": "What is bug classficiation?", "Context": result_docs})

'Bug classification is a technique used in software development to predict the presence of latent software bugs in file-level software changes. It involves using machine learning classification algorithms to assign each change made into one of the two classes: clean changes or buggy changes. The change classification technique involves two steps: training and classification. The change classification algorithms learn from a training set, that is, a collection of changes that are known to belong to an existing class, that is, the changes are labeled with the known class. The trained classifier can classify changes as buggy or clean, with a certain level of accuracy.'

# Excercise: Hybrid
Sometimes keyword search can be useful. Design a system that does keyword and semantic search, then combine the results. Use them as context for your RAG.