# Run First

In [2]:
from IPython.display import Markdown

def display_md(content):
  display(Markdown(content))

# Naive Indexer Architecture
In this course, we are indexing a small number of HTML files. However, in production situations, you're likely to encounter some situations that significantly increase the complexity of accurate indexing:  
- PDFs contain much of the world's unstructured data
- Parsing PDFs with vision can require layout understanding, which is not a generally solved problem
- PDFs often contain tables, graphics, footnotes, equations, etc that require special handling
- Many business cases require indexing highly heterogenous document layouts

Since accurately indexing files is the beginning of your inference pipeline, this is often one of the most consequential engineering problems to perform well at.

## loaders.py
Look at `./workshop-code/indexer_components/loaders.py`

The function here is pretty simple. It downloads a file at a specified URI, then saves it to a cache so it doesn't need to be downloaded on each subsequent run in the notebook. The preprocessor will consume this file in the next step. 

### Task: Read the Code
This part isn't interesting, so just look at the code and understand what it does. If you have any questions, let one of us know.

Note that the proper way to do caching is by using the HTTP response's `ETag`, `Last-Modified`, and `Cache-Control` headers, but I didn't do that here. If you want extra credit, feel free to send me a pull request with the corrected code.

## preprocessors.py
Look at `./workshop-code/indexer_components/preprocessors.py`

In order to reduce inference costs, we want to strip away all of the HTML syntax besides the human-readable body text of the documents. This is a design decision. In real life, you may, for example, preserve more of the HTML markup to reain the context of how the document is structured.

### Easy Task: Configure Beautiful Soup for A Simply Structured Blog Post

The unprocessed HTML blog post looks like this:

In [2]:
from workshop_code.indexer_components.loaders       import DocLoader

blog_post_uri = "https://lilianweng.github.io/posts/2023-06-23-agent/"
doc_content = DocLoader.load_html(blog_post_uri)
display_md(doc_content[3400:3800])

tle" content="LLM Powered Autonomous Agents" />
<meta property="og:description" content="Building agents with LLM (large language model) as its core controller is a cool concept. Several proof-of-concepts demos, such as AutoGPT, GPT-Engineer and BabyAGI, serve as inspiring examples. The potentiality of LLM extends beyond generating well-written copies, stories, essays and programs; it can be frame

The processed HTML blog post should like like this:

In [3]:
from cheat_code.indexer_components.loaders       import DocLoader
from cheat_code.indexer_components.preprocessors import GithubBlogpostPreprocessor

blog_post_uri = "https://lilianweng.github.io/posts/2023-06-23-agent/"
blog_post_html = DocLoader.load_html(blog_post_uri)
preprocessor = GithubBlogpostPreprocessor()
cleaned_doc_content = preprocessor.get_text(blog_post_html)

display_md(cleaned_doc_content[0:1000])



      LLM Powered Autonomous Agents
    
Date: June 23, 2023  |  Estimated Reading Time: 31 min  |  Author: Lilian Weng


Building agents with LLM (large language model) as its core controller is a cool concept. Several proof-of-concepts demos, such as AutoGPT, GPT-Engineer and BabyAGI, serve as inspiring examples. The potentiality of LLM extends beyond generating well-written copies, stories, essays and programs; it can be framed as a powerful general problem solver.
Agent System Overview#
In a LLM-powered autonomous agent system, LLM functions as the agent’s brain, complemented by several key components:

Planning

Subgoal and decomposition: The agent breaks down large tasks into smaller, manageable subgoals, enabling efficient handling of complex tasks.
Reflection and refinement: The agent can do self-criticism and self-reflection over past actions, learn from mistakes and refine them for future steps, thereby improving the quality of final results.


Memory

Short-term memory: I 

Complete the method `GithubBlogpostPreprocessor.get_text()` in `./workshop_code/preprocessors.py` such that the `cleaned_text` looks like the above output.

In [3]:
from workshop_code.indexer_components.loaders       import DocLoader
from workshop_code.indexer_components.preprocessors import GithubBlogpostPreprocessor

blog_post_uri = "https://lilianweng.github.io/posts/2023-06-23-agent/"
blog_post_html = DocLoader.load_html(blog_post_uri)
preprocessor = GithubBlogpostPreprocessor()
cleaned_doc_content = preprocessor.get_text(blog_post_html)

display_md(cleaned_doc_content[0:1000])


      LLM Powered Autonomous Agents
    Date: June 23, 2023  |  Estimated Reading Time: 31 min  |  Author: Lilian Weng

Building agents with LLM (large language model) as its core controller is a cool concept. Several proof-of-concepts demos, such as AutoGPT, GPT-Engineer and BabyAGI, serve as inspiring examples. The potentiality of LLM extends beyond generating well-written copies, stories, essays and programs; it can be framed as a powerful general problem solver.
Agent System Overview#
In a LLM-powered autonomous agent system, LLM functions as the agent’s brain, complemented by several key components:

Planning

Subgoal and decomposition: The agent breaks down large tasks into smaller, manageable subgoals, enabling efficient handling of complex tasks.
Reflection and refinement: The agent can do self-criticism and self-reflection over past actions, learn from mistakes and refine them for future steps, thereby improving the quality of final results.


Memory

Short-term memory: I wou

## Medium Task: Copy or Write a Preprocessor for the RAG Survey Paper
The HTML structure of the Arxiv paper is more complex than the blog post. You can try implementing some of `ArxivHtmlPaperPreprocessor` to see for yourself. But, I suggest just copying the cheat-code. 

Here is the working implementation:

In [5]:
from cheat_code.indexer_components.loaders       import DocLoader
from cheat_code.indexer_components.preprocessors import ArxivHtmlPaperPreprocessor

rag_survey_paper_uri = "https://arxiv.org/html/2312.10997v5"
rag_survey_paper_html = DocLoader.load_html(rag_survey_paper_uri)
preprocessor = ArxivHtmlPaperPreprocessor()
cleaned_doc_content = preprocessor.get_text(rag_survey_paper_html)

display_md(cleaned_doc_content[0:2000])

Retrieval-Augmented Generation for Large Language Models: A Survey

Yunfan Gao: Shanghai Research Institute for Intelligent Autonomous Systems, Tongji University


Yun Xiong: Shanghai Key Laboratory of Data Science, School of Computer Science, Fudan University


Xinyu Gao: Shanghai Key Laboratory of Data Science, School of Computer Science, Fudan University


Kangxiang Jia: Shanghai Key Laboratory of Data Science, School of Computer Science, Fudan University


Jinliu Pan: Shanghai Key Laboratory of Data Science, School of Computer Science, Fudan University


Yuxi Bi: College of Design and Innovation, Tongji University


Yi Dai: Shanghai Research Institute for Intelligent Autonomous Systems, Tongji University


Jiawei Sun: Shanghai Research Institute for Intelligent Autonomous Systems, Tongji University


Meng Wang: College of Design and Innovation, Tongji University


Haofen Wang: Shanghai Research Institute for Intelligent Autonomous Systems, Tongji University College of Design and Innovation, Tongji University



Abstract

Large Language Models (LLMs) showcase impressive capabilities but encounter challenges like hallucination, outdated knowledge, and non-transparent, untraceable reasoning processes. Retrieval-Augmented Generation (RAG) has emerged as a promising solution by incorporating knowledge from external databases. This enhances the accuracy and credibility of the generation, particularly for knowledge-intensive tasks, and allows for continuous knowledge updates and integration of domain-specific information. RAG synergistically merges LLMs’ intrinsic knowledge with the vast, dynamic repositories of external databases. This comprehensive review paper offers a detailed examination of the progression of RAG paradigms, encompassing the Naive RAG, the Advanced RAG, and the Modular RAG. It meticulously scrutinizes the tripartite foundation of RAG frameworks, which includes the retrieval, the generation and the augmentation techniques. The paper highlights the s

And, here is the implementation for you to complete:

In [4]:
from workshop_code.indexer_components.loaders       import DocLoader
from workshop_code.indexer_components.preprocessors import ArxivHtmlPaperPreprocessor

rag_survey_paper_uri = "https://arxiv.org/html/2312.10997v5"
rag_survey_paper_html = DocLoader.load_html(rag_survey_paper_uri)
preprocessor = ArxivHtmlPaperPreprocessor()
cleaned_doc_content = preprocessor.get_text(rag_survey_paper_html)

display_md(cleaned_doc_content[0:2000])

Retrieval-Augmented Generation for Large Language Models: A Survey

Yunfan Gao: Shanghai Research Institute for Intelligent Autonomous Systems, Tongji University


Yun Xiong: Shanghai Key Laboratory of Data Science, School of Computer Science, Fudan University


Xinyu Gao: Shanghai Key Laboratory of Data Science, School of Computer Science, Fudan University


Kangxiang Jia: Shanghai Key Laboratory of Data Science, School of Computer Science, Fudan University


Jinliu Pan: Shanghai Key Laboratory of Data Science, School of Computer Science, Fudan University


Yuxi Bi: College of Design and Innovation, Tongji University


Yi Dai: Shanghai Research Institute for Intelligent Autonomous Systems, Tongji University


Jiawei Sun: Shanghai Research Institute for Intelligent Autonomous Systems, Tongji University


Meng Wang: College of Design and Innovation, Tongji University


Haofen Wang: Shanghai Research Institute for Intelligent Autonomous Systems, Tongji University College of Design and Innovation, Tongji University




Abstract
Large Language Models (LLMs) showcase impressive capabilities but encounter challenges like hallucination, outdated knowledge, and non-transparent, untraceable reasoning processes. Retrieval-Augmented Generation (RAG) has emerged as a promising solution by incorporating knowledge from external databases. This enhances the accuracy and credibility of the generation, particularly for knowledge-intensive tasks, and allows for continuous knowledge updates and integration of domain-specific information. RAG synergistically merges LLMs’ intrinsic knowledge with the vast, dynamic repositories of external databases. This comprehensive review paper offers a detailed examination of the progression of RAG paradigms, encompassing the Naive RAG, the Advanced RAG, and the Modular RAG. It meticulously scrutinizes the tripartite foundation of RAG frameworks, which includes the retrieval, the generation and the augmentation techniques. The paper highlights the s

## Hard Task, not recommended for today: Write a Preprocessor for the PDF Version of the RAG Survey Paper
In production applications, you're likely to need to do inference on PDFs. Today this is often a non-trivial task. The most popular open source solution is Tesseract. However, Tesseract often underperforms computer vision-based services from vendors like Google Cloud and AWS.

## text_splitters.py
Because LLM context windows are limited, semantic indexing strategies rely on text splitting. In this tutorial, we use the most naive strategy, character text splitting. To find inspiration or source code for more strategies, I look at LlamaIndex and Langchain. However, in some production situations, it will make sense to write a text splitter specific to your needs.

### Text splitting task #1: examine and copy the code for the text splitter
Here is the working implementation:

In [7]:
from cheat_code.indexer_components.loaders        import DocLoader
from cheat_code.indexer_components.preprocessors  import ArxivHtmlPaperPreprocessor
from cheat_code.indexer_components.text_splitters import SimpleCharacterTextSplitter

CHUNK_SIZE = 250
OVERLAP_SIZE = 25
rag_survey_paper_uri = "https://arxiv.org/html/2312.10997v5"
preprocessor = ArxivHtmlPaperPreprocessor()
text_splitter = SimpleCharacterTextSplitter(CHUNK_SIZE, OVERLAP_SIZE)

rag_survey_paper_html = DocLoader.load_html(rag_survey_paper_uri)
cleaned_doc_content = preprocessor.get_text(rag_survey_paper_html)
text_splits = text_splitter.split_text(cleaned_doc_content)

display_md(text_splits[3])

approaches in their respective contexts, and speculate on upcoming trends and innovations. Our contributions are as follows: In this survey, we present a thorough and systematic review of the state-of-the-art RAG methods, delineating its evolution through paradigms including naive RAG, advanced RAG, and modular RAG. This review contextualizes the broader scope of RAG research within the landscape of LLMs. We identify and discuss the central technologies integral to the RAG process, specifically focusing on the aspects of “Retrieval”, “Generation” and “Augmentation”, and delve into their synergies, elucidating how these components intricately collaborate to form a cohesive and effective RAG framework. We have summarized the current assessment methods of RAG, covering 26 tasks, nearly 50 datasets, outlining the evaluation objectives and metrics, as well as the current evaluation benchmarks and tools. Additionally, we anticipate future directions for RAG, emphasizing potential enhancements to tackle current challenges. The paper unfolds as follows: SectionIIintroduces the main concept and current paradigms of RAG. The following three sections explore core components—“Retrieval”, “Generation” and “Augmentation”, respectively. SectionIIIfocuses on optimization methods in retrieval,including indexing, query and embedding optimization. SectionIVconcentrates on post-retrieval process and LLM fine-tuning in generation. SectionVanalyzes the three augmentation processes. SectionVIfocuses on RAG’s downstream tasks and evaluation system. SectionVIImainly discusses the challenges that RAG currently faces and its future development directions. At last, the paper concludes in SectionVIII. IIOverview of RAG A typical application of RAG is illustrated in Figure2. Here, a user poses a question to ChatGPT about a recent, widely discussed news. Given ChatGPT’s reliance on pre-training data, it initially lacks the capacity to provide updates on recent developments. RAG bridges this information gap by sourcing and incorporating knowledge

Copy the code from `cheat_code/indexer_components/text_splitters.py` to `workshop_code/` so that the code below works: 

In [5]:
from workshop_code.indexer_components.loaders        import DocLoader
from workshop_code.indexer_components.preprocessors  import ArxivHtmlPaperPreprocessor
from workshop_code.indexer_components.text_splitters import SimpleCharacterTextSplitter

CHUNK_SIZE = 250
OVERLAP_SIZE = 25
rag_survey_paper_uri = "https://arxiv.org/html/2312.10997v5"
preprocessor = ArxivHtmlPaperPreprocessor()
text_splitter = SimpleCharacterTextSplitter(CHUNK_SIZE, OVERLAP_SIZE)

rag_survey_paper_html = DocLoader.load_html(rag_survey_paper_uri)
cleaned_doc_content = preprocessor.get_text(rag_survey_paper_html)
text_splits = text_splitter.split_text(cleaned_doc_content)

display_md(text_splits[3])

techniques, assess the strengths and weaknesses of various approaches in their respective contexts, and speculate on upcoming trends and innovations. Our contributions are as follows: • In this survey, we present a thorough and systematic review of the state-of-the-art RAG methods, delineating its evolution through paradigms including naive RAG, advanced RAG, and modular RAG. This review contextualizes the broader scope of RAG research within the landscape of LLMs. • We identify and discuss the central technologies integral to the RAG process, specifically focusing on the aspects of “Retrieval”, “Generation” and “Augmentation”, and delve into their synergies, elucidating how these components intricately collaborate to form a cohesive and effective RAG framework. • We have summarized the current assessment methods of RAG, covering 26 tasks, nearly 50 datasets, outlining the evaluation objectives and metrics, as well as the current evaluation benchmarks and tools. Additionally, we anticipate future directions for RAG, emphasizing potential enhancements to tackle current challenges. The paper unfolds as follows: Section II introduces the main concept and current paradigms of RAG. The following three sections explore core components—“Retrieval”, “Generation” and “Augmentation”, respectively. Section III focuses on optimization methods in retrieval,including indexing, query and embedding optimization. Section IV concentrates on post-retrieval process and LLM fine-tuning in generation. Section V analyzes the three augmentation processes. Section VI focuses on RAG’s downstream tasks and evaluation system. Section VII mainly discusses the challenges that RAG currently faces and its future development directions. At last, the paper concludes in Section VIII. A typical application of RAG is illustrated in Figure 2. Here, a user poses a question to ChatGPT about a recent, widely discussed news. Given ChatGPT’s reliance on pre-training

### Text Splitting Task #2: look at alternative text splitters
Make a mental note of the other text splitters available here:
- [Langchain: Text Splitters](https://python.langchain.com/v0.2/docs/how_to/#text-splitters)
- [LlamaIndex: Text Splitters](https://medium.com/@bavalpreetsinghh/llamaindex-chunking-strategies-for-large-language-models-part-1-ded1218cfd30)

## Embeddings: vectorizers.py
Embeddings of text is currently the most common method of preparing human-readable text so that they can be compared to each other for relatedness. Currently, OpenAI's embedding models rank amongst the highest performing, so we use theirs.  

OpenAI's text embedding models take up to 8191 tokens as input and convert them to a vector of dimension 1536 for `text-embedding-3-small` or 3072 for `text-embedding-3-large`.

### Embeddings Task #1: Use OpenAI's Embeddings API
The embeddings code for your naive RAG pipeline should behave like this:

In [9]:
from cheat_code.common_components.vectorizers import Vectorizer

example_text_splits = ["Mary had a", "little lamb"]
vectorizer = Vectorizer()
embeddings_of_example_splits = vectorizer.vectorize_text_splits(example_text_splits)

rows = len(embeddings_of_example_splits)
columns = len(embeddings_of_example_splits[0])

print(f"Dimensions: {rows}x{columns}")
print(embeddings_of_example_splits[0:10])

Dimensions: 2x1536
[[0.02242061123251915, 0.019083090126514435, -0.022541731595993042, 0.045567940920591354, -0.04056165739893913, -0.011593852192163467, 0.02302621118724346, 0.01791226491332054, -0.003990223165601492, -0.025394774973392487, 0.03854299709200859, 0.0031995801255106926, 0.03248700872063637, 0.05652255192399025, 0.008478382602334023, 0.005574873182922602, 0.033752039074897766, -0.015772482380270958, -0.0118428198620677, 0.03924280032515526, 0.04879780113697052, 0.024048998951911926, 0.04877088591456413, -0.014278672635555267, 0.016741441562771797, -0.01806030049920082, 0.005097122862935066, -0.04443749040365219, 0.017966097220778465, 0.00755652692168951, 0.0397811084985733, -0.027359606698155403, 0.048071082681417465, -0.009225287474691868, -0.0047775013372302055, 0.045944757759571075, 0.03741254657506943, 0.017024053260684013, -0.0015947434585541487, 0.01055760495364666, -0.012616640888154507, 0.01612238399684429, 0.043064799159765244, -0.00729410070925951, 0.00685672368

Implement `vectorize_text_splits()` in `workshop_code/common_components/vectorizers.py` by referencing the [OpenAI embedding API's documentation](https://platform.openai.com/docs/api-reference/embeddings).

In [5]:
from workshop_code.common_components.vectorizers import Vectorizer

example_text_splits = ["Mary had a", "little lamb"]
vectorizer = Vectorizer()
embeddings_of_example_splits = vectorizer.vectorize_text_splits(example_text_splits)

rows = len(embeddings_of_example_splits)
columns = len(embeddings_of_example_splits[0])

# print(f"Dimensions: {rows}x{columns}")
# print(embeddings_of_example_splits[0:10])

## Vector Database: wcs_client_adapter.py
In production settings, you're likely to store your vectors in a database. In this tutorial, we are using Weaviate, which is abstracted away by `wcs_client_adapter.py`. However, there are many options available. Pinecone has been the most popular startup vector database provider, but popular existing players such as Postgres also offer vector storage.

### Vector DB Task #1: Understand the indexer code
Open `indexers.py` in `./workshop_code/`. Look over how the `WcsClientAdapter` is used, and look at how its methods are implemented. If something doesn't make sense, ask a question.

## The Complete Indexer: indexer.py
### Indexer Task: test that your indexer works
Your indexer should give output like the cheat_code version below:

In [13]:
from cheat_code.indexers import NaiveIndexer
from cheat_code.common_components.vectorizers import Vectorizer

rag_survey_paper_uri = "https://arxiv.org/html/2312.10997v5"
vectorizer = Vectorizer()
indexer = NaiveIndexer(vectorizer)
indexer.index(rag_survey_paper_uri)
num_db_entries = indexer._wcs_client_adapter.count_entries()
print(f"Number of text chunks in Weaviate: {num_db_entries}")

Number of text chunks in Weaviate: 35


Run your indexer below to see if it works the same way. If it doesn't, something is broken.

In [5]:
from workshop_code.indexers import NaiveIndexer
from workshop_code.common_components.vectorizers import Vectorizer

rag_survey_paper_uri = "https://arxiv.org/html/2312.10997v5"
vectorizer = Vectorizer()
indexer = NaiveIndexer(vectorizer)
indexer.index(rag_survey_paper_uri)
num_db_entries = indexer._wcs_client_adapter.count_entries()
print(f"Number of text chunks in Weaviate: {num_db_entries}")

Number of text chunks in Weaviate: 270
