# Your First RAG Application

In this notebook, we'll walk you through each of the components that are involved in a simple RAG application.

We won't be leveraging any fancy tools, just the OpenAI Python SDK, Numpy, and some classic Python.

> NOTE: This was done with Python 3.11.4.

> NOTE: There might be [compatibility issues](https://github.com/wandb/wandb/issues/7683) if you're on NVIDIA driver >552.44 As an interim solution - you can rollback your drivers to the 552.44.

## Table of Contents:

- Task 1: Imports and Utilities
- Task 2: Documents
- Task 3: Embeddings and Vectors
- Task 4: Prompts
- Task 5: Retrieval Augmented Generation
  - 🚧 Activity #1: Augment RAG

Let's look at a rather complicated looking visual representation of a basic RAG application.

<img src="https://i.imgur.com/vD8b016.png" />

## Task 1: Imports and Utility

We're just doing some imports and enabling `async` to work within the Jupyter environment here, nothing too crazy!

In [1]:
!pip install -qU numpy matplotlib plotly pandas scipy scikit-learn openai python-dotenv


[notice] A new release of pip is available: 24.0 -> 24.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [2]:
from aimakerspace.text_utils import TextFileLoader, CharacterTextSplitter
from aimakerspace.vectordatabase import VectorDatabase
import asyncio

In [3]:
import nest_asyncio
nest_asyncio.apply()

In [11]:
from IPython.display import display, Markdown

def pretty_print(message: str) -> str:
    display(Markdown(message))

## Task 2: Documents

We'll be concerning ourselves with this part of the flow in the following section:

<img src="https://i.imgur.com/jTm9gjk.png" />

### Loading Source Documents

So, first things first, we need some documents to work with.

While we could work directly with the `.txt` files (or whatever file-types you wanted to extend this to) we can instead do some batch processing of those documents at the beginning in order to store them in a more machine compatible format.

In this case, we're going to parse our text file into a single document in memory.

Let's look at the relevant bits of the `TextFileLoader` class:

```python
def load_file(self):
        with open(self.path, "r", encoding=self.encoding) as f:
            self.documents.append(f.read())
```

We're simply loading the document using the built in `open` method, and storing that output in our `self.documents` list.


In [4]:
text_loader = TextFileLoader("data/PMarcaBlogs.txt")
documents = text_loader.load_documents()
len(documents)

1

In [7]:
print(documents[0][:100])


The Pmarca Blog Archives
(select posts from 2007-2009)
Marc Andreessen
copyright: Andreessen Horow


### Splitting Text Into Chunks

As we can see, there is one massive document.

We'll want to chunk the document into smaller parts so it's easier to pass the most relevant snippets to the LLM.

There is no fixed way to split/chunk documents - and you'll need to rely on some intuition as well as knowing your data *very* well in order to build the most robust system.

For this toy example, we'll just split blindly on length.

>There's an opportunity to clear up some terminology here, for this course we will be stick to the following:
>
>- "source documents" : The `.txt`, `.pdf`, `.html`, ..., files that make up the files and information we start with in its raw format
>- "document(s)" : single (or more) text object(s)
>- "corpus" : the combination of all of our documents

As you can imagine (though it's not specifically true in this toy example) the idea of splitting documents is to break them into managable sized chunks that retain the most relevant local context.

In [8]:
text_splitter = CharacterTextSplitter()
split_documents = text_splitter.split_texts(documents)
len(split_documents)

373

Let's take a look at some of the documents we've managed to split.

In [22]:
pretty_print(split_documents[0:1][0])


The Pmarca Blog Archives
(select posts from 2007-2009)
Marc Andreessen
copyright: Andreessen Horowitz
cover design: Jessica Hagy
produced using: Pressbooks
Contents
THE PMARCA GUIDE TO STARTUPS
Part 1: Why not to do a startup 2
Part 2: When the VCs say "no" 10
Part 3: "But I don't know any VCs!" 18
Part 4: The only thing that matters 25
Part 5: The Moby Dick theory of big companies 33
Part 6: How much funding is too little? Too much? 41
Part 7: Why a startup's initial business plan doesn't
matter that much
49
THE PMARCA GUIDE TO HIRING
Part 8: Hiring, managing, promoting, and Dring
executives
54
Part 9: How to hire a professional CEO 68
How to hire the best people you've ever worked
with
69
THE PMARCA GUIDE TO BIG COMPANIES
Part 1: Turnaround! 82
Part 2: Retaining great people 86
THE PMARCA GUIDE TO CAREER, PRODUCTIVITY,
AND SOME OTHER THINGS
Introduction 97
Part 1: Opportunity 99
Part 2: Skills and education 107
Part 3: Where to go and why 120
The Pmarca Guide to Personal Productivi

## Task 3: Embeddings and Vectors

Next, we have to convert our corpus into a "machine readable" format as we explored in the Embedding Primer notebook.

Today, we're going to talk about the actual process of creating, and then storing, these embeddings, and how we can leverage that to intelligently add context to our queries.

### OpenAI API Key

In order to access OpenAI's APIs, we'll need to provide our OpenAI API Key!

You can work through the folder "OpenAI API Key Setup" for more information on this process if you don't already have an API Key!

In [10]:
import os
import openai
from getpass import getpass

openai.api_key = getpass("OpenAI API Key: ")
os.environ["OPENAI_API_KEY"] = openai.api_key

### Vector Database

Let's set up our vector database to hold all our documents and their embeddings!

While this is all baked into 1 call - we can look at some of the code that powers this process to get a better understanding:

Let's look at our `VectorDatabase().__init__()`:

```python
def __init__(self, embedding_model: EmbeddingModel = None):
        self.vectors = defaultdict(np.array)
        self.embedding_model = embedding_model or EmbeddingModel()
```

As you can see - our vectors are merely stored as a dictionary of `np.array` objects.

Secondly, our `VectorDatabase()` has a default `EmbeddingModel()` which is a wrapper for OpenAI's `text-embedding-3-small` model.

> **Quick Info About `text-embedding-3-small`**:
> - It has a context window of **8191** tokens
> - It returns vectors with dimension **1536**

#### ❓Question #1:

The default embedding dimension of `text-embedding-3-small` is 1536, as noted above. 

1. Is there any way to modify this dimension?
2. What technique does OpenAI use to achieve this?

> NOTE: Check out this [API documentation](https://platform.openai.com/docs/api-reference/embeddings/create) for the answer to question #1, and [this documentation](https://platform.openai.com/docs/guides/embeddings/use-cases) for an answer to question #2!

**Answer**
1. Use "dimensions" parameter in the body of the request for creating embedding to indicate the number of dimensions, [docs](https://platform.openai.com/docs/api-reference/embeddings/create#embeddings-create-dimensions)
2. OpenAI uses [Matryoshka Representation Learning](https://arxiv.org/abs/2205.13147) to achieve embeddings that can be shorten embeddings (i.e. remove some numbers from the end of the sequence) without the embedding losing its [concept-representing properties](https://platform.openai.com/docs/guides/embeddings/use-cases#:~:text=Using%20larger%20embeddings,is%20shown%20below.).

We can call the `async_get_embeddings` method of our `EmbeddingModel()` on a list of `str` and receive a list of `float` back!

```python
async def async_get_embeddings(self, list_of_text: List[str]) -> List[List[float]]:
        return await aget_embeddings(
            list_of_text=list_of_text, engine=self.embeddings_model_name
        )
```

We cast those to `np.array` when we build our `VectorDatabase()`:

```python
async def abuild_from_list(self, list_of_text: List[str]) -> "VectorDatabase":
        embeddings = await self.embedding_model.async_get_embeddings(list_of_text)
        for text, embedding in zip(list_of_text, embeddings):
            self.insert(text, np.array(embedding))
        return self
```

And that's all we need to do!

In [23]:
vector_db = VectorDatabase()
vector_db = asyncio.run(vector_db.abuild_from_list(split_documents))

#### ❓Question #2:

What are the benefits of using an `async` approach to collecting our embeddings?

> NOTE: Determining the core difference between `async` and `sync` will be useful! If you get stuck - ask ChatGPT!

**Answer**

Using `async` the code execution will not be blocked by waiting for completion of requesting and receiving embeddings for document chunks.

So, to review what we've done so far in natural language:

1. We load source documents
2. We split those source documents into smaller chunks (documents)
3. We send each of those documents to the `text-embedding-3-small` OpenAI API endpoint
4. We store each of the text representations with the vector representations as keys/values in a dictionary

### Semantic Similarity

The next step is to be able to query our `VectorDatabase()` with a `str` and have it return to us vectors and text that is most relevant from our corpus.

We're going to use the following process to achieve this in our toy example:

1. We need to embed our query with the same `EmbeddingModel()` as we used to construct our `VectorDatabase()`
2. We loop through every vector in our `VectorDatabase()` and use a distance measure to compare how related they are
3. We return a list of the top `k` closest vectors, with their text representations

There's some very heavy optimization that can be done at each of these steps - but let's just focus on the basic pattern in this notebook.

> We are using [cosine similarity](https://www.engati.com/glossary/cosine-similarity) as a distance metric in this example - but there are many many distance metrics you could use - like [these](https://flavien-vidal.medium.com/similarity-distances-for-natural-language-processing-16f63cd5ba55)

> We are using a rather inefficient way of calculating relative distance between the query vector and all other vectors - there are more advanced approaches that are much more efficient, like [ANN](https://towardsdatascience.com/comprehensive-guide-to-approximate-nearest-neighbors-algorithms-8b94f057d6b6)

In [25]:
vector_db.search_by_text("What is the Michael Eisner Memorial Weak Executive Problem?", k=3)

[('ordingly.\nSeventh, when hiring the executive to run your former specialty, be\ncareful you don’t hire someone weak on purpose.\nThis sounds silly, but you wouldn’t believe how oaen it happens.\nThe CEO who used to be a product manager who has a weak\nproduct management executive. The CEO who used to be in\nsales who has a weak sales executive. The CEO who used to be\nin marketing who has a weak marketing executive.\nI call this the “Michael Eisner Memorial Weak Executive Problem” — aaer the CEO of Disney who had previously been a brilliant TV network executive. When he bought ABC at Disney, it\npromptly fell to fourth place. His response? “If I had an extra\ntwo days a week, I could turn around ABC myself.” Well, guess\nwhat, he didn’t have an extra two days a week.\nA CEO — or a startup founder — oaen has a hard time letting\ngo of the function that brought him to the party. The result: you\nhire someone weak into the executive role for that function so\nthat you can continue to b

In [31]:
vector_db.search_by_text("What is the Michael Eisner Memorial Weak Executive Problem?", k=3)

[('ordingly.\nSeventh, when hiring the executive to run your former specialty, be\ncareful you don’t hire someone weak on purpose.\nThis sounds silly, but you wouldn’t believe how oaen it happens.\nThe CEO who used to be a product manager who has a weak\nproduct management executive. The CEO who used to be in\nsales who has a weak sales executive. The CEO who used to be\nin marketing who has a weak marketing executive.\nI call this the “Michael Eisner Memorial Weak Executive Problem” — aaer the CEO of Disney who had previously been a brilliant TV network executive. When he bought ABC at Disney, it\npromptly fell to fourth place. His response? “If I had an extra\ntwo days a week, I could turn around ABC myself.” Well, guess\nwhat, he didn’t have an extra two days a week.\nA CEO — or a startup founder — oaen has a hard time letting\ngo of the function that brought him to the party. The result: you\nhire someone weak into the executive role for that function so\nthat you can continue to b

In [35]:
[[pretty_print(t), print(d)] for t, d in vector_db.search_by_text("What is the Michael Eisner Memorial Weak Executive Problem?", k=3)]
None

ordingly.
Seventh, when hiring the executive to run your former specialty, be
careful you don’t hire someone weak on purpose.
This sounds silly, but you wouldn’t believe how oaen it happens.
The CEO who used to be a product manager who has a weak
product management executive. The CEO who used to be in
sales who has a weak sales executive. The CEO who used to be
in marketing who has a weak marketing executive.
I call this the “Michael Eisner Memorial Weak Executive Problem” — aaer the CEO of Disney who had previously been a brilliant TV network executive. When he bought ABC at Disney, it
promptly fell to fourth place. His response? “If I had an extra
two days a week, I could turn around ABC myself.” Well, guess
what, he didn’t have an extra two days a week.
A CEO — or a startup founder — oaen has a hard time letting
go of the function that brought him to the party. The result: you
hire someone weak into the executive role for that function so
that you can continue to be “the man” — cons

0.6539043027545371


m. They have areas where they are truly deXcient in judgment or skill set. That’s just life. Almost nobody is brilliant
at everything. When hiring and when Hring executives, you
must therefore focus on strength rather than lack of weakness. Everybody has severe weaknesses even if you can’t see
them yet. When managing, it’s oaen useful to micromanage and
to provide remedial training around these weaknesses. Doing so
may make the diWerence between an executive succeeding or
failing.
For example, you might have a brilliant engineering executive
who generates excellent team loyalty, has terriXc product judgment and makes the trains run on time. This same executive
may be very poor at relating to the other functions in the company. She may generate far more than her share of cross-functional conYicts, cut herself oW from critical information, and
signiXcantly impede your ability to sell and market eWectively.
Your alternatives are:
(a) Macro-manage and give her an annual or quarterly object

0.5036325125269061


ed?
In reality — as opposed to Marc’s warped view of reality — it will
be extremely helpful for Marc [if he were actually the CEO,
which he is not] to meet with the new head of engineering daily
when she comes on board and review all of her thinking and
decisions. This level of micromanagement will accelerate her
training and improve her long-term eWectiveness. It will make
her seem smarter to the rest of the organization which will build
credibility and conXdence while she comes up to speed. Micromanaging new executives is generally a good idea for a limited
period of time.
However, that is not the only time that it makes sense to micro66 The Pmarca Blog Archives
manage executives. It turns out that just about every executive
in the world has a few things that are seriously wrong with
them. They have areas where they are truly deXcient in judgment or skill set. That’s just life. Almost nobody is brilliant
at everything. When hiring and when Hring executives, you
must therefore focus o

0.48143700816214524


## Task 4: Prompts

In the following section, we'll be looking at the role of prompts - and how they help us to guide our application in the right direction.

In this notebook, we're going to rely on the idea of "zero-shot in-context learning".

This is a lot of words to say: "We will ask it to perform our desired task in the prompt, and provide no examples."

### XYZRolePrompt

Before we do that, let's stop and think a bit about how OpenAI's chat models work.

We know they have roles - as is indicated in the following API [documentation](https://platform.openai.com/docs/api-reference/chat/create#chat/create-messages)

There are three roles, and they function as follows (taken directly from [OpenAI](https://platform.openai.com/docs/guides/gpt/chat-completions-api)):

- `{"role" : "system"}` : The system message helps set the behavior of the assistant. For example, you can modify the personality of the assistant or provide specific instructions about how it should behave throughout the conversation. However note that the system message is optional and the model’s behavior without a system message is likely to be similar to using a generic message such as "You are a helpful assistant."
- `{"role" : "user"}` : The user messages provide requests or comments for the assistant to respond to.
- `{"role" : "assistant"}` : Assistant messages store previous assistant responses, but can also be written by you to give examples of desired behavior.

The main idea is this:

1. You start with a system message that outlines how the LLM should respond, what kind of behaviours you can expect from it, and more
2. Then, you can provide a few examples in the form of "assistant"/"user" pairs
3. Then, you prompt the model with the true "user" message.

In this example, we'll be forgoing the 2nd step for simplicities sake.

#### Utility Functions

You'll notice that we're using some utility functions from the `aimakerspace` module - let's take a peek at these and see what they're doing!

##### XYZRolePrompt

Here we have our `system`, `user`, and `assistant` role prompts.

Let's take a peek at what they look like:

```python
class BasePrompt:
    def __init__(self, prompt):
        """
        Initializes the BasePrompt object with a prompt template.

        :param prompt: A string that can contain placeholders within curly braces
        """
        self.prompt = prompt
        self._pattern = re.compile(r"\{([^}]+)\}")

    def format_prompt(self, **kwargs):
        """
        Formats the prompt string using the keyword arguments provided.

        :param kwargs: The values to substitute into the prompt string
        :return: The formatted prompt string
        """
        matches = self._pattern.findall(self.prompt)
        return self.prompt.format(**{match: kwargs.get(match, "") for match in matches})

    def get_input_variables(self):
        """
        Gets the list of input variable names from the prompt string.

        :return: List of input variable names
        """
        return self._pattern.findall(self.prompt)
```

Then we have our `RolePrompt` which laser focuses us on the role pattern found in most API endpoints for LLMs.

```python
class RolePrompt(BasePrompt):
    def __init__(self, prompt, role: str):
        """
        Initializes the RolePrompt object with a prompt template and a role.

        :param prompt: A string that can contain placeholders within curly braces
        :param role: The role for the message ('system', 'user', or 'assistant')
        """
        super().__init__(prompt)
        self.role = role

    def create_message(self, **kwargs):
        """
        Creates a message dictionary with a role and a formatted message.

        :param kwargs: The values to substitute into the prompt string
        :return: Dictionary containing the role and the formatted message
        """
        return {"role": self.role, "content": self.format_prompt(**kwargs)}
```

We'll look at how the `SystemRolePrompt` is constructed to get a better idea of how that extension works:

```python
class SystemRolePrompt(RolePrompt):
    def __init__(self, prompt: str):
        super().__init__(prompt, "system")
```

That pattern is repeated for our `UserRolePrompt` and our `AssistantRolePrompt` as well.

##### ChatOpenAI

Next we have our model, which is converted to a format analagous to libraries like LangChain and LlamaIndex.

Let's take a peek at how that is constructed:

```python
class ChatOpenAI:
    def __init__(self, model_name: str = "gpt-4o-mini"):
        self.model_name = model_name
        self.openai_api_key = os.getenv("OPENAI_API_KEY")
        if self.openai_api_key is None:
            raise ValueError("OPENAI_API_KEY is not set")

    def run(self, messages, text_only: bool = True):
        if not isinstance(messages, list):
            raise ValueError("messages must be a list")

        openai.api_key = self.openai_api_key
        response = openai.ChatCompletion.create(
            model=self.model_name, messages=messages
        )

        if text_only:
            return response.choices[0].message.content

        return response
```

#### ❓ Question #3:

When calling the OpenAI API - are there any ways we can achieve more reproducible outputs?

> NOTE: Check out [this section](https://platform.openai.com/docs/guides/text-generation/) of the OpenAI documentation for the answer!

**Answer**

Set the `temperature` parameter to a low value, e.g. `0`.

### Creating and Prompting OpenAI's `gpt-4o-mini`!

Let's tie all these together and use it to prompt `gpt-4o-mini`!

In [36]:
from aimakerspace.openai_utils.prompts import (
    UserRolePrompt,
    SystemRolePrompt,
    AssistantRolePrompt,
)

from aimakerspace.openai_utils.chatmodel import ChatOpenAI

chat_openai = ChatOpenAI()
user_prompt_template = "{content}"
user_role_prompt = UserRolePrompt(user_prompt_template)
system_prompt_template = (
    "You are an expert in {expertise}, you always answer in a kind way."
)
system_role_prompt = SystemRolePrompt(system_prompt_template)

messages = [
    system_role_prompt.create_message(expertise="Python"),
    user_role_prompt.create_message(
        content="What is the best way to write a loop?"
    ),
]

response = chat_openai.run(messages)

In [37]:
pretty_print(response)

The best way to write a loop in Python depends on the specific task you're trying to accomplish. However, I can share some general tips and examples to help you create clear and efficient loops.

### 1. **Using `for` loops**
`for` loops are often the best choice when you know in advance the number of iterations or when you're working with a collection (like a list, tuple, or string).

```python
# Example: Iterating through a list
fruits = ['apple', 'banana', 'cherry']
for fruit in fruits:
    print(fruit)
```

### 2. **Using `while` loops**
`while` loops are more suitable when the number of iterations is not known beforehand, and you want to continue looping until a certain condition is met.

```python
# Example: Using a while loop to count down
count = 5
while count > 0:
    print(count)
    count -= 1
```

### 3. **Using `enumerate()` in `for` loops**
When you need both the index and the value from a list, using `enumerate()` can be very helpful.

```python
# Example: Getting index and value
colors = ['red', 'green', 'blue']
for index, color in enumerate(colors):
    print(f"Index {index}: Color {color}")
```

### 4. **List comprehensions**
For simple loops that generate lists, list comprehensions can be a more concise and readable option.

```python
# Example: Creating a list of squares
squares = [x**2 for x in range(10)]
print(squares)
```

### 5. **Using `break` and `continue`**
You can control the loop's behavior using `break` to exit the loop early or `continue` to skip to the next iteration.

```python
# Example: Using break and continue
for number in range(10):
    if number == 5:
        break  # Exit the loop when number is 5
    if number % 2 == 0:
        continue  # Skip even numbers
    print(number)  # Will print only odd numbers less than 5
```

### Additional Tips:
- **Keep it Simple**: Aim for clarity. If a loop is getting complex, consider breaking it into functions.
- **Avoid Infinite Loops**: Always ensure that your `while` conditions will eventually be false to prevent infinite loops.
- **Iterate Efficiently**: Choose the right type of loop based on the scenario (e.g., prefer `for` loops for iterations over a known range).

Feel free to ask more specific questions about loops or any other Python-related topics! I'm here to help.

## Task 5: Retrieval Augmented Generation

Now we can create a RAG prompt - which will help our system behave in a way that makes sense!

There is much you could do here, many tweaks and improvements to be made!

In [38]:
RAG_PROMPT_TEMPLATE = """ \
Use the provided context to answer the user's query.

You may not answer the user's query unless there is specific context in the following text.

If you do not know the answer, or cannot answer, please respond with "I don't know".
"""

rag_prompt = SystemRolePrompt(RAG_PROMPT_TEMPLATE)

USER_PROMPT_TEMPLATE = """ \
Context:
{context}

User Query:
{user_query}
"""


user_prompt = UserRolePrompt(USER_PROMPT_TEMPLATE)

class RetrievalAugmentedQAPipeline:
    def __init__(self, llm: ChatOpenAI(), vector_db_retriever: VectorDatabase) -> None:
        self.llm = llm
        self.vector_db_retriever = vector_db_retriever

    def run_pipeline(self, user_query: str) -> str:
        context_list = self.vector_db_retriever.search_by_text(user_query, k=4)

        context_prompt = ""
        for context in context_list:
            context_prompt += context[0] + "\n"

        formatted_system_prompt = rag_prompt.create_message()

        formatted_user_prompt = user_prompt.create_message(user_query=user_query, context=context_prompt)

        return {"response" : self.llm.run([formatted_system_prompt, formatted_user_prompt]), "context" : context_list}

#### ❓ Question #4:

What prompting strategies could you use to make the LLM have a more thoughtful, detailed response?

What is that strategy called?

> NOTE: You can look through the Week 1 Day 1 "Prompting OpenAI Like A Developer" material for an answer to this question!

**Answer**

Using Chain-of-thought prompting usually results in the model providing more thoughtful, detailed response: `""Think through your response step by step.""`

In [39]:
retrieval_augmented_qa_pipeline = RetrievalAugmentedQAPipeline(
    vector_db_retriever=vector_db,
    llm=chat_openai
)

In [40]:
retrieval_augmented_qa_pipeline.run_pipeline("What is the 'Michael Eisner Memorial Weak Executive Problem'?")

{'response': "The 'Michael Eisner Memorial Weak Executive Problem' refers to the tendency of a CEO or startup founder, who has expertise in a particular function (like product management, sales, or marketing), to hire a weak executive in that same area. This often happens because the leader wants to maintain control and continue being seen as the authority in that function, resulting in the appointment of someone less capable to allow them to still play a prominent role. The example given in the context is of Michael Eisner, the former CEO of Disney, who faced challenges after acquiring ABC and blamed a lack of additional time rather than addressing any fundamental issues with the executives he appointed.",
 'context': [('ordingly.\nSeventh, when hiring the executive to run your former specialty, be\ncareful you don’t hire someone weak on purpose.\nThis sounds silly, but you wouldn’t believe how oaen it happens.\nThe CEO who used to be a product manager who has a weak\nproduct manageme

### 🏗️ Activity #1:

Enhance your RAG application in some way! 

Suggestions are: 

- Allow it to work with PDF files
- Implement a new distance metric
- Add metadata support to the vector database

While these are suggestions, you should feel free to make whatever augmentations you desire! 

> NOTE: These additions might require you to work within the `aimakerspace` library - that's expected!

In [41]:
### YOUR CODE HERE
!pip install -qU PyPDF2


[notice] A new release of pip is available: 24.0 -> 24.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [42]:
import os
import PyPDF2

class PDFFileLoader:
    def __init__(self, path: str):
        self.documents = []
        self.path = path

    def load(self):
        if os.path.isdir(self.path):
            self.load_directory()
        elif os.path.isfile(self.path) and self.path.endswith(".pdf"):
            self.load_file()
        else:
            raise ValueError(
                "Provided path is neither a valid directory nor a .pdf file."
            )

    def load_file(self):
        with open(self.path, "rb") as f:
            reader = PyPDF2.PdfReader(f)
            text = ""
            for page in reader.pages:
                text += page.extract_text()
            self.documents.append(text)

    def load_directory(self):
        for root, _, files in os.walk(self.path):
            for file in files:
                if file.endswith(".pdf"):
                    with open(os.path.join(root, file), "rb") as f:
                        reader = PyPDF2.PdfReader(f)
                        text = ""
                        for page in reader.pages:
                            text += page.extract_text()
                        self.documents.append(text)

    def load_documents(self):
        self.load()
        return self.documents

In [46]:
pdf_loader = PDFFileLoader("data/pdfs/")
documents = pdf_loader.load_documents()
len(documents)

2

In [103]:
print(documents[1][1000:])

.................4
Part V: Designing Special and Fun Gardens ........................................5
Part VI: The Part of Tens .......................................................................5
Icons Used in This Book..................................................................................5
Where to Go from Here....................................................................................6
Part I: Preparing Yourself (And Your Garden) 
for Planting ..................................................................7
Chapter 1: Getting Ready for Gardening  . . . . . . . . . . . . . . . . . . . . . . . . . .9
Playing the Name Game.................................................................................10
“Hello, my name is . . .”: Getting used to plant nomenclature ........10
Anatomy 101: Naming plant parts......................................................11
Bringing in Beauty with Flowers (and Foliage) ..........................................12
Amazing

In [128]:
from typing import List

class PriorityTextSplitter:
    def __init__(
        self,
        chunk_size: int = 1000,
        chunk_overlap: int = 200,
    ):
        assert (
            chunk_size > chunk_overlap
        ), "Chunk size must be greater than chunk overlap"

        self.chunk_size = chunk_size
        self.chunk_overlap = chunk_overlap

    def split(self, text: str) -> List[str]:
        chunks = []
        start = 0

        while start < len(text):
            end = min(start + self.chunk_size, len(text))
            chunk = text[start:end]

            if end < len(text):
                # Attempt to find the best split point based on priority
                split_point = self.find_split_point(chunk)
                chunks.append(text[start:start + split_point])
                start += split_point - self.chunk_overlap
            else:
                # No need to find a split point if we're at the end of the text
                chunks.append(chunk)
                break

        return chunks

    def find_split_point(self, chunk: str) -> int:
        # Priority 1: Double new line
        split_point = chunk.rfind('\n\n') #('Part [IVX]+:')
        if split_point != -1:
            return split_point + 2  # Include the double new line in the split

        # Priority 2: Single new line
        split_point = chunk.rfind('\n') #('Chapter [0-9]+:')
        if split_point != -1:
            return split_point + 1  # Include the new line in the split

        # Priority 3: Space
        split_point = chunk.rfind(' ')
        if split_point != -1:
            return split_point + 1  # Include the space in the split

        # Priority 4: Any character (fallback)
        return len(chunk)

    def split_texts(self, texts: List[str]) -> List[str]:
        chunks = []
        for text in texts:
            chunks.extend(self.split(text))
        return chunks

In [129]:
text_splitter = PriorityTextSplitter()
split_documents = text_splitter.split_texts(documents)
len(split_documents)

1226

In [130]:
[pretty_print(t) for t in split_documents[800:803]]
None

 off all the burlap and twine. In the past, some
landscapers and gardeners simply opened up the bottom of the bundle and
put the plant in the hole, leaving the remaining burlap and string to rot over
time in the hole. But this tack is no longer advisable — today’s burlap and
twine may contain plastic, which doesn’t biodegrade. Take it all off!
Caring for Your Shrubs
A well-cared-for shrub is beautiful and healthy, and it remains so. Frequent
attention to the plant’s needs is crucial the first year or two, less so as the
years go by and the plant becomes an established part of your yard. The fol-
lowing sections give you the basic information that you need to take very
good care of your bushes.241 Chapter 11: Reaching New Heights with Trees and Shrubs17_037492 ch11.qxp  12/26/06  9:05 PM  Page 241Watering
When first planted and indeed throughout the entire first growing season,
water your new shrubs deeply and often. Deliver the water directly to the


37492 ch11.qxp  12/26/06  9:05 PM  Page 241Watering
When first planted and indeed throughout the entire first growing season,
water your new shrubs deeply and often. Deliver the water directly to the
root area (a hose trickling slowly into the basin you created on planting day
is perfect). Twice-a-week watering may be necessary through the spring and
summer months — slow down and stop at the beginning of fall, sending the
plants into winter with one last good soaking.
A young plant can’t tolerate dry spells and drought because its roots are still
developing and may not dive very deep into the ground. An older, established
plant can withstand drought better but shows signs of distress by dropping
petals, unopened buds, and dried, curled, or yellowing leaves — don’t let the
situation come to that. (Dramatic cycles of soaking and drying out are also
stressful for a plant and weaken it, making the shrub more vulnerable to
pests and disease. Neglect becomes a downward spiral.)


ation come to that. (Dramatic cycles of soaking and drying out are also
stressful for a plant and weaken it, making the shrub more vulnerable to
pests and disease. Neglect becomes a downward spiral.)
Lay down a layer of mulch, at least an inch or two thick, in spring or summer,
all across the shrub’s root zone. This mulch helps retain soil moisture. It also
keeps encroaching weeds or lawn grass, competitors for soil moisture, at bay.
Keep mulch 1 inch away from stems.
Fertilizing your shrubs
For newly planted shrubs, some people have found starter solutions useful.
They consist of water soluble fertilizers (usually high in phosphorous) and
vitamins and hormones that stimulate new root growth. Regular fertilizing
can begin the second year
/H12012To boost overall plant health and vigor
/H12012To help green up the leaves and encourage thicker foliage
/H12012To promote more buds and thus more flowers
Apply a general-purpose garden fertilizer, diluted according to the label 


In [131]:
vector_db = VectorDatabase()
vector_db = asyncio.run(vector_db.abuild_from_list(split_documents))

In [132]:
retrieval_augmented_qa_pipeline = RetrievalAugmentedQAPipeline(
    vector_db_retriever=vector_db,
    llm=chat_openai
)

In [133]:
message = retrieval_augmented_qa_pipeline.run_pipeline("What is the best place for planting an oak in my garden and how should I care for it?")

In [134]:
pretty_print(message['response'])

To plant an oak in your garden, first consider its mature size as oaks can grow quite large. Ensure you've chosen a location that provides enough horizontal space for the oak to spread without being cramped. It’s best to plant the tree early in the fall, allowing it time to become established before the ground freezes.

Regarding care, oaks, like other trees, require attention throughout their life. During the first year, focus on deep, soaking waterings, particularly during dry spells. After the initial planting period, continue seasonal care and monitoring for any potential diseases or damage. Regular maintenance will help ensure the oak remains healthy and grows beautifully.

In [135]:
[pretty_print(t[0]) for t in message['context']]
None

f you want dense shade, go for an oak 
or maple.
If your summers are long and hot, you don’t want a new young tree to just
dry out and die, even despite extra water and attention from you. Your ideal
tree should have small rather than large leaves (this includes needles!).
Leaves with less surface area conserve moisture better. The tree should 
also have deep roots that can travel to where sustaining moisture is. Good
drought-resilient trees include Eastern red cedar, live oak, hickory, Kentucky
coffee tree, honeylocust, persimmon, bur oak, pin oak, gingko, laurel, pines,
mesquite, Aleppo pine, blue Atlas cedar, and jacaranda.
Getting Treed! Planting Trees
Trees take up space. Okay, maybe the sky’s the limit with vertical space, 
but you have to account for the horizontal space — how big around, how
broad, how spreading — that a tree requires. A tree in a crowded or cramped
setting, or one that has outgrown its spot, runs the risk of becoming not only


e tree can settle in. A tree’s a big plant, after all, and it needs all the
time it can get to adjust to its new home.
Fall planting is the second choice. Fall may be the first preference for garden-
ers in parts of the country having a very early hot spring and summer. Fall is
also ideal for transplanting if you live in an area where the weather is cool
and damp. Do be sure to plant early in the fall so the trees have a chance to
become established before very cold weather moves in and the ground
begins to freeze.223 Chapter 11: Reaching New Heights with Trees and Shrubs17_037492 ch11.qxp  12/26/06  9:05 PM  Page 223Finding a suitable location
When trying to decide where to put a tree, first find out your chosen tree’s
mature size. Figure out the standard dimensions of the exact variety you’ve
chosen. Varieties vary. For example, the handsome English oak (Quercus
robur) in its “plain old species” form can reach 100 feet tall, whereas a culti-


p and let the rest rot in the planting hole over time, but
this idea has been discredited. Also, some modern burlap and string have
plastic woven in, which never breaks down and can constrict growth.
Taking Care of Your Tree
Trees, sadly, are often cared for only when first planted and then they’re left
to fend for themselves in landscapes after they’ve matured. People usually
end this isolation only if the trees become diseased or damaged. Yes, you
should take special care of your tree when it’s newly planted to help it get
established, but you should also make a habit of caring for it season by
season, year by year, as it matures and grows. A healthy tree is a happy tree,
and trees reflect the care you put into them in their beauty. Healthy trees
often make the most beautiful trees, after all.
Giving trees a tall drink of water
The most important times to water a tree are when it’s newly planted —
indeed, throughout its first year — and during dry spells. Deep, soaking


o home — reputable local nurseries have plenty of
stock that ought to thrive for you. If your chosen site has any special or
unique conditions (boggy soil, say), tell the salesperson so he or she can
assist you in making an appropriate choice.
Beware of shedders: Trees that dump leaves, seedpods, nuts, or cotton end
up being a lot of extra work unless you’re prepared to handle it. Examples
include poplar, mulberry, cottonwood, willow, sweetgum, eucalyptus, horse
chestnut, and black walnut.222 Part III: Stretching Your Garden Beyond Its Boundaries 17_037492 ch11.qxp  12/26/06  9:05 PM  Page 222Consider shade density. If you prefer dappled shade (and hope to grow lawn
underneath the tree), choose a tree with a lighter canopy. Examples include
ash, birch, honeylocust, and linden. If you want dense shade, go for an oak 
or maple.
If your summers are long and hot, you don’t want a new young tree to just
dry out and die, even despite extra water and attention from you. Your ideal
