In [1]:
import dotenv
import os
import pandas as pd
import sys

from datetime import datetime
from langchain_community.document_loaders import PyPDFLoader
from langchain_core.messages import HumanMessage, SystemMessage
from pathlib import Path
from tqdm.notebook import tqdm

# Set `PYTHONPATH` to include the project's home directory. 
PROJECT_HOME = Path(os.environ.get('PROJECT_HOME', Path.cwd() / '..')).resolve()
sys.path.append(str(PROJECT_HOME))

from app.databases.vector import VectorDB
from app.indexing.text.context_aware import ContextAwareIndexing
from app.indexing.metadata import DocumentMetadata
from app.models.inference.openai_model import ChatOpenAI
from app.server.llm import LLMAgent, LLMEventType

In [2]:
# Load and set environment
dotenv.load_dotenv()
os.environ['USER_AGENT'] = 'myagent'
os.environ['LLM_MODEL_ID'] = 'anthropic.claude-3-5-sonnet-20240620-v1:0'

----

# The problem

Here we're initializing a vector DB with the NYC CPC (City Planning Commision) Reports. The reports can be found [here](https://a030-cpc.nyc.gov/html/cpc/index.aspx). Since there are a lot of reports, going all the way back to 1938, we've decided to use reports only from Midtown, Manhattan, from 2010 up to November 2024.

You can download the reports yourself, or use (this link)[TODO] to download a zip file with the reports that we've used.

In [3]:
# Set the vector DB collection to use and the directory with the files to load
os.environ['DEFAULT_VECTOR_DB_COLLECTION_NAME'] = 'cch_nyc_city_planning'
docs_path = PROJECT_HOME / 'data' / 'nyc-planning'

## Loading the Data

We'll start by loading all the PDFs without context

In [4]:
# We use a function, because we're going to reuse it in the next section.

async def load_all_pdfs(path_to_load: Path, vector_db: VectorDB) -> None:
    """Load all PDFs in `path_to_load` into `vector_db`"""
    
    # Go over all the files in the data directory
    for pdf_path in tqdm(list(path_to_load.glob('*.pdf'))):
    
        # Load the file.
        loader = PyPDFLoader(pdf_path)
        docs = [p for p in loader.lazy_load()]
    
        # Store the file in the vector DB.
        await vector_db.split_and_store_text(docs, metadata=DocumentMetadata(
    
            # In our case, the filename is unique, but you may want to choose a different ID and Name, in other cases.
            source_id=pdf_path.stem,
            source_name=pdf_path.name,
            
            # Note that `modified_at` is the time that it was modified accodrding to your local file system.
            # IRL you may want to use the date of the report. We won't go into it here.
            modified_at=datetime.fromtimestamp(os.path.getmtime(pdf_path)),
        ))

In [5]:
# Drop the existing collection, if necessary.
vector_db = VectorDB(drop_old=True)
await load_all_pdfs(docs_path, vector_db)

Connecting to Chroma database at chromadb:8000


  0%|          | 0/110 [00:00<?, ?it/s]

## Chatting with our Bot
----

To see the problem, you'll need to ask about something that's buried in the documents, but doesn't have a context near by.

For example, you can try asking: 
> Can you give a list of environmental reviews and the associated projects?

The `Environmental Reviews` are usually in the middle of the document and usually don't contain information about the project name or address. At most, the model can infere the project ID from the name of the file which exists in the metadata.

In [6]:
async def chat_with_rag_bot(user_config: dict, print_debug_info: bool = False) -> None:
    """Facilitates a conversation with our RAG ChatBot.

    :param `user_config`: Mainly used to store the conversation ID and keep history of previous conversations.
    :param `print_debug_info`: If `True`, will print debuge messages, such as the retrieval quetion and results. Defaults to `False`.
    """
    async with LLMAgent() as llm_agent:
        while True:
            # Make a pretty separator for the next human message
            HumanMessage('').pretty_print()
            
            # Get user question.
            message = input('Write a question to the LLM. Write `q` or `quit` to exit:')
            
            # System output
            SystemMessage('').pretty_print()
    
            # Check if user wants to quit
            if message.lower() in ('q', 'quit'):
                print('Quitting for now. You can continue the conversation by rerunning this cell')
                break
    
            # Get ChatBot's response.
            async for chat_msg in llm_agent.astream_events(message, user_config):
                if chat_msg.type == LLMEventType.CHAT_CHUNK:
                    print(chat_msg.content, end='')
                elif print_debug_info:
                    print(chat_msg.to_dict())
    
            # New line.
            print()

In [7]:
# To start a new conversation (delete history), change the `thread_id` to a new value and rerun the cell.
user_config = {"configurable": {"thread_id": "user_20_000"}}

await chat_with_rag_bot(user_config, print_debug_info=False)

Connecting to Chroma database at chromadb:8000




Write a question to the LLM. Write `q` or `quit` to exit: Can you give a list of environmental reviews and the associated projects?






  warn_beta(


Certainly! To answer your question about environmental reviews and their associated projects, I'll need to search our company's internal information. Let me use the Internal Company Info Retriever to find this information for you.

Based on the information retrieved, I can provide you with a list of environmental reviews and their associated projects. Here are the sources I'll be using:

[1] - 100049 | 100049.pdf | 2024-11-13T10:02:19.399237 | (0: This application (C 100049 ZSM), in conjunction with the applications for the related actions (C 100047 ZMM, N 100048 ZRM, C 100050 ZSM, and C 100237 PQM) was reviewed pursuant to the New York State Environmental Quality Review Act (SEQRA), and the SEQRA regulations)

[2] - 210415 | 210415.pdf | 2024-11-13T09:57:08.701722 | (0: The application (C 210415 ZSM), in conjunction with the applications for the related actions (C 210412 ZSM, C 210413 ZSM, C 210414 ZSM, N 210416 ZRM, and C 210417 PPM) was reviewed pursuant to the New York State Enviro

Write a question to the LLM. Write `q` or `quit` to exit: q




Quitting for now. You can continue the conversation by rerunning this cell


----

# Adding Contextual Chunk Headers

To solve the problem, we'll use Contextual Chunk Headers. That is, we'll add context to each chunk stored in the vector DB, so that the LLM will better know what it's related to.

In our case, we'll ask an LLM to first summarize the document into a few sentences, then we'll store this summary along each chunk.

In [8]:
# Create a different collection for the data with the context.
# This will make it easier to go back and forth between the different collections.
os.environ['DEFAULT_VECTOR_DB_COLLECTION_NAME'] = 'cch_nyc_city_planning_with_context'

In [9]:
# 2 things to note:
# 1. Use the `ContextAwareIndexing` strategy. You can look at the code that implements it to see all the details of how it works
# 2. We use `gpt-3.5-turbo` to summarize the texts. You can of course choose to use a different model, depending on your use-case. For us, we try to use a fast model that's good enough(TM).
vector_db = VectorDB(
    drop_old=True,
    split_strategy=ContextAwareIndexing(chat_model=ChatOpenAI(model_id='gpt-3.5-turbo')),
)

# Use the same function to load the data, the difference is in the strategy used by `vector_db`.
await load_all_pdfs(docs_path, vector_db)

Connecting to Chroma database at chromadb:8000


  0%|          | 0/110 [00:00<?, ?it/s]

## Chatting with our Bot
----

Same as before, but now with additional context, the LLM should be able to answer your previous question based on the additional context it recieves.

Try asking again:
> Can you give a list of environmental reviews and the associated projects?

How is the answer different from the previous time?

In [11]:
# To start a new conversation (delete history), change the `thread_id` to a new value and rerun the cell.
# Note that we use a new `thread_id` from the previous one, to start "fresh".
user_config = {"configurable": {"thread_id": "user_30_000"}}

await chat_with_rag_bot(user_config, print_debug_info=False)

Connecting to Chroma database at chromadb:8000




Write a question to the LLM. Write `q` or `quit` to exit: Can you give a list of environmental reviews and the associated projects?




Certainly! Based on the information retrieved, I can provide you with a list of environmental reviews and their associated projects. Here's the list:

1. CEQR Number: 14DCP188M
   Project: Commercial building development in Manhattan (Green 317 Madison LLC and Green 110 East 42nd LLC)
   Associated Applications: C 150128 ZSM, C 140440 MMM, N 150127 ZRM, C 150129 ZSM, C 150130 ZSM, and C 150130(A) ZSM
   Lead Agency: City Planning Commission

2. CEQR Number: 21DCP057M
   Project: 175 Park Avenue (Commodore Owner, LLC)
   Associated Applications: C 210415 ZSM, C 210412 ZSM, C 210413 ZSM, C 210414 ZSM, N 210416 ZRM, and C 210417 PPM
   Lead Agency: City Planning Commission

3. CEQR Number: 13DCP011M
   Project: East Midtown Rezoning
   Associated Applications: N 130247(A) ZRM, C 130248 ZMM, and N 130247 ZRM
   Lead Agency: City Planning Commission

4. CEQR Number: 16DCP136M
   Project: Theater Subdistrict Fund Text Amendment
   Associated Application: N 160254 ZRM (inferred from the con

Write a question to the LLM. Write `q` or `quit` to exit: q




Quitting for now. You can continue the conversation by rerunning this cell
