# Chat with Your Data


in this example, we'll employ the six step method of retrieval augmented generation to semantically search and conversate over document files of different types.

1. Library installs

In [25]:
! pip install langchain --user
! pip install numpy --user
! pip install tiktoken --user
! pip install openai --user
! pip install pypdf --user
! pip install chromadb --user

Collecting chromadb
  Obtaining dependency information for chromadb from https://files.pythonhosted.org/packages/3c/ff/ac74735884031a3b9ddf7b1abecee0885ec61660588b1e7c6862bccf5116/chromadb-0.4.14-py3-none-any.whl.metadata
  Downloading chromadb-0.4.14-py3-none-any.whl.metadata (7.0 kB)
Collecting chroma-hnswlib==0.7.3 (from chromadb)
  Obtaining dependency information for chroma-hnswlib==0.7.3 from https://files.pythonhosted.org/packages/d2/32/a91850c7aa8a34f61838913155103808fe90da6f1ea4302731b59e9ba6f2/chroma_hnswlib-0.7.3-cp311-cp311-win_amd64.whl.metadata
  Downloading chroma_hnswlib-0.7.3-cp311-cp311-win_amd64.whl.metadata (262 bytes)
Collecting fastapi>=0.95.2 (from chromadb)
  Obtaining dependency information for fastapi>=0.95.2 from https://files.pythonhosted.org/packages/4d/d2/3ad038a2365fefbac19d9a046cab7ce45f4c7bfa81d877cbece9707de9ce/fastapi-0.103.2-py3-none-any.whl.metadata
  Downloading fastapi-0.103.2-py3-none-any.whl.metadata (24 kB)
Collecting uvicorn[standard]>=0.18.3 

2. Setup openAI API

In [None]:
import os 
import openai 

# get api key from system environment variable
openai.api_key = os.getenv("OPENAI_API_KEY")


## Document Loading Step

In [5]:
from langchain.document_loaders import PyPDFLoader

def get_pdf_pages(path):
    return PyPDFLoader(path).load()

# the atomic object that all objects share
# is Document. All pdf pages are Document objects.
# this same process can be done for 80 different filetypes but for this example we will use pdfs

print(get_pdf_pages("docs/arxiv.2310.08067.pdf")[0].page_content)

GameGPT: Multi-agent Collaborative Framework for
Game Development
Dake Chen
AutoGame Research
dk@autogame.aiHanbin Wang
X-Institute
wanghanbin@mails.x-institute.edu.cn
Yunhao Huo
University of Southern California
hhuo@usc.eduYuzhao Li
AutoGame Research
ram@autogame.aiHaoyang Zhang
AutoGame Research
17@autogame.ai
Abstract
The large language model (LLM) based agents have demonstrated their capacity
to automate and expedite software development processes. In this paper, we
focus on game development and propose a multi-agent collaborative framework,
dubbed GameGPT, to automate game development. While many studies have
pinpointed hallucination as a primary roadblock for deploying LLMs in production,
we identify another concern: redundancy. Our framework presents a series of
methods to mitigate both concerns. These methods include dual collaboration and
layered approaches with several in-house lexicons, to mitigate the hallucination
and redundancy in the planning, task identification, and i

## Document splitting step

In [14]:
# this is an important step, as you need to keep semantically complete chunks of text together
# to prevent incomplete sentences from being generated

# there are a whole lotta document splitters for langchain, including
# CharacterTextSplitter, MarkdownHeaderTextSplitter, TokenTextSplitter, SentenceTransformersTextSplitter, 
# RecursiveCharacterTextSplitter, Language, NLTKTextSplitter, SpacyTextSplitter, and more

# for generic text, it's usually best to uzse RecursiveCharacterTextSplitter
from langchain.text_splitter import RecursiveCharacterTextSplitter, MarkdownHeaderTextSplitter
recursive_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500, # means different things to different splitters, this is for character length in each chunk.
    chunk_overlap=0, # how many characters are shared to keep context
    separators=["\n\n", "\n", " ", ""] # if it fails to split on the first separator, it will try the next one
)

split_docs = recursive_splitter.split_documents(get_pdf_pages("docs/arxiv.2310.08067.pdf"))
print(split_docs[0].page_content)

# Sometimes it's useful to use custom metadata to help the model understand the context of the document
# some splitters, like the markdown splitter, will automatically extract metadata from the document based 
# on the markdown headers

markdown_text = """
# This is a markdown header
This is some text
## This is a subheader
This is some more text
"""

markdown_splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=[
        ("#", "Header 1"),
        ("##", "Header 2"),
        ("###", "Header 3"),
        ("####", "Header 4"),
        ("#####", "Header 5"),
        ("######", "Header 6")
    ]
)

print(markdown_splitter.split_text(markdown_text)[0].metadata)
print(markdown_splitter.split_text(markdown_text)[1].metadata) # keeps the heading 1 metadata


GameGPT: Multi-agent Collaborative Framework for
Game Development
Dake Chen
AutoGame Research
dk@autogame.aiHanbin Wang
X-Institute
wanghanbin@mails.x-institute.edu.cn
Yunhao Huo
University of Southern California
hhuo@usc.eduYuzhao Li
AutoGame Research
ram@autogame.aiHaoyang Zhang
AutoGame Research
17@autogame.ai
Abstract
The large language model (LLM) based agents have demonstrated their capacity
to automate and expedite software development processes. In this paper, we
{'Header 1': 'This is a markdown header'}
{'Header 1': 'This is a markdown header', 'Header 2': 'This is a subheader'}


## Vector Embedding Step


In [26]:
from langchain.embeddings.openai import OpenAIEmbeddings
# Embeddings are a way to represent text as a vector of numbers
# that semantically represent the text. This is useful for comparing
# the similarity of two pieces of text, or for generating text that
# is similar to the input text.

embed = OpenAIEmbeddings()

# dot product makes it easy to compare the similarity of two pieces of text
import numpy as np
dogs_embed = embed.embed_query("I like dogs")
cats_embed = embed.embed_query("I like cats")
bycicles_embed = embed.embed_query("Bycicles are a form of transportation")

print(np.dot(dogs_embed, cats_embed))
print(np.dot(dogs_embed, bycicles_embed))

# the first two are more similar than the second two, thus the dot product is higher


0.9171789969688378
0.756001034298859


## Vector Store Step

In [27]:
# Vectorstores
# Vectorstores are a way to store embeddings for a large amount of text
import os
import shutil
from langchain.vectorstores import Chroma

path = "docs/chroma/"
# remove existing vectorstore
if os.path.exists(path):
    shutil.rmtree(path)
