# Chat with Your Data


in this example, we'll employ the six step method of retrieval augmented generation to semantically search and conversate over document files of different types.

1. Library installs

In [1]:
! pip install langchain --user
! pip install numpy --user
! pip install tiktoken --user
! pip install openai --user
! pip install pypdf --user
! pip install chromadb --user
! pip install lark --user



2. Setup openAI API

In [2]:
import os 
import openai 

# get api key from system environment variable
openai.api_key = os.getenv("OPENAI_API_KEY")


## Document Loading Step

In [3]:
from langchain.document_loaders import PyPDFLoader

def get_pdf_pages(path):
    return PyPDFLoader(path).load()

# the atomic object that all objects share
# is Document. All pdf pages are Document objects.
# this same process can be done for 80 different filetypes but for this example we will use pdfs

print(get_pdf_pages("docs/arxiv.2310.08067.pdf")[0].page_content)

GameGPT: Multi-agent Collaborative Framework for
Game Development
Dake Chen
AutoGame Research
dk@autogame.aiHanbin Wang
X-Institute
wanghanbin@mails.x-institute.edu.cn
Yunhao Huo
University of Southern California
hhuo@usc.eduYuzhao Li
AutoGame Research
ram@autogame.aiHaoyang Zhang
AutoGame Research
17@autogame.ai
Abstract
The large language model (LLM) based agents have demonstrated their capacity
to automate and expedite software development processes. In this paper, we
focus on game development and propose a multi-agent collaborative framework,
dubbed GameGPT, to automate game development. While many studies have
pinpointed hallucination as a primary roadblock for deploying LLMs in production,
we identify another concern: redundancy. Our framework presents a series of
methods to mitigate both concerns. These methods include dual collaboration and
layered approaches with several in-house lexicons, to mitigate the hallucination
and redundancy in the planning, task identification, and i

## Document splitting step

In [4]:
# this is an important step, as you need to keep semantically complete chunks of text together
# to prevent incomplete sentences from being generated

# there are a whole lotta document splitters for langchain, including
# CharacterTextSplitter, MarkdownHeaderTextSplitter, TokenTextSplitter, SentenceTransformersTextSplitter, 
# RecursiveCharacterTextSplitter, Language, NLTKTextSplitter, SpacyTextSplitter, and more

# for generic text, it's usually best to uzse RecursiveCharacterTextSplitter
from langchain.text_splitter import RecursiveCharacterTextSplitter, MarkdownHeaderTextSplitter
recursive_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500, # means different things to different splitters, this is for character length in each chunk.
    chunk_overlap=0, # how many characters are shared to keep context
    separators=["\n\n", "\n", " ", ""] # if it fails to split on the first separator, it will try the next one
)

split_docs = recursive_splitter.split_documents(get_pdf_pages("docs/arxiv.2310.08067.pdf"))
print(split_docs[0].page_content)

# Sometimes it's useful to use custom metadata to help the model understand the context of the document
# some splitters, like the markdown splitter, will automatically extract metadata from the document based 
# on the markdown headers

markdown_text = """
# This is a markdown header
This is some text
## This is a subheader
This is some more text
"""

markdown_splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=[
        ("#", "Header 1"),
        ("##", "Header 2"),
        ("###", "Header 3"),
        ("####", "Header 4"),
        ("#####", "Header 5"),
        ("######", "Header 6")
    ]
)

print(markdown_splitter.split_text(markdown_text)[0].metadata)
print(markdown_splitter.split_text(markdown_text)[1].metadata) # keeps the heading 1 metadata


GameGPT: Multi-agent Collaborative Framework for
Game Development
Dake Chen
AutoGame Research
dk@autogame.aiHanbin Wang
X-Institute
wanghanbin@mails.x-institute.edu.cn
Yunhao Huo
University of Southern California
hhuo@usc.eduYuzhao Li
AutoGame Research
ram@autogame.aiHaoyang Zhang
AutoGame Research
17@autogame.ai
Abstract
The large language model (LLM) based agents have demonstrated their capacity
to automate and expedite software development processes. In this paper, we
{'Header 1': 'This is a markdown header'}
{'Header 1': 'This is a markdown header', 'Header 2': 'This is a subheader'}


## Vector Embedding Step


In [5]:
from langchain.embeddings.openai import OpenAIEmbeddings
# Embeddings are a way to represent text as a vector of numbers
# that semantically represent the text. This is useful for comparing
# the similarity of two pieces of text, or for generating text that
# is similar to the input text.

embed = OpenAIEmbeddings()

# dot product makes it easy to compare the similarity of two pieces of text
import numpy as np
dogs_embed = embed.embed_query("I like dogs")
cats_embed = embed.embed_query("I like cats")
bycicles_embed = embed.embed_query("Bycicles are a form of transportation")

print(np.dot(dogs_embed, cats_embed))
print(np.dot(dogs_embed, bycicles_embed))

# the first two are more similar than the second two, thus the dot product is higher


0.917245426514217
0.756001034298859


## Vector Store Step

In [6]:
# Vectorstores
# Vectorstores are a way to store embeddings for a large amount of text
import os
import shutil
from langchain.vectorstores import Chroma

path = "docs/chroma/"
# remove existing vectorstore
if os.path.exists(path):
    shutil.rmtree(path)

db = Chroma.from_documents(
    documents=split_docs,
    embedding=embed,
    persist_directory=path
)

db.persist() # saves the vectorstore to disk

In [7]:
# you can now query semantic similarity of the documents
question = "Overview of GameGPT"
res = db.similarity_search(question, k=1) # k is the number of results to return
print(res[0].page_content)
print('---')
question = "What is task classification?"
res = db.similarity_search(question, k=1)
print(res[0].page_content)

address the limitations of the LLM and the temporal constraints of game development, we integrate
multiple agents with distinct roles into the framework. This integration aims to enhance precision
and scalability. The scalability aspect of GameGPT offers the potential to create games of medium to
large sizes. Moreover, GameGPT operates in a collaborative manner, exhibiting a dual collaboration
approach. Firstly, it involves cooperation between the LLMs and smaller expert models dedicated
---
tasks in game development. In order to perform the classification, a BERT model is employed to
effectively categorize each task. The BERT model has been trained with an in-house dataset. This
dataset contains data entries uniquely tailored to the tasks of game development. The input is a task
drawn from the predetermined list, while the output corresponds to the task’s designated category.
Identifying the argument involves another LLM. The agent provides a template that corresponds to the


## Retrieval Step

In [8]:
# You can enforce diversity in the results by using the max_marginal_relevance_search function
# this database will not be saved to disk

minidb = Chroma.from_texts([
    "Rats are small",
    "Dogs are big",
    "Crocodiles are big",
    "Cats are small",
    "Ford fiestas are a type of car",
], embedding=embed)

print(minidb.similarity_search("big animals", k=2))

print(minidb.max_marginal_relevance_search(
    "big animals",
    k=2, # the ones ACTUALLY selected
    fetch_k=5, # the ones to choose from
))
# it will select the most diverse from the 
# fetch_k results, and return the k most diverse


[Document(page_content='Dogs are big'), Document(page_content='Crocodiles are big')]
[Document(page_content='Dogs are big'), Document(page_content='Ford fiestas are a type of car')]


In [9]:
# You can filter manually by metadata
docs = db.similarity_search(
    "Summary of GameGPT",
    k=5,
    filter= {
        "page": 3
    }
)

print(docs[0].metadata, docs[0].page_content)

# You can infer desired metadata from the query itself
from langchain.llms import OpenAI
import lark 
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain.chains.query_constructor.schema import AttributeInfo

metadata_fields = [
    AttributeInfo(
        name="page",
        type="integer",
        description="the page of the document"
    )
]

retriever = SelfQueryRetriever.from_llm(
    OpenAI(temperature=0),
    db,
    "A scientific paper about GameGPT",
    metadata_fields,
    verbose=True
)
# a self query retriever splits up the question asked into a query and a filter
# filter addresses metadata, query addresses the text itself
# query isn't exactly the question asked, but rather keywords that are extracted from the question

docs = retriever.get_relevant_documents("First four pages, tell me about categorization")

for doc in docs:
    print(doc.metadata["page"], doc.page_content)

print('---')

docs = retriever.get_relevant_documents("seventh page")

for doc in docs:
    print(doc.metadata)

{'page': 3, 'source': 'docs/arxiv.2310.08067.pdf'} stages, the game engine testing engineer undertakes the execution of tasks and subsequently produces
a comprehensive result summary.
2.2 Multi-agent Framework
In GameGPT, each agent maintains a private memory system and can access the shared public
discussion to acquire the necessary information for guiding their decision-making process. For agent
iat time step t, this process can be formally represented as follows:
pθi(Oit|Mit, Pt), (1)
4 tasks in game development. In order to perform the classification, a BERT model is employed to
effectively categorize each task. The BERT model has been trained with an in-house dataset. This
dataset contains data entries uniquely tailored to the tasks of game development. The input is a task
drawn from the predetermined list, while the output corresponds to the task’s designated category.
Identifying the argument involves another LLM. The agent provides a template that corresponds to the
3 address t

In [10]:
# you can also use compression to extract the 
# most important semantic information from a document
# and cram more documents into the same context window
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor

llm = OpenAI(temperature=0)
compression_retriever = ContextualCompressionRetriever(
    base_compressor=LLMChainExtractor.from_llm(llm),
    base_retriever=db.as_retriever(search_type="mmr") # optional search_type param (max marginal relevance)
)

print(compression_retriever.get_relevant_documents("What is GameGPT?")[0].page_content)

# there are also a whole bunch of other retrievers
# but that is out of the scope of this example



• To address the hallucination and redundancy concerns of LLMs within game development,
several mitigations including dual collaboration and code decoupling are proposed.
•Empirical results demonstrate the GameGPT’s capability in effective decision-making and
decision-rectifying throughout the game development process.
2 GameGPT
2.1 Overview
The GameGPT framework is designed as a specialized multi-agent system for game development. To


## Output Generation Step

In [16]:
# Using the RetrievalQA chain 
# This api does not keep conversation history 

from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI
from langchain.prompts import PromptTemplate

llm = ChatOpenAI(temperature=0)
chain = RetrievalQA.from_chain_type(
    llm,
    retriever=db.as_retriever()
)

print(chain({"query": "What is GameGPT?"})["result"])

# you can also use a prompt template to help with prompt engineering

template = """
Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer. Use three sentences maximum. Keep the answer as concise as possible. Always say "thanks for asking!" at the end of the answer. 
{context}
Question: {question}
Helpful Answer:
"""

chain_prompt = PromptTemplate.from_template(template)

chain = RetrievalQA.from_chain_type(
    llm, 
    retriever=db.as_retriever(),
    return_source_documents=True, # returns the source documents that were used to answer the question
    chain_type_kwargs={
        "prompt": chain_prompt
    }
)

result = chain({"query": "What is GameGPT?"})
print(result["result"])
print(result["source_documents"][0].page_content)

# There are also multiple chain types you can use with this api 

chain_map_reduce = RetrievalQA.from_chain_type(
    llm,
    retriever=db.as_retriever(),
    chain_type="map_reduce",
)

# map-reduce uses an llm to reduce the documents to the most important info via LLM
# then passes it back to generate the answer

print(chain_map_reduce({"query": "What is GameGPT?"})["result"])


GameGPT is a specialized multi-agent system designed for game development. It integrates multiple agents with distinct roles to address the limitations of language models and the time constraints of game development. GameGPT operates in a collaborative manner, involving cooperation between language models and smaller expert models. It also employs approaches such as instruction tuning, code decoupling, and dual collaboration to address hallucination and redundancy concerns. The framework aims to enhance precision, scalability, and decision-making throughout the game development process.
GameGPT is a specialized multi-agent system for game development that integrates multiple agents with distinct roles to enhance precision and scalability. It employs a combination of approaches, including dual collaboration and code decoupling, to address hallucination and redundancy concerns. Thanks for asking!
• To address the hallucination and redundancy concerns of LLMs within game development,
seve

## Memory and Conclusion

In [20]:
# To add memory we use ConversationBufferMemory
from langchain.memory import ConversationBufferMemory

memory = ConversationBufferMemory(
    memory_key="chat_history",
    return_messages=True 
)
# and now, we use this chain with the conversational retrieval chain
from langchain.chains import ConversationalRetrievalChain
qa = ConversationalRetrievalChain.from_llm(
    llm,
    retriever=db.as_retriever(),
    memory=memory
)

print(qa({"question": "Give me an use case for GameGPT."})["answer"])
print(qa({"question": "Give me another one."})["answer"])


One potential use case for GameGPT is in the development of video games. GameGPT can be used as a specialized multi-agent system to assist game developers in various aspects of the game development process. It can help with decision-making, decision-rectifying, and addressing concerns such as hallucination and redundancy within the game development process. By integrating multiple agents with distinct roles, GameGPT aims to enhance precision and scalability in game development. It can also classify game genres and provide guidance based on the specific genre intended for the game. Overall, GameGPT offers the potential to improve the efficiency and effectiveness of game development.
The provided context does not explicitly mention other potential use cases for GameGPT. Therefore, it is unclear what other applications or domains GameGPT could be used for.
