# 여러 문서에서 찾아서 답변하는 챗봇 만들기

- QA ChatBot
- LangChain
- ChatGPT (gpt-3.5-turbo)
- ChromaDB

In [1]:
!pip install -q langchain openai tiktoken chromadb

## Setting up LangChain

In [8]:
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.llms import OpenAI
from langchain.chains import RetrievalQA
from langchain.document_loaders import TextLoader
from langchain.document_loaders import DirectoryLoader

from dotenv import load_dotenv
load_dotenv()

True

In [9]:
loader = DirectoryLoader('./assets/techcrunch-articles', glob="*.txt", loader_cls=TextLoader)

documents = loader.load()

len(documents)

21

In [10]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
texts = text_splitter.split_documents(documents)

len(texts)

233

In [5]:
texts[2:4]

[Document(page_content='April 28, 2023\n\nVC firms including Sequoia Capital, Andreessen Horowitz, Thrive and K2 Global are picking up new shares, according to documents seen by TechCrunch. A source tells us Founders Fund is also investing. Altogether the VCs have put in just over $300 million at a valuation of $27 billion to $29 billion. This is separate to a big investment from Microsoft announced earlier this year, a person familiar with the development told TechCrunch, which closed in January. The size of Microsoft’s investment is believed to be around $10 billion, a figure we confirmed with our source.\n\nApril 25, 2023\n\nCalled ChatGPT Business, OpenAI describes the forthcoming offering as “for professionals who need more control over their data as well as enterprises seeking to manage their end users.”', metadata={'source': 'assets/techcrunch-articles/05-03-chatgpt-everything-you-need-to-know-about-the-ai-powered-chatbot.txt'}),
 Document(page_content='Called ChatGPT Business, 

## Create Chroma DB

1. Text -> Embbedings
2. `db` 폴더에 데이터 저장
3. DB 초기화
4. `db` 폴더로부터 DB 로드

In [11]:
persist_directory = 'db'

embedding = OpenAIEmbeddings()

vectordb = Chroma.from_documents(
    documents=texts, 
    embedding=embedding,
    persist_directory=persist_directory)

OperationalError: attempt to write a readonly database

In [8]:
vectordb.persist()
vectordb = None

In [9]:
vectordb = Chroma(
    persist_directory=persist_directory, 
    embedding_function=embedding)

In [10]:
retriever = vectordb.as_retriever()

In [11]:
docs = retriever.get_relevant_documents("What is Generative AI?")

for doc in docs:
    print(doc.metadata["source"])

assets/techcrunch-articles/05-04-slack-updates-aim-to-put-ai-at-the-center-of-the-user-experience.txt
assets/techcrunch-articles/05-03-nova-is-building-guardrails-for-generative-ai-content-to-protect-brand-integrity.txt
assets/techcrunch-articles/05-04-hugging-face-and-servicenow-release-a-free-code-generating-model.txt
assets/techcrunch-articles/05-03-spawning-lays-out-its-plans-for-letting-creators-opt-out-of-generative-ai-training.txt


## Make a retriever

In [12]:
retriever = vectordb.as_retriever(search_kwargs={"k": 3})

In [13]:
docs = retriever.get_relevant_documents("What is Generative AI?")

for doc in docs:
    print(doc.metadata["source"])

assets/techcrunch-articles/05-04-slack-updates-aim-to-put-ai-at-the-center-of-the-user-experience.txt
assets/techcrunch-articles/05-03-nova-is-building-guardrails-for-generative-ai-content-to-protect-brand-integrity.txt
assets/techcrunch-articles/05-04-hugging-face-and-servicenow-release-a-free-code-generating-model.txt


## Make a chain

In [14]:
qa_chain = RetrievalQA.from_chain_type(
    llm=OpenAI(), 
    chain_type="stuff", 
    retriever=retriever, 
    return_source_documents=True)

  warn_deprecated(


In [15]:
def process_llm_response(llm_response):
    print(llm_response['result'])
    print('\n\nSources:')
    for source in llm_response["source_documents"]:
        print(source.metadata['source'])

## Query

In [16]:
query = "How much money did Pando raise?"
llm_response = qa_chain(query)
process_llm_response(llm_response)

  warn_deprecated(


 Pando raised $30 million in its Series B round.


Sources:
assets/techcrunch-articles/05-03-ai-powered-supply-chain-startup-pando-lands-30m-investment.txt
assets/techcrunch-articles/05-03-ai-powered-supply-chain-startup-pando-lands-30m-investment.txt
assets/techcrunch-articles/05-03-ai-powered-supply-chain-startup-pando-lands-30m-investment.txt


In [17]:
llm_response

{'query': 'How much money did Pando raise?',
 'result': ' Pando raised $30 million in its Series B round.',
 'source_documents': [Document(page_content='Signaling that investments in the supply chain sector remain robust, Pando, a startup developing fulfillment management technologies, today announced that it raised $30 million in a Series B round, bringing its total raised to $45 million.\n\nIron Pillar and Uncorrelated Ventures led the round, with participation from existing investors Nexus Venture Partners, Chiratae Ventures and Next47. CEO and founder Nitin Jayakrishnan says that the new capital will be put toward expanding Pando’s global sales, marketing and delivery capabilities.\n\n“We will not expand into new industries or adjacent product areas,” he told TechCrunch in an email interview. “Great talent is the foundation of the business — we will continue to augment our teams at all levels of the organization. Pando is also open to exploring strategic partnerships and acquisitio

In [18]:
query = "Who led the round in Pando?"
llm_response = qa_chain(query)
process_llm_response(llm_response)

 Iron Pillar and Uncorrelated Ventures.


Sources:
assets/techcrunch-articles/05-03-ai-powered-supply-chain-startup-pando-lands-30m-investment.txt
assets/techcrunch-articles/05-03-ai-powered-supply-chain-startup-pando-lands-30m-investment.txt
assets/techcrunch-articles/05-03-ai-powered-supply-chain-startup-pando-lands-30m-investment.txt


In [19]:
query = "What did Databricks acquire?"
llm_response = qa_chain(query)
process_llm_response(llm_response)

 Databricks acquired Okera, a data governance platform with a focus on AI.


Sources:
assets/techcrunch-articles/05-03-databricks-acquires-ai-centric-data-governance-platform-okera.txt
assets/techcrunch-articles/05-03-databricks-acquires-ai-centric-data-governance-platform-okera.txt
assets/techcrunch-articles/05-03-databricks-acquires-ai-centric-data-governance-platform-okera.txt


In [20]:
query = "What is Generative AI?"
llm_response = qa_chain(query)
process_llm_response(llm_response)

 Generative AI is a type of artificial intelligence that is able to generate new content or experiences based on existing data or rules. It can be incorporated into workflows and applications to automate tasks and improve efficiency. 


Sources:
assets/techcrunch-articles/05-04-slack-updates-aim-to-put-ai-at-the-center-of-the-user-experience.txt
assets/techcrunch-articles/05-03-nova-is-building-guardrails-for-generative-ai-content-to-protect-brand-integrity.txt
assets/techcrunch-articles/05-04-hugging-face-and-servicenow-release-a-free-code-generating-model.txt


In [21]:
query = "Who is CMA?"
llm_response = qa_chain(query)
process_llm_response(llm_response)

 CMA stands for Competition and Markets Authority. It is a UK government agency responsible for promoting competition and fair markets for consumers and businesses.


Sources:
assets/techcrunch-articles/05-04-cma-generative-ai-review.txt
assets/techcrunch-articles/05-04-cma-generative-ai-review.txt
assets/techcrunch-articles/05-04-cma-generative-ai-review.txt
