# 여러 문서에서 찾아서 답변하는 챗봇 만들기

> 유튜브 [빵형의 개발도상국](https://www.youtube.com/@bbanghyong)

- QA ChatBot
- LangChain
- ChatGPT (gpt-3.5-turbo)
- ChromaDB

> Reference: https://youtu.be/3yPBVii7Ct0

In [1]:
# !pip install -q langchain openai tiktoken chromadb

## 여러 문서

> TechCrunch 기사 21개

In [4]:
# https://github.com/kairess/toy-datasets/raw/master/techcrunch-articles.zip => 위 링크 타고 압축파일 다운 받고 수동으로 압축해제

#!wget -q https://github.com/kairess/toy-datasets/raw/master/techcrunch-articles.zip
#!unzip -q techcrunch-articles.zip -d articles

## Setting up LangChain

OpenAI API Key

https://platform.openai.com/account/api-keys

In [1]:
import os

os.environ["OPENAI_API_KEY"] = os.environ.get('OPENAI_API_KEY')

In [2]:
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.llms import OpenAI
from langchain.chains import RetrievalQA
from langchain.document_loaders import TextLoader
from langchain.document_loaders import DirectoryLoader

## Load multiple and process documents

In [3]:
# loader = TextLoader('single_text_file.txt')
loader = DirectoryLoader('./articles/', glob="*.txt", loader_cls=TextLoader, loader_kwargs={'encoding': 'utf-8'})

documents = loader.load()

len(documents)

FileNotFoundError: Directory not found: './articles/'

## Split texts

In [4]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
texts = text_splitter.split_documents(documents)

len(texts)

233

In [5]:
texts[2:4]

[Document(page_content='There’s truth to what Jayakrishnan’s expressing about pent-up demand. According to a recent McKinsey survey, supply chain companies had — and have — a strong desire for tools that deliver greater supply chain visibility. Sixty-seven percent of respondents to the survey say that they’ve implemented dashboards for this purpose, while over half say that they’re investing in supply chain visibility services more broadly.\n\nPando aims to meet the need by consolidating supply chain data that resides in multiple silos within and outside of the enterprise, including data on customers, suppliers, logistics service providers, facilities and product SKUs. The platform provides various tools and apps for accomplishing different tasks across freight procurement, trade and transport management, freight audit and payment and document management, as well as dispatch planning and analytics.', metadata={'source': 'articles\\05-03-ai-powered-supply-chain-startup-pando-lands-30m-i

## Create Chroma DB

1. Text -> Embbedings
2. `db` 폴더에 데이터 저장
3. DB 초기화
4. `db` 폴더로부터 DB 로드

In [6]:
persist_directory = 'db'

embedding = OpenAIEmbeddings()

vectordb = Chroma.from_documents(
    documents=texts,
    embedding=embedding,
    persist_directory=persist_directory)

  warn_deprecated(


In [7]:
vectordb.persist()
vectordb = None

  warn_deprecated(


In [8]:
# DB로드
vectordb = Chroma(
    persist_directory=persist_directory,
    embedding_function=embedding)

## Make a retriever

In [9]:
retriever = vectordb.as_retriever()

In [10]:
docs = retriever.get_relevant_documents("What is Generative AI?")

for doc in docs:
    print(doc.metadata["source"])

  warn_deprecated(


articles\05-04-slack-updates-aim-to-put-ai-at-the-center-of-the-user-experience.txt
articles\05-03-nova-is-building-guardrails-for-generative-ai-content-to-protect-brand-integrity.txt
articles\05-04-hugging-face-and-servicenow-release-a-free-code-generating-model.txt
articles\05-03-spawning-lays-out-its-plans-for-letting-creators-opt-out-of-generative-ai-training.txt


### 결과를 k개 반환

In [11]:
retriever = vectordb.as_retriever(search_kwargs={"k": 3})

In [12]:
docs = retriever.get_relevant_documents("What is Generative AI?")

for doc in docs:
    print(doc.metadata["source"])

articles\05-04-slack-updates-aim-to-put-ai-at-the-center-of-the-user-experience.txt
articles\05-03-nova-is-building-guardrails-for-generative-ai-content-to-protect-brand-integrity.txt
articles\05-04-hugging-face-and-servicenow-release-a-free-code-generating-model.txt


## Make a chain

In [13]:
qa_chain = RetrievalQA.from_chain_type(
    llm=OpenAI(),
    chain_type="stuff",
    retriever=retriever,
    return_source_documents=True)

  warn_deprecated(


In [14]:
def process_llm_response(llm_response):
    print(llm_response['result'])
    print('\n\nSources:')
    for source in llm_response["source_documents"]:
        print(source.metadata['source'])

## Query

In [15]:
query = "How much money did Pando raise?"
llm_response = qa_chain(query)
process_llm_response(llm_response)

  warn_deprecated(




Pando raised $30 million in a Series B round, bringing its total raised to $45 million.


Sources:
articles\05-03-ai-powered-supply-chain-startup-pando-lands-30m-investment.txt
articles\05-03-ai-powered-supply-chain-startup-pando-lands-30m-investment.txt
articles\05-03-ai-powered-supply-chain-startup-pando-lands-30m-investment.txt


In [16]:
llm_response

{'query': 'How much money did Pando raise?',
 'result': '\n\nPando raised $30 million in a Series B round, bringing its total raised to $45 million.',
 'source_documents': [Document(page_content='Signaling that investments in the supply chain sector remain robust, Pando, a startup developing fulfillment management technologies, today announced that it raised $30 million in a Series B round, bringing its total raised to $45 million.\n\nIron Pillar and Uncorrelated Ventures led the round, with participation from existing investors Nexus Venture Partners, Chiratae Ventures and Next47. CEO and founder Nitin Jayakrishnan says that the new capital will be put toward expanding Pando’s global sales, marketing and delivery capabilities.\n\n“We will not expand into new industries or adjacent product areas,” he told TechCrunch in an email interview. “Great talent is the foundation of the business — we will continue to augment our teams at all levels of the organization. Pando is also open to expl

In [17]:
query = "Who led the round in Pando?"
llm_response = qa_chain(query)
process_llm_response(llm_response)

 Iron Pillar and Uncorrelated Ventures.


Sources:
articles\05-03-ai-powered-supply-chain-startup-pando-lands-30m-investment.txt
articles\05-03-ai-powered-supply-chain-startup-pando-lands-30m-investment.txt
articles\05-03-ai-powered-supply-chain-startup-pando-lands-30m-investment.txt


In [18]:
query = "What did Databricks acquire?"
llm_response = qa_chain(query)
process_llm_response(llm_response)

 Databricks acquired Okera, a data governance platform focused on AI.


Sources:
articles\05-03-databricks-acquires-ai-centric-data-governance-platform-okera.txt
articles\05-03-databricks-acquires-ai-centric-data-governance-platform-okera.txt
articles\05-03-databricks-acquires-ai-centric-data-governance-platform-okera.txt


In [19]:
query = "What is Generative AI?"
llm_response = qa_chain(query)
process_llm_response(llm_response)

 Generative AI is a type of artificial intelligence that uses algorithms and machine learning models to create new content or experiences. It can be incorporated into workflows and external apps to assist with tasks such as generating written or visual content.


Sources:
articles\05-04-slack-updates-aim-to-put-ai-at-the-center-of-the-user-experience.txt
articles\05-03-nova-is-building-guardrails-for-generative-ai-content-to-protect-brand-integrity.txt
articles\05-04-hugging-face-and-servicenow-release-a-free-code-generating-model.txt


In [20]:
query = "Who is CMA?"
llm_response = qa_chain(query)
process_llm_response(llm_response)


CMA stands for Competition and Markets Authority. It is a regulatory body in the UK responsible for promoting competition and preventing anti-competitive activities in markets.


Sources:
articles\05-04-cma-generative-ai-review.txt
articles\05-04-cma-generative-ai-review.txt
articles\05-04-cma-generative-ai-review.txt
