# 1. 准备

* 安装python 3.8以上，目前的最新版本是[3.11](https://www.python.org/downloads/release/python-3111/)。
  
* 本例采用jupyter-lab作为开发环境，因此需要在电脑上安装[jupyter-lab](https://jupyter.org/)。


* [注册openai账户](https://platform.openai.com)，并设置OPENAI_API_KEY环境变量。
* 我们使用redis来保存所加载的书本的内容，因此需要部署一个redis服务。不同于我们平时一般web应用使用的redis服务，我们这次需要安装redis-stack：
  ```bash
  docker run -d -p 13333:8001 -p 10001:6379 redis/redis-stack:latest
  ```
* 然后安装相关的python依赖包
  ```bash
  pip install openai
  pip install langchain
  pip install redis
  pip install unstructured
  ```
  
* 安装[pandoc](https://github.com/jgm/pandoc/releases)，加载epub电子书要用 。

In [1]:
from langchain.embeddings import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores.redis import Redis
from langchain.document_loaders import TextLoader
from langchain.docstore.document import Document
from langchain.document_loaders import UnstructuredEPubLoader
import os
from langchain import OpenAI, VectorDBQA
from langchain.agents.agent_toolkits import (
    create_vectorstore_agent,
    VectorStoreToolkit,
    VectorStoreRouterToolkit,
    VectorStoreInfo,
)

# 2. 加载epub书内容

In [2]:
dir = 'resources/epub'
fs = os.listdir(dir)
data={}
documents={}
for f in fs:
    path = dir + '/' + f
    if (os.path.isfile(path)):
        print(path)
        loader = UnstructuredEPubLoader(path,mode='elements')
        data[f]=loader.load()

for book in data.keys():
    documents[book]=[]
    for seg in data[book]:
        cat = seg.metadata['category']
        if cat == 'NarrativeText':
            documents[book].append(seg)

resources/epub/California - Sara Benson.epub
resources/epub/LP_台湾_en.epub


# 3. 把数据以word vector的格式存入redis数据库。

In [4]:
redis_url='redis://localhost:10001'
embeddings=OpenAIEmbeddings()
for book in documents.keys():
    rds = Redis.from_documents(documents[book],embeddings,redis_url=redis_url,index_name=book)

# 4. 尝试做一次最相似查询

In [12]:
rds = Redis.from_existing_index(embeddings, redis_url=redis_url, index_name='LP_台湾_en.epub')
query = '台湾日月潭'
rds.similarity_search(query)

[Document(page_content='鯉魚潭;\r\nLǐyú Tán), a pretty willow-lined pond with a lush green mountain\r\nbackdrop, you’ll find hot springs', metadata={'source': 'resources/epub/LP_台湾_en.epub', 'page_number': 1, 'category': 'NarrativeText'}),
 Document(page_content='(明月溫泉 Míngyuè Wēnquán;  2661 7678; www.fullmoonspa.net; 1 Lane\r\n85, Wulai St; unlimited time public pools NT$490) One of\r\nthe more stylish hotels along the tourist street, Full Moon has mixed\r\nand nude segregated pools with nice views over the Tongshi River. Its\r\nprivate rooms feature wooden tubs. The hotel also offers rooms for\r\novernight stays from NT$2700. Go for the lower cheaper rooms as the\r\nviews are surprisingly better than higher up.', metadata={'source': 'resources/epub/LP_台湾_en.epub', 'page_number': 1, 'category': 'NarrativeText'}),
 Document(page_content='7 Sun Moon Lake (Click\r\nhere) is the largest body of water in Taiwan and boasts a\r\nwatercolour background ever changing with the season and light. Al

# 5. 接下来使用langchain的vector_store_agent来使用上边存下来的某书旅游数据，实现一个问答机器人。

使用VectorStoreRouterToolkit可以将多本书一起作为输入，根据用户的问题切换到最合适的书。另一个可以选的toolkit叫VectorStoreToolkit(vectorstore_info=vectorstore_info)。这个toolkit的使用思路时把多本书存在一个index下，机器人会综合所有书的相关内容做出解答，另外如果用户要求提供来源，机器人会提取metadata里的'source'字段并回复。

In [90]:
llm = OpenAI(temperature=0)
rdss = {}
infos=[]
for book in documents.keys():
    rdss[book] = Redis.from_existing_index(embeddings, redis_url=redis_url, index_name=book)
    vectorstore_info = VectorStoreInfo(
        name="hotest_travel_advice_about_"+ book,
        description="the best travel advice about " + book,
        vectorstore=rdss[book]
    )
    infos.append(vectorstore_info)

#使用VectorStoreRouterToolkit可以将多本书一起作为输入，根据用户的问题切换到最合适的书。
toolkit = VectorStoreRouterToolkit(vectorstores=infos, llm=llm)
agent_executor = create_vectorstore_agent(
    llm=llm,
    toolkit=toolkit,
    verbose=True)


# 6. 查看一下prompt

In [91]:
print(agent_executor.agent.llm_chain.prompt)

input_variables=['input', 'agent_scratchpad'] output_parser=None partial_variables={} template='You are an agent designed to answer questions about sets of documents.\nYou have access to tools for interacting with the documents, and the inputs to the tools are questions.\nSometimes, you will be asked to provide sources for your questions, in which case you should use the appropriate tool to do so.\nIf the question does not seem relevant to any of the tools provided, just return "I don\'t know" as the answer.\n\n\nhotest_travel_advice_about_California - Sara Benson.epub: Useful for when you need to answer questions about hotest_travel_advice_about_California - Sara Benson.epub. Whenever you need information about the best travel advice about California - Sara Benson.epub you should ALWAYS use this. Input should be a fully formed question.\nhotest_travel_advice_about_LP_台湾_en.epub: Useful for when you need to answer questions about hotest_travel_advice_about_LP_台湾_en.epub. Whenever you n

# 7. 尝试问个问题

In [98]:
resp = agent_executor.run("日月潭什么时候去旅游比较好，请用中文回答")
print(resp)



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m I should use hotest_travel_advice_about_LP_台湾_en.epub to answer this question
Action: hotest_travel_advice_about_LP_台湾_en.epub
Action Input: 日月潭什么时候去旅游比较好[0m
Observation: [33;1m[1;3m The best time to visit Sun Moon Lake is during autumn and early spring (October to December and March to April). May has seasonal monsoon rains, and typhoons are a problem from June to September, though if there is no typhoon, you can certainly visit.[0m
Thought:[32;1m[1;3m I now know the final answer
Final Answer: 日月潭最好的旅游时间是秋季和初春（10月到12月和3月到4月）。五月有季节性的季风雨，6月到9月有台风，但是如果没有台风，你也可以去旅游。[0m

[1m> Finished chain.[0m
日月潭最好的旅游时间是秋季和初春（10月到12月和3月到4月）。五月有季节性的季风雨，6月到9月有台风，但是如果没有台风，你也可以去旅游。
