## 1. 本地读取pdf和进行切分
生成一段代码，实现以下功能：
1. 使用langchain读取本地file_folder目录下的所有pdf文件
2. 使用langchain将pdf文本切分成小段

In [26]:
from IPython.display import clear_output
clear_output(wait=True)

In [28]:
# 包安装

# !pip3 install langchain
# !pip install pinecone-client

# !pip install --upgrade langchain
# !pip install tiktoken

import os
# print("PYTHONPATH:", os.environ.get('PYTHONPATH'))
# print("PATH:", os.environ.get('PATH'))

import sys
print(sys.version)
print(sys.executable)

3.11.3 (main, Apr  7 2023, 20:13:31) [Clang 14.0.0 (clang-1400.0.29.202)]
/opt/homebrew/opt/python@3.11/bin/python3.11


In [30]:
import sys
sys.path.append('/opt/homebrew/lib/python3.11/site-packages')
# print(sys.path)

# %ls -lrt ./files/
# %pwd

In [32]:
from langchain.document_loaders import PyPDFDirectoryLoader

# 使用PyPDFDirectoryLoader从本地xx目录读取全部的pdf文件
file_folder='./files'
loader = PyPDFDirectoryLoader(file_folder)
# docs是一个list
docs = loader.load()

print(type(docs[0]))
print(docs[0])
print(len(docs))

<class 'langchain.schema.Document'>
page_content='Is ChatGPT A Good Translator? Yes With GPT-4 As The Engine\nWenxiang Jiao\x03Wenxuan Wang Jen-tse Huang Xing Wang Zhaopeng Tu\nTencent AI Lab\nAbstract\nThis report provides a preliminary evaluation\nof ChatGPT for machine translation, includ-\ning translation prompt, multilingual transla-\ntion, and translation robustness. We adopt\nthe prompts advised by ChatGPT to trigger\nits translation ability and ﬁnd that the candi-\ndate prompts generally work well and show\nminor performance differences. By evalu-\nating on a number of benchmark test sets1,\nwe ﬁnd that ChatGPT performs competitively\nwith commercial translation products (e.g.,\nGoogle Translate) on high-resource European\nlanguages but lags behind signiﬁcantly on low-\nresource or distant languages. For distant\nlanguages, we explore an interesting strategy\nnamed pivot prompting that asks ChatGPT\nto translate the source sentence into a high-\nresource pivot language before i

In [33]:
# 使用langchain将pdf文本切分成小文档

from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1000,
    chunk_overlap  = 100,
    length_function = len,
)
docs = text_splitter.split_documents(docs)
docs[0]

Document(page_content='Is ChatGPT A Good Translator? Yes With GPT-4 As The Engine\nWenxiang Jiao\x03Wenxuan Wang Jen-tse Huang Xing Wang Zhaopeng Tu\nTencent AI Lab\nAbstract\nThis report provides a preliminary evaluation\nof ChatGPT for machine translation, includ-\ning translation prompt, multilingual transla-\ntion, and translation robustness. We adopt\nthe prompts advised by ChatGPT to trigger\nits translation ability and ﬁnd that the candi-\ndate prompts generally work well and show\nminor performance differences. By evalu-\nating on a number of benchmark test sets1,\nwe ﬁnd that ChatGPT performs competitively\nwith commercial translation products (e.g.,\nGoogle Translate) on high-resource European\nlanguages but lags behind signiﬁcantly on low-\nresource or distant languages. For distant\nlanguages, we explore an interesting strategy\nnamed pivot prompting that asks ChatGPT\nto translate the source sentence into a high-\nresource pivot language before into the target\nlanguage, w

## 2. 将信息向量化，并存入向量数据库
1. 通过openai的embedding接口，将文档转化为向量
2. 将转化后的向量存入Pinecone向量数据库

In [20]:
# openai和pinecone的API配置
import os
import getpass

# PINECONE_API_KEY = getpass.getpass('Pinecone API Key:')
# PINECONE_ENV = getpass.getpass('Pinecone Environment:')
# os.environ['OPENAI_API_KEY'] = getpass.getpass('OpenAI API Key:')

PINECONE_API_KEY='xx'
PINECONE_ENV='xx'
os.environ['OPENAI_API_KEY']='xx'
PINECONE_INDEX='xx'

In [34]:
# 通过openai的embedding接口将文档转化为向量，并存入pinecone
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Pinecone
from langchain.document_loaders import TextLoader
import pinecone 

embeddings = OpenAIEmbeddings()

# initialize pinecone
pinecone.init(
    api_key=PINECONE_API_KEY,  # find at app.pinecone.io
    environment=PINECONE_ENV  # next to api key in console
)

index_name = PINECONE_INDEX

# 首次导入时运行：索引导入一次即可
# docsearch = Pinecone.from_documents(docs, embeddings, index_name=index_name)

## 3. 在向量数据库中搜索与query相似的内容，合并投喂给gpt进行回答

1. 利用similarity_search函数搜索与query相似的内容
2. 利用langchain中的load_qa_chain函数，将query和查询到的相似内容作为参数传入，即可得到基于知识库的回答

In [35]:
# 在向量数据库中，查询相似的文档
# if you already have an index, you can load it like this
docsearch = Pinecone.from_existing_index(index_name, embeddings)

query = "does chatgpt translates better than google translation?"
docs = docsearch.similarity_search(query, 3)
print(docs)

[Document(page_content='Is ChatGPT A Good Translator? Yes With GPT-4 As The Engine\nWenxiang Jiao\x03Wenxuan Wang Jen-tse Huang Xing Wang Zhaopeng Tu\nTencent AI Lab\nAbstract\nThis report provides a preliminary evaluation\nof ChatGPT for machine translation, includ-\ning translation prompt, multilingual transla-\ntion, and translation robustness. We adopt\nthe prompts advised by ChatGPT to trigger\nits translation ability and ﬁnd that the candi-\ndate prompts generally work well and show\nminor performance differences. By evalu-\nating on a number of benchmark test sets1,\nwe ﬁnd that ChatGPT performs competitively\nwith commercial translation products (e.g.,\nGoogle Translate) on high-resource European\nlanguages but lags behind signiﬁcantly on low-\nresource or distant languages. For distant\nlanguages, we explore an interesting strategy\nnamed pivot prompting that asks ChatGPT\nto translate the source sentence into a high-\nresource pivot language before into the target\nlanguage, 

In [36]:
from langchain.llms import OpenAI
# We now initialize the ConversationalRetrievalChain
llm = OpenAI(openai_api_key=os.environ['OPENAI_API_KEY'], temperature=0)


In [37]:
from langchain.chains.question_answering import load_qa_chain
chain = load_qa_chain(llm, chain_type="stuff")
chain.run(input_documents=docs, question=query)

' Yes, ChatGPT performs competitively with commercial translation products (e.g., Google Translate) on high-resource European languages but lags behind significantly on low-resource or distant languages.'