# 处理知识库pdf文档

可移植文档格式 (PDF)，标准化为 ISO 32000，是 Adobe 于 1992 年开发的一种文件格式，用于以独立于应用程序软件、硬件和操作系统的方式呈现文档，包括文本格式和图像。

使用 pypdf 将 PDF 加载到文档数组中，其中每个文档包含页面内容和带有 page 编号的元数据。
```SHELL
pip install pypdf
```

## 加载单个pdf文档

In [1]:
from langchain_community.document_loaders import PyPDFLoader

file_path = "/Users/libing/kk_datasets/pdf/2312.04005.pdf"

loader = PyPDFLoader(file_path)
pages = loader.load_and_split()

pages

[Document(metadata={'source': '/Users/libing/kk_datasets/pdf/2312.04005.pdf', 'page': 0}, page_content='KOALA: Self-Attention Matters in Knowledge Distillation of Latent\nDiffusion Models for Memory-Efficient and Fast Image Synthesis\nYoungwan Lee1,2 Kwanyong Park1 Yoorhim Cho3 Yong-Ju Lee1 Sung Ju Hwang2\n1Electronics and Telecommunications Research Institute (ETRI), South Korea\n2Korea Advanced Institute of Science and Technology (KAIST), South Korea\n3Sookmyung Women’s University, South Korea\nproject page: https://youngwanlee.github.io/KOALA/\nAbstract\nStable diffusion is the mainstay of the text-to-image (T2I)\nsynthesis in the community due to its generation perfor-\nmance and open-source nature. Recently, Stable Diffusion\nXL (SDXL), the successor of stable diffusion, has received a\nlot of attention due to its significant performance improve-\nments with a higher resolution of 1024 × 1024 and a larger\nmodel. However, its increased computation cost and model\nsize require high

In [2]:
#这种方法的优点是可以使用页码检索文档。
pages[0]

Document(metadata={'source': '/Users/libing/kk_datasets/pdf/2312.04005.pdf', 'page': 0}, page_content='KOALA: Self-Attention Matters in Knowledge Distillation of Latent\nDiffusion Models for Memory-Efficient and Fast Image Synthesis\nYoungwan Lee1,2 Kwanyong Park1 Yoorhim Cho3 Yong-Ju Lee1 Sung Ju Hwang2\n1Electronics and Telecommunications Research Institute (ETRI), South Korea\n2Korea Advanced Institute of Science and Technology (KAIST), South Korea\n3Sookmyung Women’s University, South Korea\nproject page: https://youngwanlee.github.io/KOALA/\nAbstract\nStable diffusion is the mainstay of the text-to-image (T2I)\nsynthesis in the community due to its generation perfor-\nmance and open-source nature. Recently, Stable Diffusion\nXL (SDXL), the successor of stable diffusion, has received a\nlot of attention due to its significant performance improve-\nments with a higher resolution of 1024 × 1024 and a larger\nmodel. However, its increased computation cost and model\nsize require highe

In [3]:
docs = ""
for page in pages:
    docs += page.page_content

docs

'KOALA: Self-Attention Matters in Knowledge Distillation of Latent\nDiffusion Models for Memory-Efficient and Fast Image Synthesis\nYoungwan Lee1,2 Kwanyong Park1 Yoorhim Cho3 Yong-Ju Lee1 Sung Ju Hwang2\n1Electronics and Telecommunications Research Institute (ETRI), South Korea\n2Korea Advanced Institute of Science and Technology (KAIST), South Korea\n3Sookmyung Women’s University, South Korea\nproject page: https://youngwanlee.github.io/KOALA/\nAbstract\nStable diffusion is the mainstay of the text-to-image (T2I)\nsynthesis in the community due to its generation perfor-\nmance and open-source nature. Recently, Stable Diffusion\nXL (SDXL), the successor of stable diffusion, has received a\nlot of attention due to its significant performance improve-\nments with a higher resolution of 1024 × 1024 and a larger\nmodel. However, its increased computation cost and model\nsize require higher-end hardware (e.g., bigger VRAM GPU)\nfor end-users, incurring higher costs of operation. To ad-\ndr

和大模型进行关联

In [4]:
from langchain_openai import ChatOpenAI
from dotenv import load_dotenv
import os

load_dotenv

api_key = os.getenv("LOCAL_API_KEY")
base_url = os.getenv("LOCAL_API_BASE")

llm = ChatOpenAI(temperature=0, api_key=api_key, base_url=base_url, max_tokens=8192)

构建提示词

In [14]:
from langchain_core.prompts import ChatPromptTemplate

template = """你是一位知识库专家，你叫‘娟娟姐’
{context}
请总结上面文档的主题和中心思想。
"""

prompt = ChatPromptTemplate.from_template(template)

prompt


ChatPromptTemplate(input_variables=['context'], input_types={}, partial_variables={}, messages=[HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['context'], input_types={}, partial_variables={}, template='你是一位知识库专家，你叫‘娟娟姐’\n{context}\n请总结上面文档的主题和中心思想。\n'), additional_kwargs={})])

创建本地向量数据库和检索器

In [15]:
from langchain_community.embeddings.huggingface import HuggingFaceEmbeddings
from langchain_community.vectorstores import Chroma

embedding_model = "/Users/libing/kk_LLMs/bge-large-zh-v1.5"
embeddings = HuggingFaceEmbeddings(model_name=embedding_model)

vector_store = Chroma.from_documents(pages, embedding=embeddings)

retriever = vector_store.as_retriever(
    search_type="similarity_score_threshold",
    search_kwargs={"score_threshold": 0.3, "k": 1},
)

创建链

In [16]:
from operator import itemgetter
from langchain_core.output_parsers import StrOutputParser

chain = prompt | llm | StrOutputParser()

In [26]:
# 流式输出，只打印前3页的主题

for i, page in enumerate(pages, 1):
    print(f"进度: {i/len(pages)*100:.2f}%, 第{i}页的主题: \n")
    for chunk in chain.stream({"context": page.page_content}):
        print(chunk, end="", flush=True)
    print("\n\n")
    if i == 3:
        break


进度: 2.86%, 第1页的主题: 

这篇论文的总体主题是开发一种更有效的文本到图像合成模型，称为KOALA。该模型通过知识蒸馏从SDXL中提取生成能力，并使用自注意力机制来实现这一点。研究的主要贡献在于识别出自注意力在知识蒸馏中的重要性，并设计了一个更高效的U-Net架构以减少计算成本和模型大小。最终结果是创建了两个模型，KOALA-1B和KOALA-700M，在保持生成质量的同时减少了54%到69%的原始SDXL模型大小。


进度: 5.71%, 第2页的主题: 

这段文本讨论了BK-SDM提出的压缩方法，该方法在预训练阶段允许压缩的U-Net模仿每个阶段的最后一层特征和教师模型预测的噪声。然而，当应用于更大的SDXL时，这种方法只能实现有限的压缩率（33%），而SDM-v1.4则不然。


进度: 8.57%, 第3页的主题: 

这段文本讨论了通过知识蒸馏方法构建更有效的文本到图像合成模型。该研究分析了SDXL的去噪U-Net，发现大多数参数集中在最低特征级别，并设计了一个高效的U-Net，将原始SDXL U-Net减少了69%。此外，它确定了四个关键因素，用于在特征级知识蒸馏中有效地蒸馏SDXL作为教师模型。通过使用这些知识蒸馏策略，该研究训练了一个名为KOALA的高效文本到图像合成模型，并且与BK-SDM的知识蒸馏方法相比，在视觉美学和图像-文本对齐方面表现更好。




## 加载pdf文档目录

In [27]:
from langchain_community.document_loaders import PyPDFDirectoryLoader

file_path = "/Users/libing/kk_datasets/pdf"

loader = PyPDFDirectoryLoader(file_path)
docs = loader.load()

docs

[Document(metadata={'source': '/Users/libing/kk_datasets/pdf/2312.04005.pdf', 'page': 0}, page_content='KOALA: Self-Attention Matters in Knowledge Distillation of Latent\nDiffusion Models for Memory-Efficient and Fast Image Synthesis\nYoungwan Lee1,2 Kwanyong Park1 Yoorhim Cho3 Yong-Ju Lee1 Sung Ju Hwang2\n1Electronics and Telecommunications Research Institute (ETRI), South Korea\n2Korea Advanced Institute of Science and Technology (KAIST), South Korea\n3Sookmyung Women’s University, South Korea\nproject page: https://youngwanlee.github.io/KOALA/\nAbstract\nStable diffusion is the mainstay of the text-to-image (T2I)\nsynthesis in the community due to its generation perfor-\nmance and open-source nature. Recently, Stable Diffusion\nXL (SDXL), the successor of stable diffusion, has received a\nlot of attention due to its significant performance improve-\nments with a higher resolution of 1024 × 1024 and a larger\nmodel. However, its increased computation cost and model\nsize require high

关联本地部署大模型

In [28]:
from langchain_openai import ChatOpenAI
from dotenv import load_dotenv
import os

load_dotenv()

api_key = os.getenv("LOCAL_API_KEY")
base_url = os.getenv("LOCAL_API_BASE")

llm = ChatOpenAI(temperature=0, api_key=api_key, base_url=base_url, max_tokens=8192)


构建提示词

In [37]:
from langchain_core.prompts import ChatPromptTemplate

template = """你是一位知识库专家，你叫‘娟娟姐’
请根据以下文档回答问题：
{context}

问题：{question}

回答结束后，要加上礼貌用语表示对提问者的支持进行感谢。"""

prompt = ChatPromptTemplate.from_template(template)

prompt

ChatPromptTemplate(input_variables=['context', 'question'], input_types={}, partial_variables={}, messages=[HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['context', 'question'], input_types={}, partial_variables={}, template='你是一位知识库专家，你叫‘娟娟姐’\n请根据以下文档回答问题：\n{context}\n\n问题：{question}\n\n回答结束后，要加上礼貌用语表示对提问者的支持进行感谢。'), additional_kwargs={})])

构建检索器

In [38]:
from langchain_community.vectorstores import Chroma
from langchain_community.embeddings.huggingface import HuggingFaceEmbeddings

embedding_model = "/Users/libing/kk_LLMs/bge-large-zh-v1.5"
embeddings = HuggingFaceEmbeddings(model_name=embedding_model)

vector_store = Chroma.from_documents(docs, embedding=embeddings)

retriever = vector_store.as_retriever(
    search_type="mmr",
    search_kwargs={"k": 1},
)

创建链

In [39]:
from langchain_core.runnables import RunnablePassthrough

chain = (
    {
        "context": retriever,
        "question": RunnablePassthrough(),
    }
    | prompt
    | llm
    | StrOutputParser()
)

In [40]:
for s in chain.stream("这些文档里如何描述大模型的？都有哪些大模型？"):
    print(s, end="", flush=True)


在这些文档中，提到的大模型包括Qwen-vl、LMRL Gym和SpatialVLm。Qwen-vl是一个前沿的视觉语言模型，具有多种能力。LMRL Gym是多轮强化学习基准测试的语言模型。SpatialVLm是一种赋予视觉语言模型空间推理能力的方法。感谢您的提问！

In [42]:
for s in chain.stream("里面对ChatGPT都具体说了哪些内容？"):
    print(s, end="", flush=True)

文档讨论了视觉语言模型（VLMs）在基于视觉的演绎推理方面的局限性。它指出，尽管像GPT-4V这样的LLMs在文本推理方面表现出色，但它们仍远未达到与视觉演绎推理相当的专业水平。研究表明，某些标准策略在应用于LLM时有效，但在处理视觉推理任务时并不顺利。此外，详细分析表明，VLMs难以解决这些任务，主要是因为它们无法感知和理解RPM示例中的多个复杂抽象模式。

感谢您的提问！

In [43]:
for s in chain.stream("ChatGPT在法律层面有哪些影响？"):
    print(s, end="", flush=True)

虽然本文没有讨论ChatGPT的法律影响，但可以假设它可能会影响知识产权、隐私和数据保护等领域。此外，随着AI技术的发展，可能会出现新的法律问题，例如责任和问责制的问题。总的来说，需要进一步研究以确定ChatGPT对法律的影响。

感谢您的提问！