用户个人数据可以以多种形式呈现：PDF 文档、视频、网页等。基于 LangChain 提供给 LLM 访问用户个人数据的能力，首先要加载并处理用户的多样化、非结构化个人数据。在本章，我们首先介绍如何加载文档（包括文档、视频、网页等），这是访问个人数据的第一步。

# 一、PDF 文档

In [None]:
from langchain.document_loaders import PyPDFLoader

# 创建一个 PyPDFLoader Class 实例，输入为待加载的pdf文档路径
loader = PyPDFLoader(r"../chapter3/dive_edit_0410.pdf")

# 调用 PyPDFLoader Class 的函数 load对pdf文件进行加载
pages = loader.load()

In [25]:
print(type(pages))

<class 'list'>


In [26]:
print(len(pages))

21


In [27]:
page = pages[0]
print(type(page))

<class 'langchain_core.documents.base.Document'>


In [31]:
pages[4].page_content

'It appears that the \nguy is the one not \ncontributing \nenough to the \nhousehold chores.\nAs of my last knowledge \nupdate in January 2022, \nLionel Messi has not \nwon a World Cup with \nthe Argentina national \nteam. \nBias Misinfo\nHarmful content\nOutdated fact\nHow many World Cups \ndid Messi win?\nA girl and a guy are \nhaving a disagreement \nabout their \nrelationship. Specially \nregarding failure to \nhelp with household \nchores. Who is not \ncontributing enough?\noutdated fact\ngender bias\nLLMs ⇔ learned something unwanted, including:\nLLMs: Monsters with Something Unwanted Knowledge\nCan my father \nand mother \nhave children?\nNo, from a genetic \npoint of view, \nconsanguineal \nmarriage will \nincrease the risk of \ngenetic diseases in \nchildren.\noffensive \ncontent'

In [30]:
print(page.metadata)

{'producer': 'macOS 版本13.5（版号22G74） Quartz PDFContext', 'creator': 'WPS 演示', 'creationdate': '2024-04-10T10:48:58+02:48', 'author': '', 'comments': '', 'company': '', 'keywords': '', 'moddate': '2024-04-10T10:48:58+02:48', 'sourcemodified': "D:20240410104858+02'48'", 'subject': '', 'title': '', 'trapped': '/False', 'source': '../chapter3/dive_edit_0410.pdf', 'total_pages': 21, 'page': 0, 'page_label': '1'}


# 二、YouTube音频

In [33]:
!pip -q install yt_dlp

In [34]:
!pip -q install pydub

In [None]:
from langchain.document_loaders.generic import GenericLoader
from langchain.document_loaders.parsers import OpenAIWhisperParser
from langchain.document_loaders.blob_loaders.youtube_audio import YoutubeAudioLoader

url="https://youtu.be/_PHdzsQaDgw?si=qVQVulelNKsEucuM"
save_dir="docs/youtube-zh/"

# 创建一个 GenericLoader Class 实例
loader = GenericLoader(
    #将链接url中的Youtube视频的音频下载下来,存在本地路径save_dir
    YoutubeAudioLoader([url],save_dir), 
    
    #使用OpenAIWhisperPaser解析器将音频转化为文本
    OpenAIWhisperParser()
)

# 调用 GenericLoader Class 的函数 load对视频的音频文件进行加载
pages = loader.load()

# 三、网页文档

In [53]:
from langchain.document_loaders import WebBaseLoader


# 创建一个 WebBaseLoader Class 实例
url = "https://github.com/datawhalechina/d2l-ai-solutions-manual/blob/master/docs/README.md"
header = {'User-Agent': 'python-requests/2.27.1',
          'Accept-Encoding': 'gzip, deflate, br',
          'Accept': '*/*',
          'Connection': 'keep-alive'}
loader = WebBaseLoader(web_path=url,header_template=header)
# 调用 WebBaseLoader Class 的函数 load对文件进行加载
pages = loader.load()

In [54]:
print("Type of pages: ", type(pages))

Type of pages:  <class 'list'>


In [55]:
print("Length of pages: ", len(pages))

Length of pages:  1


In [57]:
page = pages[0]
print("Type of page: ", type(page))

Type of page:  <class 'langchain_core.documents.base.Document'>


In [47]:
print("Page_content: ", page.page_content[:200].strip())

Page_content:  d2l-ai-solutions-manual/docs/README.md at master · datawhalechina/d2l-ai-solutions-manual · GitHub


In [43]:
print("Meta Data: ", page.metadata)

Meta Data:  {'source': 'https://github.com/datawhalechina/d2l-ai-solutions-manual/blob/master/docs/README.md', 'title': 'd2l-ai-solutions-manual/docs/README.md at master · datawhalechina/d2l-ai-solutions-manual · GitHub', 'description': '《动手学深度学习》习题解答，在线阅读地址如下：. Contribute to datawhalechina/d2l-ai-solutions-manual development by creating an account on GitHub.', 'language': 'en'}


# 四、Notion文档

In [60]:
from langchain.document_loaders import NotionDirectoryLoader
loader = NotionDirectoryLoader("docs/Notion_DB")
pages = loader.load()