# Loading Document

Load a PDF into a sequence of `Document` objects.

The type of `Document` is `langchain_core.documents.Document`

通过这行代码，可以实现把 PDF 的每一页转换分别转换为一个`Document`对象：

```
docs = loader.load()
```

In [6]:
from langchain_community.document_loaders import PyPDFLoader

file_path = "../data/nke-10k-2023.pdf"
loader = PyPDFLoader(file_path)

docs = loader.load()

# 每个一个 document 对象
print(len(docs))

107


In [13]:
print(docs[0].metadata, "\n\n")
print(docs[0].page_content[:200])
# dir(docs[0])

{'source': '../data/nke-10k-2023.pdf', 'page': 0} 


Table of Contents
UNITED STATES
SECURITIES AND EXCHANGE COMMISSION
Washington, D.C. 20549
FORM 10-K
(Mark One)
☑  ANNUAL REPORT PURSUANT TO SECTION 13 OR 15(D) OF THE SECURITIES EXCHANGE ACT OF 1934
F


# Splitting

不管取数据还是做问答，

In [9]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000,
                                               chunk_overlap=200,
                                               add_start_index=True)

all_splits = text_splitter.split_documents(docs)
print(len(all_splits))

516


In [34]:
split = all_splits[0]
print(split.metadata, "\n\n",
      len(split.page_content))

{'source': '../data/nke-10k-2023.pdf', 'page': 0, 'start_index': 0} 

 972


In [33]:
print(len(docs[0].page_content), "\t", len(split.page_content), "\n\n")

print(docs[0].page_content[900:972], "\n")
print(split.page_content[900:972])


3646 	 972 


each class) (Trading symbol) (Name of each exchange on which registered) 

each class) (Trading symbol) (Name of each exchange on which registered)


# Embeddings

把文本变成向量，然后可以进行向量搜索

In [36]:
from langchain_huggingface import HuggingFaceEmbeddings
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")

In [37]:
vector_1 = embeddings.embed_query(all_splits[0].page_content)
vector_2 = embeddings.embed_query(all_splits[1].page_content)

assert len(vector_1) == len(vector_2)
print(f"Generated vectors of length {len(vector_1)}\n")
print(vector_1[:10])

Generated vectors of length 768

[0.047472357749938965, 0.021675849333405495, -0.009018078446388245, 0.005356733687222004, 0.025557702407240868, -0.010230264626443386, -0.008413944393396378, 0.03930392488837242, 0.02157050184905529, -0.024095406755805016]
