# 语义搜索引擎

本教程将帮助您熟悉LangChain的文档加载器、嵌入和向量存储抽象。这些抽象旨在支持从（向量）数据库和其他来源检索数据，以便与LLM工作流集成。它们对于在模型推理过程中获取数据进行推理的应用程序非常重要，例如检索增强生成（RAG）（请参阅我们的RAG教程此处）。

在这里，我们将在PDF文档上构建一个搜索引擎。这将使我们能够检索PDF中与输入查询相似的段落。

In [1]:
from langchain_core.documents import Document

documents = [
    Document(
        page_content="Dogs are great companions, known for their loyalty and friendliness.",
        metadata={"source": "mammal-pets-doc"},
    ),
    Document(
        page_content="Cats are independent pets that often enjoy their own space.",
        metadata={"source": "mammal-pets-doc"},
    ),
]

In [16]:
from langchain_community.document_loaders import PyPDFLoader

file_path = "./example_data/nke-10k-2023.pdf"
loader = PyPDFLoader(file_path)

docs = loader.load()

print(len(docs))

104


In [17]:
print(f"{docs[0].page_content[:200]}\n")
print(docs[0].metadata)

stocklight.com
 
>
Stocks
 
>
United States
 
Nike
 
>
Annual Reports
 
>
2023 Annual Report
Nike Annual Report 2023
Form 10-K (NYSE:NKE)
Published: July 20th, 2023
Brought to you by

{'producer': 'Qt 4.8.7', 'creator': 'wkhtmltopdf 0.12.6.1', 'creationdate': '2024-11-22T21:07:39+00:00', 'title': 'nke-20230531', 'source': './example_data/nke-10k-2023.pdf', 'total_pages': 104, 'page': 0, 'page_label': '1'}


In [18]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000, chunk_overlap=200, add_start_index=True
)
all_splits = text_splitter.split_documents(docs)

len(all_splits)

505

In [19]:
import getpass
import os

# 设置阿里云API密钥（替换OpenAI的）
if not os.environ.get("DASHSCOPE_API_KEY"):
    os.environ["DASHSCOPE_API_KEY"] = getpass.getpass("Enter API key for Alibaba Cloud DashScope: ")

# 导入阿里云的嵌入模型
from langchain.embeddings import DashScopeEmbeddings

# 初始化阿里云嵌入模型
embeddings = DashScopeEmbeddings(model="text-embedding-v4")

# 假设 all_splits 是您的文档分割列表
vector_1 = embeddings.embed_query(all_splits[0].page_content)
vector_2 = embeddings.embed_query(all_splits[1].page_content)

# 阿里云嵌入向量维度可能是1024或1536，具体看模型版本
assert len(vector_1) == len(vector_2)
print(f"Generated vectors of length {len(vector_1)}\n")
print(vector_1[:10])  # 打印前10个维度


Generated vectors of length 1024

[0.010377250611782074, 0.02415759488940239, 0.011720961891114712, -0.08389433473348618, 0.013305664993822575, 0.051148671656847, -0.005875086411833763, -0.015145965851843357, 0.007350247818976641, 0.08389433473348618]


In [20]:
from langchain_core.vectorstores import InMemoryVectorStore

vector_store = InMemoryVectorStore(embeddings)

In [21]:
ids = vector_store.add_documents(documents=all_splits)

In [26]:
results = vector_store.similarity_search(
    "耐克有几家经销中心点?"
)

print(results[0])

page_content='consumer operations sell products through the following number of retail stores in the United States:
U.S. RETAIL STORES
NUMBER
NIKE Brand factory stores
213 
NIKE Brand in-line stores (including employee-only stores)
74 
Converse stores (including factory stores)
82 
TOTAL
369
 
In the United States, NIKE has eight significant distribution centers. Refer to Item 2. Properties for further information.' metadata={'producer': 'Qt 4.8.7', 'creator': 'wkhtmltopdf 0.12.6.1', 'creationdate': '2024-11-22T21:07:39+00:00', 'title': 'nke-20230531', 'source': './example_data/nke-10k-2023.pdf', 'total_pages': 104, 'page': 6, 'page_label': '7', 'start_index': 3117}


In [23]:
results = await vector_store.asimilarity_search("When was Nike incorporated?")

print(results[0])

page_content='PART I
ITEM 1. BUSINESS
GENERAL
NIKE, Inc. was incorporated in 1967 under the laws of the State of Oregon. As used in this Annual Report on Form 10-K (this "Annual Report"), the terms "we," "us," "our," "NIKE" and
the "Company" refer to NIKE, Inc. and its predecessors, subsidiaries and affiliates, collectively, unless the context indicates otherwise.
Our principal business activity is the design, development and worldwide marketing and selling of athletic footwear, apparel, equipment, accessories and services. NIKE is the largest
seller of athletic footwear and apparel in the world. We sell our products through NIKE Direct operations, which are comprised of both NIKE-owned retail stores and sales through our
digital platforms (also referred to as "NIKE Brand Digital"), to retail accounts and to a mix of independent distributors, licensees and sales representatives in nearly all countries around' metadata={'producer': 'Qt 4.8.7', 'creator': 'wkhtmltopdf 0.12.6.1', 'creatio

In [24]:
# Note that providers implement different scores; the score here
# is a distance metric that varies inversely with similarity.

results = vector_store.similarity_search_with_score("What was Nike's revenue in 2023?")
doc, score = results[0]
print(f"Score: {score}\n")
print(doc)

Score: 0.6823624701000965

page_content='NIKE, INC.
CONSOLIDATED STATEMENTS OF INCOME
YEAR ENDED MAY 31,
(In millions, except per share data)
2023
2022
2021
Revenues
$
51,217
 
$
46,710
 
$
44,538
 
Cost of sales
28,925
 
25,231
 
24,576
 
Gross profit
22,292
 
21,479
 
19,962
 
Demand creation expense
4,060
 
3,850
 
3,114
 
Operating overhead expense
12,317
 
10,954
 
9,911
 
Total selling and administrative expense
16,377
 
14,804
 
13,025
 
Interest expense (income), net
(
6
)
205
 
262
 
Other (income) expense, net
(
280
)
(
181
)
14
 
Income before income taxes
6,201
 
6,651
 
6,661
 
Income tax expense
1,131
 
605
 
934
 
NET INCOME
$
5,070
 
$
6,046
 
$
5,727
 
Earnings per common share:
Basic
$
3.27
 
$
3.83
 
$
3.64
 
Diluted
$
3.23
 
$
3.75
 
$
3.56
 
Weighted average common shares outstanding:
Basic
1,551.6
 
1,578.8
 
1,573.0
 
Diluted
1,569.8
 
1,610.8
 
1,609.4
 
The accompanying Notes to the Consolidated Financial Statements are an integral part of this statement.
2023 

In [25]:
embedding = embeddings.embed_query("How were Nike's margins impacted in 2023?")

results = vector_store.similarity_search_by_vector(embedding)
print(results[0])

page_content='GROSS MARGIN
FISCAL 2023 COMPARED TO FISCAL 2022
For fiscal 2023, our consolidated gross profit increased 4% to $22,292 million compared to $21,479 million for fiscal 2022. Gross margin decreased 250 basis points to 43.5% for fiscal
2023 compared to 46.0% for fiscal 2022 due to the following:
*Wholesale equivalent
The decrease in gross margin for fiscal 2023 was primarily due to:
•
Higher NIKE Brand product costs, on a wholesale equivalent basis, primarily due to higher input costs and elevated inbound freight and logistics costs as well as product mix;
•
Lower margin in our NIKE Direct business, driven by higher promotional activity to liquidate inventory in the current period compared to lower promotional activity in the prior
period resulting from lower available inventory supply;
•
Unfavorable changes in net foreign currency exchange rates, including hedges; and
•
Lower off-price margin, on a wholesale equivalent basis.
This was partially offset by:
•' metadata={'prod