最简单的 LlamaIndex 示例，读取 PDF 文档并构建索引。不需要自己构建向量数据库。
使用SimpleDirectoryReader读取文件夹中的所有 pdf 文件时，碰到问题，读取到的都是乱码，后续用 pdfreader 读取

In [12]:
import os

from langchain_community.embeddings import DashScopeEmbeddings
from llama_cloud_services import LlamaParse
from llama_index.core import Settings, SimpleDirectoryReader, VectorStoreIndex
from llama_index.embeddings.dashscope import DashScopeTextEmbeddingModels
from llama_index.llms.openai_like import OpenAILike
from llama_index.readers.web import SimpleWebPageReader

# 全局设置 llm
Settings.llm = OpenAILike(model='qwen-max', api_base='https://dashscope.aliyuncs.com/compatible-mode/v1',
                          api_key=os.getenv("DASHSCOPE_API_KEY"), is_chat_model=True)
# 全局设置 embedding
Settings.embed_model = DashScopeEmbeddings(
                                           model=DashScopeTextEmbeddingModels.TEXT_EMBEDDING_V1)


In [28]:
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.readers.file import PDFReader

# 使用PDFReader读取PDF文档
# pdf_reader = PDFReader()
# documents = pdf_reader.load_data("./data/deepseek-v3-1-4.pdf")
documents = SimpleDirectoryReader(input_dir="./data").load_data()
print(documents[0].text[:500])
# 构建索引
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()
response = query_engine.query("deepseek v3 有多少参数？")
response

DeepSeek-V3 Technical Report
DeepSeek-AI
research@deepseek.com
Abstract
We present DeepSeek-V3, a strong Mixture-of-Experts (MoE) language model with 671B total
parameters with 37B activated for each token. To achieve efficient inference and cost-effective
training, DeepSeek-V3 adopts Multi-head Latent Attention (MLA) and DeepSeekMoE architec-
tures, which were thoroughly validated in DeepSeek-V2. Furthermore, DeepSeek-V3 pioneers
an auxiliary-loss-free strategy for load balancing and sets a mul


2025-09-25 15:28:32,187 - INFO - HTTP Request: POST https://dashscope.aliyuncs.com/compatible-mode/v1/chat/completions "HTTP/1.1 200 OK"


Response(response='DeepSeek-V3 拥有总计671亿个参数，每个令牌激活37亿个参数。', source_nodes=[NodeWithScore(node=TextNode(id_='aeb80c7c-5f41-44ab-bcda-a1def5c149d0', embedding=None, metadata={'page_label': '1', 'file_name': 'deepseek-v3-1-4.pdf', 'file_path': '/Users/onepiecekevin/Documents/learn/ai/agent/rag_learn/第四章 LlamaIndex/data/deepseek-v3-1-4.pdf', 'file_type': 'application/pdf', 'file_size': 192218, 'creation_date': '2025-09-25', 'last_modified_date': '2025-03-12'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='77855cf2-2b3b-475f-88b6-02d7f68d4c34', node_type='4', metadata={'page_label': '1', 'file_name': 'deepseek-v3-1-4.pdf', 'file_path': '/Users/onepiecekevin/Documents/learn/ai/agent/rag_learn/第四章 LlamaIndex/data/

In [25]:
reader = SimpleDirectoryReader(input_dir="./data", recursive=False, required_exts=[".pdf"])
documents = reader.load_data()
# 第一页内容
documents[0].text


'DeepSeek-V3 Technical Report\nDeepSeek-AI\nresearch@deepseek.com\nAbstract\nWe present DeepSeek-V3, a strong Mixture-of-Experts (MoE) language model with 671B total\nparameters with 37B activated for each token. To achieve efficient inference and cost-effective\ntraining, DeepSeek-V3 adopts Multi-head Latent Attention (MLA) and DeepSeekMoE architec-\ntures, which were thoroughly validated in DeepSeek-V2. Furthermore, DeepSeek-V3 pioneers\nan auxiliary-loss-free strategy for load balancing and sets a multi-token prediction training\nobjective for stronger performance. We pre-train DeepSeek-V3 on 14.8 trillion diverse and\nhigh-quality tokens, followed by Supervised Fine-Tuning and Reinforcement Learning stages to\nfully harness its capabilities. Comprehensive evaluations reveal that DeepSeek-V3 outperforms\nother open-source models and achieves performance comparable to leading closed-source\nmodels. Despite its excellent performance, DeepSeek-V3 requires only 2.788M H800 GPU hours\nfo

In [5]:
# 用 llamaParse 解析，可以解析表格和图片
from llama_cloud_services import LlamaParse
from llama_index.core import SimpleDirectoryReader
import os
os.environ["LLAMA_CLOUD_API_KEY"] = "llx-IzCgUhy1kpqAKDffoD2UC0xYn3UlxjWcyTu5bvY1v2VwAhFU"
# markdown and text 都是可用的
parser = LlamaParse(result_type="markdown")
file_extractor = {".pdf": parser}

documents = SimpleDirectoryReader(input_dir="./data", file_extractor=file_extractor, required_exts=[".pdf"]).load_data()
print(documents[0].text)


2025-09-25 16:01:39,447 - INFO - NumExpr defaulting to 8 threads.
2025-09-25 16:01:43,411 - INFO - HTTP Request: POST https://api.cloud.llamaindex.ai/api/parsing/upload "HTTP/1.1 200 OK"


Started parsing the file under job_id c5f1a02d-555a-4c8e-aa9b-de5ce07b311a


2025-09-25 16:01:44,947 - INFO - HTTP Request: GET https://api.cloud.llamaindex.ai/api/parsing/job/c5f1a02d-555a-4c8e-aa9b-de5ce07b311a "HTTP/1.1 200 OK"
2025-09-25 16:01:47,609 - INFO - HTTP Request: GET https://api.cloud.llamaindex.ai/api/parsing/job/c5f1a02d-555a-4c8e-aa9b-de5ce07b311a "HTTP/1.1 200 OK"
2025-09-25 16:01:48,328 - INFO - HTTP Request: GET https://api.cloud.llamaindex.ai/api/parsing/job/c5f1a02d-555a-4c8e-aa9b-de5ce07b311a/result/markdown "HTTP/1.1 200 OK"


arXiv:2412.19437v2 [cs.CL] 18 Feb 2025

# DeepSeek-V3 Technical Report

# DeepSeek-AI

research@deepseek.com

# Abstract

We present DeepSeek-V3, a strong Mixture-of-Experts (MoE) language model with 671B total parameters with 37B activated for each token. To achieve efficient inference and cost-effective training, DeepSeek-V3 adopts Multi-head Latent Attention (MLA) and DeepSeekMoE architectures, which were thoroughly validated in DeepSeek-V2. Furthermore, DeepSeek-V3 pioneers an auxiliary-loss-free strategy for load balancing and sets a multi-token prediction training objective for stronger performance. We pre-train DeepSeek-V3 on 14.8 trillion diverse and high-quality tokens, followed by Supervised Fine-Tuning and Reinforcement Learning stages to fully harness its capabilities. Comprehensive evaluations reveal that DeepSeek-V3 outperforms other open-source models and achieves performance comparable to leading closed-source models. Despite its excellent performance, DeepSeek-V3 requi

In [2]:
from llama_index.readers.web import SimpleWebPageReader

# 读取网页
documents = SimpleWebPageReader(html_to_text=True).load_data(urls=["https://edu.guangjuke.com/tx/"])
print(documents[0].text)



Login

__

用户登录

![](/api/Ajax/vertify/type/users_login.html)

[ 登录](javascript:void\(0\))

[忘记密码?](/user/Users/retrieve_password.html) [立即注册](/reg)

快捷登录 | [__](/index.php?m=plugins&c=QqLogin&a=login)[__](/index.php?m=plugins&c=WxLogin&a=login)[__](/index.php?m=plugins&c=Wblogin&a=login)

[
![聚客AI学院大模型应用开发微调项目实践课程学习平台](https://oss.guangjuke.com/uploads/allimg/20250224/1-250224200331L9.png)
](https://edu.guangjuke.com)

[首页](https://edu.guangjuke.com)
[全部课程](https://edu.guangjuke.com/shipinkecheng/)
[大模型应用](https://edu.guangjuke.com/tx/)
[精选好文](https://edu.guangjuke.com/haowen/)
[聚客社区](https://edu.guangjuke.com/ask.html)
[学员喜报](https://edu.guangjuke.com/xibao/)
[关于我们](https://www.guangjuke.com/about/)

[登录/注册](https://edu.guangjuke.com/user)

![](/template/pc/static/images/banner-tip-bar.28e923ae.png)

# 锤炼前沿实战精华，独创多领域大模型人才培养方案

Kevin聚客科技联合创始人/技术总监（CTO）

华为高级架构师互联网AI领域专家

互联网后端技术领域15年从业经验，曾任职华为、新一代技术研究院，对Open AI、Azure AI、Google AI等大模型有丰富的实战项目经验。

Aron人工智能研究院研究员

人工智能算法研究员医疗领域AI专家

8年深度

### 4. 文本切分与解析
llamadeindex 将文本切分为多个节点，每个节点包含一个文本片段（chunk），以及该片段的元数据。

In [3]:
import json
from pydantic.v1 import BaseModel

def show_json(data):
    """用于展示json数据"""
    if isinstance(data, str):
        obj = json.loads(data)
        print(json.dumps(obj, indent=4, ensure_ascii=False))
    elif isinstance(data, dict) or isinstance(data, list):
        print(json.dumps(data, indent=4, ensure_ascii=False))
    elif issubclass(type(data), BaseModel):
        print(json.dumps(data.dict(), indent=4, ensure_ascii=False))

def show_list_obj(data):
    """用于展示一组对象"""
    if isinstance(data, list):
        for item in data:
            show_json(item)
    else:
        raise ValueError("Input is not a list")


In [5]:
from llama_index.core import Document
from llama_index.core.node_parser import TokenTextSplitter

node_parser = TokenTextSplitter(chunk_size=512, chunk_overlap=200)
nodes = node_parser.get_nodes_from_documents(documents, show_progress=False)

print(nodes[1].json())
# print(nodes[2].json())


{"id_": "6f914e0f-179d-4976-9dea-af09167f07ae", "embedding": null, "metadata": {"url": "https://edu.guangjuke.com/tx/"}, "excluded_embed_metadata_keys": [], "excluded_llm_metadata_keys": [], "relationships": {"1": {"node_id": "7c8de0e0-e53d-408b-95ed-bd161be8351b", "node_type": "4", "metadata": {"url": "https://edu.guangjuke.com/tx/"}, "hash": "491fa501355a604a7e0505d17f3bea07da28f97f0c192d8b7ad941c7a0ba1ea1", "class_name": "RelatedNodeInfo"}, "2": {"node_id": "f0ff02e2-a035-4e5e-9d02-b71db211cdf1", "node_type": "1", "metadata": {"url": "https://edu.guangjuke.com/tx/"}, "hash": "717c00ec1955e32d1afeebf95ae577b94f9de33ecd0540b65486696aa0dbb4cf", "class_name": "RelatedNodeInfo"}, "3": {"node_id": "a32d32b6-1ad0-43ba-a0bf-14aa322ded13", "node_type": "1", "metadata": {}, "hash": "be74416c434cd27225f2bddd93b43b05843e97e1ee96f00b6760b045695178b0", "class_name": "RelatedNodeInfo"}}, "metadata_template": "{key}: {value}", "metadata_separator": "\n", "text": "\u9524\u70bc\u524d\u6cbf\u5b9e\u621

In [13]:
# 向量检索
import os
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core.node_parser import TokenTextSplitter, SentenceSplitter
from llama_index.embeddings.dashscope import DashScopeTextEmbeddingModels
from langchain_community.embeddings import DashScopeEmbeddings
from llama_index.llms.openai_like import OpenAILike

# 设置 embedding 模型
embed_model = DashScopeEmbeddings(
    model=DashScopeTextEmbeddingModels.TEXT_EMBEDDING_V1
)

documents = SimpleDirectoryReader(input_dir="./data", required_exts=[".pdf"]).load_data()
node_parser = TokenTextSplitter(chunk_size=512, chunk_overlap=200)

# 切分文档
nodes = node_parser.get_nodes_from_documents(documents)

# 构建 index，指定 embedding 模型
index = VectorStoreIndex(nodes, embed_model=embed_model)

# 持久化
# index.storage_context.persist(persist_dir="./doc_emb")
# 获取 reriever
vector_retriever = index.as_retriever(similarity_top_k=2)

results = vector_retriever.retrieve("deepseek v3 数学能力怎么样")
print(results[0].text)

llm = OpenAILike(model='qwen-max', api_base='https://dashscope.aliyuncs.com/compatible-mode/v1',
                     api_key=os.getenv("DASHSCOPE_API_KEY"), is_chat_model=True)
qa_engine = index.as_query_engine(llm=llm)
response = qa_engine.query("deepseek v3数学能力怎么样?")

print(response)

verification and reflection patterns of R1 into DeepSeek-V3 and notably improves its
reasoning performance. Meanwhile, we also maintain control over the output style and
length of DeepSeek-V3.
Summary of Core Evaluation Results
• Knowledge: (1) On educational benchmarks such as MMLU, MMLU-Pro, and GPQA,
DeepSeek-V3 outperforms all other open-source models, achieving 88.5 on MMLU, 75.9
on MMLU-Pro, and 59.1 on GPQA. Its performance is comparable to leading closed-source
models like GPT-4o and Claude-Sonnet-3.5, narrowing the gap between open-source
and closed-source models in this domain. (2) For factuality benchmarks, DeepSeek-V3
demonstrates superior performance among open-source models on both SimpleQA and
Chinese SimpleQA. While it trails behind GPT-4o and Claude-Sonnet-3.5 in English factual
knowledge (SimpleQA), it surpasses these models in Chinese factual knowledge (Chinese
SimpleQA), highlighting its strength in Chinese factual knowledge.
• Code, Math, and Reasoning: (1) DeepSee

2025-09-25 18:26:42,237 - INFO - HTTP Request: POST https://dashscope.aliyuncs.com/compatible-mode/v1/chat/completions "HTTP/1.1 200 OK"


DeepSeek-V3在数学相关基准测试中表现出色，特别是在非长链思维链（non-long-CoT）的开源和闭源模型中达到了最先进水平。它在特定的基准测试如MATH-500上甚至超过了o1-preview的表现，这表明它具有强大的数学推理能力。
