## 专家知识工作者

### 一个作为专家知识工作者的问答代理
### 供保险科技公司 顶云 的员工使用
### 该代理需要准确，并且解决方案应低成本。

本项目将使用 RAG（检索增强生成）以确保我们的问答助手具有高准确性。

## 今天：

- Part A：我们将文档切分为 CHUNKS（分块）
- Part B：将 CHUNKS 编码为 VECTORS（向量）并存入 Chroma
- Part C：可视化向量

In [18]:
#如果运行下面import出现找不到模块错误时（langchain-huggingface），运行下这个
!python -m pip install -U langchain-huggingface



### Part A：将文档切分为分块

In [1]:
import os
import glob
import tiktoken
import numpy as np
from dotenv import load_dotenv
from langchain_openai import OpenAIEmbeddings
from langchain_chroma import Chroma
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.document_loaders import DirectoryLoader, TextLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from sklearn.manifold import TSNE
import plotly.graph_objects as go

In [6]:
# 价格是公司需要考虑的因素，因此我们使用一个低成本模型

MODEL = "gpt-4.1-nano"
db_name = "vector_db"
load_dotenv(override=True)
openai_api_key = os.getenv('OPENAI_API_KEY')
if openai_api_key:
    print(f"OpenAI API Key 存在且开头为 {openai_api_key[:8]}")
else:
    print("OpenAI API Key 未设置")


OpenAI API Key 存在且开头为 sk-proj-


In [4]:
# 所有文档一共有多少字符？

knowledge_base_path = "knowledge-base/**/*.md"
files = glob.glob(knowledge_base_path, recursive=True)
print(f"在知识库中找到了 {len(files)} 个文件")

entire_knowledge_base = ""

for file_path in files:
    with open(file_path, 'r', encoding='utf-8') as f:
        entire_knowledge_base += f.read()
        entire_knowledge_base += "\n\n"

print(f"知识库总字符数: {len(entire_knowledge_base):,}")

在知识库中找到了 76 个文件
知识库总字符数: 304,434


In [12]:
# 所有文档一共有多少 token？

encoding = tiktoken.get_encoding("o200k_base")
tokens = encoding.encode(entire_knowledge_base)
token_count = len(tokens)
print(f"token 总数: {token_count:,}")

token 总数: 63,555


In [11]:
# 使用 LangChain 的 loader 加载知识库中的所有内容

folders = glob.glob("knowledge-base/*")

documents = []
for folder in folders:
    doc_type = os.path.basename(folder)
    loader = DirectoryLoader(folder, glob="**/*.md", loader_cls=TextLoader, loader_kwargs={'encoding': 'utf-8'})
    folder_docs = loader.load()
    for doc in folder_docs:
        doc.metadata["doc_type"] = doc_type
        documents.append(doc)

print(f"已加载 {len(documents)} 个文档")

已加载 76 个文档


In [13]:
documents[0]

Document(metadata={'source': 'knowledge-base/products/Rellm.md', 'doc_type': 'products'}, page_content="# Product Summary\n\n# Rellm: AI-Powered Enterprise Reinsurance Solution\n\n## Summary\n\nRellm is an innovative enterprise reinsurance product developed by Insurellm, designed to transform the way reinsurance companies operate. Harnessing the power of artificial intelligence, Rellm offers an advanced platform that redefines risk management, enhances decision-making processes, and optimizes operational efficiencies within the reinsurance industry. With seamless integrations and robust analytics, Rellm enables insurers to proactively manage their portfolios and respond to market dynamics with agility.\n\n## Features\n\n### AI-Driven Analytics\nRellm utilizes cutting-edge AI algorithms to provide predictive insights into risk exposures, enabling users to forecast trends and make informed decisions. Its real-time data analysis empowers reinsurance professionals with actionable intellige

In [17]:
# 使用 RecursiveCharacter文本Splitter 将文档切分为分块

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = text_splitter.split_documents(documents)

print(f"已切分为 {len(chunks)} 个分块")
print(f"第一个分块:\n\n{chunks[0]}")

已切分为 532 个分块
第一个分块:

page_content='# Product Summary

# Rellm: AI-Powered Enterprise Reinsurance Solution

## Summary

Rellm is an innovative enterprise reinsurance product developed by Insurellm, designed to transform the way reinsurance companies operate. Harnessing the power of artificial intelligence, Rellm offers an advanced platform that redefines risk management, enhances decision-making processes, and optimizes operational efficiencies within the reinsurance industry. With seamless integrations and robust analytics, Rellm enables insurers to proactively manage their portfolios and respond to market dynamics with agility.

## Features' metadata={'source': 'knowledge-base/products/Rellm.md', 'doc_type': 'products'}


In [16]:
chunks[100]

Document(metadata={'source': 'knowledge-base/contracts/Contract with National Claims Network for Claimllm.md', 'doc_type': 'contracts'}, page_content="7. **Business Continuity:** Insurellm provides disaster recovery with 4-hour RTO (Recovery Time Objective) and 1-hour RPO (Recovery Point Objective).\n\n---\n\n## Renewal\n\nThis agreement includes a mutual 120-day renewal notice period. National Claims Network receives guaranteed enterprise pricing for renewal equal to or better than new enterprise customers at renewal time. Contract may be extended in 12-month increments with mutual written agreement.\n\n---\n\n## Features\n\nNational Claims Network will receive the complete Claimllm Enterprise suite:\n\n1. **Unlimited Claims Processing:** No volume restrictions, supporting National's processing of 100,000+ claims annually with scalability to 500,000+ claims as business grows.\n\n2. **White-Label Platform:** Complete branding customization including:\n   - Custom domain names (claims.n

### Part B：生成向量并存入 Chroma

在第 3 周，你创建了 Hugging Face 账号并获得了 HF_TOKEN

此时，你可能想把它加入到 `.env` 文件里，并运行 `load_dotenv(override=True)`

（实际上这不一定需要。）

In [None]:
# 选择一个 embedding（向量化）模型

embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
#embeddings = OpenAIEmbeddings(model="text-embedding-3-large")

if os.path.exists(db_name):
    Chroma(persist_directory=db_name, embedding_function=embeddings).delete_collection()

vectorstore = Chroma.from_documents(documents=chunks, embedding=embeddings, persist_directory=db_name)
print(f"向量库已创建，包含 {vectorstore._collection.count()} 条文档")

In [None]:
# 让我们检查一下向量

collection = vectorstore._collection
count = collection.count()

sample_embedding = collection.get(limit=1, include=["embeddings"])["embeddings"][0]
dimensions = len(sample_embedding)
print(f"向量库中有 {count:,} 个向量，每个向量维度为 {dimensions:,}")

### Part C：可视化！

In [None]:
# 预处理

result = collection.get(include=['embeddings', 'documents', 'metadatas'])
vectors = np.array(result['embeddings'])
documents = result['documents']
metadatas = result['metadatas']
doc_types = [metadata['doc_type'] for metadata in metadatas]
colors = [['blue', 'green', 'red', 'orange'][['products', 'employees', 'contracts', 'company'].index(t)] for t in doc_types]

In [None]:
# 人类更容易在二维中进行可视化！
# 使用 t-SNE 将向量降维到 2D
# （t 分布随机邻域嵌入）

tsne = TSNE(n_components=2, random_state=42)
reduced_vectors = tsne.fit_transform(vectors)

# 创建二维散点图
fig = go.Figure(data=[go.Scatter(
    x=reduced_vectors[:, 0],
    y=reduced_vectors[:, 1],
    mode='markers',
    marker=dict(size=5, color=colors, opacity=0.8),
    text=[f"类型: {t}<br>文本: {d[:100]}..." for t, d in zip(doc_types, documents)],
    hoverinfo='text'
)])

fig.update_layout(title='2D Chroma 向量库可视化',
    scene=dict(xaxis_title='x',yaxis_title='y'),
    width=800,
    height=600,
    margin=dict(r=20, b=10, l=10, t=40)
)

fig.show()

In [None]:
# 再试试 3D！

tsne = TSNE(n_components=3, random_state=42)
reduced_vectors = tsne.fit_transform(vectors)

# 创建三维散点图
fig = go.Figure(data=[go.Scatter3d(
    x=reduced_vectors[:, 0],
    y=reduced_vectors[:, 1],
    z=reduced_vectors[:, 2],
    mode='markers',
    marker=dict(size=5, color=colors, opacity=0.8),
    text=[f"类型: {t}<br>文本: {d[:100]}..." for t, d in zip(doc_types, documents)],
    hoverinfo='text'
)])

fig.update_layout(
    title='3D Chroma 向量库可视化',
    scene=dict(xaxis_title='x', yaxis_title='y', zaxis_title='z'),
    width=900,
    height=700,
    margin=dict(r=10, b=10, l=10, t=40)
)

fig.show()