LangChain的使用能帮助我们更方便地使用LLM。

开发：使用 LangChain 的开源组件和第三方集成构建您的应用程序。使用LangGraph构建具有一流流媒体和人机交互支持的状态代理。

生产化：使用LangSmith检查、监控和评估您的应用程序，以便您可以不断优化和自信地部署。

部署：使用LangGraph 平台将您的 LangGraph 应用程序转变为可用于生产的 API 和助手。

（1）获取聊天模型并使用：

In [None]:
from langchain_openai import ChatOpenAI
model = ChatOpenAI(model="gpt-4o-mini")
from langchain_core.messages import HumanMessage, SystemMessage
messages = [
    SystemMessage("Translate the following from English into Italian"),
    HumanMessage("hi!"),
]
model.invoke(messages)
# 返回AIMessage

（2）创建文档：

In [None]:
from langchain_core.documents import Document
documents = [
    Document(
        page_content="Dogs are great companions, known for their loyalty and friendliness.",
        metadata={"source": "mammal-pets-doc"},
    ),
    Document(
        page_content="Cats are independent pets that often enjoy their own space.",
        metadata={"source": "mammal-pets-doc"},
    ),
]
# page_content：代表内容的字符串；
# metadata：包含任意元数据的字典；
# id：（可选）文档的字符串标识符。

若需要加载PDF文档：

In [None]:
from langchain_community.document_loaders import PyPDFLoader
file_path = "../example_data/nke-10k-2023.pdf"
loader = PyPDFLoader(file_path)
docs = loader.load()
# print(len(docs))

（3）文本分割：

In [None]:
from langchain_text_splitters import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000, chunk_overlap=200, add_start_index=True
)
all_splits = text_splitter.split_documents(docs)
# print(len(all_splits))
# 将文档分割成 1000 个字符的块，块之间有 200 个字符的重叠。重叠有助于减轻将语句与与其相关的重要上下文分离的可能性
# add_start_index=True每个分割文档在初始文档中开始的字符索引作为元数据属性“start_index”保存

（4）使用OpenAI进行嵌入（将文本转换为数字向量）

In [None]:
from langchain_openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings(model="text-embedding-3-large")

（5）进行向量储存：

In [None]:
from langchain_chroma import Chroma
vector_store = Chroma(embedding_function=embeddings)
ids = vector_store.add_documents(documents=all_splits)

（6）根据与字符串查询的相似性返回文档：

In [None]:
results = vector_store.similarity_search(
    "How many distribution centers does Nike have in the US?"
)
print(results[0])

以下是RAG的一个简单实现过程（有错误，只是为了了解构建RAG的步骤）

In [None]:
"""构建RAG程序"""
# 选择聊天模型（OpenAI）
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model="gpt-4o-mini")

# 选择嵌入模型（OpenAI）
from langchain_openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings(model="text-embedding-3-large")

# 选择向量存储（Chroma）
from langchain_chroma import Chroma
vector_store = Chroma(embedding_function=embeddings)

# 1、加载文档
import bs4
from langchain_community.document_loaders import WebBaseLoader

# Only keep post title, headers, and content from the full HTML.
bs4_strainer = bs4.SoupStrainer(class_=("post-title", "post-header", "post-content"))
loader = WebBaseLoader(
    web_paths=("https://lilianweng.github.io/posts/2023-06-23-agent/",),
    bs_kwargs={"parse_only": bs4_strainer},
)
docs = loader.load()

assert len(docs) == 1
# 检查 docs 列表的长度是否为 1
print(f"Total characters: {len(docs[0].page_content)}")

# 2、拆分文档
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,  # chunk size (characters)
    chunk_overlap=200,  # chunk overlap (characters)
    add_start_index=True,  # track index in original document
)
all_splits = text_splitter.split_documents(docs)

print(f"Split blog post into {len(all_splits)} sub-documents.")

# 3、存储文本块
document_ids = vector_store.add_documents(documents=all_splits)

# 4、检索和生成
class State(TypedDict):
    question: str
    context: List[Document]
    answer: str
    
    
def retrieve(state: State):
    retrieved_docs = vector_store.similarity_search(state["question"])
    return {"context": retrieved_docs}


def generate(state: State):
    docs_content = "\n\n".join(doc.page_content for doc in state["context"])
    messages = prompt.invoke({"question": state["question"], "context": docs_content})
    response = llm.invoke(messages)
    return {"answer": response.content}

了解了使用LangChain构建RAG后，我们学习LangChain中Chroma的操作：

参考网站https://python.langchain.com/docs/integrations/vectorstores/chroma/

（1）选择嵌入模型，从客户端进行初始化（此处为持久客户端）

In [None]:
# 选择嵌入模型
from langchain_openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings(model="text-embedding-3-large")

# 从客户端进行初始化，此处为持久客户端
import chromadb
persistent_client = chromadb.PersistentClient()
collection = persistent_client.get_or_create_collection("collection_name")
collection.add(ids=["1", "2", "3"], documents=["a", "b", "c"])

vector_store_from_client = Chroma(
    client=persistent_client,
    collection_name="collection_name",
    embedding_function=embeddings,
)

（2）向量管理操作

In [None]:
# 1、添加矢量
from uuid import uuid4
from langchain_core.documents import Document

document_1 = Document(
    page_content="I had chocolate chip pancakes and scrambled eggs for breakfast this morning.",
    metadata={"source": "tweet"},
    id=1,
)

document_2 = Document(
    page_content="The weather forecast for tomorrow is cloudy and overcast, with a high of 62 degrees.",
    metadata={"source": "news"},
    id=2,
)

documents = [
    document_1,
    document_2,
]
uuids = [str(uuid4()) for _ in range(len(documents))]

vector_store.add_documents(documents=documents, ids=uuids)

# 2、更新矢量
updated_document_1 = Document(
    page_content="I had chocolate chip pancakes and fried eggs for breakfast this morning.",
    metadata={"source": "tweet"},
    id=1,
)

updated_document_2 = Document(
    page_content="The weather forecast for tomorrow is sunny and warm, with a high of 82 degrees.",
    metadata={"source": "news"},
    id=2,
)

vector_store.update_document(document_id=uuids[0], document=updated_document_1)
# You can also update multiple documents at once
vector_store.update_documents(
    ids=uuids[:2], documents=[updated_document_1, updated_document_2]
)

# 3、删除矢量
vector_store.delete(ids=uuids[-1])

（3）完全删除Chroma数据库

In [None]:
import os
# 删除数据库文件（如 incident.db）
db_path = "/content/drive/MyDrive/Colab Notebooks/incident.db"
os.remove(db_path)

（4）清空 "pdf_pages" collection 中的所有向量和元数据，但是数据库文件本身依然存在

In [None]:
import chromadb

# 初始化 Chroma 客户端
client = chromadb.Client()

# 获取一个已有的 collection
collection = client.get_collection(name="pdf_pages")

# 清空 collection 中的所有数据
collection.reset()

（5）重置Chroma客户端（清空客户端中所有内容）

In [None]:
import chromadb
from chromadb.config import Settings
reports_client_path = r".db文件路径"
reports_client = chromadb.PersistentClient(path=reports_client_path,settings=Settings(allow_reset=True))
reports_client.reset()

（6）查询Chroma数据库中的集合列表

In [None]:
import chromadb
from chromadb.config import Settings
reports_client_path = r"D:\good_good_study\111cs_study\Python\LLM-langchain\accident_report.db"
reports_client = chromadb.PersistentClient(path=reports_client_path,settings=Settings(allow_reset=True))
collections = reports_client.list_collections()
print(collections)