# 基于langchain创建自己专属的对话大模型

1. 领域精准回答

2. 数据更新频繁

3. 生成内容可解释追溯

4. 数据隐私保护

通过这个例子，我们将基于`LangChain`, `OpenAI(LLM)`,`vector DB` 构建一个属于自己的LLM模型

主要使用技术——————__*Retrieval Augmented Generation*__


## 环境准备

可以选择官方的openai_api_key

也可以选择中间代理商，此时需要openai_api_base

In [1]:
! pip install -qU \
    langchain-openai\
    langchain-core\
    langchain-text-splitters \
    langchain-chroma\
    chromadb \
    langchain \
    openai \
    tiktoken

## 创建一个对话模型(No RAG)

In [None]:
import os
from langchain_openai import ChatOpenAI
from getpass import getpass

OPENAI_API_KEY = "************"

os.environ['OPENAI_API_KEY'] = OPENAI_API_KEY

chat = ChatOpenAI(
    openai_api_base="https://************/v1",
    openai_api_key=os.environ['OPENAI_API_KEY'],
    model="openai/gpt-3.5-turbo"
)

In [3]:
from langchain_core.messages import (
    SystemMessage,
    HumanMessage,
    AIMessage
)

messages = [
    SystemMessage(content="You are a helpful assistant."),
    HumanMessage(content="Knock knock."),
    AIMessage(content="Who's there?"),
    HumanMessage(content="Orange"),

]

In [4]:
res = chat.invoke(messages) 

print(res.content)

Orange who?


In [5]:
# 将历史对话作为历史记录
messages.append(res)

messages.append(HumanMessage(content="guess"))

res = chat.invoke(messages) 

print(res.content)

Orange you glad I guessed correctly?


### 处理LLM存在的缺陷

1. 容易出现幻觉

2. 信息滞后

3. 专业领域深度知识匮乏

In [6]:
messages = [
    SystemMessage(content="你是一个专业的知识助手。"),
    HumanMessage(content="你知道baichuan2模型吗？"),
]

res = chat.invoke(messages)
print(res.content)

baichuan2 模型可能是指百川2模型，但是我无法找到更多关于这个模型的具体信息。请提供更多背景或上下文信息，这样我可以更好地帮助你。如果 baichuan2 模型是一个特定的概念、产品或领域，请提供更多细节，我将尽力回答你的问题。


In [7]:
baichuan2_information = [
    "Baichuan 2是一个大规模多语言语言模型，它专注于训练在多种语言中表现优异的模型，包括不仅限于英文。这使得Baichuan 2在处理各种语言的任务时能够取得显著的性能提升。",
    "Baichuan 2是从头开始训练的，使用了包括了2.6万亿个标记的庞大训练数据集。相对于以往的模型，Baichuan 2提供了更丰富的数据资源，从而能够更好地支持多语言的开发和应用。",
    "Baichuan 2不仅在通用任务上表现出色，还在特定领域（如医学和法律）的任务中展现了卓越的性能。这为特定领域的应用提供了强有力的支持。"
]

source_knowledge = "\n".join(baichuan2_information)
print(source_knowledge)

Baichuan 2是一个大规模多语言语言模型，它专注于训练在多种语言中表现优异的模型，包括不仅限于英文。这使得Baichuan 2在处理各种语言的任务时能够取得显著的性能提升。
Baichuan 2是从头开始训练的，使用了包括了2.6万亿个标记的庞大训练数据集。相对于以往的模型，Baichuan 2提供了更丰富的数据资源，从而能够更好地支持多语言的开发和应用。
Baichuan 2不仅在通用任务上表现出色，还在特定领域（如医学和法律）的任务中展现了卓越的性能。这为特定领域的应用提供了强有力的支持。


In [8]:
query = "你知道baichuan2模型吗？"

prompt_template = f"""基于以下内容回答问题：

内容:
{source_knowledge}

Query: {query}"""

In [9]:
prompt = HumanMessage(
    content=prompt_template
)

messages.append(prompt)

res = chat.invoke(messages)

print(res.content)

是的，Baichuan 2是一个大规模多语言语言模型，专注于训练在多种语言中表现优异的模型。它是从头开始训练的，使用了包括2.6万亿个标记的庞大训练数据集。Baichuan 2相比以往的模型提供了更丰富的数据资源，因此能够更好地支持多语言的开发和应用，在通用任务和特定领域（如医学和法律）的任务中表现出色，为特定领域的应用提供强有力支持。


## 创建一个RAG对话模型

### 1. 加载数据

In [10]:
! pip install -qU langchain-chroma pypdf langchain_community

In [11]:
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader("2309.10305v2.pdf")

pages = loader.load_and_split()

In [12]:
pages[0]

Document(metadata={'producer': 'pdfTeX-1.40.25', 'creator': 'LaTeX with hyperref', 'creationdate': '2023-09-21T00:15:31+00:00', 'author': '', 'keywords': '', 'moddate': '2023-09-21T00:15:31+00:00', 'ptex.fullbanner': 'This is pdfTeX, Version 3.141592653-2.6-1.40.25 (TeX Live 2023) kpathsea version 6.3.5', 'subject': '', 'title': '', 'trapped': '/False', 'source': '2309.10305v2.pdf', 'total_pages': 28, 'page': 0, 'page_label': '1'}, page_content='Baichuan 2: Open Large-scale Language Models\nAiyuan Yang, Bin Xiao, Bingning Wang, Borong Zhang, Chao Yin, Chenxu Lv, Da Pan\nDian Wang, Dong Yan, Fan Yang, Fei Deng, Feng Wang, Feng Liu, Guangwei Ai\nGuosheng Dong, Haizhou Zhao, Hang Xu, Haoze Sun, Hongda Zhang, Hui Liu, Jiaming Ji\nJian Xie, Juntao Dai, Kun Fang, Lei Su, Liang Song, Lifeng Liu, Liyun Ru, Luyao Ma\nMang Wang, Mickel Liu, MingAn Lin, Nuolan Nie, Peidong Guo, Ruiyang Sun\nTao Zhang, Tianpeng Li, Tianyu Li, Wei Cheng, Weipeng Chen, Xiangrong Zeng\nXiaochuan Wang, Xiaoxi Chen, Xi

### 2. 知识切片，将文档分割成均匀的块。每个块是一段原始文本

In [13]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,
)

docs = text_splitter.split_documents(pages)

In [14]:
len(docs)

216

### 3. 利用embedding模型对每个文本片进行向量化，并存储到向量数据库中

In [None]:
from langchain_openai import OpenAIEmbeddings
from langchain_chroma import Chroma

embed_model = OpenAIEmbeddings(
    openai_api_base="https://************/v1",
    openai_api_key=OPENAI_API_KEY,
    model="openai/text-embedding-3-small"
)
vectorstore = Chroma.from_documents(
    documents=docs, 
    embedding=embed_model, 
    collection_name="openai_embed"
)

### 4. 通过向量相似度检索和问题最相关的k个文档

In [17]:
query = "How large is the baichuan2 vocabulary?"
result = vectorstore.similarity_search(query ,k = 2)
print(result)

[Document(id='ae1301cd-f41f-4619-b2a9-45c995b66cb2', metadata={'author': '', 'trapped': '/False', 'keywords': '', 'page_label': '2', 'title': '', 'ptex.fullbanner': 'This is pdfTeX, Version 3.141592653-2.6-1.40.25 (TeX Live 2023) kpathsea version 6.3.5', 'source': '2309.10305v2.pdf', 'creator': 'LaTeX with hyperref', 'producer': 'pdfTeX-1.40.25', 'subject': '', 'total_pages': 28, 'page': 1, 'creationdate': '2023-09-21T00:15:31+00:00', 'moddate': '2023-09-21T00:15:31+00:00'}, page_content='languages, such as Chinese.\nIn this technical report, we introduce Baichuan\n2, a series of large-scale multilingual language\nmodels. Baichuan 2 has two separate models,\nBaichuan 2-7B with 7 billion parameters and\nBaichuan 2-13B with 13 billion parameters. Both\nmodels were trained on 2.6 trillion tokens, which\nto our knowledge is the largest to date, more than\ndouble that of Baichuan 1 (Baichuan, 2023b,a).\nWith such a massive amount of training data,'), Document(id='f9f8491b-90e2-4ba5-a840-184

### 5. 原始`query`与检索得到的文本组合起来输入到语言模型，得到最终的回答

In [18]:
def augment_prompt(query: str):
  # 获取top3的文本片段
  results = vectorstore.similarity_search(query, k=3)
  source_knowledge = "\n".join([x.page_content for x in results])
  # 构建prompt
  augmented_prompt = f"""Using the contexts below, answer the query.

  contexts:
  {source_knowledge}

  query: {query}"""
  return augmented_prompt

print(augment_prompt(query))

Using the contexts below, answer the query.

  contexts:
  languages, such as Chinese.
In this technical report, we introduce Baichuan
2, a series of large-scale multilingual language
models. Baichuan 2 has two separate models,
Baichuan 2-7B with 7 billion parameters and
Baichuan 2-13B with 13 billion parameters. Both
models were trained on 2.6 trillion tokens, which
to our knowledge is the largest to date, more than
double that of Baichuan 1 (Baichuan, 2023b,a).
With such a massive amount of training data,
adequate training of each word embedding. We
have taken both these aspects into account. We
have expanded the vocabulary size from 64,000
in Baichuan 1 to 125,696, aiming to strike a
balance between computational efficiency and
model performance.
Tokenizer V ocab Size Compression Rate ↓
LLaMA 2 32,000 1.037
Bloom 250,680 0.501
ChatGLM 2 64,794 0.527
Baichuan 1 64,000 0.570
Baichuan 2 125,696 0.498
Table 2: The vocab size and text compression rate of
Baichuan 2: Open Large-scale Lang

In [19]:
# 创建prompt
prompt = HumanMessage(
    content=augment_prompt(query)
)

messages.append(prompt)

res = chat.invoke(messages)

print(res.content)

Baichuan 2 has a vocabulary size of 125,696 words, as mentioned in the technical report.
