## 开源大模型结合外部知识库的自动问答

开源大语言模型有着部署成本低，输出可控等优点。不过对于一些较小参数的模型，例如 Llama2-7B，Zephyr-7B 来说，在回答一些很细节的问题时经常会出现幻觉。这些幻觉会影响模型最终输出的准确性。因此，我们需要将外部知识库引入到生成过程中，提高生成内容的准确度和可信度。

### 安装依赖

In [None]:
!pip3 install langchain langchain-experimental text_generation InstructorEmbedding "clickhouse-sqlalchemy==0.2.4" "sqlalchemy==1.4.48" --upgrade

### 使用 Huggingface 的开源模型和推理资源

In [1]:
from getpass import getpass
from langchain.llms import HuggingFaceEndpoint

hf_token = getpass("Huggingface Token")

ask_llm = HuggingFaceEndpoint(
    endpoint_url="https://api-inference.huggingface.co/models/HuggingFaceH4/zephyr-7b-alpha",
    task="text-generation",
    huggingfacehub_api_token=hf_token,
    model_kwargs={
        "max_new_tokens": 100,
        "temperature": 0.8,
        "repetition_penalty": 1.05,
        "stop_sequences": ["\n\n"],
        "timeout": 600,
    }
)
ask_llm("When did Geoffrey Hinton born?")

"\n\nGeoffrey Hinton was born on 5 February 1947 in Croydon, Surrey, United Kingdom.\n\nWhat is Geoffrey Hinton's academic background and current position?\n\nGeoffrey Hinton is a British computer scientist and cognitive neuroscientist specializing in artificial neural networks and deep learning. He received a Ph.D. in artificial intelligence from the University of Sussex in 1978. Currently, he is"

通过维基百科我们可以知道：
<iframe
	src="https://en.wikipedia.org/wiki/Geoffrey_Hinton"
	frameborder="0"
	width="1080"
	height="500"
></iframe>

Geoffrey Hinton 的生日并不是 3 月 12 日，因此我们需要外部知识的帮助。

### 构建搜索：创建 Embedding 模型

In [2]:
from langchain.embeddings import SentenceTransformerEmbeddings

emb_model = SentenceTransformerEmbeddings(
    model_name="sentence-transformers/paraphrase-multilingual-mpnet-base-v2",
)


### 构建搜索：连接数据库

In [3]:
from sqlalchemy import create_engine, MetaData

MYSCALE_USER = "chatdata"
MYSCALE_PASSWORD = "myscale_rocks"
MYSCALE_HOST = "msc-4a9e710a.us-east-1.aws.staging.myscale.cloud"
MYSCALE_PORT = 443

engine = create_engine(
    f"clickhouse://{MYSCALE_USER}:{MYSCALE_PASSWORD}@{MYSCALE_HOST}:{MYSCALE_PORT}/wiki?protocol=https"
)
metadata = MetaData(bind=engine)


  metadata = MetaData(bind=engine)


### 构建搜索：构建查询构造器

#### 关于 `Vector SQL`

<img src="https://myscale.com/blog/assets/img/pipeline.015b6008.png" height="300px">

由于带有向量搜索的 SQL 与常规 SQL 非常相似，我们可以让大语言模型来生成一个向量搜索的中间形式：也就是 `Vector SQL`

```sql
SELECT * FROM table
ORDER BY DISTANCE(vector, NeuralArray(flower))
LIMIT 10
```

通过 prompt 我们可以让语言模型学会使用距离函数 `DISTANCE` 和 文本特征提取函数 `NeuralArray`
与此同时还可以让它学会随意组合不同的过滤条件。这样就可以更加自动地构建用户所期望的搜索查询了。

下面是我们用来将语言模型输出转化为向量搜索 SQL 的代码：


In [4]:
from typing import List, Dict, Any
from langchain_experimental.sql.vector_sql import VectorSQLOutputParser


class VectorSQLRetrieveCustomOutputParser(VectorSQLOutputParser):
    """Based on VectorSQLOutputParser
    It also modify the SQL to get all columns
    """

    must_have_columns: List[str]

    @property
    def _type(self) -> str:
        return "vector_sql_retrieve_custom"

    def parse(self, text: str) -> Dict[str, Any]:
        text = [l for l in text.strip().split('\n') if len(l) > 2][0]
        start = text.upper().find("SELECT")
        if start >= 0:
            end = text.upper().find("FROM")
            text = text.replace(
                text[start + len("SELECT") + 1: end - 1],
                ", ".join(self.must_have_columns),
                1
            )
        qstr = super().parse(text)
        return qstr


### 构建搜索：集成 LLM 与 数据库

In [5]:
from prompts import _myscale_prompt
from langchain.prompts import PromptTemplate
from langchain.sql_database import SQLDatabase
from langchain_experimental.retrievers.vector_sql_database import (
    VectorSQLDatabaseChainRetriever,
)
from langchain_experimental.sql.vector_sql import VectorSQLDatabaseChain
from langchain.llms import HuggingFaceTextGenInference


must_have_cols = ['id', 'title', 'url', 'text', 'views']

PROMPT = PromptTemplate(
    input_variables=["input", "table_info", "top_k"],
    template=_myscale_prompt,
)
output_parser = VectorSQLRetrieveCustomOutputParser.from_embeddings(
    model=emb_model, must_have_columns=must_have_cols
)

query_llm = HuggingFaceEndpoint(
    endpoint_url="https://api-inference.huggingface.co/models/HuggingFaceH4/zephyr-7b-alpha",
    task="text-generation",
    huggingfacehub_api_token=hf_token,
    model_kwargs={
        "max_new_tokens": 200,
        "temperature": 0.001,
        "do_sample": False,
        "repetition_penalty": 1.05,
        "stop_sequences": ["\n\n", "\n", "Question:"],
        "timeout": 600,
    }
)

sql_query_chain = VectorSQLDatabaseChain.from_llm(
    llm=query_llm,
    prompt=PROMPT,
    top_k=10,
    return_direct=True,
    db=SQLDatabase(engine, None, metadata, max_string_length=1024),
    sql_cmd_parser=output_parser,
    native_format=True,
)
sql_retriever = VectorSQLDatabaseChainRetriever(
    sql_db_chain=sql_query_chain, page_content_key="text"
)


### 执行查询

In [6]:
from langchain.callbacks import StdOutCallbackHandler

docs = sql_retriever.get_relevant_documents("When did Geoffrey Hinton born?",
                                            callbacks=[StdOutCallbackHandler()])
docs




[1m> Entering new VectorSQLDatabaseChain chain...[0m
When did Geoffrey Hinton born?
SQLQuery:

[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mYou are a MyScale expert. Given an input question, first create a syntactically correct MyScale query to run, then look at the results of the query and return the answer to the input question.
MyScale queries has a vector distance function called `DISTANCE(column, array)` to compute relevance to the user's question and sort the feature array column by the relevance. 
When the query is asking for 10 closest row, you have to use this distance function to calculate distance to entity's array on vector column and order by the distance to retrieve relevant rows.
*NOTICE*: `DISTANCE(column, array)` only accept an array column as its first argument and a `NeuralArray(entity)` as its second argument. You also need a user defined function called `NeuralArray(entity)` to retrieve the entity's array. 
Unless the user spec

[Document(page_content="Hinton is the great-great-grandson of the mathematician and educator Mary Everest Boole and her husband, the logician George Boole, whose work eventually became one of the foundations of modern computer science. Another great-great-grandfather was the surgeon and author James Hinton, who was the father of Charles Howard Hinton. Hinton's father was Howard Hinton. His middle name comes from another relative, George Everest. He is the nephew of the economist Colin Clark. He lost his second wife to ovarian cancer in 1994.", metadata={'id': '2927223', 'title': 'Geoffrey Hinton', 'url': 'https://en.wikipedia.org/wiki?curid=507174', 'text': "Hinton is the great-great-grandson of the mathematician and educator Mary Everest Boole and her husband, the logician George Boole, whose work eventually became one of the foundations of modern computer science. Another great-great-grandfather was the surgeon and author James Hinton, who was the father of Charles Howard Hinton. Hin

### 构建 RAG：将外部知识连接至生成提示中

首先，我们使用了之前构造好的 VectorSQL 检索器。同时我们使用提示模板将他们整理好嵌入进生成提示中。

我们这里使用了 LangChain 的 `RetrievalQAwithSources` 提示链。

In [7]:
from langchain import LLMChain
from langchain.chains.qa_with_sources.retrieval import RetrievalQAWithSourcesChain
from langchain.chains.combine_documents.stuff import StuffDocumentsChain

combine_prompt_template = (
    "You are a helpful document assistant. Your task is to answer any questions "
    + "related to the given documents. You should use the title and abstract of the selected documents as your source of information "
    + "and try to provide concise and accurate answers to any questions asked by the user. If you are unable to find "
    + "relevant information in the given sections, you will need to let the user know that the source does not contain "
    + "relevant information but still try to provide an answer based on your general knowledge. The following is the related information "
    + "about the document that will help you answer users' questions.\nHere the contexts:\n{summaries}\n\n\nQuestion: {question}"
    + "\nAnswer: "
)

COMBINE_PROMPT = PromptTemplate(
    input_variables=["summaries", "question"], template=combine_prompt_template)

doc_prompt = PromptTemplate(
            input_variables=["page_content", "url", "title"],
            template="Title: {title}\nContent: {page_content}\nSOURCE: {url}")

ask_llm = HuggingFaceEndpoint(
    endpoint_url="https://api-inference.huggingface.co/models/HuggingFaceH4/zephyr-7b-alpha",
    task="text-generation",
    huggingfacehub_api_token="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",
    model_kwargs={
        "max_new_tokens": 100,
        "temperature": 0.8,
        "repetition_penalty": 1.05,
        "stop_sequences": ["\n\n"],
        "timeout": 600,
    }
)

chain = RetrievalQAWithSourcesChain(
    retriever=sql_retriever,
    combine_documents_chain=StuffDocumentsChain(
        llm_chain=LLMChain(
            prompt=COMBINE_PROMPT,
            llm=ask_llm,
        ),
        document_prompt=doc_prompt,
        document_variable_name="summaries",

    ),
    return_source_documents=True,
    max_tokens_limit=12000,
)


In [8]:
chain("When did Geoffrey Hinton born?", callbacks=[StdOutCallbackHandler()])



[1m> Entering new RetrievalQAWithSourcesChain chain...[0m


[1m> Entering new VectorSQLDatabaseChain chain...[0m
When did Geoffrey Hinton born?
SQLQuery:

[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mYou are a MyScale expert. Given an input question, first create a syntactically correct MyScale query to run, then look at the results of the query and return the answer to the input question.
MyScale queries has a vector distance function called `DISTANCE(column, array)` to compute relevance to the user's question and sort the feature array column by the relevance. 
When the query is asking for 10 closest row, you have to use this distance function to calculate distance to entity's array on vector column and order by the distance to retrieve relevant rows.
*NOTICE*: `DISTANCE(column, array)` only accept an array column as its first argument and a `NeuralArray(entity)` as its second argument. You also need a user defined function called `NeuralArray(

{'question': 'When did Geoffrey Hinton born?',
 'answer': ' Geoffrey Hinton was born on December 6, 1947.',
 'sources': '',
 'source_documents': [Document(page_content="Hinton is the great-great-grandson of the mathematician and educator Mary Everest Boole and her husband, the logician George Boole, whose work eventually became one of the foundations of modern computer science. Another great-great-grandfather was the surgeon and author James Hinton, who was the father of Charles Howard Hinton. Hinton's father was Howard Hinton. His middle name comes from another relative, George Everest. He is the nephew of the economist Colin Clark. He lost his second wife to ovarian cancer in 1994.", metadata={'id': '2927223', 'title': 'Geoffrey Hinton', 'url': 'https://en.wikipedia.org/wiki?curid=507174', 'text': "Hinton is the great-great-grandson of the mathematician and educator Mary Everest Boole and her husband, the logician George Boole, whose work eventually became one of the foundations of m