# RAG-增强检索生成 (Retrieval-Augmented Generation)
前面提到的 引入外部知识库的文档问答（QA based Documents）已经算是一个最基础的 RAG 了，这里再扩展一下，增加两个重要特性：

1. 在context 中 引入来自互联网搜索的知识
2. 使用自定义LLM

## web search

In [1]:
# 联网搜索引擎
!pip install duckduckgo_search

Collecting duckduckgo_search
  Downloading duckduckgo_search-6.1.7-py3-none-any.whl.metadata (19 kB)
Collecting pyreqwest-impersonate>=0.4.8 (from duckduckgo_search)
  Downloading pyreqwest_impersonate-0.4.8-cp310-none-win_amd64.whl.metadata (9.9 kB)
Downloading duckduckgo_search-6.1.7-py3-none-any.whl (24 kB)
Downloading pyreqwest_impersonate-0.4.8-cp310-none-win_amd64.whl (2.6 MB)
   ---------------------------------------- 0.0/2.6 MB ? eta -:--:--
    --------------------------------------- 0.0/2.6 MB 991.0 kB/s eta 0:00:03
   - -------------------------------------- 0.1/2.6 MB 1.4 MB/s eta 0:00:02
   ----- ---------------------------------- 0.3/2.6 MB 2.5 MB/s eta 0:00:01
   -------- ------------------------------- 0.6/2.6 MB 3.3 MB/s eta 0:00:01
   ----------- ---------------------------- 0.8/2.6 MB 3.7 MB/s eta 0:00:01
   ------------------- -------------------- 1.3/2.6 MB 4.8 MB/s eta 0:00:01
   ---------------------- ----------------- 1.5/2.6 MB 4.8 MB/s eta 0:00:01
   --------

In [2]:
# 先来体验下
from duckduckgo_search import DDGS

with DDGS() as ddgs:
    for r in ddgs.text('床前明月光', region='cn-zh', safesearch='off', timelimit='y', max_results=5):
        print(r)

{'title': '"床前明月光，疑是地上霜。举头望明月，低头思故乡。"全诗 ...', 'href': 'https://www.kekeshici.com/shicimingju/ticai/sixiang/38404.html', 'body': '本网页介绍了唐代诗人李白的名篇《静夜思》，分析了诗句的出处、释义、点评、鉴赏、题解等内容，展示了诗人的思乡之情和诗歌的魅力。如果你想了解李白的《静夜思》，这里有详细的解释和例子。'}
{'title': '李白《静夜思》赏析', 'href': 'https://baijiahao.baidu.com/s?id=1778009756954458586', 'body': '本文分析了李白的代表作《静夜思》，介绍了诗中的对仗、比喻、辞藻等修辞手法，以及诗中的情感和意境。文章还对比了《静夜思》与其他唐代诗词，评价了诗词的艺术和历史价值。'}
{'title': '李白静夜思全文、注释、翻译和赏析_唐代', 'href': 'https://www.yuwenmi.com/shici/shiren/5143436.html', 'body': '李白静夜思全文、注释、翻译和赏析_唐代. 静夜思. 朝代：唐代|作者：李白. 床前明月光，疑是地上霜。 举头望明月，低头思故乡。 译文/注释. 直译. 明亮的月光洒在床前的窗户纸上，好像地上泛起了一层霜。 我禁不住抬起头来，看那天窗外空中的一轮明月，不由得低头沉思，想起远方的家乡。 韵译. 皎洁月光洒满床，恰似朦胧一片霜。 仰首只见月一轮，低头教人倍思乡。 注释. ⑴静夜思：静静的夜里，产生的思绪。 ⑵床：今传五种说法。 一指井台。 已经有学者撰文考证过。 中国教育家协会理事程实将考证结果写成论文发表在刊物上，还和好友创作了《诗意图》。 二指井栏。 从考古发现来看，中国最早的水井是木结构水井。 古代井栏有数米高，成方框形围住井口，防止人跌入井内，这方框形既像四堵墙，又像古代的床。'}
{'title': '床前明月光的，下一句是什么？ - 知乎', 'href': 'https://www.zhihu.com/question/642156101', 'body': '" 床前明月光 "的下一句是"疑是地上霜"，这两句诗出自唐代诗人李白的《 静夜思 》。 这首诗用简洁而又朴素的

In [3]:
def search_web(keywords, region='cn-zh', max_results=3):
    web_content = ""
    with DDGS() as ddgs:
        ddgs_gen = ddgs.text(keywords=keywords, region=region, safesearch='off',
                            timelimit='y', max_results=max_results)
        for r in ddgs_gen:
            web_content += (r['body'] + '\n')
        
    return web_content

## 自定义LLM

- 使用chatGLM

In [5]:
from dotenv import load_dotenv, find_dotenv
import os
_ = load_dotenv(find_dotenv()) # read local .env file

In [6]:
zhipu_api_key = os.environ["ZHIPUAI_API_KEY"]

In [11]:


# here put the import lib
from typing import Any, List, Mapping, Optional, Dict
from langchain.llms.base import LLM
from zhipuai import ZhipuAI

import os

# 继承自 langchain.llms.base.LLM
class ZhipuAILLM(LLM):
    # 默认选用 glm-3-turbo
    model: str = "glm-3-turbo"
    # 温度系数
    temperature: float = 0.1
    # API_Key
    api_key: str = zhipu_api_key
    
    def _call(self, prompt : str, stop: Optional[List[str]] = None,
                run_manager: Optional[CallbackManagerForLLMRun] = None,
                **kwargs: Any):
        client = ZhipuAI(
            api_key = self.api_key
        )

        def gen_glm_params(prompt):
            '''
            构造 GLM 模型请求参数 messages

            请求参数：
                prompt: 对应的用户提示词
            '''
            messages = [{"role": "user", "content": prompt}]
            return messages
        
        messages = gen_glm_params(prompt)
        response = client.chat.completions.create(
            model = self.model,
            messages = messages,
            temperature = self.temperature
        )

        if len(response.choices) > 0:
            return response.choices[0].message.content
        return "generate answer error"


    # 首先定义一个返回默认参数的方法
    @property
    def _default_params(self) -> Dict[str, Any]:
        """获取调用API的默认参数。"""
        normal_params = {
            "temperature": self.temperature,
            }
        # print(type(self.model_kwargs))
        return {**normal_params}

    @property
    def _llm_type(self) -> str:
        return "Zhipu"

    @property
    def _identifying_params(self) -> Mapping[str, Any]:
        """Get the identifying parameters."""
        return {**{"model": self.model}, **self._default_params}

In [16]:
# 科大讯飞



spark_appid = os.environ['spark_appid']
spark_api_secret = os.environ['spark_api_secret']
spark_api_key = os.environ['spark_api_key']


from langchain_community.chat_models import ChatSparkLLM


llm = ChatSparkLLM(
    spark_app_id=spark_appid, spark_api_key=spark_api_key, spark_api_secret=spark_api_secret
)

llm.invoke('你好')

AIMessage(content='你好！有什么我可以帮忙的吗？', response_metadata={'token_usage': {'question_tokens': 1, 'prompt_tokens': 1, 'completion_tokens': 7, 'total_tokens': 8}}, id='run-8f675e5c-ad2b-441b-82f7-73a5d63411eb-0')

##  知识库加载与嵌入

In [13]:
from langchain.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

loader = TextLoader('./data/wonderland.txt', 'utf-8')
doc = loader.load()
print (f"You have {len(doc)} document")
print (f"You have {len(doc[0].page_content)} characters in that document")

# 将文档分割为多个部分
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=3000,
    chunk_overlap=400
)
docs = text_splitter.split_documents(doc)

# 获取字符的总数，以便可以计算平均值
num_total_characters = sum([len(x.page_content) for x in docs])
print (f"Now you have {len(docs)} documents that have an average of {num_total_characters / len(docs):,.0f} characters (smaller pieces)")

# Emb文档，然后使用伪数据将文档和原始文本结合起来
from langchain.embeddings import SentenceTransformerEmbeddings
from langchain.vectorstores import FAISS

embeddings = SentenceTransformerEmbeddings(model_name="D:/code/models/M3E/xrunda/m3e-base")
vector_store = FAISS.from_documents(docs, embeddings)

vector_store

You have 1 document
You have 13637 characters in that document
Now you have 6 documents that have an average of 2,272 characters (smaller pieces)


<langchain_community.vectorstores.faiss.FAISS at 0x1fdd656d1e0>

In [14]:
# 链
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate

def get_knowledge_based_answer(query,
                               vector_store,
                               VECTOR_SEARCH_TOP_K,
                               history_len,
                               temperature,
                               top_p,
                               chat_history=[]):
    
    web_content = search_web(query)

    prompt_template = f"""基于以下已知信息，简洁和专业的来回答末尾的问题。
                        如果无法从中得到答案，请说 "根据已知信息无法回答该问题" 或 "没有提供足够的相关信息"，不允许在答案中添加编造成分。
                        已知网络检索内容：{web_content}""" + """
                        已知本地知识库内容:
                        {context}
                        问题:
                        {question}"""   
        
    prompt = PromptTemplate(template=prompt_template, input_variables=["context", "question"])

    #llm.history = chat_history[-history_len:] if history_len > 0 else []

    
    knowledge_chain = RetrievalQA.from_llm(
        llm=llm,
        #retriever=vector_store.as_retriever(search_kwargs={"k": VECTOR_SEARCH_TOP_K}),
        retriever=vector_store.as_retriever(),
        prompt=prompt, 
        verbose=True)    

    knowledge_chain.combine_documents_chain.document_prompt = PromptTemplate(
        input_variables=["page_content"], template="{page_content}")

    knowledge_chain.return_source_documents = True


    print(f"-> web_content: {web_content}, prompt: {prompt}, query: {query}" )
    result = knowledge_chain({"query": query})
    return result

In [17]:
query = "Who is the author of the Alice in Wonderland?"
resp = get_knowledge_based_answer(
    query=query,
    vector_store=vector_store,
    VECTOR_SEARCH_TOP_K=6,
    chat_history=[],
    history_len=0,
    temperature=0.1,
    top_p=0.9,
)
print(resp)

-> web_content: He is best known as the author of the children's book Alice's Adventures in Wonderland (1865) and its sequel Through the Looking-Glass (1871)—two of the most popular works of fiction in the English language.
As Charles L. Dodgson, he was the author of a fair number of books on mathematics, none of enduring importance, although Euclid and His Modern Rivals (1879) is of some historical interest.
Lewis Carroll was an English author, poet and mathematician famous for Alice's Adventures in Wonderland and Through the Looking-Glass. He was known for his mastery of wordplay, logic and fantasy. Lewis Carroll's Profile. The Origins of Alice's Adventures in Wonderland.
, prompt: input_variables=['context', 'question'] template='基于以下已知信息，简洁和专业的来回答末尾的问题。\n                        如果无法从中得到答案，请说 "根据已知信息无法回答该问题" 或 "没有提供足够的相关信息"，不允许在答案中添加编造成分。\n                        已知网络检索内容：He is best known as the author of the children\'s book Alice\'s Adventures in Wonderland (1865) and its sequel T

  warn_deprecated(



[1m> Finished chain.[0m
{'query': 'Who is the author of the Alice in Wonderland?', 'result': 'Lewis Carroll', 'source_documents': [Document(page_content='Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, “and what is the use of a book,” thought Alice “without pictures or conversations?” So she was considering in her own mind (as well as she could, for the hot day made her feel very sleepy and stupid), whether the pleasure of making a daisy-chain would be worth the trouble of getting up and picking the daisies, when suddenly a White Rabbit with pink eyes ran close by her. There was nothing so very remarkable in that; nor did Alice think it so very much out of the way to hear the Rabbit say to itself, “Oh dear! Oh dear! I shall be late!” (when she thought it over afterwards, it occurred to her that she ought to have wo