## 一. 文本总结(Summarization)

- 扔给LLM一段文本，让他给你生成总结可以说是最常见的场景之一了
- 目前最火的应用应该是 chatPDF

### 1. 短文本总结

In [6]:
# 从 config.py 中加载秘钥等配置

import sys
sys.path.append("../")
from models import OPENAI_API_KEY

In [7]:

# Summaries Of Short Text

from langchain.llms import OpenAI
from langchain import PromptTemplate

llm = OpenAI(temperature=0, model_name = "text-davinci-003", openai_api_key=OPENAI_API_KEY) # 初始化LLM模型

# 创建模板
template = """
%INSTRUCTIONS:
Please summarize the following piece of text.
Respond in a manner that a 5 year old would understand.

%TEXT:
{text}
"""

# 创建一个 Lang Chain Prompt 模板，稍后可以插入值
prompt = PromptTemplate(
    input_variables=["text"],
    template=template,
)

In [8]:
long_text = """
For the next 130 years, debate raged.
Some scientists called Prototaxites a lichen, others a fungus, and still others clung to the notion that it was some kind of tree.
“The problem is that when you look up close at the anatomy, it’s evocative of a lot of different things, but it’s diagnostic of nothing,” says Boyce, an associate professor in geophysical sciences and the Committee on Evolutionary Biology.
“And it’s so damn big that when whenever someone says it’s something, everyone else’s hackles get up: ‘How could you have a lichen 20 feet tall?’”
"""

In [9]:
print ("------- Prompt Begin -------")
# 打印模板内容
final_prompt = prompt.format(text=long_text)
print(final_prompt)

print ("------- Prompt End -------")

------- Prompt Begin -------

%INSTRUCTIONS:
Please summarize the following piece of text.
Respond in a manner that a 5 year old would understand.

%TEXT:

For the next 130 years, debate raged.
Some scientists called Prototaxites a lichen, others a fungus, and still others clung to the notion that it was some kind of tree.
“The problem is that when you look up close at the anatomy, it’s evocative of a lot of different things, but it’s diagnostic of nothing,” says Boyce, an associate professor in geophysical sciences and the Committee on Evolutionary Biology.
“And it’s so damn big that when whenever someone says it’s something, everyone else’s hackles get up: ‘How could you have a lichen 20 feet tall?’”


------- Prompt End -------


In [10]:
output = llm(final_prompt)
print(output)


For 130 years, people argued about what Prototaxites was. Some thought it was a lichen, some thought it was a fungus, and some thought it was a tree. But no one could agree. It was so big that it was hard to figure out what it was.


### 2. 长文本总结

- 对于文本长度较短的文本我们可以直接这样执行summary操作
- 但是对于文本长度超过lLM支持的max token size 时将会遇到困难，gpt-3.5 max token： https://platform.openai.com/docs/models/gpt-3-5
- Lang Chain 提供了开箱即用的工具解决长文本的问题：load_summarize_chain

In [7]:
# Summaries Of Longer Text

from langchain.llms import OpenAI
from langchain.chains.summarize import load_summarize_chain
from langchain.text_splitter import RecursiveCharacterTextSplitter

llm = OpenAI(temperature=0, openai_api_key=OPENAI_API_KEY)

In [13]:
with open('../data/alice_in_wonderland.txt', 'r') as file:
    text = file.read() # 文章本身是爱丽丝梦游仙境

# 打印小说的前285个字符
print (text[:285])

Chapter 1 Down the Rabbit-Hole
Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, `and what is the use of a book,' th


In [None]:
!pip install tiktoken # 安装用于分割文本的依赖

In [14]:
num_tokens = llm.get_num_tokens(text)

print (f"There are {num_tokens} tokens in your file, file_size: {text.__len__()}") 
# 全文一共2w6词
# 很明显这样的文本量是无法直接送进LLM进行处理和生成的

There are 25927 tokens in your file, file_size: 104171


In [15]:
text_splitter = RecursiveCharacterTextSplitter(separators=["\n\n", "\n"], chunk_size=5000, chunk_overlap=350)
# 虽然我使用的是 RecursiveCharacterTextSplitter，但是你也可以使用其他工具
docs = text_splitter.create_documents([text])

#docs_size = [doc.page_content.__len__() for doc in docs]

print (f"You now have {len(docs)} docs intead of 1 piece of text")

for doc in docs:
    print(doc.page_content.__len__())

You now have 18 docs intead of 1 piece of text
30
11173
27
10835
39
9050
43
13760
35
11594
24
13636
25
12438
32
11209
31
10181


In [16]:
# 设置 lang chain
# 使用 map_reduce的chain_type，这样可以将多个文档合并成一个
chain = load_summarize_chain(llm=llm, chain_type='map_reduce') # verbose=True 展示运行日志

In [17]:
# Use it. This will run through the 36 documents, summarize the chunks, then get a summary of the summary.
# 典型的map reduce的思路去解决问题，将文章拆分成多个部分，再将多个部分分别进行 summarize，最后再进行 合并，对多个 summary 进行 summary
output = chain.run(docs)
print (output)
# Try yourself

 In Alice's Adventures in Wonderland, Alice follows a White Rabbit down a rabbit hole and finds herself in a surreal world. She meets a variety of strange creatures and has a series of bizarre adventures as she attempts to find her way home. Along the way, she encounters a Caterpillar, a Pigeon, a talking pig, and a pepper shaker, and attends a mad tea-party with the Mad Hatter, the March Hare, and the Dormouse. Eventually, she is brought to the court of the King and Queen of Hearts, where the White Rabbit reveals that the Knave of Hearts is the one who stole the tarts. Alice is released and the Knave is brought to trial.
