# SFT数据

以ChatGPT自动构造为主, 通过prompt给出例子的方式, 针对一本书一次性生成多个问题, 之后再针对每条问题生成对应答案.

这样的好处是避免token过长, 并且能很好地提高回答质量与准确性.

总问答共计近2k条, 覆盖了《天龙八部》《射雕英雄传》《神雕侠侣》三本书.

## 1. 生成问题

In [None]:
import openai
import jsonlines

# titles = ['书剑恩仇录', '侠客行', '倚天屠龙记', '天龙八部', '射雕英雄传', '白马啸西风', '碧血剑', '神雕侠侣', '笑傲江湖', '越女剑', '连城诀', '雪山飞狐', '飞狐外传', '鸳鸯刀', '鹿鼎记']
titles = ['天龙八部', '射雕英雄传', '神雕侠侣']

for title in titles:
    user_content = f"""你好！你能帮我生成60个关于《{title}》的问题吗？有以下四个方面及其例子供你参考：
    1. 人物介绍
        请你介绍一下《{title}》的主角。
    2. 情节
        请你讲一下{title}中有名的比武的情节。
    3. 人物关系
        小龙女和杨过之间是什么关系？他们之间发生了什么故事？
    4. 其他
        《{title}》讲了什么故事？

    请你指明问题对应的书名，按照以下格式生成，并且不要输出额外的任何内容：
    00. xxx
    01. xxx
    02. xxx
    03. xxx
    ...
    58. xxx
    59. xxx"""

    messages = [
        {"role": "system", "content": "你是一位研究金庸武侠小说的专家。"},
        {"role": "user", "content": user_content},
    ]
    response = ''
    while response == '':
        try:
            result = openai.Completion.create(messages, engine="gpt-3.5-turbo", temperature=0.9)
            response = result['choices'][0]['message']['content']
            print(response)
        except Exception as e:
            print(result)
            response = ''

    # parse response
    questions = response.split('\n')
    questions = [q[4:] for q in questions if q != '']

    # write to jsonl
    with jsonlines.open('questions.jsonl', mode='a') as writer:
        for q in questions:
            writer.write({'Question': q, 'Answer': ''})

## 2. 生成答案

In [None]:
import openai
import json
import jsonlines

def g(question):
    answer = ''
    user_content = "请你回答下面这个问题，除了回答本身不需要输出其他任何内容。\n问题："+question+"\n\n你的回答："
    messages = [
        {"role": "system", "content": "你是一位研究金庸武侠小说的专家。"},
        {"role": "user", "content": user_content},
    ]
    while answer == '':
        try:
            result = openai.Completion.create(messages, engine="gpt-3.5-turbo", temperature=0.8)
            response = result['choices'][0]['message']['content']
            answer = response
        except Exception as e:
            print('error:', e)
            answer = ''
    return answer

with open('questions.jsonl', 'r', encoding='utf-8') as f:
    for line in f:
        data = json.loads(line)
        question = data['Question']
        print(question)
        data['Answer'] = g(question)
        print(data['Answer'])

        with jsonlines.open('result.jsonl', 'a') as haha:
            haha.write(data)

除了以上方式之外, 我还尝试从其他来源, 如知乎、金庸网等, 收集部分数据.

并且还有少量数据是使用Claude2, 截取部分维基百科内容提供给它, 同时生成问题与回答的方式构造的.

# 预训练数据

预训练数据采用金庸的15本小说.

# 模型效果分析

以下分析将采用SFT测试集的全部数据+少量手动构造问题进行评估.

## rougeL计算

In [None]:
import numpy as np
import tiktoken
import os
from contextlib import nullcontext
import torch
import tiktoken
from model import GLMConfig, MiniGLM
from rouge_chinese import Rouge
import jieba
import time

out_dir = 'out'
num_samples = 1 # number of samples to draw
max_new_tokens = 512 # number of tokens generated in each sample
temperature = 1 # 1.0 = no change, < 1.0 = less random, > 1.0 = more random, in predictions
top_k = 200 # retain only the top_k most likely tokens, clamp others to have 0 probability
seed = 1234
device = 'cuda' # examples: 'cpu', 'cuda', 'cuda:0', 'cuda:1', etc.
dtype = 'bfloat16' if torch.cuda.is_available() and torch.cuda.is_bf16_supported() else 'float16' # 'float32' or 'bfloat16' or 'float16'
compile = True # use PyTorch 2.0 to compile the model to be faster
exec(open('configurator.py').read()) # overrides from command line or config file
# -----------------------------------------------------------------------------
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
torch.backends.cuda.matmul.allow_tf32 = True # allow tf32 on matmul
torch.backends.cudnn.allow_tf32 = True # allow tf32 on cudnn
device_type = 'cuda' if 'cuda' in device else 'cpu' # for later use in torch.autocast
ptdtype = {'float32': torch.float32, 'bfloat16': torch.bfloat16, 'float16': torch.float16}[dtype]
ctx = nullcontext() if device_type == 'cpu' else torch.amp.autocast(device_type=device_type, dtype=ptdtype)
# init from a model saved in a specific directory
ckpt_path = os.path.join(out_dir, 'ckpt.pt')
checkpoint = torch.load(ckpt_path, map_location=device)
config = GLMConfig(**checkpoint['model_args'])
model = MiniGLM(config)
state_dict = checkpoint['model']
unwanted_prefix = '_orig_mod.'
for k,v in list(state_dict.items()):
    if k.startswith(unwanted_prefix):
        state_dict[k[len(unwanted_prefix):]] = state_dict.pop(k)
model.load_state_dict(state_dict)
model.eval()
model.to(device)

stop_token = tiktoken.get_encoding('cl100k_base').encode('<|endoftext|>', allowed_special='all')[0]

def generate(prompt):
    global model
    enc = tiktoken.get_encoding("cl100k_base")
    encode = lambda s: enc.encode(s, allowed_special={"<|endoftext|>"})
    decode = lambda l: enc.decode(l, errors='replace')

    if not prompt.endswith('<|endoftext|>'):
        prompt += '<|endoftext|>'

    outputs = []
    # encode the beginning of the prompt
    if prompt.startswith('FILE:'):
        with open(prompt[5:], 'r', encoding='utf-8') as f:
            prompts = [line.strip() for line in f.readlines()]

        for prompt in prompts:
            prompt_ids = encode(prompt)
            x = (torch.tensor(prompt_ids, dtype=torch.long, device=device)[None, ...])

            # run generation
            with torch.no_grad():
                with ctx:
                    for k in range(num_samples):
                        y = model.generate(x, max_new_tokens, temperature=temperature, top_k=top_k, stop_token=stop_token)
                        print("Prompt:", prompt)
                        output = decode(y[0].tolist())
                        outputs.append(output)
    else:
        prompt_ids = encode(prompt)
        x = (torch.tensor(prompt_ids, dtype=torch.long, device=device)[None, ...])

        # run generation
        with torch.no_grad():
            with ctx:
                for k in range(num_samples):
                    y = model.generate(x, max_new_tokens, temperature=temperature, top_k=top_k, stop_token=stop_token)
                    output = decode(y[0].tolist())
                    outputs.append(output)
    try:
        result = outputs[0].split('<|endoftext|>')[1]
        return result
    except Exception as e:
        print('error:', e)
        return outputs

encoder = tiktoken.get_encoding('cl100k_base')

val_data = np.memmap('qa_val.bin', dtype=np.uint32, mode='r')

scorer = Rouge()
jieba.initialize()
scores = []

start_time = time.time()
for i in range(0, len(val_data), 512):
    qa_token = val_data[i:i + 512]
    qa = encoder.decode(qa_token)
    
    question = qa.split('<|endoftext|>')[0]
    answer = qa.split('<|endoftext|>')[1]
    generated_answer = generate(question)
    print(question)
    print(answer)
    print(generated_answer)
    score = scorer.get_scores(' '.join(jieba.cut(answer)), ' '.join(jieba.cut(generated_answer)))[0]['rouge-l']['f']
    print(score)
    scores.append(score)
    print('-'*30)

print('time:', time.time() - start_time)
print('total num:', len(scores))
print('average score:', np.mean(scores))

结果:

time: 86.22463631629944 (此时间计入了计算rougeL的耗时)

total num: 174

average score: 0.2775983422370355

当然, rougeL并不能完全代表模型的效果, 因为它没有考虑模型回答的语义匹配度, 因此我还采用了sentence embedding的方式进行评估.

## sentence embedding计算
```python
import finetuner
def cos_sim(data):
    model = finetuner.build_model('jinaai/jina-embedding-l-en-v1')
    embeddings = finetuner.encode(
        model=model,
        data=data
    )
    return finetuner.cos_sim(embeddings[0], embeddings[1])
```
结果:

average score: 0.8239782

这个结果更符合语义, 也更接近模型的真实表现.

## 手动问题测试

问：黄蓉是谁？

答：黄蓉是郭靖的妻子，聪明机智，善于应对各种复杂局面。她武功高强，精通桃花岛武学和丐帮武功。在与郭靖相识后，两人共同经历了许多艰难困苦考验后，最终靠了她的智慧和勇气。

问：小龙女是哪一派的？

答：小龙女是《神雕侠侣》中的女主角，她是古墓派的传人，从小生活在古墓中，与外界隔绝。她的性格冷静。小龙女练就的古墓派绝技，使她在小龙女的教导下也逐渐学为华山派的掌门人。

问：《天龙八部》的主角都有谁？

答：《天龙八部》中有许多主角，的主角包括郭靖、黄蓉、杨康、穆念慈、欧阳锋、洪七公、丘处机等。这些人物各自有着独特的性格和经历，共同构成了这部经典武侠小说家族品的精彩故事情节。

以上问题为随机抽取, 而且不在训练集中, 可以看到尽管由于模型参数量限制, 回答并不完美, 但模型仍能生成比较相关与正确的回答.

# 模型优化

## 推理优化
修改model.generate()函数如下:
```python
@torch.no_grad()
def generate(self, idx, max_new_tokens, temperature=1.0, top_k=None, stop_token=None):
    """
    Take a conditioning sequence of indices idx (LongTensor of shape (b,t)) and complete
    the sequence max_new_tokens times, feeding the predictions back into the model each time.
    Most likely you'll want to make sure to be in model.eval() mode of operation for this.
    """
    for _ in range(max_new_tokens):
        # if the sequence context is growing too long we must crop it at block_size
        idx_cond = idx if idx.size(1) <= self.config.block_size else idx[:, -self.config.block_size:]
        # forward the model to get the logits for the index in the sequence
        logits, _ = self(idx_cond)
        # pluck the logits at the final step and scale by desired temperature
        logits = logits[:, -1, :] / temperature
        # optionally crop the logits to only the top k options
        if top_k is not None:
            v, _ = torch.topk(logits, min(top_k, logits.size(-1)))
            logits[logits < v[:, [-1]]] = -float('Inf')
        # apply softmax to convert logits to (normalized) probabilities
        probs = F.softmax(logits, dim=-1)
        # sample from the distribution
        idx_next = torch.multinomial(probs, num_samples=1)
        # append sampled index to the running sequence and continue
        idx = torch.cat((idx, idx_next), dim=1)

        # stop if <|endoftext|>
        if idx_next.item() == stop_token:
            break
    return idx
```
修改后的推理速度大幅提升, 能够保证在max_new_tokens=512的情况下, 1s内(使用NVIDIA GeForce RTX 3060 Laptop)生成回答.

## 调整编码
在预训练阶段, 我发现模型经常会生成乱码, 如'�'.

经过搜索, 我了解到gpt-2是字节对编码, 对于中文字符会对应1~3个token, 2个token占大部分.

这样会导致对数据集切分的时候, 将原本属于一个字符的tokens切分到不同的句子中, 从而导致模型生成乱码.

我的改进方案是更换编码, 采用了tiktoken库在GitHub上readme中使用的cl100k_base, 经过测试发现此编码可以将中文字符编码为单个token, 从而避免了上述问题.

代码如下:
```python
import tiktoken
enc = tiktoken.get_encoding("cl100k_base")
```