大语言模型（LLM）后训练数据准备相关笔记 #95

ninehills · 2023-06-26T14:10:46Z

本文是对 LLM 进行微调以及二次预训练时，阅读数据训练相关文章的一些学习笔记，包含极少的经验总结。

参考资料：

0x01 数据格式介绍

1. Question and Answer Dataset

Context	Question	Answer
privacy statement for adb org choice opt out correct update contacting the website the asian development bank has created this privacy statement in order to demonstrate our …	what is the purpose of the privacy statement for adb.org?	it covers various aspects of information gathering, opting out of communications, updating information, and contacting the website.

2. Text Summarization

Article	Summary
nick fairley has signed for the st louis rams on a one year prove yourself deal when fit and focused fairley is one of the nfl’s better defensive linemen but his career has been plagued with concerns about injuries attitude and his weight worth 5 million but with the incentive to make up to 7 5m according to nfl insider ian rapoport the defensive tackle joins arguably the most fearsome front four in football the 27 year old will likely slot in alongside either defensive rookie of the year aaron donald or michael brockers to aid the rams’ frightening pass rush rams coach jeff fisher poses with nick fairley during his unveiling in st louis with defensive ends chris long and robert quinn quarterbacks …	start fairley heads to the nfc west and should play a key part in jeff fisher and gregg williams’ rotating defensive line . the detroit lions did not take up his fifth year deal, and fairley penned a one-year contract worth $5m with incentives to make $2.5m more . the 27-year-old joins one of the most feared front fours in the nfl . when fit and focused he is a disruptive force in the middle . but fears about injuries, attitude and off-field concerns remain . end

3. Instruct Tuning

Instruction	Output
what is the best way to download music from youtube?	the best way to download music from youtube is through a third party website or application many of these websites or applications will let you convert the youtube video into an mp3 file and save the audio to your device however it’s important to use a reliable safe source when downloading music from youtube.

部分模型指令微调的数据集增加 Input，Input 可以为空，此时的例子如下：

Instruction	Input	Output
Translate English to Chinese.	The best way to download music from youtube is through a third party website or application many of these websites or applications will let you convert the youtube video into an mp3 file and save the audio to your device however it’s important to use a reliable safe source when downloading music from youtube.	从YouTube下载音乐的最佳方法是通过第三方网站或应用程序。其中许多网站或应用程序将允许您将YouTube视频转换为MP3文件并保存音频到您的设备上。但是，重要的是在从YouTube下载音乐时使用可靠且安全的来源。

4. Human-Bot Conversation

Conversation	Response
`<human>: what are the five most common causes of stress among people? \n<bot>: a 2017 study showed that the top causes of stress in america were: money (64%) work (60%) the economy (49%) family responsibilities (47%) personal health problems (46%) \n<human>: how would someone cope with stress from financial difficulties and work? \n<bot>:`	To cope with financial difficulties: * read up on personal finance online.

Prompt 的格式和具体的模型有关。

5. Continue Pre-Training

准备包含长文本的数据集，以便进一步预训练语言模型。

Text
thomas sheridan anthropologist thomas e sheridan born 5 september 1951 is an anthropologist of sonora mexico and the history and culture of the us south west he is distinguished outreach professor at the university of arizona affiliated with the department of anthropology and the southwest center since 2003 sheridan’s family moved to phoenix arizona at the age of 3 he left the south west after high school attended reed college briefly before returning and graduated from the first incarnation of prescott college in arizona in the 1970s he became interested in northern mexico and travelled there frequently for study spending months in baha kino in 1971 he completed a phd on the yaqui in 1983 he directed the mexican heritage project at the arizona historical society from 1982 1984 and was curator …

0x20 数据准备流程

0x21 数据导入

从不同的数据源（connector中）导入文档，需要将不同格式的原始数据处理为带有 metadata 的文本格式。

0x22 数据清理

目的是删除数据中低质量部分，包括：

文本清理（去除换行符、字母小写转换、URL 删除、HTML 标签删除等），有一些工具如justext, trafilatura。
基于元信息过滤，如 OpenAI 筛选 Reddit 链接时，过滤掉点赞数小于 3 的帖子。

同时此时可以使用一些数据质量检查方法（Bleu/Meteor/相似度/奖励模型）自动剔除低质量数据。

0x23 数据过滤

数据过滤和清理不同，数据过滤目的是过滤掉不符合模型训练的文本，可以应用不同的过滤器，典型如：

长度过滤：大语言模型一般都有上下文限制，同时考虑到训练目标，数据不能过短也不能过长。
语言过滤：比如过滤掉阿拉伯语内容，视目标模型的多语言支持目标而定。
机器生成文本过滤：比如过滤掉 Google 翻译的文本、ChatGPT 生成的文本等。

0x24 数据去重（De-duplication）

研究表现，训练数据中的重复数据会极大的降低模型的能力。所以需要对训练数据进行去重。有一些库如deduplicate-text-datasets, and datasketch。

针对微调训练集，也可以使用向量检索的方式进行去重。

0x25 数据去污（Decontamination）

需要尽量避免在训练数据集中包含测试数据集，否则可能会高估模型性能。

污染包括输入输出污染以及输入污染两种。

输入输出污染指的是测试数据集的输入输出也出现在训练数据中，模型学习到只是文本复制而非真正的能力。
输入污染指的是测试数据集的输入出现在训练数据中，虽然没有学习到答案，但同样会导致在 Zero-shot/Few-shot 情况下模型性能的高估。

这里有一个数据去污的例子：https://github.com/EleutherAI/lm-evaluation-harness/blob/master/docs/decontamination.md#decontamination

0x26 价值观控制

从数据中过滤掉不符合所在国家法律或者道德要求的的内容。比如种族歧视、言语暴力等。具体应该以所在国家的法律规定为准。

0x27 个人信息脱敏

从数据中去除法律规定的个人信息数据，如ID、医疗记录等。有presidio and pii-codex 库等。

0x28 数据转换

将数据转换为目标格式（如JSON）
匹配目标模型的数据要求，例如截断长度、添加 padding token、start-end token等

0x30 数据构建

指令微调数据准备是对数据标注员要求极高的工作，一个经验丰富的具备大学学历的数据标注员，标注速度可能会低于 5 条每小时。

有三种方法可以提高数据构建效率：

复用数据集：复用开源数据集。从而增加微调任务的多样性，提升模型性能。
使用 GPT-4 生成数据集：利用 GPT-4 的强大能力，初步生成基本数据集，人工只需要复核，可以提高标注速度。
体验良好的数据标注系统：开发或者使用体验良好的数据标注系统，可以提升标注速度和质量。同时还可以利用交叉验证等方法避免低质量数据生成。

The text was updated successfully, but these errors were encountered:

NLPpupil · 2023-07-09T06:49:17Z

gududefengzhong · 2023-11-07T11:40:25Z

学到了很多，非常感谢

dorbodwolf · 2023-11-20T07:35:45Z

感谢分享

linchen111 · 2023-11-21T03:51:38Z

请问 Continue Pre-Training的怎么处理的？

ninehills · 2023-11-21T09:51:09Z

5. Continue Pre-Training

二次预训练就用无标注的文本数据即可。
以 https://github.com/hiyouga/LLaMA-Factory/blob/main/data/README_zh.md 的二次预训练为例，你可以把文本格式化为

{"text": "xxxxx"}

每篇文章一行

在数据集配置中指定读取text列
"columns": {
"prompt": "text",
}

参考数据集：https://huggingface.co/datasets/olm/olm-wikipedia-20221220

linchen111 · 2023-11-21T10:04:07Z

5. Continue Pre-Training

二次预训练就用无标注的文本数据即可。以 https://github.com/hiyouga/LLaMA-Factory/blob/main/data/README_zh.md 的二次预训练为例，你可以把文本格式化为

{"text": "xxxxx"}

每篇文章一行

在数据集配置中指定读取text列 "columns": { "prompt": "text", }

参考数据集：https://huggingface.co/datasets/olm/olm-wikipedia-20221220

谢谢你的回复，我刚好在用这个框架，我直接填到prompt也就是alpaca 的Instruction里面，然后后面必填的我就随便写个值。反正不算loss

ninehills added the blog label Jun 26, 2023

ninehills changed the title ~~大语言模型（LLM）数据准备过程笔记~~ 大语言模型（LLM）后训练数据准备过程笔记 Jun 26, 2023

ninehills changed the title ~~大语言模型（LLM）后训练数据准备过程笔记~~ 大语言模型（LLM）后训练数据准备阅读和实践笔记 Jun 26, 2023

ninehills changed the title ~~大语言模型（LLM）后训练数据准备阅读和实践笔记~~ 大语言模型（LLM）后训练数据相关笔记 Jun 26, 2023

ninehills changed the title ~~大语言模型（LLM）后训练数据相关笔记~~ 大语言模型（LLM）后训练数据准备相关笔记 Jun 26, 2023

ninehills changed the title ~~大语言模型（LLM）后训练数据准备相关笔记~~ 大语言模型（LLM）后训练数据准备相关笔记 Jun 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

大语言模型（LLM）后训练数据准备相关笔记 #95

大语言模型（LLM）后训练数据准备相关笔记 #95

ninehills commented Jun 26, 2023 •

edited

NLPpupil commented Jul 9, 2023

gududefengzhong commented Nov 7, 2023

dorbodwolf commented Nov 20, 2023

linchen111 commented Nov 21, 2023

ninehills commented Nov 21, 2023

linchen111 commented Nov 21, 2023

大语言模型（LLM）后训练数据准备相关笔记 #95

大语言模型（LLM）后训练数据准备相关笔记 #95

Comments

ninehills commented Jun 26, 2023 • edited

0x01 数据格式介绍

0x20 数据准备流程

0x21 数据导入

0x22 数据清理

0x23 数据过滤

0x24 数据去重（De-duplication）

0x25 数据去污（Decontamination）

0x26 价值观控制

0x27 个人信息脱敏

0x28 数据转换

0x30 数据构建

NLPpupil commented Jul 9, 2023

gududefengzhong commented Nov 7, 2023

dorbodwolf commented Nov 20, 2023

linchen111 commented Nov 21, 2023

ninehills commented Nov 21, 2023

linchen111 commented Nov 21, 2023

ninehills commented Jun 26, 2023 •

edited