聚合下面这四个开源指令微调数据集，从中得到共287k数据。
- Magicoder-OSS-Instruct: https://huggingface.co/datasets/ise-uiuc/Magicoder-OSS-Instruct-75K
- Python code subset of ShareGPT: https://huggingface.co/datasets/ajibawa-2023/Python-Code-23k-ShareGPT
- Magicoder-Evol-Instruct: https://huggingface.co/datasets/ise-uiuc/Magicoder-Evol-Instruct-110K
- Evol-Instruct-Code: https://huggingface.co/datasets/nickrosh/Evol-Instruct-Code-80k-v1

In [None]:
import json
from datasets import load_dataset, concatenate_datasets

ds1 = load_dataset("/Users/baoshui/data/Magicoder-OSS-Instruct-75K")
ds2 = load_dataset("/Users/baoshui/data/Python-Code-23k-ShareGPT")
ds3 = load_dataset("/Users/baoshui/data/Magicoder-Evol-Instruct-110K")
ds4 = load_dataset("/Users/baoshui/data/Evol-Instruct-Code-80k-v1")


分别处理上面四个数据集，保存到一个新的jsonl文件中，文件路径 /Users/baoshui/data/CodeMaster/code_master.jsonl，数据格式同一处理成：
```json
{
    "query": "",
    "answer": "",
    "resource": "",
    "lang": ""
}
```

In [None]:
target_jsonl = "/Users/baoshui/data/CodeMaster/code_master.jsonl"

In [None]:
with open("/Users/baoshui/data/Magicoder-OSS-Instruct-75K/data-oss_instruct-decontaminated.jsonl", 'r', encoding='utf-8') as input_file, open(target_jsonl, 'a', encoding='utf-8') as output_file:
    for line in input_file:
        data = json.loads(line)
        new_data = {
            "query": data["problem"],
            "answer": data["solution"],
            "resource": "Magicoder-OSS-Instruct-75K",
            "lang": data["lang"]
        }
        output_file.write(json.dumps(new_data, ensure_ascii=False) + '\n')


In [None]:
with open("/Users/baoshui/data/Python-Code-23k-ShareGPT/Python-Code-23k-ShareGPT.json", 'r', encoding='utf-8') as input_file, open(target_jsonl, 'a', encoding='utf-8') as output_file:
    
    data = json.load(input_file)
    for d in data:
        new_data = {
            "query": d["conversations"][0]["value"],
            "answer": d["conversations"][1]["value"],
            "resource": "Python-Code-23k-ShareGPT",
            "lang": "python"
        }
        output_file.write(json.dumps(new_data, ensure_ascii=False) + '\n')

In [None]:
with open("/Users/baoshui/data/Magicoder-Evol-Instruct-110K/data-evol_instruct-decontaminated.jsonl", 'r', encoding='utf-8') as input_file, open(target_jsonl, 'a', encoding='utf-8') as output_file:
    batch = []
    for line in input_file:
        d = json.loads(line)
        new_data = {
            "query": d["instruction"],
            "answer": d["response"],
            "resource": "Magicoder-Evol-Instruct-110K",
            "lang": "python"
        }
        batch.append(new_data)
        if len(batch) >= 1000:
            for item in batch:
                output_file.write(json.dumps(item, ensure_ascii=False) + '\n')
            batch = []
    if batch:
        for item in batch:
            output_file.write(json.dumps(item, ensure_ascii=False) + '\n')

In [None]:
with open("/Users/baoshui/data/Evol-Instruct-Code-80k-v1/EvolInstruct-Code-80k.json", 'r', encoding='utf-8') as input_file, open(target_jsonl, 'a', encoding='utf-8') as output_file:
    data = json.load(input_file)
    for d in data:
        new_data = {
            "query": d["instruction"],
            "answer": d["output"],
            "resource": "Python-Code-23k-ShareGPT",
            "lang": "python"
        }
        output_file.write(json.dumps(new_data, ensure_ascii=False) + '\n')

In [None]:
# 打印当前的数据重量
def print_count_data():
    with open(target_jsonl, 'r', encoding='utf-8') as output_file:
        return sum(1 for _ in output_file)
    
print_count_data()

至此，已经完成了我们的第一步操作，合并4个数据集，得到共287k条数据。

第二步，对所有的题目调用 `Qwen-72B-Chat` 进行打分，选择的分为4或5的题目，得到共156k条数据。

第三步，使用Bert模型的嵌入和k-最近邻算法将相似的单轮查询-响应对合并形成多轮对话，最多选3个，也就是最多构建四轮对话。

In [None]:
import random
from sklearn.neighbors import NearestNeighbors
from transformers import BertTokenizer, BertModel
import torch

source_pair = []
with open(target_jsonl, 'r', encoding='utf-8') as output_file:
    for line in output_file[0:100]: # 选前100条做测试
        source_pair.append(json.loads(line))
    
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
model = BertModel.from_pretrained('bert-base-cased')

# 获取查询的Bert Embedding
def get_bert_embeddings(queries):
    encoded_input = tokenizer(queries, return_tensors='pt', padding=True, truncation=True)
    with torch.no_grad():
        output = model(**encoded_input)
    # 获取句子级别的嵌入
    embeddings = output.last_hidden_state.mean(dim=1)
    return embeddings

# 获取所有查询的向量化表示
queries = [pair["query"] for pair in source_pair]
embeddings = get_bert_embeddings(queries)

# 使用k-最近邻算法找到每个查询的四个最接近的邻居
nbrs = NearestNeighbors(n_neighbors=4, algorithm='auto').fit(embeddings)
distances, indices = nbrs.kneighbors(embeddings)

# 生成多轮对话数据
multi_turn_conversation = []
used_indices = set()

for i, neighbors in enumerate(indices):
    if i in used_indices:
        continue
    ns = [n for n in neighbors if n != i and n not in used_indices]
    if len(ns) >= 2:
        selected_neighbors = random.sample(ns, 2)
        conversation = [source_pair[i]]
        conversation.append([source_pair[n] for n in selected_neighbors])
        multi_turn_conversation.append(conversation)
        used_indices.update(selected_neighbors)
        
    used_indices.add(i)

print(f"生成了{len(multi_turn_conversation)}个多轮对话实例。")