### Dataset的使用

huggingface用不了可以修改huggingface_hub包的constant.py文件，修改_HF_DEFAULT_ENDPOINT = "https://hf-mirror.com"

##### 功能一：数据加载

In [2]:
from datasets import load_dataset, Dataset

在线数据集

In [None]:
# 加载在线数据集和加载在线模型一致，只需要数据集地址就行

dataset = load_dataset("madao33/new-title-chinese")

# 加载数据集中某一项任务
dataset = load_dataset("aps/super_glue", "axb")

Generating train split:   0%|          | 0/5850 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1679 [00:00<?, ? examples/s]

In [None]:
# 指定数据集
dataset = load_dataset("madao33/new-title-chinese", split="train")
dataset = load_dataset("madao33/new-title-chinese", split="train[10:100]")
dataset = load_dataset("madao33/new-title-chinese", split="train[:50%]")
dataset = load_dataset("madao33/new-title-chinese", split=["train[:50%]", "train[50%:]"])

本地数据集

In [49]:
dataset = load_dataset("csv", data_files="../../datas/ChnSentiCorp_htl_all.csv", split="train")
dataset

Dataset({
    features: ['label', 'review'],
    num_rows: 7766
})

##### 查看数据集

In [21]:
dataset = load_dataset("madao33/new-title-chinese")

Using the latest cached version of the dataset since madao33/new-title-chinese couldn't be found on the Hugging Face Hub
Found the latest cached dataset configuration 'default' at F:\code\python_\other\project\deepLearning\models\datasets\madao33___new-title-chinese\default\0.0.0\be61f6e55257d64aa16e6a5c09ef9451e3f24c40 (last modified on Fri Jun 13 01:17:05 2025).


In [22]:
dataset["train"][:3] # 每一个字段都聚合成一个列表

{'title': Value(dtype='string', id=None),
 'content': Value(dtype='string', id=None)}

##### 功能二：数据集划分

In [24]:
trainset = dataset["train"]
trainset.train_test_split(0.1) # 默认test_size，小数比例，整数个数

DatasetDict({
    train: Dataset({
        features: ['title', 'content'],
        num_rows: 5265
    })
    test: Dataset({
        features: ['title', 'content'],
        num_rows: 585
    })
})

In [27]:
boolq_dataset = load_dataset("super_glue", "boolq")
trainset = boolq_dataset["train"]
trainset.train_test_split(0.1, stratify_by_column="label") # 按label分层抽样

SSLError: (MaxRetryError("HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /api/datasets/super_glue/revision/3de24cf8022e94f4ee4b9d55a6f539891524d646 (Caused by SSLError(SSLEOFError(8, '[SSL: UNEXPECTED_EOF_WHILE_READING] EOF occurred in violation of protocol (_ssl.c:1000)')))"), '(Request ID: 99cc6736-8bac-4c62-9ef6-e3b400befa6b)')

##### 功能三：数据选取

In [29]:
dataset["train"].select([0, 1]) # select传递一个sequence就好，如range(100)

Dataset({
    features: ['title', 'content'],
    num_rows: 2
})

In [30]:
filter_dataset = dataset["train"].filter(lambda x: "中国" in x["title"])
filter_dataset[:5]

Filter:   0%|          | 0/5850 [00:00<?, ? examples/s]

##### 功能四：数据映射

数据运算之前的预处理，处理成标准化形式

In [33]:
def add_prefix(example):
    example["title"] = "prefix: " + example["title"]
    return example

In [36]:
prefix_dataset = dataset.map(add_prefix)
prefix_dataset["train"]["title"]

['prefix: 望海楼美国打“台湾牌”是危险的赌博',
 'prefix: 大力推进高校治理能力建设',
 'prefix: 坚持事业为上选贤任能',
 'prefix: “大朋友”的话儿记心头',
 'prefix: 用好可持续发展这把“金钥匙”',
 'prefix: 跨越雄关，我们走在大路上',
 'prefix: 脱贫奇迹彰显政治优势',
 'prefix: 拱卫亿万人共同的绿色梦想',
 'prefix: 为党育人、为国育才',
 'prefix: 净化网络语言',
 'prefix: 用心叵测的美国政客哪有资格谈“宗教自由”？',
 'prefix: 保卫城市，打响关口战、阵地战、街巷战！',
 'prefix: 兼顾疫情防控与社会运转',
 'prefix: 人民！人民！一切为了人民！',
 'prefix: 以更深层次改革推动构建完整内需体系',
 'prefix: 创新传播，放大主流声音',
 'prefix: 英勇奋斗的民族精神',
 'prefix: 全面推进复工复产要守住安全生产底线',
 'prefix: 为国有企业强“根”铸“魂”',
 'prefix: 建设廉洁之路是共同价值追求',
 'prefix: 让阅读成为一种习惯',
 'prefix: 靠什么纾解“基层压力”',
 'prefix: 确保长江禁渔取得实效',
 'prefix: 致敬战“疫”中的每一个“她”',
 'prefix: 脱贫摘帽是新起点',
 'prefix: 书写践行初心使命的历史新篇章',
 'prefix: 让世界文明百花园群芳竞艳',
 'prefix: 让环保压力层层传递',
 'prefix: 形式主义官僚主义是我们党的大敌、人民的大敌',
 'prefix: 确保全面建成小康社会圆满收官',
 'prefix: 上海，让开放成为一种思维方式',
 'prefix: 用“四个坚持”引领新时代文艺工作',
 'prefix: 聚焦两会，世界探寻中国成功秘诀',
 'prefix: 持之以恒为基层减负',
 'prefix: “港区国安立法”效应显现，乱港分子回头是岸！',
 'prefix: 把高质量发展主题长期坚持下去',
 'prefix: 建设人与自然和谐共生的现代化',
 'prefix: 新建造，挺起发展的脊梁',
 'prefix: 大国

In [40]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("hfl/chinese-macbert-base")
def process_func(examples):
    inputs = tokenizer(examples["content"], max_length=512, truncation=True, padding="max_length")
    labels = tokenizer(examples["title"], max_length=32, truncation=True, padding="max_length")
    inputs["labels"] = labels["input_ids"]
    return inputs

In [41]:
tokenized_data = dataset.map(process_func, batched=True, remove_columns=dataset["train"].column_names)

Map:   0%|          | 0/5850 [00:00<?, ? examples/s]

Map:   0%|          | 0/1679 [00:00<?, ? examples/s]

##### 功能五：保存和加载

In [43]:
tokenized_data.save_to_disk("../../datas/datacache")

Saving the dataset (0/1 shards):   0%|          | 0/5850 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/1679 [00:00<?, ? examples/s]

In [44]:
from datasets import DatasetDict
tokenized_data = DatasetDict.load_from_disk("../../datas/datacache")

In [53]:
# 从其它形式数据加载
import pandas as pd

# 从dataframe中创建
data = pd.read_csv("../../datas/ChnSentiCorp_htl_all.csv")
dataset = Dataset.from_pandas(data)

# 从列表创建
data = [{"text": "abc"}, {"text": "def"}]
Dataset.from_list(data)

Dataset({
    features: ['text'],
    num_rows: 2
})

# 示例

##### step1 导包

In [38]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, DataCollatorWithPadding
from datasets import load_dataset
import torch

##### step2 加载数据

In [39]:
dataset = load_dataset("csv", data_files="../../datas/ChnSentiCorp_htl_all.csv", split="train")

In [40]:
dataset = dataset.filter(lambda x: x["review"] is not None)
dataset = dataset.train_test_split(0.1)

##### step3 数据预处理

In [41]:
tokenizer = AutoTokenizer.from_pretrained("./models/rbt3")

In [43]:
dataset["train"][0]

{'label': 1,
 'review': '总体感觉不错,房间很温馨,考虑到丽江温差大,床铺还用了电热毯.服务员态度好,每次进出都会主动打招呼.唯一不足是热水不稳定,空调不够暖和.早餐定在阳光和酒,不知道是否是连锁的,第二天去大理,居然在逛洋人街的时候也看到了阳光和酒的酒吧.不嘈杂还有表演,酒味歌唱都不错.'}

In [50]:
def process_func(examples):
    inputs = tokenizer(
        examples["review"],
        max_length=128,
        truncation=True,
        padding=True,
        return_tensors="pt")
    inputs["labels"] = examples["label"]
    return inputs

In [51]:
tokenized_data = dataset.map(process_func, batched=True, remove_columns=dataset["train"].column_names)

Map:   0%|          | 0/6988 [00:00<?, ? examples/s]

Map:   0%|          | 0/777 [00:00<?, ? examples/s]

In [54]:
from torch.utils.data import DataLoader
trainloader = DataLoader(tokenized_data["train"], batch_size=32, shuffle=True, collate_fn=DataCollatorWithPadding(tokenizer))
validloader = DataLoader(tokenized_data["test"], batch_size=32, shuffle=False, collate_fn=DataCollatorWithPadding(tokenizer))

##### step4 加载模型

In [56]:
model = AutoModelForSequenceClassification.from_pretrained("./models/rbt3")

if torch.cuda.is_available():
    model = model.cuda()

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at ./models/rbt3 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [57]:
model.device

device(type='cuda', index=0)

##### step5 创建优化器

In [58]:
from torch.optim import Adam

optimizer = Adam(model.parameters(), lr=2e-5)

In [61]:
def train(epoch=3, log_step=20):
    global_step = 0
    for ep in range(1, epoch + 1):
        model.train()
        for batch in trainloader:
            if torch.cuda.is_available():
                batch = {k: v.cuda() for k, v in batch.items()}
            optimizer.zero_grad()
            outputs = model(**batch)
            outputs.loss.backward()
            optimizer.step()
            if global_step % log_step == 0:
                print(f"ep: {ep}, global_step: {global_step}, loss: {outputs.loss.item()}")
            global_step += 1
        acc = evaluate()
        print(f"ep: {ep}, acc: {acc}")


def evaluate():
    model.eval()
    acc_num = 0
    with torch.inference_mode():
        for batch in validloader:
            if torch.cuda.is_available():
                batch = {k: v.cuda() for k, v in batch.items()}
            output = model(**batch).logits.argmax(dim=-1)
            acc_num += (output.long()==batch["labels"].long()).float().sum()
        return acc_num / len(tokenized_data["test"])

In [62]:
train()

ep: 1, global_step: 0, loss: 0.21116232872009277
ep: 1, global_step: 20, loss: 0.20282669365406036
ep: 1, global_step: 40, loss: 0.352164089679718
ep: 1, global_step: 60, loss: 0.21960942447185516
ep: 1, global_step: 80, loss: 0.14668001234531403
ep: 1, global_step: 100, loss: 0.18309013545513153
ep: 1, global_step: 120, loss: 0.378016859292984
ep: 1, global_step: 140, loss: 0.1675594598054886
ep: 1, global_step: 160, loss: 0.2201177477836609
ep: 1, global_step: 180, loss: 0.21774506568908691
ep: 1, global_step: 200, loss: 0.38338080048561096
ep: 1, acc: 0.9111969470977783
ep: 2, global_step: 220, loss: 0.1919189989566803
ep: 2, global_step: 240, loss: 0.3124852180480957
ep: 2, global_step: 260, loss: 0.18380849063396454
ep: 2, global_step: 280, loss: 0.2537326216697693
ep: 2, global_step: 300, loss: 0.3811986744403839
ep: 2, global_step: 320, loss: 0.32174238562583923
ep: 2, global_step: 340, loss: 0.19202063977718353
ep: 2, global_step: 360, loss: 0.18299517035484314
ep: 2, global_st

In [68]:
from transformers import pipeline

model.config.id2label = {0: "差评", 1: "好评"}
pipe = pipeline("text-classification", model=model, tokenizer=tokenizer)

Device set to use cuda:0


In [69]:
pipe("这地方有点差劲")

[{'label': '差评', 'score': 0.8925237655639648}]