### (1)基于Transformers的NLP解决方案
- 以文本分类为例
    - Step1 导入相关包
    - Step2 加载数据集                 （Datasets）
    - Step3 数据集划分                 （Datasets）
    - Step4 数据集预处理                （Tokenizer+Datasets）
    - Step5 创建模型                    （Model）
    - Step6 设置评估函数                （Evaluate）
    - Step7 配置训练参数                （TrainingArguments）
    - Step8 创建训练器                  （Trainer+DataCollator）
    - Step9 模型训练、评估、预测（数据集）（Trainer）
    - Step10 模型预测（单条）            （Pipeline）

### (2)显存优化策略，4G显存跑BERT-Large
#### 显存占用简单分析
- 模型权重
    - 4Bytes*模型参数量
- 优化器状态
    - 8Bytes*模型参数量，对于常用的AdamW优化器而言
- 梯度
    - 4Bytes*模型参数量
- 前向激活值
    - 取决于序列长度、隐层维度、Batch大小等多个因素


#### 显存优化策略
- hfl/chinese-macbert-large,330M



优化策略|优化对象|显存占用|训练时间
--|:--:|:--:|--:
Baseline(BS 32,MaxLength 128)|-|15.2G|64s
+Gradient Accumulation(BS 1,GA 32)|前向激活值|7.4G|259s
+Gradient Checkpoints(BS 1,GA 32)|前向激活值|7.2G|422s
+Adafactor Optiomizer(BS 1,GA 32)|优化器状态|5.0G|406s
+Freeze Model(BS 1,GA 32)|前向激活值/梯度|3.5G|178s
+Data Length(BS 1,GA 32,MaxLength 32)|前向激活值|3.4G|126s

### (3)实战

#### Step1 导入相关包

In [1]:
import torch
from transformers import AutoTokenizer,AutoModelForSequenceClassification,Trainer,TrainingArguments
from datasets import load_dataset
import evaluate
from transformers import DataCollatorWithPadding

  from .autonotebook import tqdm as notebook_tqdm


#### Step2 数据加载

In [2]:
dataset=load_dataset('csv',data_files='./ChnSentiCorp_htl_all.csv',split='train')
# 删除空行，不删map映射那里会报错
dataset=dataset.filter(lambda x:x['review'] is not None)
dataset

Dataset({
    features: ['label', 'review'],
    num_rows: 7765
})

#### Step3 数据划分

In [3]:
datasets=dataset.train_test_split(test_size=0.1)
datasets

DatasetDict({
    train: Dataset({
        features: ['label', 'review'],
        num_rows: 6988
    })
    test: Dataset({
        features: ['label', 'review'],
        num_rows: 777
    })
})

#### Step4 数据预处理

In [4]:
tokenizer=AutoTokenizer.from_pretrained('rbt3')
def pre_data(data):
    examples=tokenizer(data['review'],max_length=32,truncation=True,padding="max_length")
    examples['labels']=data['label']
    return examples
tokenizer_datasets=datasets.map(pre_data,batched=True,remove_columns=datasets['train'].column_names)
tokenizer_datasets

Map: 100%|██████████| 6988/6988 [00:01<00:00, 3688.46 examples/s]
Map: 100%|██████████| 777/777 [00:00<00:00, 3629.77 examples/s]


DatasetDict({
    train: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 6988
    })
    test: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 777
    })
})

#### Step5 模型构建

In [5]:
model=AutoModelForSequenceClassification.from_pretrained('rbt3')

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at rbt3 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


#### Step6 设置评估函数

In [15]:
acc_metric=evaluate.load('accuracy')
f1_metric=evaluate.load('f1')

In [16]:
def evaluate_metric(eval_predict):
    predictions,label=eval_predict
    predictions=predictions.argmax(axis=-1)
    acc=acc_metric.compute(predictions=predictions,references=label)
    f1=f1_metric.compute(predictions=predictions,references=label)
    acc.update(f1)
    return acc



#### Step7 配置训练参数Arguments

In [17]:
training_args=TrainingArguments(
    output_dir="./checkpoints",      # 输出文件夹
    per_device_train_batch_size=16,  # 训练时的batch_size
    gradient_accumulation_steps=32,  # *** 梯度累加 ***
    gradient_checkpointing=True,     # *** 梯度检查点 ***
    optim="adafactor",               # *** adafactor优化器 *** 
    per_device_eval_batch_size=16,  # 验证时的batch_size
    logging_steps=10,                # log 打印的频率
    evaluation_strategy="epoch",     # 评估策略
    save_strategy="epoch",           # 保存策略
    save_total_limit=3,              # 最大保存数
    learning_rate=2e-5,              # 学习率
    weight_decay=0.01,               # weight_decay
    metric_for_best_model="f1",      # 设定评估指标
    load_best_model_at_end=True      # 训练完成后加载最优模型
)

#### Step8 创建训练器

In [18]:

# ***参数冻结***
for name,param in model.bert.named_parameters():
    param.requires_grad=False

trainer=Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenizer_datasets['train'],
    eval_dataset=tokenizer_datasets['test'],
    data_collator=DataCollatorWithPadding(tokenizer=tokenizer),#：数据收集器，这里使用了 DataCollatorWithPadding 类，用于处理输入数据的批处理和填充。
    compute_metrics=evaluate_metric
)

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


#### Step9 模型训练、评估、预测

In [19]:
# 训练
trainer.train()

 33%|███▎      | 13/39 [22:37<45:14, 104.42s/it]
                                               

[A[A                                         


 26%|██▌       | 10/39 [01:32<03:32,  7.31s/it]
[A

[A[A

{'loss': 0.7513, 'grad_norm': 6.907197952270508, 'learning_rate': 1.4871794871794874e-05, 'epoch': 0.73}


                                               

[A[A                                         


[A[A[A                                      
 33%|███▎      | 13/39 [02:08<03:00,  6.96s/it]
[A



{'eval_loss': 0.7160607576370239, 'eval_accuracy': 0.40283140283140284, 'eval_f1': 0.2772585669781931, 'eval_runtime': 11.3677, 'eval_samples_per_second': 68.352, 'eval_steps_per_second': 4.31, 'epoch': 0.95}


                                               

[A[A                                         


 51%|█████▏    | 20/39 [02:52<02:17,  7.25s/it]
[A

[A[A

{'loss': 0.7207, 'grad_norm': 5.634609222412109, 'learning_rate': 9.743589743589744e-06, 'epoch': 1.46}


                                               

[A[A                                         


[A[A[A                                      
 69%|██████▉   | 27/39 [03:54<01:22,  6.85s/it]
[A



{'eval_loss': 0.6900193095207214, 'eval_accuracy': 0.5263835263835264, 'eval_f1': 0.5740740740740741, 'eval_runtime': 11.7365, 'eval_samples_per_second': 66.204, 'eval_steps_per_second': 4.175, 'epoch': 1.98}


                                               

[A[A                                         


 77%|███████▋  | 30/39 [04:16<01:22,  9.21s/it]
[A

[A[A

{'loss': 0.6995, 'grad_norm': 5.7766618728637695, 'learning_rate': 4.615384615384616e-06, 'epoch': 2.2}


                                               

[A[A                                         


[A[A[A                                      
100%|██████████| 39/39 [05:41<00:00,  8.06s/it]
[A

[A[A

{'eval_loss': 0.6832708120346069, 'eval_accuracy': 0.564993564993565, 'eval_f1': 0.6334056399132321, 'eval_runtime': 13.1849, 'eval_samples_per_second': 58.931, 'eval_steps_per_second': 3.716, 'epoch': 2.86}


                                               

[A[A                                         


100%|██████████| 39/39 [05:41<00:00,  8.06s/it]
[A

100%|██████████| 39/39 [05:41<00:00,  8.76s/it]

{'train_runtime': 341.4684, 'train_samples_per_second': 61.394, 'train_steps_per_second': 0.114, 'train_loss': 0.7162004739810259, 'epoch': 2.86}





TrainOutput(global_step=39, training_loss=0.7162004739810259, metrics={'train_runtime': 341.4684, 'train_samples_per_second': 61.394, 'train_steps_per_second': 0.114, 'train_loss': 0.7162004739810259, 'epoch': 2.86})

In [20]:
# 评估
trainer.evaluate(tokenizer_datasets['test'])

100%|██████████| 49/49 [00:11<00:00,  4.29it/s]


{'eval_loss': 0.6832708120346069,
 'eval_accuracy': 0.564993564993565,
 'eval_f1': 0.6334056399132321,
 'eval_runtime': 11.7526,
 'eval_samples_per_second': 66.113,
 'eval_steps_per_second': 4.169,
 'epoch': 2.86}

In [24]:
# 预测
pre=trainer.predict(tokenizer_datasets['test'])
pre

100%|██████████| 49/49 [00:13<00:00,  3.75it/s]


PredictionOutput(predictions=array([[-0.04740256,  0.0760574 ],
       [ 0.09700981, -0.0429115 ],
       [ 0.23148103,  0.3142735 ],
       ...,
       [-0.06397601,  0.19077998],
       [ 0.39171207,  0.0683006 ],
       [ 0.21678816,  0.1644122 ]], dtype=float32), label_ids=array([1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0,
       1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1,
       1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1,
       1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1,
       1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0,
       1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0,
       1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1,
       0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1,
       0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0

In [33]:
id2_label={0:'差评',1:'好评'}
model.config.id2label=id2_label

In [41]:
import pandas as pd
pre_result=pd.Series(pre.label_ids).map({0:'差评',1:'好评'}).values
pre_result

array(['好评', '好评', '好评', '差评', '好评', '好评', '好评', '好评', '差评', '好评', '好评',
       '好评', '差评', '好评', '好评', '好评', '好评', '好评', '差评', '差评', '差评', '差评',
       '好评', '差评', '好评', '差评', '好评', '差评', '好评', '好评', '好评', '好评', '差评',
       '好评', '好评', '好评', '好评', '差评', '好评', '差评', '好评', '好评', '好评', '好评',
       '好评', '好评', '差评', '好评', '好评', '好评', '好评', '好评', '好评', '好评', '好评',
       '差评', '好评', '好评', '好评', '好评', '差评', '差评', '差评', '差评', '好评', '好评',
       '好评', '好评', '差评', '好评', '差评', '好评', '好评', '差评', '好评', '好评', '好评',
       '好评', '好评', '差评', '差评', '差评', '差评', '好评', '好评', '差评', '差评', '好评',
       '好评', '好评', '差评', '好评', '好评', '好评', '差评', '好评', '好评', '差评', '好评',
       '差评', '好评', '差评', '差评', '好评', '好评', '好评', '好评', '好评', '好评', '好评',
       '好评', '好评', '好评', '好评', '好评', '好评', '差评', '好评', '好评', '好评', '好评',
       '好评', '差评', '好评', '差评', '好评', '好评', '差评', '好评', '差评', '好评', '差评',
       '好评', '差评', '好评', '差评', '好评', '差评', '差评', '好评', '差评', '好评', '差评',
       '好评', '好评', '好评', '差评', '好评', '差评', '好评', '差