# 3.5 回归类任务

GPT-2 原本是设计用于生成自然语言的模型，但通过适当的调整和微调，它也可以用于回归任务，例如预测连续值。

使用 GPT-2 进行回归问题的解决，可以将回归问题转化为自回归语言模型任务。GPT-2 原本是设计用于生成自然语言的模型，但通过适当的调整和微调，它也可以用于回归任务，例如预测连续值（如情感评分、价格预测等）。

---

### **1. 使用 GPT-2 做回归的核心思路**

1. **调整输出层**：
   - 默认情况下，GPT-2 的输出是一个词汇表大小的概率分布，用于预测下一个 token。
   - 对于回归问题，可以将模型的最后一层替换为一个线性层，使得输出变为一个标量或多个连续值。
   - gpt2的huggingface实现中，可以简单设置1个分类的分类header，实现回归预测。

2. **损失函数**：
   - 对于回归问题，使用均方误差（MSE）或均绝对误差（MAE）作为损失函数，而不是分类任务中常用的交叉熵。

3. **输入格式**：
   - 输入数据仍然是文本，可以通过特定的模板形式加入上下文信息。

---

### **2. GPT-2 回归任务的实现步骤**

#### **（1）加载基础模型**

从 Hugging Face Transformers 库加载 GPT-2 模型和分词器，并调整其配置以适应回归任务。

```python
from transformers import GPT2Tokenizer, GPT2Model, GPT2Config, AutoModelForSequenceClassification

# 加载分词器
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

# 调整模型配置，num_labels=1 表示回归任务
config = GPT2Config.from_pretrained("gpt2", num_labels=1)

# 加载模型，增加回归输出
model = AutoModelForSequenceClassification.from_pretrained("gpt2", config=config)
```

---

### **3. 课程数据集**

本例程使用了蛋白质稳定性分析的数据集，也就是一个蛋白质序列，对应一个float的数值，做回归预测分析。

**蛋白质稳定性分析**是研究蛋白质在不同条件下保持其结构和功能的能力的过程。蛋白质稳定性是生物化学和生物技术领域的重要课题，影响着蛋白质的折叠、功能执行、以及在应用中的可用性（如工业酶、药物开发等）。


In [1]:
import subprocess
import os
# 设置环境变量, autodl一般区域
result = subprocess.run('bash -c "source /etc/network_turbo && env | grep proxy"', shell=True, capture_output=True, text=True)
output = result.stdout
for line in output.splitlines():
    if '=' in line:
        var, value = line.split('=', 1)
        os.environ[var] = value

"""
import os

# 设置环境变量, autodl专区 其他idc
os.environ['HF_ENDPOINT'] = 'https://hf-mirror.com'

# 打印环境变量以确认设置成功
print(os.environ.get('HF_ENDPOINT'))
"""

"\nimport os\n\n# 设置环境变量, autodl专区 其他idc\nos.environ['HF_ENDPOINT'] = 'https://hf-mirror.com'\n\n# 打印环境变量以确认设置成功\nprint(os.environ.get('HF_ENDPOINT'))\n"

In [2]:
from transformers import AutoTokenizer, AutoModel
from tokenizers import Tokenizer
from transformers import GPT2LMHeadModel, AutoConfig,GPT2Tokenizer
from transformers import AutoModelForSequenceClassification
from transformers import DataCollatorWithPadding

In [3]:
#set tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("dnagpt/gene_eng_gpt2_v0")
tokenizer.pad_token = tokenizer.eos_token

In [4]:
#set model
model = AutoModelForSequenceClassification.from_pretrained('dnagpt/gene_eng_gpt2_v0', num_labels=1)
model.config.pad_token_id = model.config.eos_token_id

Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at dnagpt/gene_eng_gpt2_v0 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [5]:
from datasets import load_dataset
# 1. load ~11k samples from promoters prediction dataset
dataset = load_dataset("csv", data_files="data/protein_stab.csv")['train'].train_test_split(test_size=0.1)
dataset

DatasetDict({
    train: Dataset({
        features: ['seq_id', 'seq_type', 'seq', 'label'],
        num_rows: 62079
    })
    test: Dataset({
        features: ['seq_id', 'seq_type', 'seq', 'label'],
        num_rows: 6898
    })
})

In [6]:
dataset["train"][0]

{'seq_id': 'train_prot_32672',
 'seq_type': 'prot',
 'seq': 'FYRLIIFKYPDYIDTYLRLAAIAKEKNNLQLSIEGNGSGGNGSGGNGSGN',
 'label': 0.7599999904632561}

In [7]:
token_len_list = []
for item in dataset["test"]:
    inputs = tokenizer.tokenize(item["seq"])
    token_len_list.append( len(inputs) )

mean_len = sum(token_len_list)/len(token_len_list)
min_len  = min(token_len_list)
max_len = max(token_len_list)

print("datasets ", "mean token lenght", mean_len, "min token length", min_len, "max token length", max_len)

datasets  mean token lenght 17.24006958538707 min token length 12 max token length 35


In [25]:
# 2. tokenize
def tokenize_function(examples):
    return tokenizer(examples['seq'], truncation=True, padding='max_length')

# 3. 对数据集应用分词函数
tokenized_datasets = dataset.map(tokenize_function, batched=True)

# 4. 创建一个数据收集器，用于动态填充和遮蔽
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

Map:   0%|          | 0/62079 [00:00<?, ? examples/s]

Asking to pad to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no padding.
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Map:   0%|          | 0/6898 [00:00<?, ? examples/s]

In [26]:
from transformers import TrainingArguments, Trainer
import numpy as np
from sklearn.metrics import mean_squared_error


def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    rmse = mean_squared_error(labels, predictions)
    return {"rmse": rmse}

# 设置训练参数
training_args = TrainingArguments(
    output_dir='./results',
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=20,
    per_device_eval_batch_size=20,
    num_train_epochs=10,
    weight_decay=0.01,
)

# 使用Trainer API进行训练（假设已有train_dataset和eval_dataset）
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

  trainer = Trainer(


In [None]:
# 开始训练
trainer.train()

Epoch,Training Loss,Validation Loss,Rmse
1,0.0446,0.163462,0.163462
2,0.0419,0.1579,0.1579
3,0.0377,0.159724,0.159724
4,0.0317,0.157686,0.157686
5,0.0288,0.157124,0.157124
6,0.0254,0.150852,0.150852
7,0.0223,0.159293,0.159293
8,0.0196,0.154608,0.154608
9,0.0173,0.156104,0.156104


IOPub message rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_msg_rate_limit`.

Current values:
ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
ServerApp.rate_limit_window=3.0 (secs)



In [None]:
#模型测试
predictions = trainer.predict(tokenized_datasets["test"])
predictions

In [18]:
trainer.evaluate()

{'eval_loss': 0.15949687361717224,
 'eval_rmse': 0.15949687361717224,
 'eval_runtime': 9.1483,
 'eval_samples_per_second': 754.017,
 'eval_steps_per_second': 37.712,
 'epoch': 10.0}

In [23]:
predictions.predictions[0:10].squeeze()

[[ 1.7208484 ]
 [ 0.00225139]
 [ 0.3325616 ]
 [-0.34372616]
 [-0.45505935]
 [-0.06892765]
 [ 0.15099108]
 [ 0.12211376]
 [ 0.3947332 ]
 [ 0.23186803]]


In [24]:
predictions.label_ids[0:10]

array([ 1.69,  0.84,  0.58, -0.15,  0.23,  0.03,  0.15,  0.2 ,  0.51,
        1.1 ], dtype=float32)