# 2.4 基因大模型的生物序列特征提取

使用 GPT-2 模型获取文本的特征向量是一个常见的需求，尤其是在进行文本分类、相似度计算或其他下游任务时。Hugging Face 的 transformers 库提供了简单易用的接口来实现这一点。以下是详细的步骤和代码示例，帮助你从 GPT-2 模型中提取文本的特征向量。

使用 GPT-2 模型获取文本的特征向量是一个常见的需求，尤其是在进行文本分类、相似度计算或其他下游任务时。Hugging Face 的 `transformers` 库提供了简单易用的接口来实现这一点。以下是详细的步骤和代码示例，帮助你从 GPT-2 模型中提取文本的特征向量。

### 方法 1: 使用隐藏状态（Hidden States）

GPT-2 是一个基于 Transformer 的语言模型，它在每一层都有隐藏状态（hidden states），这些隐藏状态可以作为文本的特征表示。你可以选择最后一层的隐藏状态作为最终的特征向量，或者对多层的隐藏状态进行平均或拼接。


### 方法 2: 使用池化策略

另一种方法是通过对所有 token 的隐藏状态进行池化操作来获得句子级别的特征向量。常见的池化方法包括：

- **均值池化**（Mean Pooling）：对所有 token 的隐藏状态求平均。
- **最大池化**（Max Pooling）：对每个维度取最大值。

In [43]:
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained('dna_bpe_dict')
tokenizer.tokenize("GAGCACATTCGCCTGCGTGCGCACTCACACACACGTTCAAAAAGAGTCCATTCGATTCTGGCAGTAG")
#result: [G','AGCAC','ATTCGCC',....]

model = AutoModel.from_pretrained('dna_gpt2_v0')
import torch
dna = "ACGTAGCATCGGATCTATCTATCGACACTTGGTTATCGATCTACGAGCATCTCGTTAGC"
inputs = tokenizer(dna, return_tensors = 'pt')
print(inputs)

outputs = model(inputs["input_ids"])
#outputs = model(**inputs)

hidden_states = outputs.last_hidden_state # [1, sequence_length, 768]  outputs.last_hidden_state or outputs[0]

# embedding with mean pooling
embedding_mean = torch.mean(hidden_states[0], dim=0)
print(embedding_mean.shape) # expect to be 768

# embedding with max pooling
embedding_max = torch.max(hidden_states[0], dim=0)[0]
print(embedding_max.shape) # expect to be 768

# embedding with first token
embedding_first_token = hidden_states[0][0]
print(embedding_first_token.shape) # expect to be 768

{'input_ids': tensor([[    1,   191,    29,   753,  1241,  2104, 12297,   357,    85,  4395,
         26392,    16]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}
torch.Size([768])
torch.Size([768])
torch.Size([768])


In [44]:
from transformers import AutoTokenizer, AutoModel
import torch

tokenizer = AutoTokenizer.from_pretrained('dna_wordpiece_dict')
tokenizer.tokenize("GAGCACATTCGCCTGCGTGCGCACTCACACACACGTTCAAAAAGAGTCCATTCGATTCTGGCAGTAG")
#result: [G','AGCAC','ATTCGCC',....]

model = AutoModel.from_pretrained('dna_bert_v0')
dna = "ACGTAGCATCGGATCTATCTATCGACACTTGGTTATCGATCTACGAGCATCTCGTTAGC"
inputs = tokenizer(dna, return_tensors = 'pt')
print(inputs)

outputs = model(inputs["input_ids"])
#outputs = model(**inputs)

hidden_states = outputs.last_hidden_state # [1, sequence_length, 768]  outputs.last_hidden_state or outputs[0]

# embedding with mean pooling
embedding_mean = torch.mean(hidden_states[0], dim=0)
print(embedding_mean.shape) # expect to be 768

# embedding with max pooling
embedding_max = torch.max(hidden_states[0], dim=0)[0]
print(embedding_max.shape) # expect to be 768

# embedding with first token
embedding_first_token = hidden_states[0][0]
print(embedding_first_token.shape) # expect to be 768

Some weights of BertModel were not initialized from the model checkpoint at dna_bert_v0 and are newly initialized: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


{'input_ids': tensor([[    6,   200, 16057,    10,  1256,  2123, 12294,   366, 13138,  7826,
            82,    25]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}
torch.Size([768])
torch.Size([768])
torch.Size([768])


## 特征提取并分类

我们使用第一章中的"dnagpt/dna_core_promoter"数据集，演示下使用我们训练的DNA GPT2或者DNA bert模型，提取序列特征，然使用最基础的逻辑回归分类方法，对序列进行分类。

In [2]:
import subprocess
import os
# 设置环境变量, autodl一般区域
result = subprocess.run('bash -c "source /etc/network_turbo && env | grep proxy"', shell=True, capture_output=True, text=True)
output = result.stdout
for line in output.splitlines():
    if '=' in line:
        var, value = line.split('=', 1)
        os.environ[var] = value

#或者
"""
import os

# 设置环境变量, autodl专区 其他idc
os.environ['HF_ENDPOINT'] = 'https://hf-mirror.com'

# 打印环境变量以确认设置成功
print(os.environ.get('HF_ENDPOINT'))
"""

"\nimport os\n\n# 设置环境变量, autodl专区 其他idc\nos.environ['HF_ENDPOINT'] = 'https://hf-mirror.com'\n\n# 打印环境变量以确认设置成功\nprint(os.environ.get('HF_ENDPOINT'))\n"

In [3]:
from datasets import load_dataset
dna_data = load_dataset("dnagpt/dna_core_promoter")
dna_data

Using the latest cached version of the dataset since dnagpt/dna_core_promoter couldn't be found on the Hugging Face Hub
Found the latest cached dataset configuration 'default' at /root/.cache/huggingface/datasets/dnagpt___dna_core_promoter/default/0.0.0/809065798bf4928f67397ddba23e4aa9cc5ac3ed (last modified on Fri Dec 27 16:05:19 2024).


DatasetDict({
    train: Dataset({
        features: ['sequence', 'label'],
        num_rows: 59196
    })
})

这里，我们不需要关注这个数据的具体生物学含义，只需知道sequence是具体的DNA序列，label是分类标签，有两个类别0和1即可

In [4]:
dna_data["train"][0]

{'sequence': 'CATGCGGGTCGATATCCTATCTGAATCTCTCAGCCCAAGAGGGAGTCCGCTCATCTATTCGGCAGTACTG',
 'label': 0}

这里使用scikit-learn库来构建逻辑回归分类器。首先是特征提取：

In [5]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from transformers import GPT2Tokenizer, GPT2Model
import torch

# 初始化 GPT-2 模型和分词器
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token  # 将填充符号设置为 eos_token
model = GPT2Model.from_pretrained("gpt2")

def get_gpt2_feature(sequence):
    """
    使用 GPT-2 模型提取特征向量。
    :param sequence: DNA 序列 (字符串格式)
    :return: 平均特征向量 (numpy 数组)
    """
    # 将 DNA 序列分词并转换为 GPT-2 输入
    inputs = tokenizer(sequence, return_tensors="pt", padding=True, truncation=True)
    with torch.no_grad():
        outputs = model(**inputs)
     # 提取最后一层的隐藏状态作为特征向量并平均，会对每个序列的所有 token 的特征进行平均，最终得到一个形状为 (1, 768) 的向量（对于 batch_size=1）
    feature_vector = outputs.last_hidden_state.mean(dim=1).detach().numpy()
    return feature_vector


In [6]:
from tqdm import tqdm
# 提取特征和标签
X = []
Y = []

# 存储特征向量和标签
for item in tqdm(dna_data["train"], desc="Processing DNA data"):
    sequence = item["sequence"]
    label = item["label"]
    x_v = get_gpt2_feature(sequence)
    y_v = label
    X.append(x_v)
    Y.append(y_v)


Processing DNA data: 100%|██████████| 59196/59196 [25:16<00:00, 39.04it/s]


In [11]:
X = np.array(X).squeeze(1)  # 去掉维度为1的那一维

In [17]:
# 将数据分为训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

# 创建逻辑回归模型
model = LogisticRegression(max_iter=200, solver='newton-cg')


In [18]:
# 训练模型
for i in tqdm(range(200), desc="Training Logistic Regression"):
    model.fit(X_train, y_train)

Training Logistic Regression: 100%|██████████| 200/200 [27:45<00:00,  8.33s/it]


In [19]:
# 在测试集上进行预测
y_pred = model.predict(X_test)

In [20]:
# 计算准确率
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy * 100:.2f}%")

Accuracy: 77.48%


In [21]:
# 输出部分预测结果与真实标签对比
for i in range(5):
    print(f"True: {y_test[i]}, Predicted: {y_pred[i]}")

True: 0, Predicted: 0
True: 0, Predicted: 1
True: 1, Predicted: 1
True: 0, Predicted: 0
True: 0, Predicted: 0
