# Tutorial on PEFT with DNA Language Models

This notebook is a tutorial on how to use parameter-efficient fine-tuning techniques from the PEFT library to fine-tune a DNA Language Model. This fine-tuned DNA-LM is used to solve a task from the nucleotide benchmark dataset.

### 1. Import relevant libraries

In [84]:
import sklearn
import torch
import transformers 
import peft
import tqdm
import math

### 2. Prepare datasets

We load the ```nucleotide_transformer_downstream_tasks``` dataset, that contains 18 downstream tasks from the Nucleotide Transformer paper. It provides a consistent genomics benchmark with both binary and mulit-class classification tasks.

In [118]:
from datasets import load_dataset

raw_dataset = load_dataset("InstaDeepAI/nucleotide_transformer_downstream_tasks", "H3")

We'll use the "H3" subset of this dataset, which contains a total of 13,468 rows in the training data, and 1497 rows in the test data.

In [125]:
ds

DatasetDict({
    train: Dataset({
        features: ['sequence', 'name', 'label'],
        num_rows: 8241
    })
    validation: Dataset({
        features: ['sequence', 'name', 'label'],
        num_rows: 1455
    })
    test: Dataset({
        features: ['sequence', 'name', 'label'],
        num_rows: 1497
    })
})

In [148]:
y = ds['train']['label']

The dataset consists of three columns, ```sequence```, ```name``` and ```label```. An example is given below.

In [149]:
ds['train'][0]

{'sequence': 'TTAGGTGGTTTATTATTTAATTTTATGCTGATTAATTTATTTACTTTCGTATTCGGTTTTGTACCTTTAGCTATGATCTTAGCTAATTGAAGAGGGTGGTGTGATCTTTAACCATACCTTATTATCTTTCAGCTGCTTACCATTTTCTTATATTGATTTTTAGCGAAAGATTTTTATTCACAAGCTTTTTTTATCCTTAATGCTCGAATACTACAACAAAACAAAAAACATTAAACAGTTTTTAATTTTGTGAACAAACTGAATTACAAGGCCTTACATCTTATTTAGAATATATTAAGAAACAGAGGCCAACATGCCTTCTTAATTATATTGATATGGACCTCTGTCCTTCCTAAAAACGGGTTTTTGTTCGATGAAAAATCACCAGTAGAGCACCATATATGAATTTACAATCATTGTAGGGAAAAGAAAACTTGTTCTGCTTCGCCAATTGATTTCATTTCTTTTTTTCCTTTGTTTTTGTTGTATACTATTAATAT',
 'name': 'iYLR134W_412873|0',
 'label': 0}

We split out dataset into training, test, and validation.

In [150]:
from datasets import Dataset, DatasetDict

train_valid_split = ds['train'].train_test_split(test_size=0.15, seed=42)

train_valid_split = DatasetDict({
    'train': train_valid_split['train'],
    'validation': train_valid_split['test']
})

ds = DatasetDict({
    'train': train_valid_split['train'],
    'validation': train_valid_split['validation'],
    'test': ds['test']
})

Then, we generate our data and labels.

In [165]:
def get_kmers(seq, k=6, stride=1):
    return [seq[i:i + k] for i in range(0, len(seq), stride) if i + k <= len(seq)]

In [170]:
test_sequences = []
train_sequences = []
val_sequences = []

dataset_limit = 20

for i in range(0, len(ds['train'])):
    
    if dataset_limit and i == dataset_limit:
        break
        
    sequence = ds['train'][i]['sequence']
    sequence = "candida_glabrata " + " ".join(get_kmers(sequence))
    sequence = tokenizer(sequence)["input_ids"]
    train_sequences.append(sequence)
    

for i in range(0, len(ds['validation'])):
    if dataset_limit and i == dataset_limit:
        break
    sequence = ds['validation'][i]['sequence']
    sequence = "candida_glabrata " + " ".join(get_kmers(sequence))
    sequence = tokenizer(sequence)["input_ids"]
    val_sequences.append(sequence)
    

for i in range(0, len(ds['test'])):
    if dataset_limit and i == dataset_limit:
        break
    sequence = ds['test'][i]['sequence']
    sequence = "candida_glabrata " + " ".join(get_kmers(sequence))
    sequence = tokenizer(sequence)["input_ids"]
    test_sequences.append(sequence)
    

train_labels = ds['train']['label']
test_labels = ds['test']['label']
val_labels = ds['validation']['label']

if dataset_limit:
    train_labels = train_labels[0:dataset_limit]
    test_labels = test_labels[0:dataset_limit]
    val_labels = val_labels[0:dataset_limit]

Finally, we create a Dataset object for each our sets.

In [171]:
import pandas as pd
from datasets import Dataset

a = {"input_ids": train_sequences, "labels": train_labels}
df = pd.DataFrame.from_dict(a)
train_dataset = Dataset.from_pandas(df)

b = {"input_ids": val_sequences, "labels": val_labels}
df = pd.DataFrame.from_dict(b)
val_dataset = Dataset.from_pandas(df)

c = {"input_ids": test_sequences, "labels": test_labels}
df = pd.DataFrame.from_dict(c)
test_dataset = Dataset.from_pandas(df)

### 3. Load models


We'll use a "species-aware" DNA Language Model, called Species-LM for our task. This can be loaded through HuggingFace.

In [172]:
from transformers import AutoTokenizer, AutoModelForMaskedLM

In [173]:
tokenizer = AutoTokenizer.from_pretrained("gagneurlab/SpeciesLM", revision = "downstream_species_lm")
lm = AutoModelForMaskedLM.from_pretrained("gagneurlab/SpeciesLM", revision = "downstream_species_lm")

In [174]:
print(torch.cuda.is_available())
lm.eval()
lm.to("cuda")
print("Done!")

True
Done!


### 4. Train model

Now, we'll train our DNA Language Model with the training dataset. We'll add a linear layer in the final layer of our language model, and then, train all the parameteres of our model with the training dataset.

In [175]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

In [176]:
import torch
from torch import nn

class DNA_LM(nn.Module):
    def __init__(self, model, num_labels):
        super(DNA_LM, self).__init__()
        self.model = model
        self.classifier = nn.Linear(768, num_labels)

    def forward(self, input_ids, attention_mask=None, labels=None):
        outputs = self.model(input_ids=input_ids, attention_mask=attention_mask, output_hidden_states=True)
        sequence_output = outputs.hidden_states[-1]
        # Use the [CLS] token for classification
        cls_output = sequence_output[:, 0, :]
        logits = self.classifier(cls_output)
        
        loss = None
        if labels is not None:
            loss_fct = nn.CrossEntropyLoss()
            loss = loss_fct(logits.view(-1, self.classifier.out_features), labels.view(-1))

        return (loss, logits) if loss is not None else logits

# Number of classes for your classification task
num_labels = 2
classification_model = DNA_LM(lm, num_labels)

In [177]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

In [178]:
from transformers import Trainer, TrainingArguments

# Define training arguments
training_args = TrainingArguments(
    output_dir='./results',
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=30,
    weight_decay=0.01,
)

# Initialize Trainer
trainer = Trainer(
    model=classification_model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

# Train the model
trainer.train()


Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


Epoch,Training Loss,Validation Loss
1,No log,0.656401
2,No log,0.652655
3,No log,0.649918
4,No log,0.646798
5,No log,0.643524
6,No log,0.640057
7,No log,0.636397
8,No log,0.633467
9,No log,0.634695
10,No log,0.638783


TrainOutput(global_step=60, training_loss=0.20707823435465494, metrics={'train_runtime': 18.5682, 'train_samples_per_second': 32.313, 'train_steps_per_second': 3.231, 'total_flos': 0.0, 'train_loss': 0.20707823435465494, 'epoch': 30.0})

### 5. Evaluation

In [179]:
# Generate predictions

predictions = trainer.predict(test_dataset)
logits = predictions.predictions
predicted_labels = logits.argmax(axis=-1)
print(predicted_labels)

[0 1 1 1 0 1 1 0 0 0 0 1 0 0 1 0 1 1 0 0]


In [180]:
# Step 3: Evaluate the Predictions (Optional)
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

accuracy = accuracy_score(test_labels, predicted_labels)
print(f"Accuracy: {accuracy}")

precision, recall, f1, _ = precision_recall_fscore_support(test_labels, predicted_labels, average='weighted')
print(f"Precision: {precision}")
print(f"Recall: {recall}")
print(f"F1-Score: {f1}")


Accuracy: 0.7
Precision: 0.7141414141414142
Recall: 0.7
F1-Score: 0.7


### <WIP> Embed the sequences

We embed the sequences in the training dataset using our DNA-Language Model. 

We start by creating a function that generates kmers for any given sequence (```get_kmers``` below).

In [91]:
def get_kmers(seq, k=6, stride=1):
    return [seq[i:i + k] for i in range(0, len(seq), stride) if i + k <= len(seq)]

Then, we tokenize our sequences using the tokenizer.

Next, we create a ```torch.Tensor``` matrix from our sequences.

In [18]:
tokenized_sequences = torch.tensor(sequences)

Checking the shape of our tokenized sequences, we get:

In [19]:
tokenized_sequences.shape

torch.Size([5, 498])

We'll generate the embeddings for our tokenized sequences.

In [17]:
embedding = lm(tokenized_sequences.to('cuda'), output_hidden_states=True)["hidden_states"]

In [18]:
type(embedding[12])

torch.Tensor

In [19]:
embeddings = []
embeddings.append(embedding[12])

In [20]:
embeddings = []
batch_size = 64
device = "cuda"
count = 0

for i in tqdm.tqdm(range(math.ceil(tokenized_sequences.shape[0]/batch_size))):
    with torch.inference_mode():
        with torch.autocast(device):
            embedding = lm(tokenized_sequences[i * batch_size:(i+1)*(batch_size)].to(device), output_hidden_states=True)["hidden_states"]
            
            #embedding = torch.stack(embedding[12:], axis=0)
            embeddings.append(embedding.cpu())
            
            # embedding = lm(tokenized_sequences[i*batch_size:(i+1)*(batch_size)].to(device), output_hidden_states=True)["hidden_states"]
            # embedding = torch.stack(embedding[8:], axis=0)
            # embedding = torch.mean(embedding[:,2:-1,:], axis=1)
            # embeddings.append(embedding.cpu())

#embeddings = torch.concat(embeddings, axis=0)

  0%|          | 0/1 [00:00<?, ?it/s]


AttributeError: 'tuple' object has no attribute 'cpu'

In [52]:
tokenized

13

In [45]:
outputs.hidden_states[-1].shape

torch.Size([28, 498, 768])

In [28]:
last_hidden_states = outputs.hidden_states[-1]
last_hidden_states.shape

torch.Size([28, 498, 768])

In [87]:
embeddings.shape

torch.Size([1055, 498, 768])

Now, we'll train our DNA Language Model with the training dataset. We'll add a linear layer in the final layer of our language model, and then, train all the parameteres of our model with the training dataset.

### TODO: Include implementation of PEFT library