# Data Processing

After the set of experiments done in the `experimentation_data_processing.ipynb` notebook, here we will process the data with the steps that worked out. This is done to keep working but in a better organized jupyter notebook.

After extracting the relevant information from the VCF files with help of `bcftools`. We are going to process the data and produce the alternate sequences with the help of `samtools`.
On the pre-processed data we have the following information about each variant (from Ensembl Variation build 110).
* Number of chromosome
* Position of variant
* Reference Allele
* Alternate Allele

Some of the following code snippets were retrieved from the [DeepPerVar repository](https://github.com/alfredyewang/DeepPerVar), with the objective of mimic the way they produced their alternate sequences.

In [1]:
# Import the necessary modules
import numpy as np
from Bio import SeqIO
from Bio.Seq import MutableSeq, Seq
import pandas as pd
#import subprocess
import torch
from torch import nn
from torch.utils.data import DataLoader
import torch.optim as optim
# The following custom module contains the code to generate the alternative sequences from the data extracted from the VCF files with bcftools
from process_data import *

In [2]:
chr_data_path = '/mnt/sda1/Databases/Ensembl/Variation/110/chromosomes_data/'
reference_genome_path = '/mnt/sda1/Databases/Reference Genome/GRCh38p14/Ensembl/Homo_sapiens_GRCh38_dna_primary_assembly.fa'
res_path = '/mnt/sda1/Databases/Ensembl/Variation/110/chromosomes_data/res/'

## A little bit of data preprocessing
The data pre processing steps in this section uses the scripts that were made to extract and give form to the data from the VCF files from the Ensembl Variation database build 110.

In [3]:
chr_21 = generate_bed(chr_data_path, '21', res_path)
chr_21.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  chr_n_snps['start'] = chr_n_snps['pos'].astype(int) - 64
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  chr_n_snps['end'] = chr_n_snps['pos'].astype(int) + 63
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  chr_n_snps['bed'] = chr_n_snps['chr'].astype(str) + ':' + chr_n_snps['start'].astype(str) + '

Unnamed: 0,chr,pos,ref,alt,tsa,id,start,end,bed
0,21,5030088,C,T,SNV,rs1455320509,5030024,5030151,21:5030024-5030151
1,21,5030105,C,A,SNV,rs1173141359,5030041,5030168,21:5030041-5030168
2,21,5030151,T,G,SNV,rs1601770018,5030087,5030214,21:5030087-5030214
3,21,5030154,T,C,SNV,rs1461284410,5030090,5030217,21:5030090-5030217
4,21,5030160,T,A,SNV,rs1601770028,5030096,5030223,21:5030096-5030223


In [4]:
# Generate the alternate sequences
chr_21_df = generate_sequences(reference_genome_path, res_path, '21', chr_21)
chr_21_df.head()

Unnamed: 0,chr,pos,ref,alt,tsa,id,start,end,bed,ref_seq,alt_seq
0,21,5030088,C,T,SNV,rs1455320509,5030024,5030151,21:5030024-5030151,TTTGTAGCGATGGGGCCTCACTGTGTTGCCCAGGCTAGATTCAAGC...,TTTGTAGCGATGGGGCCTCACTGTGTTGCCCAGGCTAGATTCAAGC...
1,21,5030105,C,A,SNV,rs1173141359,5030041,5030168,21:5030041-5030168,TCACTGTGTTGCCCAGGCTAGATTCAAGCTCCTGGACACAAGCGAT...,TCACTGTGTTGCCCAGGCTAGATTCAAGCTCCTGGACACAAGCGAT...
2,21,5030151,T,G,SNV,rs1601770018,5030087,5030214,21:5030087-5030214,GCTCCTGCCTAAGCCTCCCAAAGTGCTGAGATTACAGGTGTGAGCC...,GCTCCTGCCTAAGCCTCCCAAAGTGCTGAGATTACAGGTGTGAGCC...
3,21,5030154,T,C,SNV,rs1461284410,5030090,5030217,21:5030090-5030217,CCTGCCTAAGCCTCCCAAAGTGCTGAGATTACAGGTGTGAGCCACC...,CCTGCCTAAGCCTCCCAAAGTGCTGAGATTACAGGTGTGAGCCACC...
4,21,5030160,T,A,SNV,rs1601770028,5030096,5030223,21:5030096-5030223,TAAGCCTCCCAAAGTGCTGAGATTACAGGTGTGAGCCACCACGTCC...,TAAGCCTCCCAAAGTGCTGAGATTACAGGTGTGAGCCACCACGTCC...


In [5]:
# Save the data frame with the reference and alternative sequences into a csv file for easier access
#chr_21_df.to_csv("/mnt/sda1/Databases/Ensembl/Variation/110/chromosome_datasets/chromosome_21_sequeces.csv", index = False)

## Import the dataframe containing the alternative and reference sequences

In [3]:
chr_21_df = pd.read_csv('/mnt/sda1/Databases/Ensembl/Variation/110/chromosome_datasets/chromosome_21_sequeces.csv')
chr_21_df.head()

Unnamed: 0,chr,pos,ref,alt,tsa,id,start,end,bed,ref_seq,alt_seq
0,21,5030088,C,T,SNV,rs1455320509,5030024,5030151,21:5030024-5030151,TTTGTAGCGATGGGGCCTCACTGTGTTGCCCAGGCTAGATTCAAGC...,TTTGTAGCGATGGGGCCTCACTGTGTTGCCCAGGCTAGATTCAAGC...
1,21,5030105,C,A,SNV,rs1173141359,5030041,5030168,21:5030041-5030168,TCACTGTGTTGCCCAGGCTAGATTCAAGCTCCTGGACACAAGCGAT...,TCACTGTGTTGCCCAGGCTAGATTCAAGCTCCTGGACACAAGCGAT...
2,21,5030151,T,G,SNV,rs1601770018,5030087,5030214,21:5030087-5030214,GCTCCTGCCTAAGCCTCCCAAAGTGCTGAGATTACAGGTGTGAGCC...,GCTCCTGCCTAAGCCTCCCAAAGTGCTGAGATTACAGGTGTGAGCC...
3,21,5030154,T,C,SNV,rs1461284410,5030090,5030217,21:5030090-5030217,CCTGCCTAAGCCTCCCAAAGTGCTGAGATTACAGGTGTGAGCCACC...,CCTGCCTAAGCCTCCCAAAGTGCTGAGATTACAGGTGTGAGCCACC...
4,21,5030160,T,A,SNV,rs1601770028,5030096,5030223,21:5030096-5030223,TAAGCCTCCCAAAGTGCTGAGATTACAGGTGTGAGCCACCACGTCC...,TAAGCCTCCCAAAGTGCTGAGATTACAGGTGTGAGCCACCACGTCC...


In [4]:
# Check if the `alt` and `ref` elements from each register match with the sequences in `ref_seq` and `alt_seq`
print(chr_21_df['ref_seq'][2][64], chr_21_df['alt_seq'][2][64])

T G


In [5]:
chr21_ref_sequences = chr_21_df[['ref_seq']]
chr21_alt_sequences = chr_21_df[['alt_seq']]

In [6]:
# Label the reference sequences
chr21_ref_sequences['y'] = np.zeros(shape = chr21_ref_sequences.shape[0])
chr21_ref_sequences.rename({'ref_seq': 'seq'}, axis =1, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  chr21_ref_sequences['y'] = np.zeros(shape = chr21_ref_sequences.shape[0])
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  chr21_ref_sequences.rename({'ref_seq': 'seq'}, axis =1, inplace=True)


In [7]:
chr21_ref_sequences.head()

Unnamed: 0,seq,y
0,TTTGTAGCGATGGGGCCTCACTGTGTTGCCCAGGCTAGATTCAAGC...,0.0
1,TCACTGTGTTGCCCAGGCTAGATTCAAGCTCCTGGACACAAGCGAT...,0.0
2,GCTCCTGCCTAAGCCTCCCAAAGTGCTGAGATTACAGGTGTGAGCC...,0.0
3,CCTGCCTAAGCCTCCCAAAGTGCTGAGATTACAGGTGTGAGCCACC...,0.0
4,TAAGCCTCCCAAAGTGCTGAGATTACAGGTGTGAGCCACCACGTCC...,0.0


In [8]:
# Label the alternate sequences
chr21_alt_sequences['y'] = np.ones(shape = chr21_alt_sequences.shape[0])
chr21_alt_sequences.rename({'alt_seq': 'seq'}, axis=1, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  chr21_alt_sequences['y'] = np.ones(shape = chr21_alt_sequences.shape[0])
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  chr21_alt_sequences.rename({'alt_seq': 'seq'}, axis=1, inplace=True)


In [9]:
chr21_alt_sequences.head()

Unnamed: 0,seq,y
0,TTTGTAGCGATGGGGCCTCACTGTGTTGCCCAGGCTAGATTCAAGC...,1.0
1,TCACTGTGTTGCCCAGGCTAGATTCAAGCTCCTGGACACAAGCGAT...,1.0
2,GCTCCTGCCTAAGCCTCCCAAAGTGCTGAGATTACAGGTGTGAGCC...,1.0
3,CCTGCCTAAGCCTCCCAAAGTGCTGAGATTACAGGTGTGAGCCACC...,1.0
4,TAAGCCTCCCAAAGTGCTGAGATTACAGGTGTGAGCCACCACGTCC...,1.0


In [10]:
chr_21_dataset = pd.concat([chr21_alt_sequences, chr21_ref_sequences], ignore_index=True)
chr_21_dataset.head()                           

Unnamed: 0,seq,y
0,TTTGTAGCGATGGGGCCTCACTGTGTTGCCCAGGCTAGATTCAAGC...,1.0
1,TCACTGTGTTGCCCAGGCTAGATTCAAGCTCCTGGACACAAGCGAT...,1.0
2,GCTCCTGCCTAAGCCTCCCAAAGTGCTGAGATTACAGGTGTGAGCC...,1.0
3,CCTGCCTAAGCCTCCCAAAGTGCTGAGATTACAGGTGTGAGCCACC...,1.0
4,TAAGCCTCCCAAAGTGCTGAGATTACAGGTGTGAGCCACCACGTCC...,1.0


In [11]:
chr_21_dataset.shape

(15084830, 2)

# Tokenization

The final objective is to use the tokenization from DNABERT-2.

## Tokenizer from DNARBERT-2

In [12]:
# Import tokenizer from the Transformers module
from transformers import AutoTokenizer, AutoModel

In [13]:
tokenizer = AutoTokenizer.from_pretrained("zhihan1996/DNABERT-2-117M", trust_remote_code=True)
tokenizer

PreTrainedTokenizerFast(name_or_path='zhihan1996/DNABERT-2-117M', vocab_size=4096, model_max_length=1000000000000000019884624838656, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True)

In [14]:
samples = chr_21_dataset.sample(n=100000)
samples.head()

Unnamed: 0,seq,y
5481323,AAAGCAAAGAGAATGAGAGATCAGTGGTATGAAGACACAGGGAAGA...,1.0
2810785,GTAATTAAAACAGCATTGTATTGGCATTAAAACAGATACATAGACC...,1.0
1761843,GCATTCTGAGATCCAGGCTGAAGAAACACTGGTTAACAGGGTCATA...,1.0
5876021,ATCCAATTGGCAAAGACAATAAACACATAACTCCTAATGTTGATAA...,1.0
7547806,CCCTGCAACAGTGCCTGGAGCCAGACGTTCACCCCAGATCCTTCTG...,0.0


In [15]:
type(chr_21_dataset.y)

pandas.core.series.Series

In [15]:
sample_sequences = samples.seq.to_list()
sample_sequences[:5]

['AAAGCAAAGAGAATGAGAGATCAGTGGTATGAAGACACAGGGAAGAAGCCACGTGAAAACAAATACTGCCCAAGCCAAAGAGAGTGTGGGGCCACGAGAAGCTGAAACAGGCAGCAGGAGTTCTCCCC',
 'GTAATTAAAACAGCATTGTATTGGCATTAAAACAGATACATAGACCAATGGAACAGAATACAGAGCCAGGAAACAAGTTCACACACCTACAGTGAACTCATTTTTGAAAATGGTGCCCAGAACATATA',
 'GCATTCTGAGATCCAGGCTGAAGAAACACTGGTTAACAGGGTCATATTTTTCTCATGGTAGCAGAGAATAAATAAGAGACCAAACCAAAATGTGTAAGCATGTTTAAAGCTGCTGTGTGAAAAAAGTG',
 'ATCCAATTGGCAAAGACAATAAACACATAACTCCTAATGTTGATAAAAATATAGAGAATGAAACTTTCTCATGTAGACATTCCTTACAAGAATATAAATTGATATAACCTTTTTGGAGGGCAAGTTGG',
 'CCCTGCAACAGTGCCTGGAGCCAGACGTTCACCCCAGATCCTTCTGTGGGGTGAGACTGCAGGTCAGGCCGAGGCGTGTCAGCCAGGGTGGTGTGACTGCACCTCCAGAGCCAGCAGCAACGGCCAGG']

In [18]:
# Tokenize sequences
encoded_inputs = tokenizer(sample_sequences, padding = True, return_tensors="pt")["input_ids"]
print(encoded_inputs)

tensor([[   1,   18,  124,  ...,    3,    3,    3],
        [   1,  461,   39,  ...,    3,    3,    3],
        [   1,  183,   59,  ...,    3,    3,    3],
        ...,
        [   1,  945,  199,  ...,    3,    3,    3],
        [   1,    9,   10,  ...,    3,    3,    3],
        [   1,    5, 1040,  ...,    3,    3,    3]])


In [20]:
encoded_inputs[0]

tensor([   1,   18,  124,  145, 1505,  574, 2418,  683,  200, 2919,  123,   39,
          49,  236,  264,  463,   50, 2168,   89, 3145,  978,   53,  148,  945,
          78,   13,    2,    3,    3,    3,    3,    3,    3,    3,    3,    3,
           3,    3,    3,    3,    3,    3,    3,    3,    3,    3,    3,    3,
           3,    3,    3,    3,    3,    3,    3,    3,    3,    3,    3,    3,
           3,    3,    3,    3,    3,    3,    3,    3,    3,    3,    3,    3,
           3,    3,    3,    3,    3,    3,    3])

In [21]:
tokenizer.decode(encoded_inputs[0])

'[CLS] AAA GCAAA GAGAA TGAGAGA TCAGTG GTATGAA GACACA GGGAA GAAGCCA CGTG AAAA CAAA TACTG CCCAA GCCAAA GAGA GTGTGGG GCCA CGAGAA GCTGAAA CAGG CAGCA GGAGTT CTCC CC [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]'

In [19]:
encoded_inputs[0].dtype

torch.int64

### Data splitting

In [20]:
from sklearn.model_selection import train_test_split

In [21]:
# Convert y in samples into a tensor
y_tensor = torch.from_numpy(samples.y.to_numpy())
y_tensor = y_tensor.to(torch.int64)
print(y_tensor, y_tensor.dtype)

tensor([0, 0, 1,  ..., 0, 1, 0]) torch.int64


In [22]:
# encoded_inputs is X and y in samples is Y
seed = 7
np.random.seed(seed)

X_train, X_test, y_train, y_test = train_test_split(encoded_inputs, y_tensor, test_size=0.33, random_state= seed)

In [24]:
len(X_train[0])

71

# Initial model
We are going to build and train a model to differenciate alternative sequences from reference ones.

In [33]:
# Get cpu, gpu or mps device for training.
device = (
    "cuda"
    if torch.cuda.is_available()
    else "mps"
    if torch.backends.mps.is_available()
    else "cpu"
)
print(f"Using {device} device")

Using cuda device


In [43]:
# Define model
class NeuralNetwork(nn.Module):
    def __init__(self):
        super().__init__()
        self.linear_relu_stack = nn.Sequential(
            nn.Linear(51, 30),
            nn.ReLU(),
            nn.Linear(30, 15),
            nn.ReLU(),
            nn.Linear(15, 7),
            nn.ReLU(),
            nn.Linear(7, 1)
        )

    def forward(self, x):
        logits = self.linear_relu_stack(x)
        return logits

model = NeuralNetwork().to(device)
print(model)

NeuralNetwork(
  (linear_relu_stack): Sequential(
    (0): Linear(in_features=51, out_features=30, bias=True)
    (1): ReLU()
    (2): Linear(in_features=30, out_features=15, bias=True)
    (3): ReLU()
    (4): Linear(in_features=15, out_features=7, bias=True)
    (5): ReLU()
    (6): Linear(in_features=7, out_features=1, bias=True)
  )
)


In [44]:
# Loss function
loss = nn.BCELoss()
# Optimizer
optimizer = optim.Adam(model.parameters(), lr = 0.001)

In [45]:
# Loop parameters
num_epochs = 10
X_train = X_train.to(device)
y_train = y_train.to(device)

In [46]:
print(X_train.dtype, y_train.dtype, X_train.device, y_train.device)

torch.int64 torch.int64 cuda:0 cuda:0


In [47]:
# Training loop
for epoch in range(num_epochs):
    model.train()  # Set the model to training mode

    # Forward pass
    outputs = model(X_train)
    loss = loss(outputs, y_train)

    # Backpropagation and optimization
    optimizer.zero_grad()  # Zero the gradients
    loss.backward()  # Compute gradients
    optimizer.step()  # Update weights

    print(f'Epoch [{epoch+1}/{num_epochs}], Loss: {loss.item()}')

print('Training finished')

RuntimeError: mat1 and mat2 must have the same dtype