# Data processing 

After extracting the relevant information from the VCF files with help of `bcftools`. We are going to process the data and produce the alternate sequences with the help of `samtools`.
On the pre-processed data we have the following information about each variant (from Ensembl Variation build 110).
* Number of chromosome
* Position of variant
* Reference Allele
* Alternate Allele

Some of the following code snippets were retrieved from the [DeepPerVar repository](https://github.com/alfredyewang/DeepPerVar), with the objective of mimic the way they produced their alternate sequences.

In [1]:
# Import libraries
import numpy as np
from Bio import SeqIO
from Bio.Seq import MutableSeq, Seq
import pandas as pd
import subprocess

In [2]:
# Chromosome data path
chr_data_path = "/mnt/sda1/Databases/Ensembl/Variation/110/chromosomes_data/"
res_folder = "/mnt/sda1/Databases/Ensembl/Variation/110/chromosomes_data/res/"
chr_21_data = pd.read_csv(chr_data_path+'chr21_data.tsv', sep = '\t')
chr_21_data.rename(columns={'#[1]CHROM':'chr', '[2]POS':'pos', '[3]REF':'ref', '[4]ALT':'alt', '[5]TSA':'tsa', '[6]ID':'id'}, inplace=True)
chr_21_data.head()

Unnamed: 0,chr,pos,ref,alt,tsa,id
0,21,5030088,C,T,SNV,rs1455320509
1,21,5030105,C,A,SNV,rs1173141359
2,21,5030151,T,G,SNV,rs1601770018
3,21,5030154,T,C,SNV,rs1461284410
4,21,5030160,T,A,SNV,rs1601770028


In [3]:
print(chr_21_data.shape, chr_21_data.columns)

(9243118, 6) Index(['chr', 'pos', 'ref', 'alt', 'tsa', 'id'], dtype='object')


# SNVs
We are going to start with the SNVs, as they are the simplest form on variation.

In [4]:
# We filter the variants that are snps and that have only one alternate allele
chr_21_snps = chr_21_data.loc[(chr_21_data['tsa']=='SNV') & (chr_21_data['alt'].str.len() == 1)]
chr_21_snps.head()

Unnamed: 0,chr,pos,ref,alt,tsa,id
0,21,5030088,C,T,SNV,rs1455320509
1,21,5030105,C,A,SNV,rs1173141359
2,21,5030151,T,G,SNV,rs1601770018
3,21,5030154,T,C,SNV,rs1461284410
4,21,5030160,T,A,SNV,rs1601770028


In [5]:
print(chr_21_snps.shape,',',min(chr_21_snps['pos']))

(7542415, 6) , 5030088


## Generate sequences in FASTA format with samtools

In [38]:
chr_21_snps['start'] = chr_21_snps['pos'].astype(int) - 64
chr_21_snps['end'] = chr_21_snps['pos'].astype(int) + 63
chr_21_snps['bed'] = chr_21_snps['chr'].astype(str) + ':' + chr_21_snps['start'].astype(str) + '-' + chr_21_snps['end'].astype(str)
chr_21_snps['bed'].to_csv('{}/bed_chr21'.format(res_folder), sep='\t', index=False, header=False)
chr_21_snps.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  chr_21_snps['start'] = chr_21_snps['pos'].astype(int) - 64
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  chr_21_snps['end'] = chr_21_snps['pos'].astype(int) + 63
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  chr_21_snps['bed'] = chr_21_snps['chr'].astype(str) + ':' + chr_21_snps['start'].astype(s

Unnamed: 0,chr,pos,ref,alt,tsa,id,start,end,bed
0,21,5030088,C,T,SNV,rs1455320509,5030024,5030151,21:5030024-5030151
1,21,5030105,C,A,SNV,rs1173141359,5030041,5030168,21:5030041-5030168
2,21,5030151,T,G,SNV,rs1601770018,5030087,5030214,21:5030087-5030214
3,21,5030154,T,C,SNV,rs1461284410,5030090,5030217,21:5030090-5030217
4,21,5030160,T,A,SNV,rs1601770028,5030096,5030223,21:5030096-5030223


In [14]:
# This code has to be run directly in the terminal after adding the samtools path to the PATH variable
exit_code = subprocess.Popen("samtools faidx /mnt/sda1/Databases/Reference Genome/GRCh38p14/Ensembl/Homo_sapiens_GRCh38_dna_primary_assembly.fa -r {}/bed_chr21 -o {}/seq_vcf_chr21".format(
    res_folder, res_folder), shell=True, stdout=subprocess.PIPE).stdout.read()

/bin/sh: 1: samtools: not found


## Generate alternate sequences
With the produced sequences in FASTA format, we are going to modify them by including the variants information that we have.
Remember that we produced sequences that have 64 bases before and after the variant position. Therefore, the variant will be located in the **index 63** of each sequence.

In [5]:
fasta_seq_path = "/mnt/sda1/Databases/Ensembl/Variation/110/chromosomes_data/res/"

In [6]:
records = list(SeqIO.parse(fasta_seq_path+'seq_vcf_chr21', "fasta"))

In [41]:
print(records[2].seq[64], chr_21_snps['alt'][2])

T G


In [30]:
ref_seqs = [sequence[1].seq for sequence in enumerate(records)]

In [42]:
example_mut = MutableSeq(ref_seqs[0])
example_mut[64] = chr_21_snps['alt'][0]
example_mut = Seq(example_mut)
print('',ref_seqs[0],'\n',example_mut)

 TTTGTAGCGATGGGGCCTCACTGTGTTGCCCAGGCTAGATTCAAGCTCCTGGACACAAGCGATGCTCCTGCCTAAGCCTCCCAAAGTGCTGAGATTACAGGTGTGAGCCACCACGTCCAAGCTAGAGT 
 TTTGTAGCGATGGGGCCTCACTGTGTTGCCCAGGCTAGATTCAAGCTCCTGGACACAAGCGATGTTCCTGCCTAAGCCTCCCAAAGTGCTGAGATTACAGGTGTGAGCCACCACGTCCAAGCTAGAGT


In [47]:
chr_21_snps.head()

Unnamed: 0,chr,pos,ref,alt,tsa,id,start,end,bed
0,21,5030088,C,T,SNV,rs1455320509,5030024,5030151,21:5030024-5030151
1,21,5030105,C,A,SNV,rs1173141359,5030041,5030168,21:5030041-5030168
2,21,5030151,T,G,SNV,rs1601770018,5030087,5030214,21:5030087-5030214
3,21,5030154,T,C,SNV,rs1461284410,5030090,5030217,21:5030090-5030217
4,21,5030160,T,A,SNV,rs1601770028,5030096,5030223,21:5030096-5030223


In [44]:
alt_seqs = []
rs_ids = []
i = 0
tmp_seq = MutableSeq(ref_seqs[i])
tmp_seq[64] = chr_21_snps['alt'][i]
alt_seqs.append(Seq(tmp_seq))
rs_ids.append(chr_21_snps['id'][i])

In [46]:
print('',alt_seqs[0][60:69], rs_ids[0], '\n', ref_seqs[0][60:69])

 GATGTTCCT rs1455320509 
 GATGCTCCT


In [60]:
rs_ids = list(chr_21_snps['id'])
rs_ids[0:5]

['rs1455320509',
 'rs1173141359',
 'rs1601770018',
 'rs1461284410',
 'rs1601770028']

In [63]:
alt_alleles = list(chr_21_snps['alt'])
alt_alleles[0:5]

['T', 'A', 'G', 'C', 'A']

In [93]:
alt_seqs = []
i = 0
while i < len(ref_seqs):
    mutable = MutableSeq(ref_seqs[i])
    mutable[64] = alt_alleles[i]
    alt_seqs.append(str(mutable))
    i+=1

In [97]:
ref_seqs = list(map(str, ref_seqs))
ref_seqs[0:5]

['TTTGTAGCGATGGGGCCTCACTGTGTTGCCCAGGCTAGATTCAAGCTCCTGGACACAAGCGATGCTCCTGCCTAAGCCTCCCAAAGTGCTGAGATTACAGGTGTGAGCCACCACGTCCAAGCTAGAGT',
 'TCACTGTGTTGCCCAGGCTAGATTCAAGCTCCTGGACACAAGCGATGCTCCTGCCTAAGCCTCCCAAAGTGCTGAGATTACAGGTGTGAGCCACCACGTCCAAGCTAGAGTTTTAAAAGTGAAATTTG',
 'GCTCCTGCCTAAGCCTCCCAAAGTGCTGAGATTACAGGTGTGAGCCACCACGTCCAAGCTAGAGTTTTAAAAGTGAAATTTGAAGGGAGTGCCATGAAGCACTAAATGAGAACAAAATTTAAGAGAAA',
 'CCTGCCTAAGCCTCCCAAAGTGCTGAGATTACAGGTGTGAGCCACCACGTCCAAGCTAGAGTTTTAAAAGTGAAATTTGAAGGGAGTGCCATGAAGCACTAAATGAGAACAAAATTTAAGAGAAAAAT',
 'TAAGCCTCCCAAAGTGCTGAGATTACAGGTGTGAGCCACCACGTCCAAGCTAGAGTTTTAAAAGTGAAATTTGAAGGGAGTGCCATGAAGCACTAAATGAGAACAAAATTTAAGAGAAAAATTAGAGG']

In [94]:
print('',alt_seqs[1][60:69],'\n',ref_seqs[1][60:69])

 CTCCAAAAG 
 CTCCCAAAG


In [95]:
alt_seqs[0:5]

['TTTGTAGCGATGGGGCCTCACTGTGTTGCCCAGGCTAGATTCAAGCTCCTGGACACAAGCGATGTTCCTGCCTAAGCCTCCCAAAGTGCTGAGATTACAGGTGTGAGCCACCACGTCCAAGCTAGAGT',
 'TCACTGTGTTGCCCAGGCTAGATTCAAGCTCCTGGACACAAGCGATGCTCCTGCCTAAGCCTCCAAAAGTGCTGAGATTACAGGTGTGAGCCACCACGTCCAAGCTAGAGTTTTAAAAGTGAAATTTG',
 'GCTCCTGCCTAAGCCTCCCAAAGTGCTGAGATTACAGGTGTGAGCCACCACGTCCAAGCTAGAGGTTTAAAAGTGAAATTTGAAGGGAGTGCCATGAAGCACTAAATGAGAACAAAATTTAAGAGAAA',
 'CCTGCCTAAGCCTCCCAAAGTGCTGAGATTACAGGTGTGAGCCACCACGTCCAAGCTAGAGTTTCAAAAGTGAAATTTGAAGGGAGTGCCATGAAGCACTAAATGAGAACAAAATTTAAGAGAAAAAT',
 'TAAGCCTCCCAAAGTGCTGAGATTACAGGTGTGAGCCACCACGTCCAAGCTAGAGTTTTAAAAGAGAAATTTGAAGGGAGTGCCATGAAGCACTAAATGAGAACAAAATTTAAGAGAAAAATTAGAGG']

In [75]:
chr_21_snps.head()

Unnamed: 0,chr,pos,ref,alt,tsa,id,start,end,bed
0,21,5030088,C,T,SNV,rs1455320509,5030024,5030151,21:5030024-5030151
1,21,5030105,C,A,SNV,rs1173141359,5030041,5030168,21:5030041-5030168
2,21,5030151,T,G,SNV,rs1601770018,5030087,5030214,21:5030087-5030214
3,21,5030154,T,C,SNV,rs1461284410,5030090,5030217,21:5030090-5030217
4,21,5030160,T,A,SNV,rs1601770028,5030096,5030223,21:5030096-5030223


In [98]:
chr21_df = pd.DataFrame(data = chr_21_snps, copy=True)
chr21_df['ref_seq'] = ref_seqs
chr21_df['alt_seq'] = alt_seqs
chr21_df.head()

Unnamed: 0,chr,pos,ref,alt,tsa,id,start,end,bed,ref_seq,alt_seq
0,21,5030088,C,T,SNV,rs1455320509,5030024,5030151,21:5030024-5030151,TTTGTAGCGATGGGGCCTCACTGTGTTGCCCAGGCTAGATTCAAGC...,TTTGTAGCGATGGGGCCTCACTGTGTTGCCCAGGCTAGATTCAAGC...
1,21,5030105,C,A,SNV,rs1173141359,5030041,5030168,21:5030041-5030168,TCACTGTGTTGCCCAGGCTAGATTCAAGCTCCTGGACACAAGCGAT...,TCACTGTGTTGCCCAGGCTAGATTCAAGCTCCTGGACACAAGCGAT...
2,21,5030151,T,G,SNV,rs1601770018,5030087,5030214,21:5030087-5030214,GCTCCTGCCTAAGCCTCCCAAAGTGCTGAGATTACAGGTGTGAGCC...,GCTCCTGCCTAAGCCTCCCAAAGTGCTGAGATTACAGGTGTGAGCC...
3,21,5030154,T,C,SNV,rs1461284410,5030090,5030217,21:5030090-5030217,CCTGCCTAAGCCTCCCAAAGTGCTGAGATTACAGGTGTGAGCCACC...,CCTGCCTAAGCCTCCCAAAGTGCTGAGATTACAGGTGTGAGCCACC...
4,21,5030160,T,A,SNV,rs1601770028,5030096,5030223,21:5030096-5030223,TAAGCCTCCCAAAGTGCTGAGATTACAGGTGTGAGCCACCACGTCC...,TAAGCCTCCCAAAGTGCTGAGATTACAGGTGTGAGCCACCACGTCC...


In [8]:
with open(fasta_seq_path+"seq_vcf_chr21", "r") as fasta_file:
    fasta_sequences = SeqIO.parse(fasta_file, format = "fasta")

# Test process_data module

In [2]:
from process_data import generate_sequences

In [3]:
data_path = "/mnt/sda1/Databases/Ensembl/Variation/110/chromosomes_data/"
res_path = "/mnt/sda1/Databases/Ensembl/Variation/110/chromosomes_data/res/"
chromosome = "17"

In [4]:
data_path = data_path
res_path = res_path
chromosome = chromosome
chr_n_data = pd.read_csv(data_path+'chr{}_data.tsv'.format(chromosome), sep = '\t')
chr_n_data.rename(columns={"#[1]CHROM":'chr', '[2]POS':'pos', '[3]REF':'ref', '[4]ALT':'alt', '[5]TSA':'tsa', "[6]ID":'id'}, inplace=True)

# Filter the variants that are snps and that have only one alternate allele
chr_n_snps = chr_n_data.loc[(chr_n_data['tsa']=='SNV') & (chr_n_data['alt'].str.len() == 1)]
chr_n_snps.head()

Unnamed: 0,chr,pos,ref,alt,tsa,id
0,17,60103,G,A,SNV,rs1399099657
1,17,60104,C,T,SNV,rs1363626035
2,17,60168,C,T,SNV,rs1160440358
3,17,60189,A,G,SNV,rs1473889808
4,17,60202,G,A,SNV,rs1415762065


In [5]:

chr_n_snps['start'] = chr_n_snps['pos'].astype(int) - 64
chr_n_snps['end'] = chr_n_snps['pos'].astype(int) + 63
chr_n_snps['bed'] = chr_n_snps['chr'].astype(str) + ':' + chr_n_snps['start'].astype(str) + '-' + chr_n_snps['end'].astype(str)
chr_n_snps['bed'].to_csv('{}/bed_chr{}'.format(res_path, chromosome), sep='\t', index=False, header=False)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  chr_n_snps['start'] = chr_n_snps['pos'].astype(int) - 64
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  chr_n_snps['end'] = chr_n_snps['pos'].astype(int) + 63
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  chr_n_snps['bed'] = chr_n_snps['chr'].astype(str) + ':' + chr_n_snps['start'].astype(str) + '

In [6]:
# This code has to be run directly in the terminal after adding the samtools path to the PATH variable
exit_code = subprocess.Popen("samtools faidx /mnt/sda1/Databases/Reference Genome/GRCh38p14/Ensembl/Homo_sapiens_GRCh38_dna_primary_assembly.fa -r {}/bed_chr{} -o {}/seq_vcf_chr{}".format(res_path, chromosome, res_path, chromosome), 
                                shell=True, stdout=subprocess.PIPE).stdout.read()

/bin/sh: 1: samtools: not found


In [7]:
records = list(SeqIO.parse(res_path+'seq_vcf_chr{}'.format(chromosome), "fasta"))
ref_seqs = [sequence[1].seq for sequence in enumerate(records)]