# Data processing 

After extracting the relevant information from the VCF files with help of `bcftools`. We are going to process the data and produce the alternate sequences with the help of `samtools`.
On the pre-processed data we have the following information about each variant (from Ensembl Variation build 110).
* Number of chromosome
* Position of variant
* Reference Allele
* Alternate Allele

Some of the following code snippets were retrieved from the [DeepPerVar repository](https://github.com/alfredyewang/DeepPerVar), with the objective of mimic the way they produced their alternate sequences.

In [30]:
# Import libraries
import numpy as np
from Bio import SeqIO
from Bio.Seq import MutableSeq, Seq
import pandas as pd
import subprocess

In [2]:
# Chromosome data path
chr_data_path = "/mnt/sda1/Databases/Ensembl/Variation/110/chromosomes_data/"
res_folder = "/mnt/sda1/Databases/Ensembl/Variation/110/chromosomes_data/res/"
chr_21_data = pd.read_csv(chr_data_path+'chr21_data.tsv', sep = '\t')
chr_21_data.rename(columns={'#[1]CHROM':'chr', '[2]POS':'pos', '[3]REF':'ref', '[4]ALT':'alt', '[5]TSA':'tsa', '[6]ID':'id'}, inplace=True)
chr_21_data.head()

Unnamed: 0,chr,pos,ref,alt,tsa,id
0,21,5030088,C,T,SNV,rs1455320509
1,21,5030105,C,A,SNV,rs1173141359
2,21,5030151,T,G,SNV,rs1601770018
3,21,5030154,T,C,SNV,rs1461284410
4,21,5030160,T,A,SNV,rs1601770028


In [3]:
print(chr_21_data.shape, chr_21_data.columns)

(9243118, 6) Index(['chr', 'pos', 'ref', 'alt', 'tsa', 'id'], dtype='object')


# SNVs
We are going to start with the SNVs, as they are the simplest form on variation.

In [4]:
# We filter the variants that are snps and that have only one alternate allele
chr_21_snps = chr_21_data.loc[(chr_21_data['tsa']=='SNV') & (chr_21_data['alt'].str.len() == 1)]
chr_21_snps.head()

Unnamed: 0,chr,pos,ref,alt,tsa,id
0,21,5030088,C,T,SNV,rs1455320509
1,21,5030105,C,A,SNV,rs1173141359
2,21,5030151,T,G,SNV,rs1601770018
3,21,5030154,T,C,SNV,rs1461284410
4,21,5030160,T,A,SNV,rs1601770028


In [5]:
print(chr_21_snps.shape,',',min(chr_21_snps['pos']))

(7542415, 6) , 5030088


## Generate sequences in FASTA format with samtools

In [25]:
chr_21_snps['start'] = chr_21_snps['pos'].astype(int) - 64
chr_21_snps['end'] = chr_21_snps['pos'].astype(int) + 63
chr_21_snps['bed'] = chr_21_snps['chr'].astype(str) + ':' + chr_21_snps['start'].astype(str) + '-' + chr_21_snps['end'].astype(str)
chr_21_snps['bed'].to_csv('{}/bed_chr21'.format(res_folder), sep='\t', index=False, header=False)
chr_21_snps.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  chr_21_snps['start'] = chr_21_snps['pos'].astype(int) - 64
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  chr_21_snps['end'] = chr_21_snps['pos'].astype(int) + 63
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  chr_21_snps['bed'] = chr_21_snps['chr'].astype(str) + ':' + chr_21_snps['start'].astype(s

Unnamed: 0,chr,pos,ref,alt,tsa,id,start,end,bed
0,21,5030088,C,T,SNV,rs1455320509,5030024,5030151,21:5030024-5030151
1,21,5030105,C,A,SNV,rs1173141359,5030041,5030168,21:5030041-5030168
2,21,5030151,T,G,SNV,rs1601770018,5030087,5030214,21:5030087-5030214
3,21,5030154,T,C,SNV,rs1461284410,5030090,5030217,21:5030090-5030217
4,21,5030160,T,A,SNV,rs1601770028,5030096,5030223,21:5030096-5030223


In [14]:
# This code has to be run directly in the terminal after adding the samtools path to the PATH variable
exit_code = subprocess.Popen("samtools faidx /mnt/sda1/Databases/Reference Genome/GRCh38p14/Ensembl/Homo_sapiens_GRCh38_dna_primary_assembly.fa -r {}/bed_chr21 -o {}/seq_vcf_chr21".format(
    res_folder, res_folder), shell=True, stdout=subprocess.PIPE).stdout.read()

/bin/sh: 1: samtools: not found


## Generate alternate sequences
With the produced sequences in FASTA format, we are going to modify them by including the variants information that we have.
Remember that we produced sequences that have 64 bases before and after the variant position. Therefore, the variant will be located in the **index 63** of each sequence.

In [6]:
fasta_seq_path = "/mnt/sda1/Databases/Ensembl/Variation/110/chromosomes_data/res/"

In [10]:
records = list(SeqIO.parse(fasta_seq_path+'seq_vcf_chr21', "fasta"))

In [22]:
print(records[1].seq[65], chr_21_snps['alt'][1])

A A


In [26]:
ref_seqs = [sequence[1].seq for sequence in enumerate(records)]

In [33]:
chr_21_snps['alt'][0]

'T'

In [34]:
alt_seqs = []
rs_ids = []
for i, sequence in enumerate(ref_seqs):
    tmp_seq = MutableSeq(sequence)
    tmp_seq[65] = chr_21_snps['alt'][i]
    alt_seqs.append(Seq(tmp_seq))
    rs_ids.append(chr_21_snps['id'][i])

KeyError: 9

In [8]:
with open(fasta_seq_path+"seq_vcf_chr21", "r") as fasta_file:
    fasta_sequences = SeqIO.parse(fasta_file, format = "fasta")

In [9]:
fasta_sequences[0]

TypeError: 'FastaIterator' object is not subscriptable

In [27]:
try:
    fasta_sequences = SeqIO.index(fasta_seq_path+"seq_vcf_chr21", "fasta")
except ValueError:
    fasta_sequences = 
print(fasta_sequences.keys())

NameError: name 'fasta_sequences' is not defined

In [9]:
print(fasta_sequences.keys())

NameError: name 'fasta_sequences' is not defined