# Generate Alternative Sequences (from VCF files)
This notebook is intended to generate alternative sequences and explain the process along the way.

The files that we are going to be using for now, principally to not load the whole data and overload the RAM and system are:
* `homo_sapiens-chr21.vcf`
* `Homo_sapiens.GRCh38.dna.primary_assembly.fa`

These files are the somatic variations from the **Ensembl database**, and the Reference Genome assembly **GRCh38.p14**, both from Ensembl build 110.

In [1]:
# Import the necessary modules
from Bio import SeqIO
# To be able to modify sequences
from Bio.Seq import MutableSeq, Seq
# Import SeqRecord
from Bio.SeqRecord import SeqRecord
# To import the databases directly
from database_io import load_db
# Import pandas
import pandas as pd
# To parse the VCF files
import vcf #PyVCF
# Import numpy
import numpy as np
# To transform string-converted lists back to actual lists
import ast

# Load and parse the VCF file with the help of PyVCF

## Testing on chromosome 21
We will do first test on the chromosome 21 because is the shortest one. Once the creation of alternative sequences is performed on this chromosome, we will apply the same process for the whole data set and all the chromosomes.

In [3]:
# Load paths were the data is stored
ensembl_var_path = "/mnt/sda1/Databases/Ensembl/Variation/110/VCF/homo_sapiens-chr21.vcf"
ensembl_grch38_path = "/mnt/sda1/Databases/Reference Genome/GRCh38p14/Ensembl/Homo_sapiens.GRCh38.dna.primary_assembly.fa"

In [4]:
# Declare vcf reader
vcf_reader = vcf.Reader(open(ensembl_var_path, mode = "r"))

In [5]:
# Create a list where the records stored in the vcf file are contained
records = []
# Save the records in the list
# for record in vcf_reader:
#    records.append(record)

#--------------------------------------------------------------------------------------#
# The above piece of code was commented because the whole `Register` object is extracted, using notably more RAM.
# The desired attributes are extracted a couple of cells below.

# Extract ONLY one `Record` object for to refer if necessary
sample_record = next(vcf_reader)
print(sample_record)

Record(CHROM=21, POS=5030088, REF=C, ALT=[T])


Each `Record` object contains the following attributes:
* `Record.CHROM`
* `Record.POS`
* `Record.ID`
* `Record.REF`
* `Record.ALT`
* `Record.QUAL`
* `Record.FILTER`
* `Record.INFO`

In [6]:
sample_record.INFO

{'dbSNP_154': True, 'TSA': 'SNV', 'E_Freq': True, 'E_TOPMed': True, 'AA': 'C'}

In the `INFO`-produced dictionary we can find the following key-value pairs that provide additional information about the variant:
* **dbSNP_154**: Variants (including SNP and indels) imported form dbSNP
* **ClinVar_202301**: Variants of clinical importance imported from ClinVar
* **HGMD-PUBLIC_20204**: Variants from HGMD-PUBLIC dataset December 2020
* **COSMIC_97**: Somatic mutations found in human cancers from the COSMIC catalogue
* **TSA**: Type of sequence alteration. Child of term sequence_alteration as defined by the sequence ontology project
* **E_Cited**: Cited
* **E_Multiple_observations**: Multiple Observations
* **E_Freq**: Frequency
* **E_TOPMed**: TOPMed
* **E_HapMap**: HapMap 
* **E_phenotype_or_Disease**:Phenotype or Disease
* **E_ESP**: ESP
* **E_gnomAD**: gnomAD
* **E_1000G**: 1000 Genomes
* **E_EaXC**: EaXC
* **CLIN_risk_factor**: risk factor
* **CLIN_protective**: protective
* **CLIN_confers_sensitivity**: confers sensitivity
* **CLIN_other**: other
* **CLIN_drug_response**: drug response
* **CLIN_uncertain_significance**: uncertain significance
* **CLIN_benign**: benign
* **CLIN_likely_pahtogenic**: likely pathogenic
* **CLIN_pathogenic**: pathogenic
* **CLIN_likely_benign**: likely benign
* **CLIN_histocompatibility**: histocompatibility
* **CLIN_not_provided**: not provided
* **CLIN_association**: association
* **MA**: Minor Allele
* **MAF**: Minor Allele Frequency
* **MAC**: Minor Allele Count
* **AA**: Ancestral Allele


For additional information on those properties that start with the string `"E_"` additional information is available [here](https://www.ensembl.org/info/genome/variation/prediction/variant_quality.html#evidence_status)

For additional information and details on those properties that start with the `"CLIN_"` string, see this [link](https://www.ensembl.org/info/genome/variation/phenotype/phenotype_annotation.html#clin_significance)

## Organize data
We order the data contained in the `Register` objects into lists to eventually organize the all the data in a DataFrame.

In [7]:
# Extract Record's attributes
print(sample_record, dir(sample_record))

Record(CHROM=21, POS=5030088, REF=C, ALT=[T]) ['ALT', 'CHROM', 'FILTER', 'FORMAT', 'ID', 'INFO', 'POS', 'QUAL', 'REF', '__class__', '__cmp__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_compute_coordinates_for_indel', '_compute_coordinates_for_none_alt', '_compute_coordinates_for_snp', '_compute_coordinates_for_sv', '_sample_indexes', '_set_start_and_end', 'aaf', 'add_filter', 'add_format', 'add_info', 'affected_end', 'affected_start', 'alleles', 'call_rate', 'end', 'genotype', 'get_hets', 'get_hom_alts', 'get_hom_refs', 'get_unknowns', 'heterozygosity', 'is_deletion', 'is_filtered', 'is_indel', 'is_monomorphic', 'is_snp', 'is_sv', 'is_sv_precise', 'is_transition', 'nucl_diversity', 'num_c

In [34]:
print(sample_record.is_filtered)

False


In [7]:
# Order Record's attributes in a list
record_test = [sample_record.CHROM, sample_record.ID, sample_record.REF, sample_record.ALT, sample_record.INFO]
records.append(record_test)
record_test

['21',
 'rs1455320509',
 'C',
 [T],
 {'dbSNP_154': True,
  'TSA': 'SNV',
  'E_Freq': True,
  'E_TOPMed': True,
  'AA': 'C'}]

In [8]:
# Extract records' attributes in a ordered list
#for record in vcf_reader:
#    record_attributes = [record.CHROM, record.ID, record.REF, record.ALT, record.INFO]
#    records.append(record_attributes)

#-------------------------------------------------------------------------------------#
# GO TO ATTEMPT 3 SECTION

In [9]:
records[0]

['21',
 'rs1455320509',
 'C',
 [T],
 {'dbSNP_154': True,
  'TSA': 'SNV',
  'E_Freq': True,
  'E_TOPMed': True,
  'AA': 'C'}]

In [8]:
# Run this cell before going to Attempt 3
column_names = ['CHROM', 'ID', 'POS', 'REF', 'ALT', 'TSA', 'AA', 'MAF', 'MA', 'MAC',
                'dbSNP_154', 'ClinVar_202301', 'HGMD-PUBLIC_20204', 'COSMIC_97',
                'E_Cited', 'E_Multiple_observations', 'E_Freq', 'E_TOPMed', 'E_HapMap', 'E_phenotype_or_Disease', 'E_ESP', 'E_gnomAD', 'E_1000G', 'E_EaXC',
                'CLIN_risk_factor', 'CLIN_protective', 'CLIN_confers_sensitivity', 'CLIN_other', 'CLIN_drug_response', 'CLIN_uncertain_significance', 'CLIN_benign',
                'CLIN_likely_pahtogenic', 'CLIN_pathogenic', 'CLIN_likely_benign', 'CLIN_histocompatibility', 'CLIN_not_provided', 'CLIN_association']


## Create the DataFrame that will contain all the data

The data records have the following order `[CHROM, ID, POS, REF, ALT, TSA, INFO]`

### Attempt 3:
Create a dictionary in which keys are the `column_names` and the values are updated from each `Record` attributes in `record.INFO`

In [9]:
# Extract records' attributes in a ordered list
dict_list = []
for record in vcf_reader:
    record_dict = {key: np.nan for key in column_names}
    record_dict['CHROM'] = record.CHROM
    record_dict['ID'] = record.ID
    record_dict['POS'] = record.POS
    record_dict['REF'] = record.REF
    record_dict['ALT'] = record.ALT
    record_dict.update(record.INFO)
    dict_list.append(record_dict)

In [10]:
chrom21_var = pd.DataFrame(dict_list)

In [11]:
chrom21_var.head()

Unnamed: 0,CHROM,ID,POS,REF,ALT,TSA,AA,MAF,MA,MAC,...,CLIN_benign,CLIN_likely_pahtogenic,CLIN_pathogenic,CLIN_likely_benign,CLIN_histocompatibility,CLIN_not_provided,CLIN_association,E_Phenotype_or_Disease,E_ExAC,CLIN_likely_pathogenic
0,21,rs1173141359,5030105,C,[A],SNV,C,,,,...,,,,,,,,,,
1,21,rs1601770018,5030151,T,[G],SNV,T,,,,...,,,,,,,,,,
2,21,rs1461284410,5030154,T,[C],SNV,T,,,,...,,,,,,,,,,
3,21,rs1601770028,5030160,T,[A],SNV,T,,,,...,,,,,,,,,,
4,21,rs1371194619,5030173,G,[C],SNV,G,,,,...,,,,,,,,,,


Save the Chromosome 21 variation data frame

In [13]:
chrom21_var.to_pickle("/mnt/sda1/Databases/Ensembl/Variation/110/VCF/chromosome_21_variations.pkl", compression= 'gzip')

# Generate alternative sequences
Create the alterntive sequences taking advantage of the information we just extracted in the data frame and the reference genome assembly **GCRh38.p14**.

In [2]:
# Load the csv file containing the variation data from chromosome 21
#chr21_variaton_csv_path = "/mnt/sda1/Databases/Ensembl/Variation/110/chromosome_21_variations.csv"
chr21_variation_pkl_path = "/mnt/sda1/Databases/Ensembl/Variation/110/VCF/chromosome_21_variations.pkl"
#chr21_variation = pd.read_csv(chr21_variaton_csv_path, index_col=0)
chr21_variation = pd.read_pickle(chr21_variation_pkl_path, compression= "gzip")
chr21_variation.head()

Unnamed: 0,CHROM,ID,POS,REF,ALT,TSA,AA,MAF,MA,MAC,...,CLIN_benign,CLIN_likely_pahtogenic,CLIN_pathogenic,CLIN_likely_benign,CLIN_histocompatibility,CLIN_not_provided,CLIN_association,E_Phenotype_or_Disease,E_ExAC,CLIN_likely_pathogenic
0,21,rs1173141359,5030105,C,[A],SNV,C,,,,...,,,,,,,,,,
1,21,rs1601770018,5030151,T,[G],SNV,T,,,,...,,,,,,,,,,
2,21,rs1461284410,5030154,T,[C],SNV,T,,,,...,,,,,,,,,,
3,21,rs1601770028,5030160,T,[A],SNV,T,,,,...,,,,,,,,,,
4,21,rs1371194619,5030173,G,[C],SNV,G,,,,...,,,,,,,,,,


In [4]:
# Check the size of the chr21_variation Data Frame
chr21_variation.shape

(9243117, 40)

In [6]:
# Check the number of different variant types in chromosome 21
chr21_variation.TSA.value_counts()

TSA
SNV                    8399022
indel                   614835
deletion                169949
insertion                54562
substitution              4518
sequence_alteration        231
Name: count, dtype: int64

In [3]:
chr21_variation.iloc[0]['ALT']

[A]

In [4]:
chr21_variation.columns

Index(['CHROM', 'ID', 'POS', 'REF', 'ALT', 'TSA', 'AA', 'MAF', 'MA', 'MAC',
       'dbSNP_154', 'ClinVar_202301', 'HGMD-PUBLIC_20204', 'COSMIC_97',
       'E_Cited', 'E_Multiple_observations', 'E_Freq', 'E_TOPMed', 'E_HapMap',
       'E_phenotype_or_Disease', 'E_ESP', 'E_gnomAD', 'E_1000G', 'E_EaXC',
       'CLIN_risk_factor', 'CLIN_protective', 'CLIN_confers_sensitivity',
       'CLIN_other', 'CLIN_drug_response', 'CLIN_uncertain_significance',
       'CLIN_benign', 'CLIN_likely_pahtogenic', 'CLIN_pathogenic',
       'CLIN_likely_benign', 'CLIN_histocompatibility', 'CLIN_not_provided',
       'CLIN_association', 'E_Phenotype_or_Disease', 'E_ExAC',
       'CLIN_likely_pathogenic'],
      dtype='object')

In [5]:
#chr21_variation.describe(include='all')

We are going to focus only on the following features for now:
- CHROM
- REF
- ALT
- TSA
- POS
- ID
- MAF (Minimal Allele Frequency)

In [10]:
chr21_variation = chr21_variation[['CHROM', 'ID', 'POS', 'REF', 'ALT', 'TSA', 'MAF']]
chr21_variation

Unnamed: 0,CHROM,ID,POS,REF,ALT,TSA,MAF
0,21,rs1173141359,5030105,C,[A],SNV,
1,21,rs1601770018,5030151,T,[G],SNV,
2,21,rs1461284410,5030154,T,[C],SNV,
3,21,rs1601770028,5030160,T,[A],SNV,
4,21,rs1371194619,5030173,G,[C],SNV,
...,...,...,...,...,...,...,...
9243112,21,rs1302396446,46699976,G,[A],SNV,
9243113,21,rs1388937426,46699978,T,"[C, G]",SNV,
9243114,21,rs1347178542,46699979,T,"[A, G]",SNV,
9243115,21,rs1601974205,46699980,A,[G],SNV,


In [11]:
chr21_variation.columns

Index(['CHROM', 'ID', 'POS', 'REF', 'ALT', 'TSA', 'MAF'], dtype='object')

In [12]:
chr21_variation.POS.value_counts()

POS
26626949    25
16845821    24
36370636    23
24846108    23
22320724    22
            ..
21765218     1
21765217     1
21765215     1
21765212     1
46699981     1
Name: count, Length: 8942587, dtype: int64

In [16]:
chr21_variation[chr21_variation.POS==16845821]

Unnamed: 0,CHROM,ID,POS,REF,ALT,TSA,MAF
1870009,21,rs1166882669,16845821,TTTTTAAAAAAAAAAAAAAAAAA,[T],deletion,
1870010,21,rs1186871105,16845821,TTTTTAAAAAAAAAAAAAAAAAAAAA,[T],deletion,
1870011,21,rs1195310663,16845821,TTTTTA,[T],deletion,
1870012,21,rs1216326754,16845821,TTTTTAAAA,[T],deletion,
1870013,21,rs1227579059,16845821,TTTTTAAAAAAAA,[T],deletion,
1870014,21,rs1232894075,16845821,TTTTTAAAAAAA,[T],deletion,
1870015,21,rs1235632360,16845821,TTTTTAAAAAAAAAAAAAAAAAAAAAAA,[T],deletion,
1870016,21,rs1260860153,16845821,TTTTTAAAAAAAAAAAAAAAAAAAA,[T],deletion,
1870017,21,rs1289976524,16845821,TTTTTAAAAAAAAAAAA,[T],deletion,
1870018,21,rs1305118707,16845821,TTTTTAAAAAAAAA,[T],deletion,


In [15]:
chr21_variation[chr21_variation.TSA == 'insertion']

Unnamed: 0,CHROM,ID,POS,REF,ALT,TSA,MAF
449,21,rs1568756826,5032969,A,[ACAGTGAG],insertion,
506,21,rs1339717783,5033367,C,[CT],insertion,
507,21,rs1286929552,5033369,C,"[CA, CT]",insertion,
1431,21,rs1350039135,5043821,T,[TTCCTCCTCTCCTCCTCTCCTCCTCCTCCTCCTCCCTCCT],insertion,
1768,21,rs1442408011,5051295,C,"[CAT, CGT]",insertion,
...,...,...,...,...,...,...,...
9242975,21,rs1569175235,46699870,T,[TG],insertion,
9242987,21,rs1569175255,46699872,A,[AA],insertion,
9242988,21,rs1443541773,46699873,G,[GCT],insertion,
9243039,21,rs796884542,46699905,G,[GGTTAGGGT],insertion,


## Import the reference genome information

In [17]:
# Save reference genome path and the record in variables
reference_genome_path = "/mnt/sda1/Databases/Reference Genome/GRCh38p14/Ensembl/Homo_sapiens.GRCh38.dna.primary_assembly.fa"
reference_genome = SeqIO.index(reference_genome_path,"fasta")

In [19]:
# Extract the SeqRecord of the chromosomes into a dict and then put them into a dataframe
reference_genome_keys = list(reference_genome.keys())
reference_genome_keys

['1',
 '10',
 '11',
 '12',
 '13',
 '14',
 '15',
 '16',
 '17',
 '18',
 '19',
 '2',
 '20',
 '21',
 '22',
 '3',
 '4',
 '5',
 '6',
 '7',
 '8',
 '9',
 'MT',
 'X',
 'Y',
 'KI270728.1',
 'KI270727.1',
 'KI270442.1',
 'KI270729.1',
 'GL000225.1',
 'KI270743.1',
 'GL000008.2',
 'GL000009.2',
 'KI270747.1',
 'KI270722.1',
 'GL000194.1',
 'KI270742.1',
 'GL000205.2',
 'GL000195.1',
 'KI270736.1',
 'KI270733.1',
 'GL000224.1',
 'GL000219.1',
 'KI270719.1',
 'GL000216.2',
 'KI270712.1',
 'KI270706.1',
 'KI270725.1',
 'KI270744.1',
 'KI270734.1',
 'GL000213.1',
 'GL000220.1',
 'KI270715.1',
 'GL000218.1',
 'KI270749.1',
 'KI270741.1',
 'GL000221.1',
 'KI270716.1',
 'KI270731.1',
 'KI270751.1',
 'KI270750.1',
 'KI270519.1',
 'GL000214.1',
 'KI270708.1',
 'KI270730.1',
 'KI270438.1',
 'KI270737.1',
 'KI270721.1',
 'KI270738.1',
 'KI270748.1',
 'KI270435.1',
 'GL000208.1',
 'KI270538.1',
 'KI270756.1',
 'KI270739.1',
 'KI270757.1',
 'KI270709.1',
 'KI270746.1',
 'KI270753.1',
 'KI270589.1',
 'KI270726.

In [20]:
# Access to one resgister from the reference genome variable
reference_genome['21']

SeqRecord(seq=Seq('NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN...NNN'), id='21', name='21', description='21 dna:chromosome chromosome:GRCh38:21:1:46709983:1 REF', dbxrefs=[])

In [21]:
ref_items = dict(reference_genome.items())
# Is it worth to save this object in a dataframe?

In [21]:
# Extract only the chromosome 21 information
chr21_record = reference_genome['21']
chr21_record

SeqRecord(seq=Seq('NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN...NNN'), id='21', name='21', description='21 dna:chromosome chromosome:GRCh38:21:1:46709983:1 REF', dbxrefs=[])

In [11]:
dir(chr21_record)

['__add__',
 '__bool__',
 '__bytes__',
 '__class__',
 '__contains__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__radd__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_per_letter_annotations',
 '_seq',
 '_set_per_letter_annotations',
 '_set_seq',
 'annotations',
 'count',
 'dbxrefs',
 'description',
 'features',
 'format',
 'id',
 'islower',
 'isupper',
 'letter_annotations',
 'lower',
 'name',
 'reverse_complement',
 'seq',
 'translate',
 'upper']

We are interested mostly in the `seq` object. Let's see if the information from the variation data and the reference genome match.

In [12]:
chr21_sequence = chr21_record.seq
chr21_sequence

Seq('NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN...NNN')

In [13]:
variation_pos = chr21_variation.iloc[0]['POS']
chr21_variation.iloc[0]

CHROM                                    21
ID                             rs1173141359
POS                                 5030105
REF                                       C
ALT                                     [A]
TSA                                     SNV
AA                                        C
MAF                                     NaN
MA                                      NaN
MAC                                     NaN
dbSNP_154                              True
ClinVar_202301                          NaN
HGMD-PUBLIC_20204                       NaN
COSMIC_97                               NaN
E_Cited                                 NaN
E_Multiple_observations                 NaN
E_Freq                                 True
E_TOPMed                               True
E_HapMap                                NaN
E_phenotype_or_Disease                  NaN
E_ESP                                   NaN
E_gnomAD                                NaN
E_1000G                         

In [14]:
# Check if the info matches. REF letter in the register and the reference genome have to match based on the position
print("Reference allele in position {} is {} and in the reference genome is {}".format(variation_pos, chr21_variation.iloc[0]['REF'],
                                                                                       chr21_sequence[variation_pos-1]))

Reference allele in position 5030105 is C and in the reference genome is C


In [15]:
# Print subsequence 5 bases before and after the variant position specified in the chr21_variation dataframe
chr21_sequence[variation_pos-6:variation_pos+5]

Seq('CCTCCCAAAGT')

They match!
I also corroborated in the dbSNP database, in the genome browser.

In [16]:
# Let's start with the SNPs because they are the simplest variants to put into the reference genome
chr21_snps = chr21_variation.copy(deep = True)
chr21_snps = chr21_snps[chr21_snps['TSA']=='SNV'] # TSA stands for Type of Sequence Alteration
chr21_snps.head()

Unnamed: 0,CHROM,ID,POS,REF,ALT,TSA,AA,MAF,MA,MAC,...,CLIN_benign,CLIN_likely_pahtogenic,CLIN_pathogenic,CLIN_likely_benign,CLIN_histocompatibility,CLIN_not_provided,CLIN_association,E_Phenotype_or_Disease,E_ExAC,CLIN_likely_pathogenic
0,21,rs1173141359,5030105,C,[A],SNV,C,,,,...,,,,,,,,,,
1,21,rs1601770018,5030151,T,[G],SNV,T,,,,...,,,,,,,,,,
2,21,rs1461284410,5030154,T,[C],SNV,T,,,,...,,,,,,,,,,
3,21,rs1601770028,5030160,T,[A],SNV,T,,,,...,,,,,,,,,,
4,21,rs1371194619,5030173,G,[C],SNV,G,,,,...,,,,,,,,,,


In [17]:
chr21_snps.tail()

Unnamed: 0,CHROM,ID,POS,REF,ALT,TSA,AA,MAF,MA,MAC,...,CLIN_benign,CLIN_likely_pahtogenic,CLIN_pathogenic,CLIN_likely_benign,CLIN_histocompatibility,CLIN_not_provided,CLIN_association,E_Phenotype_or_Disease,E_ExAC,CLIN_likely_pathogenic
9243112,21,rs1302396446,46699976,G,[A],SNV,,,,,...,,,,,,,,,,
9243113,21,rs1388937426,46699978,T,"[C, G]",SNV,,,,,...,,,,,,,,,,
9243114,21,rs1347178542,46699979,T,"[A, G]",SNV,,,,,...,,,,,,,,,,
9243115,21,rs1601974205,46699980,A,[G],SNV,,,,,...,,,,,,,,,,
9243116,21,rs1601974207,46699981,G,[A],SNV,,,,,...,,,,,,,,,,


In [18]:
str(chr21_snps.iloc[0]['ALT'][0])

'A'

## Simulate mutations in reference genome
To do this we have to turn the sequences into `MutableSeq` objects and replace the reference alleles with the ones specified in the `ALT` column from the `chr21_snps` dataframe.
In this case we want to create sequences of 128 bases length.

In [19]:
# Test with 5 bases before and after the mutation site
# For first snp register:
mutable_chr21 = MutableSeq(chr21_record.seq) #Uncomment
mutable_chr21[chr21_snps.iloc[0]['POS']-1] = str(chr21_snps.iloc[0]['ALT'][0]) # Replace in the mutable sequence the REF base with the ALT one
rs_end_1359 = mutable_chr21[chr21_snps.iloc[0]['POS']-64:chr21_snps.iloc[0]['POS']+64] # Take 

In [20]:
len(rs_end_1359)

128

In [21]:
rs_end_1359

MutableSeq('CACTGTGTTGCCCAGGCTAGATTCAAGCTCCTGGACACAAGCGATGCTCCTGCC...TGA')

To create automatically the sequences for every register we will have to turn the sequences into a `MutableSeq` every time, because some variants are close to each other and we don't want to create sequences with multiple variants within it.
As well, we have to create and save them in an interpretable way. We are including the following information about every subsequence:
* Start and end coordinates of the subsequence relative to the reference genome
* Variant position relative to the reference genome
* Type of alteration

In [20]:
# THIS WAS COMMENTED BECAUSE WE WERE DOING A TEST WITH A SUBSET OF THE DATA FIRST
#test_snps = chr21_snps.iloc[0:5]
#test_snps

Unnamed: 0,CHROM,ID,POS,REF,ALT,TSA,AA,MAF,MA,MAC,...,CLIN_benign,CLIN_likely_pahtogenic,CLIN_pathogenic,CLIN_likely_benign,CLIN_histocompatibility,CLIN_not_provided,CLIN_association,E_Phenotype_or_Disease,E_ExAC,CLIN_likely_pathogenic
0,21,rs1173141359,5030105,C,[A],SNV,C,,,,...,,,,,,,,,,
1,21,rs1601770018,5030151,T,[G],SNV,T,,,,...,,,,,,,,,,
2,21,rs1461284410,5030154,T,[C],SNV,T,,,,...,,,,,,,,,,
3,21,rs1601770028,5030160,T,[A],SNV,T,,,,...,,,,,,,,,,
4,21,rs1371194619,5030173,G,[C],SNV,G,,,,...,,,,,,,,,,


In [67]:
# Create altered subsequences automatically (length 128)
alt_sequences = {'CHROM': [], 'ID':[], 'REF_SEQ': [], 'ALT_SEQ': [], 'START_SUBSEQ':[], 'END_SUBSEQ':[], 'MUT_POS':[]}
bases_before_after = 64 # This is the number of bases before and after the mutation site
alt_alleles = list(chr21_snps.ALT)
alt_pos = list(chr21_snps.POS)

In [61]:
str(alt_alleles[0][0])

'A'

In [None]:
for i in range(len(alt_alleles)):
    ref_seq = MutableSeq(chr21_sequence)
    start = alt_pos[i]-bases_before_after
    end = alt_pos[i]+bases_before_after
    alt_sequences['REF_SEQ'].append(Seq(ref_seq[start:end]))
    ref_seq[alt_pos[i]-1] = str(alt_alleles[i][0])
    alt_sequences['ALT_SEQ'].append(Seq(ref_seq[start:end]))
    alt_sequences['START_SUBSEQ'].append(start)
    alt_sequences['END_SUBSEQ'].append(end)

alt_sequences['CHROM'] = list(chr21_snps.CHROM)
alt_sequences['ID'] = list(chr21_snps.ID)
alt_sequences['MUT_POS'] = list(chr21_snps.POS)

In [64]:
alt_sequences

{'CHROM': ['21', '21', '21', '21', '21'],
 'ID': ['rs1173141359',
  'rs1601770018',
  'rs1461284410',
  'rs1601770028',
  'rs1371194619'],
 'REF_SEQ': [Seq('CACTGTGTTGCCCAGGCTAGATTCAAGCTCCTGGACACAAGCGATGCTCCTGCC...TGA'),
  Seq('CTCCTGCCTAAGCCTCCCAAAGTGCTGAGATTACAGGTGTGAGCCACCACGTCC...AAA'),
  Seq('CTGCCTAAGCCTCCCAAAGTGCTGAGATTACAGGTGTGAGCCACCACGTCCAAG...ATT'),
  Seq('AAGCCTCCCAAAGTGCTGAGATTACAGGTGTGAGCCACCACGTCCAAGCTAGAG...GGC'),
  Seq('TGCTGAGATTACAGGTGTGAGCCACCACGTCCAAGCTAGAGTTTTAAAAGTGAA...CTG')],
 'ALT_SEQ': [Seq('CACTGTGTTGCCCAGGCTAGATTCAAGCTCCTGGACACAAGCGATGCTCCTGCC...TGA'),
  Seq('CTCCTGCCTAAGCCTCCCAAAGTGCTGAGATTACAGGTGTGAGCCACCACGTCC...AAA'),
  Seq('CTGCCTAAGCCTCCCAAAGTGCTGAGATTACAGGTGTGAGCCACCACGTCCAAG...ATT'),
  Seq('AAGCCTCCCAAAGTGCTGAGATTACAGGTGTGAGCCACCACGTCCAAGCTAGAG...GGC'),
  Seq('TGCTGAGATTACAGGTGTGAGCCACCACGTCCAAGCTAGAGTTTTAAAAGTGAA...CTG')],
 'START_SUBSEQ': [5030041, 5030087, 5030090, 5030096, 5030109],
 'END_SUBSEQ': [5030169, 5030215, 5030218, 5030224, 5030237],
 

In [66]:
len(alt_sequences['REF_SEQ'][1])

128

# Whole dataset

In [3]:
#data_path = "/home/msr/Documents/Databases/Ensembl Variation/homo_sapiens-chr21.vcf"
# Load paths were the data is stored
ensembl_var_path = "/mnt/sda1/Databases/Ensembl/Variation/110/homo_sapiens_somatic.vcf"
ensembl_grch38_path = "/mnt/sda1/Databases/Reference Genome/GRCh38p14/Ensembl/Homo_sapiens.GRCh38.dna.primary_assembly.fa"

# Declare vcf reader
vcf_reader = vcf.Reader(open(ensembl_var_path, mode = "r"))

In [None]:
# Create list with all the records
for record in vcf_reader:
    print(record)