# Análise de anotações de variantes dos genes causadores de CCFP no ABraOM

Neste projeto nós vamos analisar quais são as anotações das variantes para os cinco genes da Calcificação Cerebral Familiar Primária da coorte SABE609.
Os genes são SLC20A2, XPR1, PDGFB, PDGFRB, KIAA1161 e JAM2.

In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv('ABRaOM_60+_SABE_609_exomes_annotated', sep='\t', encoding='latin1')

  interactivity=interactivity, compiler=compiler, result=result)


O arquivo contendo informações das variantes foi disponibilizado no website da Arquivo Brasileiro Online de Mutações (ABraOM, http://abraom.ib.usp.br/), na página de Downloads.

In [3]:
df.head()

Unnamed: 0,Chr,Start,Ref,Alt,PredictedFunc.refGene,Gene.refGene,PredConsequence.refGene,avsnp147,FILTER,CEGH Filter,HomozygousALT count,Hemizygous count,Allele number,Allele ALT count,Frequencies,Cohort
0,1,13116,T,G,ncRNA_intronic,DDX11L1,,rs62635286,LowQual,FDP,2,0,598,6,0.010033,SABE609
1,1,13244,G,A,ncRNA_exonic,DDX11L1,,,LowQual,FAB,0,0,456,1,0.002193,SABE609
2,1,13248,C,G,ncRNA_exonic,DDX11L1,,,VQSRTrancheSNP99.00to99.90,FAB,0,0,482,2,0.004149,SABE609
3,1,13273,G,C,ncRNA_exonic,DDX11L1,,rs531730856,VQSRTrancheSNP99.00to99.90,WK-LowCall,20,0,600,68,0.113333,SABE609
4,1,13302,C,T,ncRNA_exonic,DDX11L1,,rs180734498,VQSRTrancheSNP99.00to99.90,WK-LowCall,1,0,700,12,0.017143,SABE609


O banco de dados está organizado, por amostra, em 16 colunas.

Chr: Qual cromossomo aquela amostra pertence;<br>
Start: Posição cromossômica que inicia aquela variação (hg19);<br>
Ref: Nucleotídeo de referência para aquela posição, no genoma de referência;<br>
Alt: Nucleotídeo alternativo presente naquela posição do genoma;<br>
PredictedFunc.refGene: Qual a parte do gene referência daquela posição;<br>
Gene.refGene: Gene referência da posição;<br>
PredConsequence.refGene: Consequência predita daquela variação ao gene referência;<br>
avsnp147:<br>
FILTER:<br>
CEGH Filter:<br>
HomozygousALT count:<br>
Hemizygous count:<br>
Allele number:<br>
Allele ALT count:<br>
Frequencies: Frequências daquela variante no banco de dados (frequência da variante na população brasileira);<br>
Cohort: A coorte a qual a amostra pertence.

In [4]:
len(df['Gene.refGene'].unique())

28127

In [5]:
slc20a2 = df[df['Gene.refGene'].str.contains('SLC20A2')]
slc20a2

Unnamed: 0,Chr,Start,Ref,Alt,PredictedFunc.refGene,Gene.refGene,PredConsequence.refGene,avsnp147,FILTER,CEGH Filter,HomozygousALT count,Hemizygous count,Allele number,Allele ALT count,Frequencies,Cohort
1012090,8,42274108,A,G,UTR3,SLC20A2,"NM_006749:c.*1213T>C,NM_001257181:c.*1213T>C,N...",,PASS,FDP,1,0,196,2,0.010204,SABE609
1012091,8,42274264,G,C,UTR3,SLC20A2,"NM_006749:c.*1057C>G,NM_001257181:c.*1057C>G,N...",rs6841,PASS,FDP,14,0,264,32,0.121212,SABE609
1012092,8,42274303,C,A,UTR3,SLC20A2,"NM_006749:c.*1018G>T,NM_001257181:c.*1018G>T,N...",,LowQual,FDP,1,0,340,2,0.005882,SABE609
1012093,8,42274448,C,A,UTR3,SLC20A2,"NM_006749:c.*873G>T,NM_001257181:c.*873G>T,NM_...",,LowQual,FDP,0,0,542,1,0.001845,SABE609
1012094,8,42274751,C,A,UTR3,SLC20A2,"NM_006749:c.*570G>T,NM_001257181:c.*570G>T,NM_...",rs1803657,PASS,FDP,14,0,252,28,0.111111,SABE609
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1012218,8,42397004,C,T,UTR5,"SLC20A2,SMIM19","NM_006749:c.-67096G>A;NM_001135674:c.-4612C>T,...",rs2923444,PASS,FDP,201,0,658,440,0.668693,SABE609
1012219,8,42397233,T,A,UTR5,"SLC20A2,SMIM19","NM_006749:c.-67325A>T;NM_001135674:c.-4383T>A,...",rs113755761,PASS,FDP,5,0,566,16,0.028269,SABE609
1012220,8,42397291,G,A,UTR5,"SLC20A2,SMIM19","NM_006749:c.-67383C>T;NM_001135674:c.-4325G>A,...",rs145711927,PASS,FDP,1,0,590,6,0.010169,SABE609
1012221,8,42397310,G,T,UTR5,"SLC20A2,SMIM19",NM_006749:c.-67402C>A;NM_001135674:c.-4306G>T,,LowQual,FDP,1,0,612,2,0.003268,SABE609


In [6]:
pfbcAbraom = {'SLC20A2': df[df['Gene.refGene'].str.contains('SLC20A2')],
             'XPR1': df[df['Gene.refGene'].str.contains('XPR1')],
             'PDGFB': df[df['Gene.refGene'].str.contains('PDGFB')],
             'PDGFRB': df[df['Gene.refGene'].str.contains('PDGFRB')],
             'KIAA1161': df[df['Gene.refGene'].str.contains('KIAA1161')],
             'JAM2': df[df['Gene.refGene'].str.contains('JAM2')]}


In [7]:
pfbcAbraom.keys()

dict_keys(['SLC20A2', 'XPR1', 'PDGFB', 'PDGFRB', 'KIAA1161', 'JAM2'])

In [8]:
pfbcAbraom['SLC20A2']

Unnamed: 0,Chr,Start,Ref,Alt,PredictedFunc.refGene,Gene.refGene,PredConsequence.refGene,avsnp147,FILTER,CEGH Filter,HomozygousALT count,Hemizygous count,Allele number,Allele ALT count,Frequencies,Cohort
1012090,8,42274108,A,G,UTR3,SLC20A2,"NM_006749:c.*1213T>C,NM_001257181:c.*1213T>C,N...",,PASS,FDP,1,0,196,2,0.010204,SABE609
1012091,8,42274264,G,C,UTR3,SLC20A2,"NM_006749:c.*1057C>G,NM_001257181:c.*1057C>G,N...",rs6841,PASS,FDP,14,0,264,32,0.121212,SABE609
1012092,8,42274303,C,A,UTR3,SLC20A2,"NM_006749:c.*1018G>T,NM_001257181:c.*1018G>T,N...",,LowQual,FDP,1,0,340,2,0.005882,SABE609
1012093,8,42274448,C,A,UTR3,SLC20A2,"NM_006749:c.*873G>T,NM_001257181:c.*873G>T,NM_...",,LowQual,FDP,0,0,542,1,0.001845,SABE609
1012094,8,42274751,C,A,UTR3,SLC20A2,"NM_006749:c.*570G>T,NM_001257181:c.*570G>T,NM_...",rs1803657,PASS,FDP,14,0,252,28,0.111111,SABE609
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1012218,8,42397004,C,T,UTR5,"SLC20A2,SMIM19","NM_006749:c.-67096G>A;NM_001135674:c.-4612C>T,...",rs2923444,PASS,FDP,201,0,658,440,0.668693,SABE609
1012219,8,42397233,T,A,UTR5,"SLC20A2,SMIM19","NM_006749:c.-67325A>T;NM_001135674:c.-4383T>A,...",rs113755761,PASS,FDP,5,0,566,16,0.028269,SABE609
1012220,8,42397291,G,A,UTR5,"SLC20A2,SMIM19","NM_006749:c.-67383C>T;NM_001135674:c.-4325G>A,...",rs145711927,PASS,FDP,1,0,590,6,0.010169,SABE609
1012221,8,42397310,G,T,UTR5,"SLC20A2,SMIM19",NM_006749:c.-67402C>A;NM_001135674:c.-4306G>T,,LowQual,FDP,1,0,612,2,0.003268,SABE609


In [10]:
nsVar = {'SLC20A2': pfbcAbraom['SLC20A2'][pfbcAbraom['SLC20A2']['PredictedFunc.refGene'].str.contains('nonsynonymous SNV')],
        'XPR1': pfbcAbraom['XPR1'][pfbcAbraom['XPR1']['PredictedFunc.refGene'].str.contains('nonsynonymous SNV')],
        'PDGFB': pfbcAbraom['PDGFB'][pfbcAbraom['PDGFB']['PredictedFunc.refGene'].str.contains('nonsynonymous SNV')],
        'PDGFRB': pfbcAbraom['PDGFRB'][pfbcAbraom['PDGFRB']['PredictedFunc.refGene'].str.contains('nonsynonymous SNV')],
        'KIAA1161': pfbcAbraom['KIAA1161'][pfbcAbraom['KIAA1161']['PredictedFunc.refGene'].str.contains('nonsynonymous SNV')],
        'JAM2': pfbcAbraom['JAM2'][pfbcAbraom['JAM2']['PredictedFunc.refGene'].str.contains('nonsynonymous SNV')]}

In [11]:
for var in nsVar['SLC20A2']['PredConsequence.refGene']:
    print(var)

SLC20A2:NM_001257180:exon8:c.G1438A:p.A480T,SLC20A2:NM_001257181:exon8:c.G1438A:p.A480T,SLC20A2:NM_006749:exon8:c.G1438A:p.A480T
SLC20A2:NM_001257180:exon8:c.C1316T:p.A439V,SLC20A2:NM_001257181:exon8:c.C1316T:p.A439V,SLC20A2:NM_006749:exon8:c.C1316T:p.A439V
SLC20A2:NM_001257180:exon8:c.C1223T:p.S408L,SLC20A2:NM_001257181:exon8:c.C1223T:p.S408L,SLC20A2:NM_006749:exon8:c.C1223T:p.S408L
SLC20A2:NM_001257180:exon8:c.G1198A:p.A400T,SLC20A2:NM_001257181:exon8:c.G1198A:p.A400T,SLC20A2:NM_006749:exon8:c.G1198A:p.A400T
SLC20A2:NM_001257180:exon7:c.G910A:p.G304S,SLC20A2:NM_001257181:exon7:c.G910A:p.G304S,SLC20A2:NM_006749:exon7:c.G910A:p.G304S
SLC20A2:NM_001257180:exon7:c.G761A:p.R254Q,SLC20A2:NM_001257181:exon7:c.G761A:p.R254Q,SLC20A2:NM_006749:exon7:c.G761A:p.R254Q
SLC20A2:NM_001257180:exon5:c.A533G:p.N178S,SLC20A2:NM_001257181:exon5:c.A533G:p.N178S,SLC20A2:NM_006749:exon5:c.A533G:p.N178S
SLC20A2:NM_001257180:exon2:c.A145G:p.I49V,SLC20A2:NM_001257181:exon2:c.A145G:p.I49V,SLC20A2:NM_006749:exon

In [12]:
for var in nsVar['SLC20A2']['PredConsequence.refGene']:
    print(var.find(':', 27))

35
35
35
35
34
34
34
34
33


In [13]:
for var in nsVar['SLC20A2']['PredConsequence.refGene']:
    print(var[27:(var.find(':', 27))])

c.G1438A
c.C1316T
c.C1223T
c.G1198A
c.G910A
c.G761A
c.A533G
c.A145G
c.T44C


In [14]:
varAnnot = dict()
def extractNSVar(dictionary, gene):
    cdsChange = []
    proteinChange = []
    frequency = []
    alleleNum = []
    filterpass = []
    for var in nsVar[gene]['PredConsequence.refGene']:
        if ',' in var:
        #print(var[var.find('p.'):var.find(',')])
            cdsChange.append(var[var.find('c.'):var.find(':p.')])
            proteinChange.append(var[var.find('p.'):var.find(',')])
        else:
            cdsChange.append(var[var.find('c.'):var.find(':p.')])
            proteinChange.append(var[var.find('p.'):])
    
    for var in nsVar[gene]['Frequencies']:
        frequency.append(var)
        
    for var in nsVar[gene]['Allele number']:
        alleleNum.append(var)
        
    for var in nsVar[gene]['FILTER']:
        filterpass.append(var)
    
    varAnnot[gene] = (list(zip(cdsChange, proteinChange, frequency, alleleNum, filterpass)))

In [15]:
for var in nsVar['PDGFB']['PredConsequence.refGene']:
    if ',' in var:
        print(var[var.find('p.'):var.find(',')])
    else:
        print(var[var.find('p.'):])

p.H228N
p.P219L
p.T212M
p.Q145R
p.C12F


In [16]:
for key in nsVar:
    extractNSVar(nsVar, key)

In [17]:
varAnnot['SLC20A2']

[('c.G1438A', 'p.A480T', 0.00821, 1218, 'PASS'),
 ('c.C1316T', 'p.A439V', 0.0008210000000000001, 1218, 'PASS'),
 ('c.C1223T', 'p.S408L', 0.0008210000000000001, 1218, 'PASS'),
 ('c.G1198A', 'p.A400T', 0.0016420000000000002, 1218, 'PASS'),
 ('c.G910A', 'p.G304S', 0.050082, 1218, 'PASS'),
 ('c.G761A', 'p.R254Q', 0.0008210000000000001, 1218, 'PASS'),
 ('c.A533G', 'p.N178S', 0.0016420000000000002, 1218, 'PASS'),
 ('c.A145G', 'p.I49V', 0.0008210000000000001, 1218, 'PASS'),
 ('c.T44C', 'p.I15T', 0.0008210000000000001, 1218, 'PASS')]

In [18]:
varAnnot

{'SLC20A2': [('c.G1438A', 'p.A480T', 0.00821, 1218, 'PASS'),
  ('c.C1316T', 'p.A439V', 0.0008210000000000001, 1218, 'PASS'),
  ('c.C1223T', 'p.S408L', 0.0008210000000000001, 1218, 'PASS'),
  ('c.G1198A', 'p.A400T', 0.0016420000000000002, 1218, 'PASS'),
  ('c.G910A', 'p.G304S', 0.050082, 1218, 'PASS'),
  ('c.G761A', 'p.R254Q', 0.0008210000000000001, 1218, 'PASS'),
  ('c.A533G', 'p.N178S', 0.0016420000000000002, 1218, 'PASS'),
  ('c.A145G', 'p.I49V', 0.0008210000000000001, 1218, 'PASS'),
  ('c.T44C', 'p.I15T', 0.0008210000000000001, 1218, 'PASS')],
 'XPR1': [('c.G527T', 'p.R176L', 0.0008210000000000001, 1218, 'PASS'),
  ('c.G1310A', 'p.R437Q', 0.0008210000000000001, 1218, 'LowQual'),
  ('c.T1811C', 'p.L604P', 0.0008210000000000001, 1218, 'PASS')],
 'PDGFB': [('c.C682A', 'p.H228N', 0.0008210000000000001, 1218, 'PASS'),
  ('c.C656T', 'p.P219L', 0.0032840000000000005, 1218, 'PASS'),
  ('c.C635T', 'p.T212M', 0.0057469999999999995, 1218, 'PASS'),
  ('c.A434G', 'p.Q145R', 0.000822, 1216, 'PASS

In [19]:
slc20a2 = pd.DataFrame(varAnnot['SLC20A2'])

In [20]:
slc20a2.columns = ['Codon Change', 'Protein Change', 'Frequency', 'Allele Number', 'Filter Pass']

In [21]:
slc20a2['Gene'] = 'SLC20A2'

In [22]:
cols = slc20a2.columns.tolist()
cols = cols[-1:] + cols[:-1]
slc20a2 = slc20a2[cols]
slc20a2

Unnamed: 0,Gene,Codon Change,Protein Change,Frequency,Allele Number,Filter Pass
0,SLC20A2,c.G1438A,p.A480T,0.00821,1218,PASS
1,SLC20A2,c.C1316T,p.A439V,0.000821,1218,PASS
2,SLC20A2,c.C1223T,p.S408L,0.000821,1218,PASS
3,SLC20A2,c.G1198A,p.A400T,0.001642,1218,PASS
4,SLC20A2,c.G910A,p.G304S,0.050082,1218,PASS
5,SLC20A2,c.G761A,p.R254Q,0.000821,1218,PASS
6,SLC20A2,c.A533G,p.N178S,0.001642,1218,PASS
7,SLC20A2,c.A145G,p.I49V,0.000821,1218,PASS
8,SLC20A2,c.T44C,p.I15T,0.000821,1218,PASS


In [23]:
xpr1 = pd.DataFrame(varAnnot['XPR1'])
xpr1.columns = ['Codon Change', 'Protein Change', 'Frequency', 'Allele Number', 'Filter Pass']
xpr1['Gene'] = 'XPR1'
cols = xpr1.columns.tolist()
cols = cols[-1:] + cols[:-1]
xpr1 = xpr1[cols]

pfbcNSVars = pd.concat([slc20a2, xpr1], join = 'outer')

pdgfb = pd.DataFrame(varAnnot['PDGFB'])
pdgfb.columns = ['Codon Change', 'Protein Change', 'Frequency', 'Allele Number', 'Filter Pass']
pdgfb['Gene'] = 'PDGFB'
cols = pdgfb.columns.tolist()
cols = cols[-1:] + cols[:-1]
pdgfb = pdgfb[cols]

pfbcNSVars = pd.concat([pfbcNSVars, pdgfb], join = 'outer')

pdgfrb = pd.DataFrame(varAnnot['PDGFRB'])
pdgfrb.columns = ['Codon Change', 'Protein Change', 'Frequency', 'Allele Number', 'Filter Pass']
pdgfrb['Gene'] = 'PDGFRB'
cols = pdgfrb.columns.tolist()
cols = cols[-1:] + cols[:-1]
pdgfrb = pdgfrb[cols]

pfbcNSVars = pd.concat([pfbcNSVars, pdgfrb], join = 'outer')

jam2 = pd.DataFrame(varAnnot['JAM2'])
jam2.columns = ['Codon Change', 'Protein Change', 'Frequency', 'Allele Number', 'Filter Pass']
jam2['Gene'] = 'JAM2'
cols = jam2.columns.tolist()
cols = cols[-1:] + cols[:-1]
jam2 = jam2[cols]

pfbcNSVars = pd.concat([pfbcNSVars, jam2], join = 'outer')

In [24]:
pfbcNSVars

Unnamed: 0,Gene,Codon Change,Protein Change,Frequency,Allele Number,Filter Pass
0,SLC20A2,c.G1438A,p.A480T,0.00821,1218,PASS
1,SLC20A2,c.C1316T,p.A439V,0.000821,1218,PASS
2,SLC20A2,c.C1223T,p.S408L,0.000821,1218,PASS
3,SLC20A2,c.G1198A,p.A400T,0.001642,1218,PASS
4,SLC20A2,c.G910A,p.G304S,0.050082,1218,PASS
5,SLC20A2,c.G761A,p.R254Q,0.000821,1218,PASS
6,SLC20A2,c.A533G,p.N178S,0.001642,1218,PASS
7,SLC20A2,c.A145G,p.I49V,0.000821,1218,PASS
8,SLC20A2,c.T44C,p.I15T,0.000821,1218,PASS
0,XPR1,c.G527T,p.R176L,0.000821,1218,PASS


Filtrando as variantes de baixa qualidade

In [25]:
pfbcNSVars = pfbcNSVars[pfbcNSVars['Filter Pass'].str.contains('PASS')]


In [26]:
pfbcNSVars.drop('Filter Pass', axis=1, inplace = True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


In [27]:
pfbcNSVars

Unnamed: 0,Gene,Codon Change,Protein Change,Frequency,Allele Number
0,SLC20A2,c.G1438A,p.A480T,0.00821,1218
1,SLC20A2,c.C1316T,p.A439V,0.000821,1218
2,SLC20A2,c.C1223T,p.S408L,0.000821,1218
3,SLC20A2,c.G1198A,p.A400T,0.001642,1218
4,SLC20A2,c.G910A,p.G304S,0.050082,1218
5,SLC20A2,c.G761A,p.R254Q,0.000821,1218
6,SLC20A2,c.A533G,p.N178S,0.001642,1218
7,SLC20A2,c.A145G,p.I49V,0.000821,1218
8,SLC20A2,c.T44C,p.I15T,0.000821,1218
0,XPR1,c.G527T,p.R176L,0.000821,1218


O MYORG (KIAA1161) não está bem anotado no banco de dados ABraOM. Vamos precisar extrair suas informaçẽos a partir da posição cromossômica das variantes.

In [28]:
pfbcAbraom['KIAA1161']

Unnamed: 0,Chr,Start,Ref,Alt,PredictedFunc.refGene,Gene.refGene,PredConsequence.refGene,avsnp147,FILTER,CEGH Filter,HomozygousALT count,Hemizygous count,Allele number,Allele ALT count,Frequencies,Cohort
1081663,9,34368674,A,G,UTR3,KIAA1161,NM_020702:c.*2123T>C,rs7861760,PASS,FDP,10,0,108,22,0.203704,SABE609
1081664,9,34368984,G,T,UTR3,KIAA1161,NM_020702:c.*1813C>A,rs3892388,PASS,FDP,32,0,464,72,0.155172,SABE609
1081665,9,34369027,C,T,UTR3,KIAA1161,NM_020702:c.*1770G>A,,LowQual,FDP,1,0,582,2,0.003436,SABE609
1081666,9,34369062,C,G,UTR3,KIAA1161,NM_020702:c.*1735G>C,rs1134455,PASS,FDP,46,0,528,105,0.198864,SABE609
1081667,9,34369374,G,T,UTR3,KIAA1161,NM_020702:c.*1423C>A,,PASS,FDP,1,0,426,2,0.004695,SABE609
1081668,9,34369683,TGAGA,-,UTR3,KIAA1161,NM_020702:c.*1114_*1110delTCTCA,rs374451075,PASS,FDP,38,0,496,84,0.169355,SABE609
1081669,9,34369718,A,T,UTR3,KIAA1161,NM_020702:c.*1079T>A,,LowQual,FDP,1,0,456,2,0.004386,SABE609
1081670,9,34370002,TC,-,UTR3,KIAA1161,NM_020702:c.*795_*794delGA,rs149223881,PASS,FDP,22,0,432,62,0.143519,SABE609
1081671,9,34370020,T,A,UTR3,KIAA1161,NM_020702:c.*777A>T,rs10758255,PASS,FDP,120,0,422,252,0.597156,SABE609
1081672,9,34370266,G,T,UTR3,KIAA1161,NM_020702:c.*531C>A,,LowQual,FDP,0,0,448,1,0.002232,SABE609


Extraídas as variantes, vamos comparar agora com as variantes encontradas em pacientes com CCFP.<br>
Para isso usaremos quatro bancos de dados de anotações de variantes clínicas:<br>
* O Human Gene Mutation Database (HGMD) - http://www.hgmd.cf.ac.uk/ac/index.php <br>
* O ClinVar (NCBI) - https://www.ncbi.nlm.nih.gov/clinvar/ <br>
* E o Leiden Open Variation Database (Coppola Lab) - https://coppolalab.ucla.edu/lovd_pfbc/genes <br>

# Carregando o banco de dados do ClinVar

A análise do banco do ClinVar incluiu a pesquisa pela palavra-chave 'Primary Familial Brain Calcification'. <br>
A tabela com todas as variantes foi baixada como arquivo de texto tabular.<br>

In [29]:
clinVar = pd.read_csv('clinvarPfbc.txt', sep = '\t')

In [30]:
clinVar.head()

Unnamed: 0,Name,Gene(s),Condition(s),Clinical significance (Last reviewed),Review status,GRCh37Chromosome,GRCh37Location,GRCh38Chromosome,GRCh38Location,VariationID,AlleleID(s),Unnamed: 11
0,NC_000008.10:g.42338721_42916885del578165,CHRNB3|FNTA|SLC20A2|CHRNA6|THAP1|RNF170|HOOK3|...,Idiopathic basal ganglia calcification 1,Likely pathogenic,no assertion criteria provided,8.0,42338721 - 42916885,,,236030,237596,
1,NM_001278074.1(COL5A1):c.514G>T (p.Val172Phe),COL5A1,Connective tissue disorder|Cardiovascular phen...,Conflicting interpretations of pathogenicity(L...,"criteria provided, conflicting interpretations",9.0,137593039,9.0,134701193,180298,178594,
2,"MYORG, ILE655THR",MYORG,"BASAL GANGLIA CALCIFICATION, IDIOPATHIC, 7, AU...","Pathogenic(Last reviewed: Feb 8, 2019)",no assertion criteria provided,,,,,617697,609100,
3,NM_020702.5(MYORG):c.1057_1059GAC[1] (p.Asp354...,MYORG,"BASAL GANGLIA CALCIFICATION, IDIOPATHIC, 7, AU...","Pathogenic(Last reviewed: Feb 8, 2019)",no assertion criteria provided,9.0,34371880 - 34371882,9.0,34371882 - 34371884,617696,609099,
4,NM_020702.5(MYORG):c.1233del (p.Phe411fs),MYORG,"BASAL GANGLIA CALCIFICATION, IDIOPATHIC, 7, AU...","Pathogenic(Last reviewed: Feb 8, 2019)",no assertion criteria provided,9.0,34371709,9.0,34371711,617695,609098,


In [31]:
clinVar['Gene(s)'].unique()

array(['CHRNB3|FNTA|SLC20A2|CHRNA6|THAP1|RNF170|HOOK3|SMIM19', 'COL5A1',
       'MYORG', 'PDGFB', 'PDGFRB', 'SLC20A2', 'SLC20A2|SMIM19', 'SUFU',
       'XPR1'], dtype=object)

In [35]:
#clinVars = dict('SLC20A2': clinVar[clinVar['Gene(s)'].str.contains('SLC20A2')])
clinSLC20A2 = clinVar[clinVar['Gene(s)'].str.contains('SLC20A2')]
clinXPR1 = clinVar[clinVar['Gene(s)'].str.contains('XPR1')]
clinPDGFB = clinVar[clinVar['Gene(s)'].str.contains('PDGFB')]
clinPDGFRB = clinVar[clinVar['Gene(s)'].str.contains('PDGFRB')]
clinMYORG = clinVar[clinVar['Gene(s)'].str.contains('MYORG')]

In [36]:
clinSLC20A2

Unnamed: 0,Name,Gene(s),Condition(s),Clinical significance (Last reviewed),Review status,GRCh37Chromosome,GRCh37Location,GRCh38Chromosome,GRCh38Location,VariationID,AlleleID(s),Unnamed: 11
0,NC_000008.10:g.42338721_42916885del578165,CHRNB3|FNTA|SLC20A2|CHRNA6|THAP1|RNF170|HOOK3|...,Idiopathic basal ganglia calcification 1,Likely pathogenic,no assertion criteria provided,8.0,42338721 - 42916885,,,236030,237596,
44,NM_006749.5(SLC20A2):c.935-2A>G,SLC20A2,Idiopathic basal ganglia calcification 1,"Pathogenic(Last reviewed: Dec 17, 2018)","criteria provided, single submitter",8.0,42295097,8.0,42437579,636249,624067,
45,NM_001257180.2(SLC20A2):c.21del (p.Leu7fs),SLC20A2,Idiopathic basal ganglia calcification 1,"Likely pathogenic(Last reviewed: Nov 30, 2018)",no assertion criteria provided,8.0,42329888,8.0,42472370,634898,622734,
46,NM_001257180.2(SLC20A2):c.1795-1G>A,SLC20A2,Idiopathic basal ganglia calcification 1,"Likely pathogenic(Last reviewed: Nov 30, 2018)",no assertion criteria provided,8.0,42275486,8.0,42417968,634897,622737,
47,NM_001257180.2(SLC20A2):c.303del (p.Trp101fs),SLC20A2,Idiopathic basal ganglia calcification 1,"Likely pathogenic(Last reviewed: Nov 30, 2018)",no assertion criteria provided,8.0,42323422,8.0,42465905,634896,622731,
...,...,...,...,...,...,...,...,...,...,...,...,...
116,NM_006749.5(SLC20A2):c.1784C>T (p.Thr595Met),SLC20A2,Idiopathic basal ganglia calcification 1,"Pathogenic(Last reviewed: Feb 12, 2012)",no assertion criteria provided,8.0,42286286,8.0,42428768,29797,38752,
117,NM_001257180.2(SLC20A2):c.1723G>A (p.Glu575Lys),SLC20A2,Idiopathic basal ganglia calcification 1,"Likely pathogenic(Last reviewed: Nov 21, 2017)","criteria provided, single submitter",8.0,42286347,8.0,42428829,29796,38751,
118,NM_001257180.2(SLC20A2):c.1802C>T (p.Ser601Leu),SLC20A2,Idiopathic basal ganglia calcification 1,"Pathogenic(Last reviewed: Jun 27, 2013)",no assertion criteria provided,8.0,42275478,8.0,42417960,29795,38750,
119,NM_001257180.2(SLC20A2):c.1802C>G (p.Ser601Trp),SLC20A2,Idiopathic basal ganglia calcification 1,"Pathogenic(Last reviewed: Jun 27, 2013)",no assertion criteria provided,8.0,42275478,8.0,42417960,29794,38749,


In [198]:
clinVariants = dict()
proteinChange = []
codonChange = []
clinicalSig = []

def extractClinVar(df):
    for idx in enumerate(df['Name']):
        for s in df['Name']:
            if 'NM' in s:
                if '(' in s[s.find('c.'):]:
                    #print(s[s.find('c.'):s.find(' ')])
                    codon = s[s.find('c.'):s.find(' ')]
                    aminoacid = s[s.find('p.'):-1]
                    if idx[1] in s:
                        patho = df.iloc[idx[0]]['Clinical significance (Last reviewed)']
                        clinVariants[codon] = (codon, aminoacid, patho)

                        codonChange.append(s[s.find('c.'):s.find(' ')])
                        proteinChange.append(s[s.find('p.'):-1])
                else:
                    codonChange.append(s[s.find('c.'):])
                    codon = s[s.find('c.'):]
                    if idx[1] in s:
                        patho = df.iloc[idx[0]]['Clinical significance (Last reviewed)']
                        clinVariants[codon] = (codon, patho)
                    
                
    return codonChange, proteinChange, clinVariants

In [199]:
extractClinVar(clinSLC20A2)

(['c.935-2A>G',
  'c.1795-1G>A',
  'c.-5+443G>A',
  'c.-5+419G>T',
  'c.-5+385T>A',
  'c.-5+308C>A',
  'c.-5+287G>A',
  'c.-389G>A',
  'c.-305G>A',
  'c.-288G>C',
  'c.-265+3dup',
  'c.-265+7dup',
  'c.-89A>G',
  'c.290-13C>A',
  'c.290-5T>C',
  'c.*70C>T',
  'c.*112G>T',
  'c.*113A>T',
  'c.*124G>T',
  'c.*171G>A',
  'c.*263A>C',
  'c.*289A>C',
  'c.*361A>G',
  'c.*380C>T',
  'c.*409A>G',
  'c.*538C>T',
  'c.*570G>T',
  'c.*660T>C',
  'c.*926C>T',
  'c.*1057C>G',
  'c.*1069G>C',
  'c.935-2A>G',
  'c.1795-1G>A',
  'c.-5+443G>A',
  'c.-5+419G>T',
  'c.-5+385T>A',
  'c.-5+308C>A',
  'c.-5+287G>A',
  'c.-389G>A',
  'c.-305G>A',
  'c.-288G>C',
  'c.-265+3dup',
  'c.-265+7dup',
  'c.-89A>G',
  'c.290-13C>A',
  'c.290-5T>C',
  'c.*70C>T',
  'c.*112G>T',
  'c.*113A>T',
  'c.*124G>T',
  'c.*171G>A',
  'c.*263A>C',
  'c.*289A>C',
  'c.*361A>G',
  'c.*380C>T',
  'c.*409A>G',
  'c.*538C>T',
  'c.*570G>T',
  'c.*660T>C',
  'c.*926C>T',
  'c.*1057C>G',
  'c.*1069G>C',
  'c.935-2A>G',
  'c.21del',
 

In [200]:
clinVariants.keys()

dict_keys(['c.935-2A>G', 'c.21del', 'c.1795-1G>A', 'c.303del', 'c.1196A>C', 'c.187G>A', 'c.188G>A', 'c.1520_1521del', 'c.1375G>T', 'c.1723G>T', 'c.136C>T', 'c.-5+443G>A', 'c.-5+419G>T', 'c.-5+385T>A', 'c.-5+308C>A', 'c.-5+287G>A', 'c.-389G>A', 'c.-305G>A', 'c.-288G>C', 'c.-265+3dup', 'c.-265+7dup', 'c.-89A>G', 'c.138G>A', 'c.186A>G', 'c.290-13C>A', 'c.290-5T>C', 'c.345G>A', 'c.553G>C', 'c.678C>T', 'c.761G>A', 'c.787G>A', 'c.834G>A', 'c.846C>T', 'c.849C>T', 'c.883C>T', 'c.909G>A', 'c.910G>A', 'c.933C>T', 'c.1008C>T', 'c.1011C>A', 'c.1090G>A', 'c.1134G>T', 'c.1254C>T', 'c.1377G>T', 'c.1438G>A', 'c.1572C>T', 'c.1576G>T', 'c.1803G>A', 'c.1911C>T', 'c.*70C>T', 'c.*112G>T', 'c.*113A>T', 'c.*124G>T', 'c.*171G>A', 'c.*263A>C', 'c.*289A>C', 'c.*361A>G', 'c.*380C>T', 'c.*409A>G', 'c.*538C>T', 'c.*570G>T', 'c.*660T>C', 'c.*926C>T', 'c.*1057C>G', 'c.*1069G>C', 'c.58T>C', 'c.1812C>T', 'c.509del', 'c.583_584del', 'c.1828_1831del', 'c.1784C>T', 'c.1723G>A', 'c.1802C>T', 'c.1802C>G', 'c.1492G>A'])

In [56]:
aminoacids = {'CYS': 'C', 'ASP': 'D', 'SER': 'S', 'GLN': 'Q', 'LYS': 'K',
     'ILE': 'I', 'PRO': 'P', 'THR': 'T', 'PHE': 'F', 'ASN': 'N', 
     'GLY': 'G', 'HIS': 'H', 'LEU': 'L', 'ARG': 'R', 'TRP': 'W', 
     'ALA': 'A', 'VAL':'V', 'GLU': 'E', 'TYR': 'Y', 'MET': 'M'}

In [41]:
def reduceVar(proteinChange):
    for i in enumerate(proteinChange):
        var = str(i[1]).upper()
        for a in aminoacids.keys():
            if a in var:
                proteinChange[i[0]] = var[0:var.find(a)] + aminoacids[a] + var[var.find(a)+3:]
    for i in enumerate(proteinChange):
        var = str(i[1]).upper()
        for a in aminoacids.keys():
            if a in var:
                proteinChange[i[0]] = var[0:var.find(a)] + aminoacids[a] + var[var.find(a)+3:]

In [201]:
reduceVar(proteinChange)
proteinChange

['P.L7FS',
 'P.W101FS',
 'P.H399P',
 'P.G63S',
 'P.G63D',
 'P.V507FS',
 'P.E459TER',
 'P.E575TER',
 'P.Q46TER',
 'P.Q46=',
 'P.L62=',
 'P.T115=',
 'P.V185L',
 'P.F226=',
 'P.R254Q',
 'P.V263I',
 'P.K278=',
 'P.D282=',
 'P.S283=',
 'P.L295=',
 'P.A303=',
 'P.G304S',
 'P.Y311=',
 'P.H336=',
 'P.T337=',
 'P.D364N',
 'P.R378=',
 'P.G418=',
 'P.E459D',
 'P.A480T',
 'P.G524=',
 'P.V526L',
 'P.S601=',
 'P.S637=',
 'P.L20=',
 'P.A604=',
 'P.I169_L170INSTER',
 'P.V195FS',
 'P.S610FS',
 'P.T595M',
 'P.E575K',
 'P.S601L',
 'P.S601W',
 'P.G498R']

In [211]:
abraomPFBC = dict()
variantlist = []
def vartodict(gene, variantlist):
    for var in pfbcNSVars[pfbcNSVars['Gene'].str.contains(gene)]['Protein Change']:
        variant = var.upper()
        if variant in variantlist:
            abraomPFBC[var] = pfbcNSVars[pfbcNSVars['Protein Change'].str.contains(var)]

In [212]:
vartodict('SLC20A2', proteinChange)

In [213]:
abraomPFBC

{'p.A480T':       Gene Codon Change Protein Change  Frequency  Allele Number
 0  SLC20A2     c.G1438A        p.A480T    0.00821           1218,
 'p.G304S':       Gene Codon Change Protein Change  Frequency  Allele Number
 4  SLC20A2      c.G910A        p.G304S   0.050082           1218,
 'p.R254Q':       Gene Codon Change Protein Change  Frequency  Allele Number
 5  SLC20A2      c.G761A        p.R254Q   0.000821           1218}

In [214]:
clinVariants = dict()
proteinChange = []
codonChange = []
clinicalSig = []

extractClinVar(clinPDGFB)

reduceVar(proteinChange)
proteinChange
vartodict('PDGFB', proteinChange)

In [215]:
clinVariants = dict()
proteinChange = []
codonChange = []
clinicalSig = []

extractClinVar(clinPDGFRB)

reduceVar(proteinChange)
proteinChange
vartodict('PDGFRB', proteinChange)

In [216]:
clinVariants = dict()
proteinChange = []
codonChange = []
clinicalSig = []

extractClinVar(clinXPR1)
reduceVar(proteinChange)
proteinChange
vartodict('XPR1', proteinChange)

In [217]:
clinVariants = dict()
proteinChange = []
codonChange = []
clinicalSig = []

extractClinVar(clinMYORG)
reduceVar(proteinChange)
proteinChange
vartodict('MYORG', proteinChange)

In [218]:
abraomPFBC

{'p.A480T':       Gene Codon Change Protein Change  Frequency  Allele Number
 0  SLC20A2     c.G1438A        p.A480T    0.00821           1218,
 'p.G304S':       Gene Codon Change Protein Change  Frequency  Allele Number
 4  SLC20A2      c.G910A        p.G304S   0.050082           1218,
 'p.R254Q':       Gene Codon Change Protein Change  Frequency  Allele Number
 5  SLC20A2      c.G761A        p.R254Q   0.000821           1218,
 'p.G1040V':      Gene Codon Change Protein Change  Frequency  Allele Number
 2  PDGFRB     c.G3119T       p.G1040V   0.004105           1218,
 'p.R502Q':      Gene Codon Change Protein Change  Frequency  Allele Number
 6  PDGFRB     c.G1505A        p.R502Q   0.000821           1218,
 'p.E485K':      Gene Codon Change Protein Change  Frequency  Allele Number
 8  PDGFRB     c.G1453A        p.E485K   0.027915           1218,
 'p.T464M':       Gene Codon Change Protein Change  Frequency  Allele Number
 11  PDGFRB     c.C1391T        p.T464M   0.025452           121

In [219]:
AbraomPFBC = pd.DataFrame()
for key in abraomPFBC:
    AbraomPFBC = pd.concat([AbraomPFBC, abraomPFBC[key]], join = 'outer')
#pfbcNSVars = pd.concat([slc20a2, xpr1], join = 'outer')
AbraomPFBC.index = range(len(AbraomPFBC['Gene']))

In [224]:
AbraomPFBC.to_csv('abraomPFBC.csv')

In [225]:
pfbcNSVars.to_csv('PFBC NS Variants.csv')

In [248]:
slc20a2PathoFreq = 0
pdgfrbPathoFreq = 0
for i,y in enumerate(AbraomPFBC['Gene']):
    if AbraomPFBC['Gene'][i] == 'SLC20A2':
        slc20a2PathoFreq += AbraomPFBC['Frequency'][i]
    else:
        pdgfrbPathoFreq += AbraomPFBC['Frequency'][i]

pfbcMinPrev = slc20a2PathoFreq + pdgfrbPathoFreq

print('A frequência de variantes causadoras de Calcificação Cerebral Familial Primária na população brasileira representada no ABraOM é de ' +
     str(pfbcMinPrev*100) + '%.')
print('Desse total, ' + str(round(slc20a2PathoFreq*100,4)) + '% apresentam uma variante pelo menos em heterozigose no gene SLC20A2, enquanto ' + str(round(pdgfrbPathoFreq*100,4)) + '% apresentam variantes no gene PDGFRB ligadas a doença.')

A frequência de variantes causadoras de Calcificação Cerebral Familial Primária na população brasileira representada no ABraOM é de 18.2266%.
Desse total, 5.9113% apresentam uma variante pelo menos em heterozigose no gene SLC20A2, enquanto 12.3153% apresentam variantes no gene PDGFRB ligadas a doença.


In [243]:
slc20a2PathoFreq = 0
pdgfrbPathoFreq = 0
for i,y in enumerate(AbraomPFBC['Gene']):
    if AbraomPFBC['Gene'][i] == 'SLC20A2':
        slc20a2PathoFreq += AbraomPFBC['Frequency'][i]
    else:
        pdgfrbPathoFreq += AbraomPFBC['Frequency'][i]

In [245]:
print(slc20a2PathoFreq)
print(pdgfrbPathoFreq)

0.059113000000000006
0.12315300000000001


# Tentando extrair informações genéticas das anotações de rs do MYORG

In [135]:
rsList = []
for rs in pfbcAbraom['KIAA1161']['avsnp147'].dropna():
    rsList.append(rs)

In [136]:
rsList

['rs7861760',
 'rs3892388',
 'rs1134455',
 'rs374451075',
 'rs149223881',
 'rs10758255',
 'rs558324774',
 'rs528652691',
 'rs398096380',
 'rs200098063',
 'rs755830209',
 'rs775768418',
 'rs199990841',
 'rs370944350',
 'rs115786753',
 'rs749897076',
 'rs369042834',
 'rs41311426',
 'rs767224436',
 'rs764057941',
 'rs7852399',
 'rs766107041',
 'rs115412086',
 'rs41312814',
 'rs202003652',
 'rs202011793',
 'rs12377',
 'rs780531156',
 'rs541063329',
 'rs74600562',
 'rs4879781',
 'rs4879782',
 'rs2297776',
 'rs202207332',
 'rs10972084',
 'rs191077117']

In [138]:
import myvariant
mv = myvariant.MyVariantInfo()

In [139]:
res = mv.querymany(rsList, scopes='dbsnp.rsid', fields='clinvar')

querying 1-36...done.
Finished.
12 input query terms found dup hits:
	[('rs149223881', 6), ('rs755830209', 2), ('rs775768418', 2), ('rs370944350', 2), ('rs41311426', 2), 
2 input query terms found no hit:
	['rs374451075', 'rs398096380']
Pass "returnall=True" to return complete lists of duplicate or missing query terms.


In [144]:
res

[{'query': 'rs7861760', '_id': 'chr9:g.34368674A>G', '_score': 16.08133},
 {'query': 'rs3892388', '_id': 'chr9:g.34368984G>T', '_score': 16.08159},
 {'query': 'rs1134455', '_id': 'chr9:g.34369062C>G', '_score': 16.081884},
 {'query': 'rs374451075', 'notfound': True},
 {'query': 'rs149223881',
  '_id': 'chr9:g.34370002_34370003TC[7]',
  '_score': 16.082241},
 {'query': 'rs149223881',
  '_id': 'chr9:g.34370002_34370003TC[12]',
  '_score': 16.08212},
 {'query': 'rs149223881',
  '_id': 'chr9:g.34370002_34370003TC[9]',
  '_score': 16.081781},
 {'query': 'rs149223881',
  '_id': 'chr9:g.34370002_34370003TC[8]',
  '_score': 16.08177},
 {'query': 'rs149223881',
  '_id': 'chr9:g.34370002_34370003TC[6]',
  '_score': 16.081665},
 {'query': 'rs149223881',
  '_id': 'chr9:g.34370002_34370003TC[11]',
  '_score': 16.081644},
 {'query': 'rs10758255', '_id': 'chr9:g.34370020T>A', '_score': 16.081594},
 {'query': 'rs558324774', '_id': 'chr9:g.34370542G>A', '_score': 16.081665},
 {'query': 'rs528652691', '

Não foi encontrada nenhuma variante clinicamente relevante na lista de rs providenciada.