# Genomic Data for Variant Pathogenicity
This notebook reads the vcf file containing ClinVar data and outputs a vcf file that contains the right information to run ANNOVAR and, eventually reach the table templated format provide in FH-EARLY for the genomic data.

### To download the ClinVar data:
Go to https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/ (Our version is clinvar_20260208.vcf.gz and also clinvar_20260208.vcf.gz.tbi)

In [65]:
# import packages
import pandas as pd
pd.set_option('display.max_columns', None)
from cyvcf2 import VCF # https://github.com/brentp/cyvcf2/tree/main

### Variable choices
1. Genome Interval (ClinVar): consulted https://www-ncbi-nlm-nih-gov.tudelft.idm.oclc.org/clinvar/?term=LDLR%5Bgene%5D to select the range between the first and the last LDLR variant in chromosome 19.
2. Variable description for VCF file: https://ftp.ncbi.nlm.nih.gov/pub/clinvar/README_VCF.txt
3. Variables ANNOVAR (following MetaRNN paper - additional file):
    * chromosome = var.CHROM
    * position = var.POS
    * reference allele = var.REF
    * alternative allele = var.ALT
    * [missing] reference aa of the protein [gloria]
    * [missing] alternative aa of the protein [gloria]
    * label (TP or TN) (Clinical Significance by CLinVar)= var.INFO.get('CLNSIG','')
4. Additional variables used in FH-EARLY:
    * [uncertain] function = for now var.INFO.get('MC','') (eg. synonymous_variant)
    * gene = var.INFO.get('GENEINFO', '')
    * variant type = var.INFO.get('CLNVC','')
    * [uncertain] change = for now included in var.INFO.get('CLNHGVS','') (look for >)
    * ONIM = included in var.INFO.CLNDISDB (look for ONIM)
5. Additional variables needed to select samples
    * id = var.INFO.get('ALLELEID', '')
    * star review = var.INFO.get('CLNREVSTAT','')
    * rs# = var.INFO.get('RS', '')
6. Additional variables we need to get output from 1000 genomes, gnomAD exome, and dbSNP
    * TODO

In [66]:
vcf_og = VCF('clinvar_20260208.vcf.gz')

In [67]:
rows = []
for var in vcf_og('19:11087732-11133700'):
    # TODO: select necessary information
    rows.append({
        # identifiers
        'allele_id': var.INFO.get('ALLELEID', ''),
        'rs#': var.INFO.get('RS', ''),
        'review': var.INFO.get('CLNREVSTAT',''),
        'gene': var.INFO.get('GENEINFO', ''),
        'chrom': var.CHROM,
        # useful
        'pos': var.POS,  # 1-based
        # 'ref': var.REF, fetch it from other file
        # 'alt': ','.join(var.ALT), fetch it from other file
        # missing: 'ref_aa': fetch it from other file
        # missing: 'alt_aa': fetch it from other file
        'clinsig': var.INFO.get('CLNSIG', ''),
        'function': var.INFO.get('MC',''),
        'type': var.INFO.get('CLNVC',''),
        'change': var.INFO.get('CLNHGVS',''),
        'onim': var.INFO.get('CLNDISDB','')
    })

vcf_df = pd.DataFrame(rows)
vcf_df

[W::hts_idx_load3] The index file is older than the data file: clinvar_20260208.vcf.gz.tbi


Unnamed: 0,allele_id,rs#,review,gene,chrom,pos,clinsig,function,type,change,onim
0,682121,,"criteria_provided,_single_submitter",LDLR:3949|LDLR-AS1:115271120,19,11087729,Pathogenic,,Deletion,NC_000019.10:g.11087732_11090710del,"Human_Phenotype_Ontology:HP:0003124,Human_Phen..."
1,424286,989307060.0,reviewed_by_expert_panel,LDLR:3949|LDLR-AS1:115271120,19,11089263,Uncertain_significance,SO:0001619|non-coding_transcript_variant,single_nucleotide_variant,NC_000019.10:g.11089263C>G,"MONDO:MONDO:0005439,MedGen:C0020445,OMIM:PS143..."
2,245300,17249134.0,"criteria_provided,_multiple_submitters,_no_con...",LDLR:3949|LDLR-AS1:115271120,19,11089281,Benign/Likely_benign,SO:0001619|non-coding_transcript_variant,single_nucleotide_variant,NC_000019.10:g.11089281G>T,"MONDO:MONDO:0005439,MedGen:C0020445,OMIM:PS143..."
3,4180223,,no_assertion_criteria_provided,LDLR:3949|LDLR-AS1:115271120,19,11089283,Uncertain_significance,SO:0001619|non-coding_transcript_variant,single_nucleotide_variant,NC_000019.10:g.11089283C>T,"MONDO:MONDO:0005439,MedGen:C0020445,OMIM:PS143890"
4,3752756,,"criteria_provided,_single_submitter",LDLR:3949|LDLR-AS1:115271120,19,11089309,Uncertain_significance,SO:0001619|non-coding_transcript_variant,Microsatellite,NC_000019.10:g.11089310AG[1],"MONDO:MONDO:0005439,MedGen:C0020445,OMIM:PS143890"
5,2967261,2515905935.0,"criteria_provided,_single_submitter",LDLR:3949|LDLR-AS1:115271120,19,11089318,Uncertain_significance,SO:0001619|non-coding_transcript_variant,Deletion,NC_000019.10:g.11089319del,"MONDO:MONDO:0005439,MedGen:C0020445,OMIM:PS143890"
6,424287,1555800611.0,"criteria_provided,_single_submitter",LDLR:3949|LDLR-AS1:115271120,19,11089318,Pathogenic,SO:0001619|non-coding_transcript_variant,Deletion,NC_000019.10:g.11089320_11089459del,"MONDO:MONDO:0007750,MedGen:C0745103,OMIM:14389..."
7,425023,376713337.0,"criteria_provided,_multiple_submitters,_no_con...",LDLR:3949|LDLR-AS1:115271120,19,11089321,Uncertain_significance,SO:0001619|non-coding_transcript_variant,single_nucleotide_variant,NC_000019.10:g.11089321G>C,"MONDO:MONDO:0007750,MedGen:C0745103,OMIM:14389..."
8,514374,1555800612.0,"criteria_provided,_single_submitter",LDLR:3949|LDLR-AS1:115271120,19,11089322,Uncertain_significance,SO:0001619|non-coding_transcript_variant,single_nucleotide_variant,NC_000019.10:g.11089322G>C,"MONDO:MONDO:0007750,MedGen:C0745103,OMIM:14389..."
9,245301,879254359.0,"criteria_provided,_single_submitter",LDLR:3949|LDLR-AS1:115271120,19,11089324,Uncertain_significance,,Duplication,NC_000019.10:g.11089329dup,"MONDO:MONDO:0007750,MedGen:C0745103,OMIM:14389..."


* id is unique
* pos is almost unique, there are some variants in the same position
* missing aa info

# Get AA change info from variant_summary
* Source: https://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/variant_summary.txt.gz
* Goal: get name field, extract ref, pos, aa_ref, aa_pos
* Link df to samples available in vcf: use AlleleID to link; select Assembly=='GRCh38'

In [68]:
summary_og = pd.read_csv('variant_summary.txt.gz', sep='\t')

  summary_og = pd.read_csv('variant_summary.txt.gz', sep='\t')


In [69]:
summary_df = summary_og.loc[summary_og['GeneID']==3949] # Filter for LDLR gene
summary_df = summary_df.loc[summary_df['Assembly']=='GRCh38'] # Filter for h38 assembly
print("Shape: ", summary_df.shape)
summary_df.head()

Shape:  (4402, 43)


Unnamed: 0,#AlleleID,Type,Name,GeneID,GeneSymbol,HGNC_ID,ClinicalSignificance,ClinSigSimple,LastEvaluated,RS# (dbSNP),nsv/esv (dbVar),RCVaccession,PhenotypeIDS,PhenotypeList,Origin,OriginSimple,Assembly,ChromosomeAccession,Chromosome,Start,Stop,ReferenceAllele,AlternateAllele,Cytogenetic,ReviewStatus,NumberSubmitters,Guidelines,TestedInGTR,OtherIDs,SubmitterCategories,VariationID,PositionVCF,ReferenceAlleleVCF,AlternateAlleleVCF,SomaticClinicalImpact,SomaticClinicalImpactLastEvaluated,ReviewStatusClinicalImpact,Oncogenicity,OncogenicityLastEvaluated,ReviewStatusOncogenicity,SCVsForAggregateGermlineClassification,SCVsForAggregateSomaticClinicalImpact,SCVsForAggregateOncogenicityClassification
6764,18722,single nucleotide variant,NM_000527.5(LDLR):c.97C>T (p.Gln33Ter),3949,LDLR,HGNC:6547,Pathogenic,1,"Mar 25, 2022",121908024,-,RCV000003868|RCV000786350|RCV001034691|RCV0023...,"MONDO:MONDO:0007750,MedGen:C0745103,OMIM:14389...","Hypercholesterolemia, familial, 1|not provided...",germline;unknown,germline,GRCh38,NC_000019.10,19,11100252,11100252,na,na,19p13.2,reviewed by expert panel,20,"ACMG2013,ACMG2016,ACMG2021,ACMG2022",N,"ClinGen:CA023802,LDLR-LOVD, British Heart Foun...",3,3683,11100252,C,T,-,-,-,-,-,-,SCV002506409,-,-
6766,18724,single nucleotide variant,NM_000527.5(LDLR):c.259T>G (p.Trp87Gly),3949,LDLR,HGNC:6547,Pathogenic,1,"Jun 07, 2021",121908025,-,RCV000003870|RCV000622852|RCV000776466|RCV0008...,"MONDO:MONDO:0007750,MedGen:C0745103,OMIM:14389...","Hypercholesterolemia, familial, 1|not provided...",germline;inherited;unknown,germline,GRCh38,NC_000019.10,19,11102732,11102732,na,na,19p13.2,reviewed by expert panel,30,"ACMG2013,ACMG2016,ACMG2021,ACMG2022",N,"ClinGen:CA023683,LDLR-LOVD, British Heart Foun...",3,3685,11102732,T,G,-,-,-,-,-,-,SCV001960956,-,-
6768,18725,single nucleotide variant,NM_000527.5(LDLR):c.530C>T (p.Ser177Leu),3949,LDLR,HGNC:6547,Pathogenic,1,"Jun 03, 2022",121908026,-,RCV000003871|RCV000161958|RCV000588687|RCV0023...,"MONDO:MONDO:0007750,MedGen:C0745103,OMIM:14389...","Hypercholesterolemia, familial, 1|not provided...",germline;inherited;not applicable;unknown,germline,GRCh38,NC_000019.10,19,11105436,11105436,na,na,19p13.2,reviewed by expert panel,28,"ACMG2013,ACMG2016,ACMG2021,ACMG2022",N,"ClinGen:CA023715,LDLR-LOVD, British Heart Foun...",3,3686,11105436,C,T,-,-,-,-,-,-,SCV002568105,-,-
6770,18727,single nucleotide variant,NM_000527.5(LDLR):c.1694G>T (p.Gly565Val),3949,LDLR,HGNC:6547,Pathogenic/Likely pathogenic,1,"Jul 08, 2024",28942082,-,RCV000003874|RCV000791454|RCV001195593|RCV0023...,"MONDO:MONDO:0007750,MedGen:C0745103,OMIM:14389...","Hypercholesterolemia, familial, 1|Familial hyp...",germline;not applicable,germline,GRCh38,NC_000019.10,19,11116201,11116201,na,na,19p13.2,"criteria provided, multiple submitters, no con...",12,"ACMG2013,ACMG2016,ACMG2021,ACMG2022",N,"ClinGen:CA023553,LDLR-LOVD, British Heart Foun...",3,3688,11116201,G,T,-,-,-,-,-,-,SCV000295584|SCV000503387|SCV000583865|SCV0005...,-,-
6772,18728,single nucleotide variant,NM_000527.5(LDLR):c.2000G>A (p.Cys667Tyr),3949,LDLR,HGNC:6547,Likely pathogenic,1,"Jun 18, 2021",28942083,-,RCV000030131|RCV000313287|RCV000775084|RCV0024...,"MONDO:MONDO:0007750,MedGen:C0745103,OMIM:14389...","Hypercholesterolemia, familial, 1|not provided...",germline;unknown,germline,GRCh38,NC_000019.10,19,11120382,11120382,na,na,19p13.2,reviewed by expert panel,21,"ACMG2013,ACMG2016,ACMG2021,ACMG2022",N,"LDLR-LOVD, British Heart Foundation:LDLR_00027...",3,3689,11120382,G,A,-,-,-,-,-,-,SCV001960936,-,-


In [70]:
# Link variant_summary df to vcf_file df
filtered_df = summary_df.loc[summary_df['#AlleleID'].isin(vcf_df['allele_id'])]
print("Shape: ", filtered_df.shape)
filtered_df.head()

Shape:  (4321, 43)


Unnamed: 0,#AlleleID,Type,Name,GeneID,GeneSymbol,HGNC_ID,ClinicalSignificance,ClinSigSimple,LastEvaluated,RS# (dbSNP),nsv/esv (dbVar),RCVaccession,PhenotypeIDS,PhenotypeList,Origin,OriginSimple,Assembly,ChromosomeAccession,Chromosome,Start,Stop,ReferenceAllele,AlternateAllele,Cytogenetic,ReviewStatus,NumberSubmitters,Guidelines,TestedInGTR,OtherIDs,SubmitterCategories,VariationID,PositionVCF,ReferenceAlleleVCF,AlternateAlleleVCF,SomaticClinicalImpact,SomaticClinicalImpactLastEvaluated,ReviewStatusClinicalImpact,Oncogenicity,OncogenicityLastEvaluated,ReviewStatusOncogenicity,SCVsForAggregateGermlineClassification,SCVsForAggregateSomaticClinicalImpact,SCVsForAggregateOncogenicityClassification
6764,18722,single nucleotide variant,NM_000527.5(LDLR):c.97C>T (p.Gln33Ter),3949,LDLR,HGNC:6547,Pathogenic,1,"Mar 25, 2022",121908024,-,RCV000003868|RCV000786350|RCV001034691|RCV0023...,"MONDO:MONDO:0007750,MedGen:C0745103,OMIM:14389...","Hypercholesterolemia, familial, 1|not provided...",germline;unknown,germline,GRCh38,NC_000019.10,19,11100252,11100252,na,na,19p13.2,reviewed by expert panel,20,"ACMG2013,ACMG2016,ACMG2021,ACMG2022",N,"ClinGen:CA023802,LDLR-LOVD, British Heart Foun...",3,3683,11100252,C,T,-,-,-,-,-,-,SCV002506409,-,-
6766,18724,single nucleotide variant,NM_000527.5(LDLR):c.259T>G (p.Trp87Gly),3949,LDLR,HGNC:6547,Pathogenic,1,"Jun 07, 2021",121908025,-,RCV000003870|RCV000622852|RCV000776466|RCV0008...,"MONDO:MONDO:0007750,MedGen:C0745103,OMIM:14389...","Hypercholesterolemia, familial, 1|not provided...",germline;inherited;unknown,germline,GRCh38,NC_000019.10,19,11102732,11102732,na,na,19p13.2,reviewed by expert panel,30,"ACMG2013,ACMG2016,ACMG2021,ACMG2022",N,"ClinGen:CA023683,LDLR-LOVD, British Heart Foun...",3,3685,11102732,T,G,-,-,-,-,-,-,SCV001960956,-,-
6768,18725,single nucleotide variant,NM_000527.5(LDLR):c.530C>T (p.Ser177Leu),3949,LDLR,HGNC:6547,Pathogenic,1,"Jun 03, 2022",121908026,-,RCV000003871|RCV000161958|RCV000588687|RCV0023...,"MONDO:MONDO:0007750,MedGen:C0745103,OMIM:14389...","Hypercholesterolemia, familial, 1|not provided...",germline;inherited;not applicable;unknown,germline,GRCh38,NC_000019.10,19,11105436,11105436,na,na,19p13.2,reviewed by expert panel,28,"ACMG2013,ACMG2016,ACMG2021,ACMG2022",N,"ClinGen:CA023715,LDLR-LOVD, British Heart Foun...",3,3686,11105436,C,T,-,-,-,-,-,-,SCV002568105,-,-
6770,18727,single nucleotide variant,NM_000527.5(LDLR):c.1694G>T (p.Gly565Val),3949,LDLR,HGNC:6547,Pathogenic/Likely pathogenic,1,"Jul 08, 2024",28942082,-,RCV000003874|RCV000791454|RCV001195593|RCV0023...,"MONDO:MONDO:0007750,MedGen:C0745103,OMIM:14389...","Hypercholesterolemia, familial, 1|Familial hyp...",germline;not applicable,germline,GRCh38,NC_000019.10,19,11116201,11116201,na,na,19p13.2,"criteria provided, multiple submitters, no con...",12,"ACMG2013,ACMG2016,ACMG2021,ACMG2022",N,"ClinGen:CA023553,LDLR-LOVD, British Heart Foun...",3,3688,11116201,G,T,-,-,-,-,-,-,SCV000295584|SCV000503387|SCV000583865|SCV0005...,-,-
6772,18728,single nucleotide variant,NM_000527.5(LDLR):c.2000G>A (p.Cys667Tyr),3949,LDLR,HGNC:6547,Likely pathogenic,1,"Jun 18, 2021",28942083,-,RCV000030131|RCV000313287|RCV000775084|RCV0024...,"MONDO:MONDO:0007750,MedGen:C0745103,OMIM:14389...","Hypercholesterolemia, familial, 1|not provided...",germline;unknown,germline,GRCh38,NC_000019.10,19,11120382,11120382,na,na,19p13.2,reviewed by expert panel,21,"ACMG2013,ACMG2016,ACMG2021,ACMG2022",N,"LDLR-LOVD, British Heart Foundation:LDLR_00027...",3,3689,11120382,G,A,-,-,-,-,-,-,SCV001960936,-,-


In [71]:
filtered_df.loc[filtered_df['#AlleleID']==18722]

Unnamed: 0,#AlleleID,Type,Name,GeneID,GeneSymbol,HGNC_ID,ClinicalSignificance,ClinSigSimple,LastEvaluated,RS# (dbSNP),nsv/esv (dbVar),RCVaccession,PhenotypeIDS,PhenotypeList,Origin,OriginSimple,Assembly,ChromosomeAccession,Chromosome,Start,Stop,ReferenceAllele,AlternateAllele,Cytogenetic,ReviewStatus,NumberSubmitters,Guidelines,TestedInGTR,OtherIDs,SubmitterCategories,VariationID,PositionVCF,ReferenceAlleleVCF,AlternateAlleleVCF,SomaticClinicalImpact,SomaticClinicalImpactLastEvaluated,ReviewStatusClinicalImpact,Oncogenicity,OncogenicityLastEvaluated,ReviewStatusOncogenicity,SCVsForAggregateGermlineClassification,SCVsForAggregateSomaticClinicalImpact,SCVsForAggregateOncogenicityClassification
6764,18722,single nucleotide variant,NM_000527.5(LDLR):c.97C>T (p.Gln33Ter),3949,LDLR,HGNC:6547,Pathogenic,1,"Mar 25, 2022",121908024,-,RCV000003868|RCV000786350|RCV001034691|RCV0023...,"MONDO:MONDO:0007750,MedGen:C0745103,OMIM:14389...","Hypercholesterolemia, familial, 1|not provided...",germline;unknown,germline,GRCh38,NC_000019.10,19,11100252,11100252,na,na,19p13.2,reviewed by expert panel,20,"ACMG2013,ACMG2016,ACMG2021,ACMG2022",N,"ClinGen:CA023802,LDLR-LOVD, British Heart Foun...",3,3683,11100252,C,T,-,-,-,-,-,-,SCV002506409,-,-


In [72]:
vcf_df.loc[vcf_df['allele_id']==18722]

Unnamed: 0,allele_id,rs#,review,gene,chrom,pos,clinsig,function,type,change,onim
280,18722,121908024,reviewed_by_expert_panel,LDLR:3949,19,11100252,Pathogenic,SO:0001587|nonsense,single_nucleotide_variant,NC_000019.10:g.11100252C>T,"MedGen:CN230736|MONDO:MONDO:0005439,MedGen:C00..."


In [73]:
cols_to_use = ['#AlleleID', 'RS# (dbSNP)', 'Stop', 'Name', 'ReferenceAlleleVCF', 'AlternateAlleleVCF', 'ClinSigSimple']
filtered_df = filtered_df[cols_to_use]
print("Shape: ", filtered_df.shape)
filtered_df.head()

Shape:  (4321, 7)


Unnamed: 0,#AlleleID,RS# (dbSNP),Stop,Name,ReferenceAlleleVCF,AlternateAlleleVCF,ClinSigSimple
6764,18722,121908024,11100252,NM_000527.5(LDLR):c.97C>T (p.Gln33Ter),C,T,1
6766,18724,121908025,11102732,NM_000527.5(LDLR):c.259T>G (p.Trp87Gly),T,G,1
6768,18725,121908026,11105436,NM_000527.5(LDLR):c.530C>T (p.Ser177Leu),C,T,1
6770,18727,28942082,11116201,NM_000527.5(LDLR):c.1694G>T (p.Gly565Val),G,T,1
6772,18728,28942083,11120382,NM_000527.5(LDLR):c.2000G>A (p.Cys667Tyr),G,A,1


# Combine VCF and Variant_Summary dataframes
1. Using Labels from variant_summary - way less unknown samples (1(Pathogenic) vs 0(Benign/Uncertain))
2. Remove duplication (too complicated to encode) samples


In [74]:
vcf_df_filtered = vcf_df.loc[vcf_df['allele_id'].isin(filtered_df['#AlleleID'])]
print(vcf_df_filtered.shape)
vcf_df_filtered.head()

(4321, 11)


Unnamed: 0,allele_id,rs#,review,gene,chrom,pos,clinsig,function,type,change,onim
1,424286,989307060.0,reviewed_by_expert_panel,LDLR:3949|LDLR-AS1:115271120,19,11089263,Uncertain_significance,SO:0001619|non-coding_transcript_variant,single_nucleotide_variant,NC_000019.10:g.11089263C>G,"MONDO:MONDO:0005439,MedGen:C0020445,OMIM:PS143..."
2,245300,17249134.0,"criteria_provided,_multiple_submitters,_no_con...",LDLR:3949|LDLR-AS1:115271120,19,11089281,Benign/Likely_benign,SO:0001619|non-coding_transcript_variant,single_nucleotide_variant,NC_000019.10:g.11089281G>T,"MONDO:MONDO:0005439,MedGen:C0020445,OMIM:PS143..."
3,4180223,,no_assertion_criteria_provided,LDLR:3949|LDLR-AS1:115271120,19,11089283,Uncertain_significance,SO:0001619|non-coding_transcript_variant,single_nucleotide_variant,NC_000019.10:g.11089283C>T,"MONDO:MONDO:0005439,MedGen:C0020445,OMIM:PS143890"
4,3752756,,"criteria_provided,_single_submitter",LDLR:3949|LDLR-AS1:115271120,19,11089309,Uncertain_significance,SO:0001619|non-coding_transcript_variant,Microsatellite,NC_000019.10:g.11089310AG[1],"MONDO:MONDO:0005439,MedGen:C0020445,OMIM:PS143890"
5,2967261,2515905935.0,"criteria_provided,_single_submitter",LDLR:3949|LDLR-AS1:115271120,19,11089318,Uncertain_significance,SO:0001619|non-coding_transcript_variant,Deletion,NC_000019.10:g.11089319del,"MONDO:MONDO:0005439,MedGen:C0020445,OMIM:PS143890"


In [75]:
merged_df = vcf_df_filtered.merge(
    filtered_df,
    left_on='allele_id',
    right_on='#AlleleID',
    how='left'
)
print(merged_df.shape)
merged_df.head()

(4321, 18)


Unnamed: 0,allele_id,rs#,review,gene,chrom,pos,clinsig,function,type,change,onim,#AlleleID,RS# (dbSNP),Stop,Name,ReferenceAlleleVCF,AlternateAlleleVCF,ClinSigSimple
0,424286,989307060.0,reviewed_by_expert_panel,LDLR:3949|LDLR-AS1:115271120,19,11089263,Uncertain_significance,SO:0001619|non-coding_transcript_variant,single_nucleotide_variant,NC_000019.10:g.11089263C>G,"MONDO:MONDO:0005439,MedGen:C0020445,OMIM:PS143...",424286,989307060,11089263,NM_000527.5(LDLR):c.-286C>G,C,G,1
1,245300,17249134.0,"criteria_provided,_multiple_submitters,_no_con...",LDLR:3949|LDLR-AS1:115271120,19,11089281,Benign/Likely_benign,SO:0001619|non-coding_transcript_variant,single_nucleotide_variant,NC_000019.10:g.11089281G>T,"MONDO:MONDO:0005439,MedGen:C0020445,OMIM:PS143...",245300,17249134,11089281,NM_000527.5(LDLR):c.-268G>T,G,T,0
2,4180223,,no_assertion_criteria_provided,LDLR:3949|LDLR-AS1:115271120,19,11089283,Uncertain_significance,SO:0001619|non-coding_transcript_variant,single_nucleotide_variant,NC_000019.10:g.11089283C>T,"MONDO:MONDO:0005439,MedGen:C0020445,OMIM:PS143890",4180223,-1,11089283,NM_000527.4(LDLR):c.-266C>T,C,T,0
3,3752756,,"criteria_provided,_single_submitter",LDLR:3949|LDLR-AS1:115271120,19,11089309,Uncertain_significance,SO:0001619|non-coding_transcript_variant,Microsatellite,NC_000019.10:g.11089310AG[1],"MONDO:MONDO:0005439,MedGen:C0020445,OMIM:PS143890",3752756,-1,11089311,NC_000019.10:g.11089310AG[1],CAG,C,0
4,2967261,2515905935.0,"criteria_provided,_single_submitter",LDLR:3949|LDLR-AS1:115271120,19,11089318,Uncertain_significance,SO:0001619|non-coding_transcript_variant,Deletion,NC_000019.10:g.11089319del,"MONDO:MONDO:0005439,MedGen:C0020445,OMIM:PS143890",2967261,2515905935,11089319,NC_000019.10:g.11089319del,AC,A,0


In [76]:
# rename useful columns
merged_df = merged_df.rename(columns={
    'chrom': 'chr',
    'Stop': 'end',
    'ReferenceAlleleVCF': 'ref',
    'AlternateAlleleVCF': 'alt',
    'onim': 'ONIM'
})

In [77]:
# filter on submissions
merged_df['review'].value_counts()

review
criteria_provided,_single_submitter                     1872
criteria_provided,_multiple_submitters,_no_conflicts    1432
reviewed_by_expert_panel                                 534
criteria_provided,_conflicting_classifications           292
no_assertion_criteria_provided                           155
                                                          25
no_classification_provided                                 7
no_classification_for_the_single_variant                   4
Name: count, dtype: int64

In [78]:
# define criteria to keep sample (at least one star in https://www-ncbi-nlm-nih-gov.tudelft.idm.oclc.org/clinvar/docs/review_status/)
to_keep = ['criteria_provided,_single_submitter', 'criteria_provided,_multiple_submitters,_no_conflicts', 'reviewed_by_expert_panel']
merged_df = merged_df.loc[merged_df['review'].isin(to_keep)]
print(merged_df.shape)
merged_df.head()

(3838, 18)


Unnamed: 0,allele_id,rs#,review,gene,chr,pos,clinsig,function,type,change,ONIM,#AlleleID,RS# (dbSNP),end,Name,ref,alt,ClinSigSimple
0,424286,989307060.0,reviewed_by_expert_panel,LDLR:3949|LDLR-AS1:115271120,19,11089263,Uncertain_significance,SO:0001619|non-coding_transcript_variant,single_nucleotide_variant,NC_000019.10:g.11089263C>G,"MONDO:MONDO:0005439,MedGen:C0020445,OMIM:PS143...",424286,989307060,11089263,NM_000527.5(LDLR):c.-286C>G,C,G,1
1,245300,17249134.0,"criteria_provided,_multiple_submitters,_no_con...",LDLR:3949|LDLR-AS1:115271120,19,11089281,Benign/Likely_benign,SO:0001619|non-coding_transcript_variant,single_nucleotide_variant,NC_000019.10:g.11089281G>T,"MONDO:MONDO:0005439,MedGen:C0020445,OMIM:PS143...",245300,17249134,11089281,NM_000527.5(LDLR):c.-268G>T,G,T,0
3,3752756,,"criteria_provided,_single_submitter",LDLR:3949|LDLR-AS1:115271120,19,11089309,Uncertain_significance,SO:0001619|non-coding_transcript_variant,Microsatellite,NC_000019.10:g.11089310AG[1],"MONDO:MONDO:0005439,MedGen:C0020445,OMIM:PS143890",3752756,-1,11089311,NC_000019.10:g.11089310AG[1],CAG,C,0
4,2967261,2515905935.0,"criteria_provided,_single_submitter",LDLR:3949|LDLR-AS1:115271120,19,11089318,Uncertain_significance,SO:0001619|non-coding_transcript_variant,Deletion,NC_000019.10:g.11089319del,"MONDO:MONDO:0005439,MedGen:C0020445,OMIM:PS143890",2967261,2515905935,11089319,NC_000019.10:g.11089319del,AC,A,0
5,424287,1555800611.0,"criteria_provided,_single_submitter",LDLR:3949|LDLR-AS1:115271120,19,11089318,Pathogenic,SO:0001619|non-coding_transcript_variant,Deletion,NC_000019.10:g.11089320_11089459del,"MONDO:MONDO:0007750,MedGen:C0745103,OMIM:14389...",424287,1555800611,11089458,NM_000527.4(LDLR):c.-229_-90del,ACGGGTTAAAAAGCCGATGTCACATCGGCCGTTCGAAACTCCTCCT...,A,1


In [79]:
# filter on known labels
merged_df = merged_df.loc[~(merged_df['ClinSigSimple']==-1)]
print(merged_df.shape)
merged_df.head()

(3838, 18)


Unnamed: 0,allele_id,rs#,review,gene,chr,pos,clinsig,function,type,change,ONIM,#AlleleID,RS# (dbSNP),end,Name,ref,alt,ClinSigSimple
0,424286,989307060.0,reviewed_by_expert_panel,LDLR:3949|LDLR-AS1:115271120,19,11089263,Uncertain_significance,SO:0001619|non-coding_transcript_variant,single_nucleotide_variant,NC_000019.10:g.11089263C>G,"MONDO:MONDO:0005439,MedGen:C0020445,OMIM:PS143...",424286,989307060,11089263,NM_000527.5(LDLR):c.-286C>G,C,G,1
1,245300,17249134.0,"criteria_provided,_multiple_submitters,_no_con...",LDLR:3949|LDLR-AS1:115271120,19,11089281,Benign/Likely_benign,SO:0001619|non-coding_transcript_variant,single_nucleotide_variant,NC_000019.10:g.11089281G>T,"MONDO:MONDO:0005439,MedGen:C0020445,OMIM:PS143...",245300,17249134,11089281,NM_000527.5(LDLR):c.-268G>T,G,T,0
3,3752756,,"criteria_provided,_single_submitter",LDLR:3949|LDLR-AS1:115271120,19,11089309,Uncertain_significance,SO:0001619|non-coding_transcript_variant,Microsatellite,NC_000019.10:g.11089310AG[1],"MONDO:MONDO:0005439,MedGen:C0020445,OMIM:PS143890",3752756,-1,11089311,NC_000019.10:g.11089310AG[1],CAG,C,0
4,2967261,2515905935.0,"criteria_provided,_single_submitter",LDLR:3949|LDLR-AS1:115271120,19,11089318,Uncertain_significance,SO:0001619|non-coding_transcript_variant,Deletion,NC_000019.10:g.11089319del,"MONDO:MONDO:0005439,MedGen:C0020445,OMIM:PS143890",2967261,2515905935,11089319,NC_000019.10:g.11089319del,AC,A,0
5,424287,1555800611.0,"criteria_provided,_single_submitter",LDLR:3949|LDLR-AS1:115271120,19,11089318,Pathogenic,SO:0001619|non-coding_transcript_variant,Deletion,NC_000019.10:g.11089320_11089459del,"MONDO:MONDO:0007750,MedGen:C0745103,OMIM:14389...",424287,1555800611,11089458,NM_000527.4(LDLR):c.-229_-90del,ACGGGTTAAAAAGCCGATGTCACATCGGCCGTTCGAAACTCCTCCT...,A,1


In [80]:
# get p. info
merged_df['protein_info'] = (
    merged_df['Name'].astype(str)
    .str.extract(r'p\.(.+)', expand=False)
    .str.rstrip(')')
)
merged_df[['aa_ref', 'aa_change', 'aa_alt']] = (
    merged_df['protein_info'].str.extract(r'([A-Z][a-z]{2})(\d+)(.*)')
)
merged_df.loc[~merged_df['protein_info'].isna()].head()

Unnamed: 0,allele_id,rs#,review,gene,chr,pos,clinsig,function,type,change,ONIM,#AlleleID,RS# (dbSNP),end,Name,ref,alt,ClinSigSimple,protein_info,aa_ref,aa_change,aa_alt
139,1866537,2515907041,"criteria_provided,_single_submitter",LDLR:3949|LDLR-AS1:115271120,19,11089548,Pathogenic,"SO:0001582|initiator_codon_variant,SO:0001589|...",Deletion,NC_000019.10:g.11089549_11089615del,MedGen:C3661900,1866537,2515907041,11089615,NM_000527.5(LDLR):c.1_67del (p.Met1fs),CATGGGGCCCTGGGGCTGGAAATTGCGCTGGACCGTCGCCTTGCTC...,C,1,Met1fs,Met,1,fs
140,245332,879254382,reviewed_by_expert_panel,LDLR:3949|LDLR-AS1:115271120,19,11089549,Pathogenic,"SO:0001582|initiator_codon_variant,SO:0001583|...",single_nucleotide_variant,NC_000019.10:g.11089549A>C,"MONDO:MONDO:0007750,MedGen:C0745103,OMIM:14389...",245332,879254382,11089549,NM_000527.5(LDLR):c.1A>C (p.Met1Leu),A,C,1,Met1Leu,Met,1,Leu
141,245333,879254382,reviewed_by_expert_panel,LDLR:3949|LDLR-AS1:115271120,19,11089549,Likely_pathogenic,"SO:0001582|initiator_codon_variant,SO:0001583|...",single_nucleotide_variant,NC_000019.10:g.11089549A>G,"MedGen:CN230736|MONDO:MONDO:0005439,MedGen:C00...",245333,879254382,11089549,NM_000527.5(LDLR):c.1A>G (p.Met1Val),A,G,1,Met1Val,Met,1,Val
142,245334,879254382,reviewed_by_expert_panel,LDLR:3949|LDLR-AS1:115271120,19,11089549,Pathogenic,"SO:0001582|initiator_codon_variant,SO:0001583|...",single_nucleotide_variant,NC_000019.10:g.11089549A>T,"MedGen:CN230736|MONDO:MONDO:0005439,MedGen:C00...",245334,879254382,11089549,NM_000527.5(LDLR):c.1A>T (p.Met1Leu),A,T,1,Met1Leu,Met,1,Leu
143,434816,1555800701,reviewed_by_expert_panel,LDLR:3949|LDLR-AS1:115271120,19,11089550,Likely_pathogenic,"SO:0001582|initiator_codon_variant,SO:0001583|...",single_nucleotide_variant,NC_000019.10:g.11089550T>C,"MedGen:C4229399|MONDO:MONDO:0007750,MedGen:C07...",434816,1555800701,11089550,NM_000527.5(LDLR):c.2T>C (p.Met1Thr),T,C,1,Met1Thr,Met,1,Thr


In [81]:
pd.set_option('display.max_rows', None)
merged_df['aa_alt'].value_counts()

aa_alt
fs                                                                                                                                             560
=                                                                                                                                              547
Ter                                                                                                                                            206
Ser                                                                                                                                            155
Arg                                                                                                                                            153
Gly                                                                                                                                            116
Tyr                                                                                                            

### aa changes to remove (focusing on aa_alt)
Basically every change that involves more than one aa

In [82]:
to_drop = merged_df['aa_alt'].value_counts()[merged_df['aa_alt'].value_counts() == 1].index.tolist()
to_drop = to_drop + ['_Asp227dup', '_Ser213del', '_Asp227del', '_Cys231del']
print("Values to remove: ", len(to_drop))

Values to remove:  127


In [83]:
# remove samples that have more that one aa change
merged_df = merged_df.loc[~merged_df['aa_alt'].isin(to_drop)]
print(merged_df.shape)
merged_df.head()

(3706, 22)


Unnamed: 0,allele_id,rs#,review,gene,chr,pos,clinsig,function,type,change,ONIM,#AlleleID,RS# (dbSNP),end,Name,ref,alt,ClinSigSimple,protein_info,aa_ref,aa_change,aa_alt
0,424286,989307060.0,reviewed_by_expert_panel,LDLR:3949|LDLR-AS1:115271120,19,11089263,Uncertain_significance,SO:0001619|non-coding_transcript_variant,single_nucleotide_variant,NC_000019.10:g.11089263C>G,"MONDO:MONDO:0005439,MedGen:C0020445,OMIM:PS143...",424286,989307060,11089263,NM_000527.5(LDLR):c.-286C>G,C,G,1,,,,
1,245300,17249134.0,"criteria_provided,_multiple_submitters,_no_con...",LDLR:3949|LDLR-AS1:115271120,19,11089281,Benign/Likely_benign,SO:0001619|non-coding_transcript_variant,single_nucleotide_variant,NC_000019.10:g.11089281G>T,"MONDO:MONDO:0005439,MedGen:C0020445,OMIM:PS143...",245300,17249134,11089281,NM_000527.5(LDLR):c.-268G>T,G,T,0,,,,
3,3752756,,"criteria_provided,_single_submitter",LDLR:3949|LDLR-AS1:115271120,19,11089309,Uncertain_significance,SO:0001619|non-coding_transcript_variant,Microsatellite,NC_000019.10:g.11089310AG[1],"MONDO:MONDO:0005439,MedGen:C0020445,OMIM:PS143890",3752756,-1,11089311,NC_000019.10:g.11089310AG[1],CAG,C,0,,,,
4,2967261,2515905935.0,"criteria_provided,_single_submitter",LDLR:3949|LDLR-AS1:115271120,19,11089318,Uncertain_significance,SO:0001619|non-coding_transcript_variant,Deletion,NC_000019.10:g.11089319del,"MONDO:MONDO:0005439,MedGen:C0020445,OMIM:PS143890",2967261,2515905935,11089319,NC_000019.10:g.11089319del,AC,A,0,,,,
5,424287,1555800611.0,"criteria_provided,_single_submitter",LDLR:3949|LDLR-AS1:115271120,19,11089318,Pathogenic,SO:0001619|non-coding_transcript_variant,Deletion,NC_000019.10:g.11089320_11089459del,"MONDO:MONDO:0007750,MedGen:C0745103,OMIM:14389...",424287,1555800611,11089458,NM_000527.4(LDLR):c.-229_-90del,ACGGGTTAAAAAGCCGATGTCACATCGGCCGTTCGAAACTCCTCCT...,A,1,,,,


In [84]:
# TODO: Gene LDLR encoding only; ONIM only

In [85]:
useless_cols = ['review', 'clinsig', '#AlleleID', 'RS# (dbSNP)', 'Name', 'protein_info']
on_hold_cols = ['function', 'change']
annovar_cols = ['chr', 'pos', 'end', 'ref', 'alt', 'aa_ref', 'aa_alt', 'ClinSigSimple']

# Prepare Dataset for ANNOVAR

In [86]:
annovar_df = merged_df[annovar_cols]
annovar_df = annovar_df.rename({'pos': 'start'})
annovar_df = annovar_df.fillna('')
print(annovar_df.shape)
annovar_df.head()

(3706, 8)


Unnamed: 0,chr,pos,end,ref,alt,aa_ref,aa_alt,ClinSigSimple
0,19,11089263,11089263,C,G,,,1
1,19,11089281,11089281,G,T,,,0
3,19,11089309,11089311,CAG,C,,,0
4,19,11089318,11089319,AC,A,,,0
5,19,11089318,11089458,ACGGGTTAAAAAGCCGATGTCACATCGGCCGTTCGAAACTCCTCCT...,A,,,1


In [87]:
annovar_df.to_csv("mafalda_files/annovar_file.avinput", sep="\t", header=False, index=False)

In [88]:
full_dataset = merged_df.drop(columns=useless_cols).drop(columns=on_hold_cols)
full_dataset = full_dataset.rename({'pos': 'start'})
# full_dataset = full_dataset.fillna('')
print(full_dataset.shape)
full_dataset.head()

(3706, 14)


Unnamed: 0,allele_id,rs#,gene,chr,pos,type,ONIM,end,ref,alt,ClinSigSimple,aa_ref,aa_change,aa_alt
0,424286,989307060.0,LDLR:3949|LDLR-AS1:115271120,19,11089263,single_nucleotide_variant,"MONDO:MONDO:0005439,MedGen:C0020445,OMIM:PS143...",11089263,C,G,1,,,
1,245300,17249134.0,LDLR:3949|LDLR-AS1:115271120,19,11089281,single_nucleotide_variant,"MONDO:MONDO:0005439,MedGen:C0020445,OMIM:PS143...",11089281,G,T,0,,,
3,3752756,,LDLR:3949|LDLR-AS1:115271120,19,11089309,Microsatellite,"MONDO:MONDO:0005439,MedGen:C0020445,OMIM:PS143890",11089311,CAG,C,0,,,
4,2967261,2515905935.0,LDLR:3949|LDLR-AS1:115271120,19,11089318,Deletion,"MONDO:MONDO:0005439,MedGen:C0020445,OMIM:PS143890",11089319,AC,A,0,,,
5,424287,1555800611.0,LDLR:3949|LDLR-AS1:115271120,19,11089318,Deletion,"MONDO:MONDO:0007750,MedGen:C0745103,OMIM:14389...",11089458,ACGGGTTAAAAAGCCGATGTCACATCGGCCGTTCGAAACTCCTCCT...,A,1,,,


In [90]:
full_dataset.to_csv("mafalda_files/full_dataset.csv", index=False)

In [95]:
df = pd.read_csv("mafalda_files/full_dataset.csv")
df.shape

(3706, 14)

In [94]:
df.loc[df['pos']==11102786]

Unnamed: 0,allele_id,rs#,gene,chr,pos,type,ONIM,end,ref,alt,ClinSigSimple,aa_ref,aa_change,aa_alt
523,245477,879254500.0,LDLR:3949,19,11102786,Duplication,"MONDO:MONDO:0007750,MedGen:C0745103,OMIM:14389...",11102787,C,CG,1,,,
524,998358,2077234000.0,LDLR:3949,19,11102786,Duplication,"MONDO:MONDO:0005439,MedGen:C0020445,OMIM:PS143890",11102787,C,CGT,0,,,
525,3499360,,LDLR:3949,19,11102786,Indel,"MONDO:MONDO:0007750,MedGen:C0745103,OMIM:14389...",11102787,CG,TC,1,,,
526,1504833,2147218000.0,LDLR:3949,19,11102786,Deletion,"MONDO:MONDO:0005439,MedGen:C0020445,OMIM:PS143890",11102790,CGTAA,C,0,,,
