# Genomic Data for Variant Pathogenicity
This notebook reads the vcf file containing ClinVar data and outputs a vcf file that contains the right information to run ANNOVAR and, eventually reach the table templated format provide in FH-EARLY for the genomic data.

### To download the ClinVar data:
Go to https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/ (Our version is clinvar_20260208.vcf.gz and also clinvar_20260208.vcf.gz.tbi)

In [78]:
# import packages
import pandas as pd
pd.set_option('display.max_columns', None)
from cyvcf2 import VCF # https://github.com/brentp/cyvcf2/tree/main

### sources for choices
1. Genome Interval (ClinVar): consulted https://www-ncbi-nlm-nih-gov.tudelft.idm.oclc.org/clinvar/?term=LDLR%5Bgene%5D to select the range between the first and the last LDLR variant in chromosome 19.
2. Variables ANNOVAR (following MetaRNN paper - additional file):
    * chromosome = var.CHROM
    * position = var.POS
    * reference allele = var.REF
    * alternative allele = var.ALT
    * [missing] reference aa of the protein [gloria]
    * [missing] alternative aa of the protein [gloria]
    * label (TP or TN) (Clinical Significance by CLinVar)= var.INFO.get('CLNSIG','')
3. Additional variables used in FH-EARLY:
    * [uncertain] function = for now var.INFO.get('MC','') (eg. synonymous_variant)
    * gene = var.INFO.get('GENEINFO', '')
    * variant type = var.INFO.get('CLNVC','')
    * [uncertain] change = for now included in var.INFO.get('CLNHGVS','') (look for >)
    * ONIM = included in var.INFO.CLNDISDB (look for ONIM)
4. Additional variables needed to select samples
    * id = var.ID
    * star review = var.INFO.get('CLNREVSTAT','')
5. Additional variables we need to get output from 1000 genomes, gnomAD exome, and dbSNP

In [42]:
vcf_og = VCF('clinvar_20260208.vcf.gz')
print(dir(next(iter(vcf_og))))

['ALT', 'CHROM', 'FILTER', 'FILTERS', 'FORMAT', 'ID', 'INFO', 'POS', 'QUAL', 'REF', '__class__', '__delattr__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setstate__', '__sizeof__', '__str__', '__subclasshook__', 'aaf', 'call_rate', 'end', 'format', 'genotype', 'genotypes', 'gt_alt_depths', 'gt_alt_freqs', 'gt_bases', 'gt_depths', 'gt_phases', 'gt_phred_ll_het', 'gt_phred_ll_homalt', 'gt_phred_ll_homref', 'gt_quals', 'gt_ref_depths', 'gt_types', 'is_deletion', 'is_indel', 'is_mnp', 'is_snp', 'is_sv', 'is_transition', 'nucl_diversity', 'num_called', 'num_het', 'num_hom_alt', 'num_hom_ref', 'num_unknown', 'ploidy', 'relatedness', 'set_format', 'set_pos', 'start', 'var_subtype', 'var_type']


In [69]:
rows = []
for var in vcf_og('19:11087732-11133700'):
    # TODO: select necessary information
    rows.append({
        # identifiers
        'id': var.INFO.get('ALLELEID', ''),
        'review': var.INFO.get('CLNREVSTAT',''),
        'gene': var.INFO.get('GENEINFO', ''),
        'chrom': var.CHROM,
        # useful
        'pos': var.POS,  # 1-based
        'ref': var.REF,
        'alt': ','.join(var.ALT), # this because it is a lisy
        # missing: 'ref_aa': 
        # missing: 'alt_aa':
        'clinsig': var.INFO.get('CLNSIG', ''),
        'function': var.INFO.get('MC',''),
        'type': var.INFO.get('CLNVC',''),
        'change': var.INFO.get('CLNHGVS',''),
        'onim': var.INFO.get('CLNDISDB','')
    })

df = pd.DataFrame(rows)
df

Unnamed: 0,id,review,gene,chrom,pos,ref,alt,clinsig,function,type,change,onim
0,682121,"criteria_provided,_single_submitter",LDLR:3949|LDLR-AS1:115271120,19,11087729,ACCACGCCCGGCTAATTTTTTGTATTTTTTTTTAGTAGAGGTGGGG...,A,Pathogenic,,Deletion,NC_000019.10:g.11087732_11090710del,"Human_Phenotype_Ontology:HP:0003124,Human_Phen..."
1,424286,reviewed_by_expert_panel,LDLR:3949|LDLR-AS1:115271120,19,11089263,C,G,Uncertain_significance,SO:0001619|non-coding_transcript_variant,single_nucleotide_variant,NC_000019.10:g.11089263C>G,"MONDO:MONDO:0005439,MedGen:C0020445,OMIM:PS143..."
2,245300,"criteria_provided,_multiple_submitters,_no_con...",LDLR:3949|LDLR-AS1:115271120,19,11089281,G,T,Benign/Likely_benign,SO:0001619|non-coding_transcript_variant,single_nucleotide_variant,NC_000019.10:g.11089281G>T,"MONDO:MONDO:0005439,MedGen:C0020445,OMIM:PS143..."
3,4180223,no_assertion_criteria_provided,LDLR:3949|LDLR-AS1:115271120,19,11089283,C,T,Uncertain_significance,SO:0001619|non-coding_transcript_variant,single_nucleotide_variant,NC_000019.10:g.11089283C>T,"MONDO:MONDO:0005439,MedGen:C0020445,OMIM:PS143890"
4,3752756,"criteria_provided,_single_submitter",LDLR:3949|LDLR-AS1:115271120,19,11089309,CAG,C,Uncertain_significance,SO:0001619|non-coding_transcript_variant,Microsatellite,NC_000019.10:g.11089310AG[1],"MONDO:MONDO:0005439,MedGen:C0020445,OMIM:PS143890"
...,...,...,...,...,...,...,...,...,...,...,...,...
4319,342703,"criteria_provided,_conflicting_classifications",LDLR:3949,19,11133635,C,G,Conflicting_classifications_of_pathogenicity,SO:0001624|3_prime_UTR_variant,single_nucleotide_variant,NC_000019.10:g.11133635C>G,"MedGen:C3661900|MONDO:MONDO:0007750,MedGen:C07..."
4320,879868,"criteria_provided,_single_submitter",LDLR:3949,19,11133666,C,G,Uncertain_significance,SO:0001624|3_prime_UTR_variant,single_nucleotide_variant,NC_000019.10:g.11133666C>G,"MONDO:MONDO:0007750,MedGen:C0745103,OMIM:14389..."
4321,348054,"criteria_provided,_single_submitter",LDLR:3949,19,11133681,C,T,Uncertain_significance,SO:0001624|3_prime_UTR_variant,single_nucleotide_variant,NC_000019.10:g.11133681C>T,"MONDO:MONDO:0007750,MedGen:C0745103,OMIM:14389..."
4322,349345,"criteria_provided,_single_submitter",LDLR:3949,19,11133682,G,T,Uncertain_significance,SO:0001624|3_prime_UTR_variant,single_nucleotide_variant,NC_000019.10:g.11133682G>T,"MONDO:MONDO:0007750,MedGen:C0745103,OMIM:14389..."


In [74]:
df['id'].value_counts()

id
682121     1
424286     1
245300     1
4180223    1
3752756    1
          ..
342703     1
879868     1
348054     1
349345     1
348063     1
Name: count, Length: 4324, dtype: int64

* id is unique
* pos is almost unique, there are some variants in the same position
* missing aa info

## Trying to get aa change from other file

In [39]:
sum_df = pd.read_csv('variant_summary.txt.gz', sep='\t')
sum_df = sum_df.loc[sum_df['GeneID']==3949]

  sum_df = pd.read_csv('variant_summary.txt.gz', sep='\t')


In [40]:
sum_df

Unnamed: 0,#AlleleID,Type,Name,GeneID,GeneSymbol,HGNC_ID,ClinicalSignificance,ClinSigSimple,LastEvaluated,RS# (dbSNP),...,AlternateAlleleVCF,SomaticClinicalImpact,SomaticClinicalImpactLastEvaluated,ReviewStatusClinicalImpact,Oncogenicity,OncogenicityLastEvaluated,ReviewStatusOncogenicity,SCVsForAggregateGermlineClassification,SCVsForAggregateSomaticClinicalImpact,SCVsForAggregateOncogenicityClassification
6763,18722,single nucleotide variant,NM_000527.5(LDLR):c.97C>T (p.Gln33Ter),3949,LDLR,HGNC:6547,Pathogenic,1,"Mar 25, 2022",121908024,...,T,-,-,-,-,-,-,SCV002506409,-,-
6764,18722,single nucleotide variant,NM_000527.5(LDLR):c.97C>T (p.Gln33Ter),3949,LDLR,HGNC:6547,Pathogenic,1,"Mar 25, 2022",121908024,...,T,-,-,-,-,-,-,SCV002506409,-,-
6765,18724,single nucleotide variant,NM_000527.5(LDLR):c.259T>G (p.Trp87Gly),3949,LDLR,HGNC:6547,Pathogenic,1,"Jun 07, 2021",121908025,...,G,-,-,-,-,-,-,SCV001960956,-,-
6766,18724,single nucleotide variant,NM_000527.5(LDLR):c.259T>G (p.Trp87Gly),3949,LDLR,HGNC:6547,Pathogenic,1,"Jun 07, 2021",121908025,...,G,-,-,-,-,-,-,SCV001960956,-,-
6767,18725,single nucleotide variant,NM_000527.5(LDLR):c.530C>T (p.Ser177Leu),3949,LDLR,HGNC:6547,Pathogenic,1,"Jun 03, 2022",121908026,...,T,-,-,-,-,-,-,SCV002568105,-,-
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8677402,4801188,single nucleotide variant,NM_000527.5(LDLR):c.747C>G (p.Ile249Met),3949,LDLR,HGNC:6547,Uncertain significance,0,"Jan 26, 2026",-1,...,G,-,-,-,-,-,-,SCV007344328,-,-
8677403,4801188,single nucleotide variant,NM_000527.5(LDLR):c.747C>G (p.Ile249Met),3949,LDLR,HGNC:6547,Uncertain significance,0,"Jan 26, 2026",-1,...,G,-,-,-,-,-,-,SCV007344328,-,-
8677879,4801428,single nucleotide variant,NM_000527.5(LDLR):c.2566G>A (p.Glu856Lys),3949,LDLR,HGNC:6547,Uncertain significance,0,"Jul 30, 2025",-1,...,A,-,-,-,-,-,-,SCV007346034,-,-
8677880,4801428,single nucleotide variant,NM_000527.5(LDLR):c.2566G>A (p.Glu856Lys),3949,LDLR,HGNC:6547,Uncertain significance,0,"Jul 30, 2025",-1,...,A,-,-,-,-,-,-,SCV007346034,-,-


In [80]:
sum_df = sum_df.loc[sum_df['Assembly']=='GRCh38']

In [81]:
sum_df.loc[sum_df['#AlleleID'].isin(df['id'])]

Unnamed: 0,#AlleleID,Type,Name,GeneID,GeneSymbol,HGNC_ID,ClinicalSignificance,ClinSigSimple,LastEvaluated,RS# (dbSNP),nsv/esv (dbVar),RCVaccession,PhenotypeIDS,PhenotypeList,Origin,OriginSimple,Assembly,ChromosomeAccession,Chromosome,Start,Stop,ReferenceAllele,AlternateAllele,Cytogenetic,ReviewStatus,NumberSubmitters,Guidelines,TestedInGTR,OtherIDs,SubmitterCategories,VariationID,PositionVCF,ReferenceAlleleVCF,AlternateAlleleVCF,SomaticClinicalImpact,SomaticClinicalImpactLastEvaluated,ReviewStatusClinicalImpact,Oncogenicity,OncogenicityLastEvaluated,ReviewStatusOncogenicity,SCVsForAggregateGermlineClassification,SCVsForAggregateSomaticClinicalImpact,SCVsForAggregateOncogenicityClassification
6764,18722,single nucleotide variant,NM_000527.5(LDLR):c.97C>T (p.Gln33Ter),3949,LDLR,HGNC:6547,Pathogenic,1,"Mar 25, 2022",121908024,-,RCV000003868|RCV000786350|RCV001034691|RCV0023...,"MONDO:MONDO:0007750,MedGen:C0745103,OMIM:14389...","Hypercholesterolemia, familial, 1|not provided...",germline;unknown,germline,GRCh38,NC_000019.10,19,11100252,11100252,na,na,19p13.2,reviewed by expert panel,20,"ACMG2013,ACMG2016,ACMG2021,ACMG2022",N,"ClinGen:CA023802,LDLR-LOVD, British Heart Foun...",3,3683,11100252,C,T,-,-,-,-,-,-,SCV002506409,-,-
6766,18724,single nucleotide variant,NM_000527.5(LDLR):c.259T>G (p.Trp87Gly),3949,LDLR,HGNC:6547,Pathogenic,1,"Jun 07, 2021",121908025,-,RCV000003870|RCV000622852|RCV000776466|RCV0008...,"MONDO:MONDO:0007750,MedGen:C0745103,OMIM:14389...","Hypercholesterolemia, familial, 1|not provided...",germline;inherited;unknown,germline,GRCh38,NC_000019.10,19,11102732,11102732,na,na,19p13.2,reviewed by expert panel,30,"ACMG2013,ACMG2016,ACMG2021,ACMG2022",N,"ClinGen:CA023683,LDLR-LOVD, British Heart Foun...",3,3685,11102732,T,G,-,-,-,-,-,-,SCV001960956,-,-
6768,18725,single nucleotide variant,NM_000527.5(LDLR):c.530C>T (p.Ser177Leu),3949,LDLR,HGNC:6547,Pathogenic,1,"Jun 03, 2022",121908026,-,RCV000003871|RCV000161958|RCV000588687|RCV0023...,"MONDO:MONDO:0007750,MedGen:C0745103,OMIM:14389...","Hypercholesterolemia, familial, 1|not provided...",germline;inherited;not applicable;unknown,germline,GRCh38,NC_000019.10,19,11105436,11105436,na,na,19p13.2,reviewed by expert panel,28,"ACMG2013,ACMG2016,ACMG2021,ACMG2022",N,"ClinGen:CA023715,LDLR-LOVD, British Heart Foun...",3,3686,11105436,C,T,-,-,-,-,-,-,SCV002568105,-,-
6770,18727,single nucleotide variant,NM_000527.5(LDLR):c.1694G>T (p.Gly565Val),3949,LDLR,HGNC:6547,Pathogenic/Likely pathogenic,1,"Jul 08, 2024",28942082,-,RCV000003874|RCV000791454|RCV001195593|RCV0023...,"MONDO:MONDO:0007750,MedGen:C0745103,OMIM:14389...","Hypercholesterolemia, familial, 1|Familial hyp...",germline;not applicable,germline,GRCh38,NC_000019.10,19,11116201,11116201,na,na,19p13.2,"criteria provided, multiple submitters, no con...",12,"ACMG2013,ACMG2016,ACMG2021,ACMG2022",N,"ClinGen:CA023553,LDLR-LOVD, British Heart Foun...",3,3688,11116201,G,T,-,-,-,-,-,-,SCV000295584|SCV000503387|SCV000583865|SCV0005...,-,-
6772,18728,single nucleotide variant,NM_000527.5(LDLR):c.2000G>A (p.Cys667Tyr),3949,LDLR,HGNC:6547,Likely pathogenic,1,"Jun 18, 2021",28942083,-,RCV000030131|RCV000313287|RCV000775084|RCV0024...,"MONDO:MONDO:0007750,MedGen:C0745103,OMIM:14389...","Hypercholesterolemia, familial, 1|not provided...",germline;unknown,germline,GRCh38,NC_000019.10,19,11120382,11120382,na,na,19p13.2,reviewed by expert panel,21,"ACMG2013,ACMG2016,ACMG2021,ACMG2022",N,"LDLR-LOVD, British Heart Foundation:LDLR_00027...",3,3689,11120382,G,A,-,-,-,-,-,-,SCV001960936,-,-
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8666935,4795747,single nucleotide variant,NM_000527.5(LDLR):c.282C>T (p.Asp94=),3949,LDLR,HGNC:6547,Likely benign,0,"Jul 17, 2025",-1,-,RCV006442944,MedGen:C3661900,not provided,germline,germline,GRCh38,NC_000019.10,19,11102755,11102755,na,na,19p13.2,"criteria provided, single submitter",1,"ACMG2013,ACMG2016,ACMG2021,ACMG2022",N,-,2,4684308,11102755,C,T,-,-,-,-,-,-,SCV007318927,-,-
8672463,4798621,Deletion,NM_000527.5(LDLR):c.1628_1632del (p.Lys543fs),3949,LDLR,HGNC:6547,Pathogenic,1,"Oct 21, 2025",-1,-,RCV006452967,"MONDO:MONDO:0005439,MedGen:C0020445,OMIM:PS143890",Familial hypercholesterolemia,germline,germline,GRCh38,NC_000019.10,19,11116134,11116138,na,na,19p13.2,"criteria provided, single submitter",1,"ACMG2013,ACMG2016,ACMG2021,ACMG2022",N,-,2,4687151,11116133,CAAGAA,C,-,-,-,-,-,-,SCV007336418,-,-
8674689,4799777,single nucleotide variant,NM_000527.5(LDLR):c.2099A>T (p.Asp700Val),3949,LDLR,HGNC:6547,Uncertain significance,0,"Dec 15, 2025",-1,-,RCV006457712,MedGen:CN169374,not specified,germline,germline,GRCh38,NC_000019.10,19,11120481,11120481,na,na,-,"criteria provided, single submitter",1,"ACMG2013,ACMG2016,ACMG2021,ACMG2022",N,-,2,4688317,11120481,A,T,-,-,-,-,-,-,SCV007340120,-,-
8677403,4801188,single nucleotide variant,NM_000527.5(LDLR):c.747C>G (p.Ile249Met),3949,LDLR,HGNC:6547,Uncertain significance,0,"Jan 26, 2026",-1,-,RCV006456239,MedGen:CN169374,not specified,germline,germline,GRCh38,NC_000019.10,19,11106617,11106617,na,na,-,"criteria provided, single submitter",1,"ACMG2013,ACMG2016,ACMG2021,ACMG2022",N,-,2,4689731,11106617,C,G,-,-,-,-,-,-,SCV007344328,-,-


#AlleleID
18722      2
18724      2
18725      2
18727      2
18728      2
          ..
4650082    1
4650083    1
4650084    1
4650085    1
4801728    1
Name: count, Length: 4703, dtype: int64

In [79]:
sum_df.loc[sum_df['#AlleleID']==18722]

Unnamed: 0,#AlleleID,Type,Name,GeneID,GeneSymbol,HGNC_ID,ClinicalSignificance,ClinSigSimple,LastEvaluated,RS# (dbSNP),nsv/esv (dbVar),RCVaccession,PhenotypeIDS,PhenotypeList,Origin,OriginSimple,Assembly,ChromosomeAccession,Chromosome,Start,Stop,ReferenceAllele,AlternateAllele,Cytogenetic,ReviewStatus,NumberSubmitters,Guidelines,TestedInGTR,OtherIDs,SubmitterCategories,VariationID,PositionVCF,ReferenceAlleleVCF,AlternateAlleleVCF,SomaticClinicalImpact,SomaticClinicalImpactLastEvaluated,ReviewStatusClinicalImpact,Oncogenicity,OncogenicityLastEvaluated,ReviewStatusOncogenicity,SCVsForAggregateGermlineClassification,SCVsForAggregateSomaticClinicalImpact,SCVsForAggregateOncogenicityClassification
6763,18722,single nucleotide variant,NM_000527.5(LDLR):c.97C>T (p.Gln33Ter),3949,LDLR,HGNC:6547,Pathogenic,1,"Mar 25, 2022",121908024,-,RCV000003868|RCV000786350|RCV001034691|RCV0023...,"MONDO:MONDO:0007750,MedGen:C0745103,OMIM:14389...","Hypercholesterolemia, familial, 1|not provided...",germline;unknown,germline,GRCh37,NC_000019.9,19,11210928,11210928,na,na,19p13.2,reviewed by expert panel,20,"ACMG2013,ACMG2016,ACMG2021,ACMG2022",N,"ClinGen:CA023802,LDLR-LOVD, British Heart Foun...",3,3683,11210928,C,T,-,-,-,-,-,-,SCV002506409,-,-
6764,18722,single nucleotide variant,NM_000527.5(LDLR):c.97C>T (p.Gln33Ter),3949,LDLR,HGNC:6547,Pathogenic,1,"Mar 25, 2022",121908024,-,RCV000003868|RCV000786350|RCV001034691|RCV0023...,"MONDO:MONDO:0007750,MedGen:C0745103,OMIM:14389...","Hypercholesterolemia, familial, 1|not provided...",germline;unknown,germline,GRCh38,NC_000019.10,19,11100252,11100252,na,na,19p13.2,reviewed by expert panel,20,"ACMG2013,ACMG2016,ACMG2021,ACMG2022",N,"ClinGen:CA023802,LDLR-LOVD, British Heart Foun...",3,3683,11100252,C,T,-,-,-,-,-,-,SCV002506409,-,-
