# Genomic Data for Variant Pathogenicity
This notebook reads the vcf file containing ClinVar data and outputs a vcf file that contains the right information to run ANNOVAR and, eventually reach the table templated format provide in FH-EARLY for the genomic data.

### To download the ClinVar data:
Go to https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/ (Our version is clinvar_20260208.vcf.gz and also clinvar_20260208.vcf.gz.tbi)

In [1]:
# import packages
import pandas as pd
from cyvcf2 import VCF # https://github.com/brentp/cyvcf2/tree/main

### sources for choices
1. Genome Interval (ClinVar): consulted https://www-ncbi-nlm-nih-gov.tudelft.idm.oclc.org/clinvar/?term=LDLR%5Bgene%5D to select the range between the first and the last LDLR variant in chromosome 19.
2. Variables ANNOVAR (following MetaRNN paper - additional file):
    * chromosome = var.CHROM
    * position = var.POS
    * reference allele = var.REF
    * alternative allele = var.ALT
    * [missing] reference aa of the protein [gloria]
    * [missing] alternative aa of the protein [gloria]
    * label (TP or TN) (Clinical Significance by CLinVar)= var.INFO.get('CLNSIG','')
3. Additional variables used in FH-EARLY:
    * [uncertain] function = for now var.INFO.get('MC','') (eg. synonymous_variant)
    * gene = var.INFO.get('GENEINFO', '')
    * variant type = var.INFO.get('CLNVC','')
    * [uncertain] change = for now included in var.INFO.get('CLNHGVS','') (look for >)
    * ONIM = included in var.INFO.CLNDISDB (look for ONIM)
4. Additional variables needed to select samples
    * id = var.ID
    * star review = var.INFO.get('CLNREVSTAT','')
5. Additional variables we need to get output from 1000 genomes, gnomAD exome, and dbSNP

In [30]:
vcf_og = VCF('clinvar_20260208.vcf.gz')
print(dir(next(iter(vcf_og))))

['ALT', 'CHROM', 'FILTER', 'FILTERS', 'FORMAT', 'ID', 'INFO', 'POS', 'QUAL', 'REF', '__class__', '__delattr__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setstate__', '__sizeof__', '__str__', '__subclasshook__', 'aaf', 'call_rate', 'end', 'format', 'genotype', 'genotypes', 'gt_alt_depths', 'gt_alt_freqs', 'gt_bases', 'gt_depths', 'gt_phases', 'gt_phred_ll_het', 'gt_phred_ll_homalt', 'gt_phred_ll_homref', 'gt_quals', 'gt_ref_depths', 'gt_types', 'is_deletion', 'is_indel', 'is_mnp', 'is_snp', 'is_sv', 'is_transition', 'nucl_diversity', 'num_called', 'num_het', 'num_hom_alt', 'num_hom_ref', 'num_unknown', 'ploidy', 'relatedness', 'set_format', 'set_pos', 'start', 'var_subtype', 'var_type']


In [32]:
rows = []
for var in vcf_og('19:11087732-11133700'):
    # TODO: select necessary information
    rows.append({
        # identifiers
        'id': var.ID,
        'review': var.INFO.get('CLNREVSTAT',''),
        'gene': var.INFO.get('GENEINFO', ''),
        'chrom': var.CHROM,
        # useful
        'pos': var.POS,  # 1-based
        'ref': var.REF,
        'alt': ','.join(var.ALT), # this because it is a lisy
        # missing: 'ref_aa': 
        # missing: 'alt_aa':
        'clinsig': var.INFO.get('CLNSIG', ''),
        'function': var.INFO.get('MC',''),
        'type': var.INFO.get('CLNVC',''),
        'change': var.INFO.get('CLNHGVS',''),
        'onim': var.INFO.get('CLNDISDB','')
    })

df = pd.DataFrame(rows)
df

Unnamed: 0,id,review,gene,chrom,pos,ref,alt,clinsig,function,type,change,onim
0,694275,"criteria_provided,_single_submitter",LDLR:3949|LDLR-AS1:115271120,19,11087729,ACCACGCCCGGCTAATTTTTTGTATTTTTTTTTAGTAGAGGTGGGG...,A,Pathogenic,,Deletion,NC_000019.10:g.11087732_11090710del,"Human_Phenotype_Ontology:HP:0003124,Human_Phen..."
1,430740,reviewed_by_expert_panel,LDLR:3949|LDLR-AS1:115271120,19,11089263,C,G,Uncertain_significance,SO:0001619|non-coding_transcript_variant,single_nucleotide_variant,NC_000019.10:g.11089263C>G,"MONDO:MONDO:0005439,MedGen:C0020445,OMIM:PS143..."
2,250925,"criteria_provided,_multiple_submitters,_no_con...",LDLR:3949|LDLR-AS1:115271120,19,11089281,G,T,Benign/Likely_benign,SO:0001619|non-coding_transcript_variant,single_nucleotide_variant,NC_000019.10:g.11089281G>T,"MONDO:MONDO:0005439,MedGen:C0020445,OMIM:PS143..."
3,4069787,no_assertion_criteria_provided,LDLR:3949|LDLR-AS1:115271120,19,11089283,C,T,Uncertain_significance,SO:0001619|non-coding_transcript_variant,single_nucleotide_variant,NC_000019.10:g.11089283C>T,"MONDO:MONDO:0005439,MedGen:C0020445,OMIM:PS143890"
4,3628882,"criteria_provided,_single_submitter",LDLR:3949|LDLR-AS1:115271120,19,11089309,CAG,C,Uncertain_significance,SO:0001619|non-coding_transcript_variant,Microsatellite,NC_000019.10:g.11089310AG[1],"MONDO:MONDO:0005439,MedGen:C0020445,OMIM:PS143890"
...,...,...,...,...,...,...,...,...,...,...,...,...
4319,328126,"criteria_provided,_conflicting_classifications",LDLR:3949,19,11133635,C,G,Conflicting_classifications_of_pathogenicity,SO:0001624|3_prime_UTR_variant,single_nucleotide_variant,NC_000019.10:g.11133635C>G,"MedGen:C3661900|MONDO:MONDO:0007750,MedGen:C07..."
4320,890682,"criteria_provided,_single_submitter",LDLR:3949,19,11133666,C,G,Uncertain_significance,SO:0001624|3_prime_UTR_variant,single_nucleotide_variant,NC_000019.10:g.11133666C>G,"MONDO:MONDO:0007750,MedGen:C0745103,OMIM:14389..."
4321,328127,"criteria_provided,_single_submitter",LDLR:3949,19,11133681,C,T,Uncertain_significance,SO:0001624|3_prime_UTR_variant,single_nucleotide_variant,NC_000019.10:g.11133681C>T,"MONDO:MONDO:0007750,MedGen:C0745103,OMIM:14389..."
4322,328128,"criteria_provided,_single_submitter",LDLR:3949,19,11133682,G,T,Uncertain_significance,SO:0001624|3_prime_UTR_variant,single_nucleotide_variant,NC_000019.10:g.11133682G>T,"MONDO:MONDO:0007750,MedGen:C0745103,OMIM:14389..."


* only samples encoding LDLR variants
* id is unique
* pos is almost unique, there are some variants in the same position

# information needed for ANNOVAR
1. chromosome
2. start position (pos)
3. end position - TODO
4. ref nucleotide (ref)
5. observed nucleotide (alt)

## TO CHECK: I think ANNOVAR is only for missense (1 nucleotide)