# Genomic Data for Variant Pathogenicity
This notebook reads the vcf file containing ClinVar data and outputs a vcf file that contains the right information to run ANNOVAR and, eventually reach the table templated format provide in FH-EARLY for the genomic data.

### To download the ClinVar data:
Go to https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/ (Our version is clinvar_20260208.vcf.gz)

In [4]:
# import packages
import pandas as pd
from cyvcf2 import VCF # https://github.com/brentp/cyvcf2/tree/main

In [None]:
vcf_og = VCF('clinvar_20260208.vcf.gz')
rows = []
for var in vcf_og('19:11087732-11133700'):
    # TODO: select necessary information
    rows.append({
        'chrom': var.CHROM,
        'pos': var.POS,  # 1-based
        'ref': var.REF,
        'alt': ','.join(var.ALT),
        'id': var.ID,
        'qual': var.QUAL,
        'clnsig': var.INFO.get('CLNSIG', ''),
        'geneinfo': var.INFO.get('GENEINFO', ''),
        'gt_types': var.gt_types.tolist()
    })

df = pd.DataFrame(rows)
df

Unnamed: 0,chrom,pos,ref,alt,id,qual,clnsig,geneinfo,gt_types
0,19,11087729,ACCACGCCCGGCTAATTTTTTGTATTTTTTTTTAGTAGAGGTGGGG...,A,694275,,Pathogenic,LDLR:3949|LDLR-AS1:115271120,[]
1,19,11089263,C,G,430740,,Uncertain_significance,LDLR:3949|LDLR-AS1:115271120,[]
2,19,11089281,G,T,250925,,Benign/Likely_benign,LDLR:3949|LDLR-AS1:115271120,[]
3,19,11089283,C,T,4069787,,Uncertain_significance,LDLR:3949|LDLR-AS1:115271120,[]
4,19,11089309,CAG,C,3628882,,Uncertain_significance,LDLR:3949|LDLR-AS1:115271120,[]
...,...,...,...,...,...,...,...,...,...
4319,19,11133635,C,G,328126,,Conflicting_classifications_of_pathogenicity,LDLR:3949,[]
4320,19,11133666,C,G,890682,,Uncertain_significance,LDLR:3949,[]
4321,19,11133681,C,T,328127,,Uncertain_significance,LDLR:3949,[]
4322,19,11133682,G,T,328128,,Uncertain_significance,LDLR:3949,[]


In [13]:
df['geneinfo'].value_counts()

geneinfo
LDLR:3949                           4062
LDLR:3949|LDLR-AS1:115271120         220
LDLR:3949|MIR6886:102465534           41
LDLR:3949|LOC126862855:126862855       1
Name: count, dtype: int64

* only samples encoding LDLR variants
* id is unique
* pos is almost unique, there are some variants in the same position

In [18]:
df.loc[df['pos']==11105575]

Unnamed: 0,chrom,pos,ref,alt,id,qual,clnsig,geneinfo,gt_types
1201,19,11105575,G,A,2094541,,Likely_benign,LDLR:3949,[]
1202,19,11105575,G,C,251370,,Likely_pathogenic,LDLR:3949,[]
1203,19,11105575,G,GGACAAA,977984,,Likely_pathogenic,LDLR:3949,[]
1204,19,11105575,G,GGACAAATCT,251377,,Likely_pathogenic,LDLR:3949,[]
1205,19,11105575,G,GGACAAATCTGAC,251376,,Likely_pathogenic,LDLR:3949,[]
1206,19,11105575,GGAC,G,2125438,,Pathogenic,LDLR:3949,[]
1207,19,11105575,GGACAAATCTGACGA,AACTGCGGTAAACTGCGGTAAACT,430761,,Pathogenic,LDLR:3949,[]


In [23]:
to_remove = ['gt_types', 'qual']
df_v2 = df.drop(columns=to_remove)
df_v2.head()

Unnamed: 0,chrom,pos,ref,alt,id,clnsig,geneinfo
0,19,11087729,ACCACGCCCGGCTAATTTTTTGTATTTTTTTTTAGTAGAGGTGGGG...,A,694275,Pathogenic,LDLR:3949|LDLR-AS1:115271120
1,19,11089263,C,G,430740,Uncertain_significance,LDLR:3949|LDLR-AS1:115271120
2,19,11089281,G,T,250925,Benign/Likely_benign,LDLR:3949|LDLR-AS1:115271120
3,19,11089283,C,T,4069787,Uncertain_significance,LDLR:3949|LDLR-AS1:115271120
4,19,11089309,CAG,C,3628882,Uncertain_significance,LDLR:3949|LDLR-AS1:115271120


# information needed for ANNOVAR
1. chromosome
2. start position (pos)
3. end position - TODO
4. ref nucleotide (ref)
5. observed nucleotide (alt)

## TO CHECK: I think ANNOVAR is only for missense (1 nucleotide)