# Get Phenopackets

This notebook shows how to pack all phenopackets in this repository into a TAR or ZIP archive.

To do so, the code looks in the *notebooks* directory for all subfolders called "phenopackets", copies all of the
"*.json" files in those directories to a temporary location, creates a TAR or ZIP archive, and copies that to the location indicated by the code.
The code also provides two pandas dataframes that can be used to extract files from the archives that satisfy certain criteria, e.g., having a minimum number of HPO terms, having a certain disease diagnosis, etc.

In [1]:
from ppktstore.model import PhenopacketStore

notebook_dir = "notebooks"
store = PhenopacketStore.from_notebook_dir(notebook_dir)

We show a table with summary for each phenopacket.

In [2]:
from ppktstore.stats import create_summary_df

df = create_summary_df(store)
df.head(2)

Unnamed: 0,disease,disease_id,patient_id,gene,allele_1,allele_2,PMID,cohort,filename
0,"Fanconi anemia, complementation group C",OMIM:227645,proband,FANCC,NM_000136.3:c.67del,NM_000136.3:c.67del,PMID:22701786,FANCC,FANCC/phenopackets/PMID_22701786_proband.json
1,"Fanconi anemia, complementation group C",OMIM:227645,first patient,FANCC,NM_000136.3:c.455dup,NM_000136.3:c.1393C>T,PMID:16429406,FANCC,FANCC/phenopackets/PMID_16429406_firstpatient....


In [3]:

from ppktstore.archive import PhenopacketStoreArchiver

archiver = PhenopacketStoreArchiver(store)

In [4]:
from ppktstore.stats import summarize_cohorts

df = summarize_cohorts(store)
df.head()

Unnamed: 0,cohort,directory,filename,phenopacket.id,disease,n_hpo,n_var,n_alleles,n_encounters
0,FANCC,FANCC/phenopackets,PMID_22701786_proband.json,PMID_22701786_proband,"Fanconi anemia, complementation group C (OMIM:...",11,1,2,2
1,FANCC,FANCC/phenopackets,PMID_16429406_firstpatient.json,PMID_16429406_first_patient,"Fanconi anemia, complementation group C (OMIM:...",4,2,2,1
2,FANCC,FANCC/phenopackets,PMID_31044565_proband.json,PMID_31044565_proband,"Fanconi anemia, complementation group C (OMIM:...",12,1,2,3
3,FANCC,FANCC/phenopackets,PMID_16429406_secondpatient.json,PMID_16429406_second_patient,"Fanconi anemia, complementation group C (OMIM:...",10,2,2,2
4,SAMD7,SAMD7/phenopackets,PMID_38272031_Individual1-1.json,PMID_38272031_Individual_1_1,Macular dystrophy with or without cone dysfunc...,7,1,2,1


In [5]:
from ppktstore.stats import summarize_cohort_sizes

summary_df = summarize_cohort_sizes(store)
summary_df.head(50)

Unnamed: 0,Cohort,Directory,Count
0,FANCC,FANCC/phenopackets,4
1,SAMD7,SAMD7/phenopackets,8
2,TMEM38B,TMEM38B/phenopackets,3
3,DLL3,DLL3/phenopackets,3
4,LONP1,LONP1/phenopackets,8
5,ISCA2,ISCA2/phenopackets,16
6,CAV3,CAV3/phenopackets,8
7,HOXC13,HOXC13/phenopackets,6
8,CHST14,CHST14/phenopackets,1
9,FOSL2,FOSL2/phenopackets,11


# Export gzip archive

Write phenopackets into a TAR GZ archive `all_phenopackets.tar.gz`.

In [6]:
archiver.get_store_gzip("all_phenopackets")

# Export zip archive

Write phenopackets into a TAR GZ archive `all_phenopackets.zip`.

In [None]:
archiver.get_store_zip("all_phenopackets")

# Create MarkDown file
We use this function to generate the markdown report for the online documentation.

In [None]:
import os
from ppktstore.stats import generate_phenopacket_store_report

report = generate_phenopacket_store_report(notebook_dir)

outfile = os.path.join('docs', 'collections.md')
with open(outfile, 'w') as fh:
    fh.write(report)

EOF