# Phenopacket store statistics

This notebook performs quality assessment and calculate descriptive statistics about a phenopacket-store release.

Note: 

We recommend installing Phenopacket Store Toolkit into the notebook kernel:

```shell
python3 -m pip install phenopacket-store-toolkit[release]
```

In [45]:
import math

import matplotlib as mpl
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

The input file is the ZIP file that is or will be added to each release.

The ZIP file can be generated by running:

```shell
python3 -m ppktstore package --notebook-dir notebooks --release-tag 0.1.19 --output all_phenopackets
```

assuming `phenopacket-store-toolkit` was installed into the active environment, and `notebooks` points to the folder with Phenopacket store notebook directory.

Note, we need to update the `--release-tag` option periodically.

In [46]:
import zipfile
from ppktstore.model import PhenopacketStore

release_tag = '0.1.19'

input_zip = "/Users/leonardo/git/malco/in_multlingual_nov24/prompts/used_ppkts/"
#with zipfile.ZipFile() as zf:
store = PhenopacketStore.from_notebook_dir(input_zip, "jsons/")

Now we can summarize statistics of the individuals described in the phenopackets, their phenotypic features, diseases, and genotypes.

In [47]:
from ppktstore.release.stats import PPKtStoreStats

stats = PPKtStoreStats(store)

The summary 

In [48]:
df = stats.get_summary_df().sort_values(by='gene')
df.head()

Unnamed: 0,patient_id,cohort,disease_id,disease,gene,allele_1,allele_2,PMID,filename
2210,Family 1 proband,singlecohort,OMIM:148600,"Keratoderma, palmoplantar, punctate type IA",AAGAB,NM_024666.5:c.505_506dup,,PMID:28239884,singlecohort/jsons/PMID_28239884_Family1proband.json
3458,Family 3 proband,singlecohort,OMIM:148600,"Keratoderma, palmoplantar, punctate type IA",AAGAB,NM_024666.5:c.870+1G>A,,PMID:28239884,singlecohort/jsons/PMID_28239884_Family3proband.json
160,Family 2 proband,singlecohort,OMIM:148600,"Keratoderma, palmoplantar, punctate type IA",AAGAB,NM_024666.5:c.473del,,PMID:28239884,singlecohort/jsons/PMID_28239884_Family2proband.json
88,II.2,singlecohort,OMIM:601718,Retinitis pigmentosa 19,ABCA4,NM_000350.3:c.1938-1G>A,,PMID:10874631,singlecohort/jsons/PMID_10874631_II2.json
4290,PATIENT II.1,singlecohort,OMIM:301310,"Anemia, sideroblastic, and spinocerebellar ataxia",ABCB7,NM_001271696.3:c.1231G>C,,PMID:11118249,singlecohort/jsons/PMID_11118249_PATIENTII1.json


## Individual statistics

In [49]:
from ppktstore.release.stats import summarize_individuals

individuals_df = summarize_individuals(store)
individuals_df.head(10)

Unnamed: 0,id,sex,age_in_days,age_in_years,vital_status
0,PMID_34722527_individual_048-051_1_Thaddeus_P__Dryja_Null RPGRIP1 Al-individual_048-051_1_Thaddeus_P__Dryja_Null RPGRIP1 Al,UNKNOWN_SEX,,,
1,PMID_23407777_23407777_P1-23407777_P1,FEMALE,44.0,0.120465,
2,"PMID_31239556_individual_22_father-individual 22, father",MALE,11322.75,31.0,
3,PMID_29469822_Family_4_II-2-Family 4 II-2,MALE,4.0,0.010951,
4,"PMID_31021519_SATB2_47_from_Zarate_et_al__2018a__Bengani_et_al-SATB2-47 from Zarate et al., 2018a; Bengani et al.",MALE,2556.75,7.0,
5,PMID_37196654_Individual_5-Individual 5,MALE,9131.25,25.0,
6,PMID_29290338_Family_UAB_R45201FN_101_individual_RS-Family UAB-R45201FN.101 individual RS,MALE,1461.0,4.0,UNKNOWN_STATUS
7,"PMID_36446582_Novara_2017_P2-Novara, 2017_P2",MALE,,,
8,PMID_29122497_29122497_P8-29122497_P8,MALE,300.0,0.821355,
9,STX_EG1010P-STX_EG1010P,UNKNOWN_SEX,1461.0,4.0,


### Summary statistics


#### Sex
The number of males and females in all case report collections.

In [50]:
sex_summary = {
    'males': sum(individuals_df.sex=='MALE'),
    'females': sum(individuals_df.sex=='FEMALE'),
    'unknown': sum(individuals_df.sex=='UNKNOWN_SEX')
}
sex_summary

{'males': 1848, 'females': 1611, 'unknown': 1507}

In [51]:
n_w_sex = sex_summary['males'] + sex_summary['females']
perc_w_sex = (100 * n_w_sex) / sum(sex_summary.values())
perc_males = (100 * sex_summary['males']) / n_w_sex
perc_females = (100 * sex_summary['females']) / n_w_sex

f'{n_w_sex} ({perc_w_sex:.1f}%) had the sex specified ({perc_males:.1f}% males, {perc_females:.1f}% females)'

'3459 (69.7%) had the sex specified (53.4% males, 46.6% females)'


#### Age
The number and percentage of subjects with the age information available.

In [52]:
n_no_age = sum(individuals_df.age_in_days.isna())
n_w_age = len(individuals_df) - n_no_age
age_summary = {
    'individuals with no age': f'{n_no_age} ({n_no_age * 100 / len(individuals_df):.1f}%)',
    'individuals with age': f'{n_w_age} ({n_w_age * 100 / len(individuals_df):.1f}%)',
}
age_summary

{'individuals with no age': '1977 (39.8%)',
 'individuals with age': '2989 (60.2%)'}

In [54]:
stats_d = stats.get_descriptive_stats(version=release_tag)
items = list()
for k,v in stats_d.items():
    items.append({"item": k, "value": v})
pd.DataFrame(items)


Unnamed: 0,item,value
0,version,0.1.19
1,phenopackets,4966
2,diseases,378
3,genes,343
4,alleles,2934
5,PMIDs,729
6,individuals per gene (max),459
7,individuals per gene (min),1
8,individuals per gene (mean),14.478134
9,individuals per gene (median),3.0
