# UDN Data Characterization

This notebook is intended to characterize the content of the clinical data in UDN to have a first glimpse of what is in the network and how is discribed.

## 1. DATA: UDN Network Resource 

The Undiagnosed Diseases Network (UDN), funded by the NIH Common Fund, is a research study to improve diagnosis and care of patients with undiagnosed conditions. The UDN established a nationwide network of clinicians and researchers who use both basic and clinical research to uncover the underlying disease mechanisms associated with these conditions. In its first 20 months, the UDN accepted 601 participants undiagnosed by traditional medical practices. Of those who completed their UDN evaluation during this time, 35% were given a diagnosis. Many of these diagnoses were rare genetic diseases including 31 previously unknown syndromes. 

The specific goals of UDN are to: (1) improve the level of diagnosis and care for patients with undiagnosed diseases through the development of common protocols designed by a large community of investigators; (2) facilitate research into the etiology of undiagnosed diseases, by collecting and sharing standardized, high-quality clinical and laboratory data including genotyping, phenotyping, and documentation of environmental exposures; and (3) create an integrated and collaborative community across multiple clinical sites and among laboratory and clinical investigators prepared to investigate the pathophysiology of these new and rare diseases.

For more information, please refer to https://commonfund.nih.gov/diseases

### PIC-SURE API

Databases exposed through PIC-SURE API encompass a wide heterogeneity of architectures and data models underneath. PIC-SURE hides this complexity, allowing researchers to access data in a normalized way and focus on the analysis and medical insights. The API is available in Python and R programming languages. 

The API is actively developed by the Avillach-Lab at Harvard Medical School. For more information, please refer to the GitHub repo:
* https://github.com/hms-dbmi/pic-sure-python-adapter-hpds
* https://github.com/hms-dbmi/pic-sure-python-client

---

### Environment setup

* Pre-requisites: Python >= 3.7

In [None]:
!cat requirements.txt

In [None]:
%load_ext autoreload
%autoreload 2

# set up environment
import sys
!{sys.executable} -m pip install -r requirements.txt
!{sys.executable} -m pip install --upgrade --force-reinstall git+https://github.com/hms-dbmi/pic-sure-python-adapter-hpds.git
!{sys.executable} -m pip install --upgrade --force-reinstall git+https://github.com/hms-dbmi/pic-sure-python-client.git


### Imports

In [None]:
# Useful to estimate execution time of the Notebook
from datetime import datetime
then = datetime.now()

# pic-sure api lib
import PicSureHpdsLib
import PicSureClient

# python_lib for pic-sure
from python_lib.HPDS_connection_manager import tokenManager
from python_lib.utils import get_multiIndex_variablesDict

# analysis
import pandas as pd
from pprint import pprint
import matplotlib.pyplot as plt
import numpy as np
import collections as collec

### Functions

In [None]:
def get_data_df(column_head):
    """Enables the user to download the data as a pandas dataframe indexed by UDN IDs (through API)
    Parameters : column_head : string, with the name of the header that will be selected. For example, if the columns that 
                                should be selected containt "this string", then column_head="this string".
    Returns: df : dataframe indexed by UDN IDs of the selected columns
    """
    dictionary=resource.dictionary().find(column_head)
    query=resource.query()
    query.select().add(dictionary.keys())
    query.select().add('\\000_UDN ID\\')
    df=query.getResultsDataFrame()
    df.set_index("\\000_UDN ID\\", inplace=True)
    query.select().clear()
    return df

def get_tree_df(parent_class):
    """Enables the user to show multi-index variable dictionary by parent class (level 0)
    Parameters : parent_class : string, the name of the parent class
    Returns : df : dataframe sliced by the parent class input
    """
    dictdf = resource.dictionary().find().DataFrame()
    vdict = get_multiIndex_variablesDict(dictdf)
    mask = vdict.index.get_level_values(0) == parent_class
    df = vdict.loc[mask,:]
    return df

### Parameters and metadata

In [None]:
# set up displaying options for tables and plots
## tables: 
pd.set_option("max.rows", 435)

## plots:
fig_size = plt.rcParams["figure.figsize"]
fig_size[0] = 14
fig_size[1] = 8
plt.rcParams["figure.figsize"] = fig_size

font = {'weight' : 'bold',
        'size'   : 12}

plt.rc('font', **font)

In [None]:
# print metadata
print("The PIC-SURE API libraries versions you've been downloading are: \n- PicSureClient: {0}\n- PicSureHpdsLib: {1}".format(PicSureClient.__version__, PicSureHpdsLib.__version__))

In [None]:
print("UDN database time stamp: {}".format(then))

## 2. DATA ACCESS Workflow
### 1. Connect to the UDN data resource using the HPDS adapter

In [None]:
# token is the individual key given to connect to the UDN resource
token_file = "token.txt"
token = tokenManager(token_file).get_token()

In [None]:
# Connection to the PIC-SURE API w/ key
connection = PicSureClient.Client.connect("https://udn.hms.harvard.edu/picsure", token)
adapter = PicSureHpdsLib.Adapter(connection)
resource = adapter.useResource("c23b6814-7e5b-48d2-80d9-65511d7d2051")

In [None]:
# get object information
resource.help()

### 2. Explore data: data structure description

**Methods**:

    * Search: Dictionary method
    * Retrieve: Query method

**Data structures**:

    * Dictionary object structure
    * Query object structure 

In [None]:
# create a dictionary object and search
dictionary = resource.dictionary()
lookup = dictionary.find()

In [None]:
# description of the dictionary 
pprint({"Count": lookup.count(), 
        "Keys": lookup.keys()[0:2],
        "Entries": lookup.entries()[0:2]})

In [None]:
# show table of records from the dictionary object
data = lookup.DataFrame()
print('data structure: {}'.format(data.shape))
data.head()

The UDN network resource contains 13414 variables described by 10 data fields:
* HpdsDataType
* description
* categorical
* categoryValues
* values
* continuous
* min
* max
* observationCount
* patientCount

#### Characterize patient records

* patient variables (keys)
* patient records (entries data fields)

In [None]:
# some stats
print('## Variables:')
print('* number of variables: {}'.format(len(data)))
print('* data types: {}'.format(data.HpdsDataType.unique()))
print()
print('## Distribution of data types:')
pprint(data.HpdsDataType.value_counts())
print('* list of variables "info": {}'.format(list(data.query('HpdsDataType == "info"').index.values)))
print()
print('## Distribution of categorical variables:')
pprint(data.groupby('HpdsDataType').categorical.value_counts())
print('* list of non categorical variables: {}'.format(list(data.query('categorical == False').index.values)))
print()
print('## Continuous variables: {}'.format(len(data.query('continuous == True'))))
print('* Number of continuous variables: {}'.format(len(data.query('continuous == True'))))
print('* Show continuous variable:')
data.query('continuous == True')

In [None]:
# check the tree structure of the variables
variablesDict = get_multiIndex_variablesDict(data)
print(variablesDict.shape)
# head
variablesDict.head()

In [None]:
# tail
variablesDict.tail()

In [None]:
# get parent classes
pprint(variablesDict.index.get_level_values(0).unique())

In [None]:
print('Number of parent classes: {}'.format(len(variablesDict.index.get_level_values(0).unique())))

#### Variables tree description

Three distinct parent classes (level 0):

1. 'Info' type
2. 000_UDN ID
3. 00..14 parent classes (17 classes): 

    01. 00_Demographics
    02. 01_Primary symptom category reported by patient or caregiver
    03. 02_Sequence Submitted
    04. 02_Type of sequencing
    05. 03_UDN Clinical Site
    06. 04a_Clinical symptoms and physical findings (in HPO, from PhenoTips)
    07. 04b_Clinical symptoms and physical findings (in HPO, from PhenoTips) 
    08. 04c_Clinical symptoms and physical findings (in HPO, from PhenoTips)
    09. 05_Maternal ethnicity (from PhenoTips)
    10. 06_Paternal ethnicity (from PhenoTips)
    11. 08_Family history (from PhenoTips)
    12. 09_Prenatal and perinatal history (from PhenoTips)
    13. 10_Medications (from PhenoTips)
    14. 11_Candidate genes
    15. 12_Candidate variants
    16. 13_Status
    17. 14_Disorders (in OMIM, from PhenoTips)

    
**Phenotips** [https://www.phenotips.com]: Workflows for genomic medicine. Capture patient _symptoms_ and _family history_ with intuitive tools that make your data standardized and interoperable.

In [None]:
# Phenotype variables: patientCount and ObservationCount data fields 
# Number of variables with different patient and observation counts
print('Number of variables with different patient and observation counts: {}'.format(len(data.query('HpdsDataType == "phenotypes" and observationCount != patientCount'))))

In [None]:
# Distribution of phenotype variables by patientCount: most common patientCount
# subset phenotypes variables-patientCounts dataframe
print(data.query('HpdsDataType == "phenotypes"')[['patientCount']].shape)
variable_patientCount_df = (data.query('HpdsDataType == "phenotypes"')
                            [['patientCount']]
                            .sort_values(by='patientCount', ascending=False)
                            .copy()
                           )
variable_patientCount_df.head()

In [None]:
print('number of phenotype variables: {}'.format(len(variable_patientCount_df)))
print('number of patients: {}'.format(variable_patientCount_df.loc['\\000_UDN ID\\','patientCount']))

In [None]:
# description of patienCount unique values
patient_df = variable_patientCount_df.groupby('patientCount')['patientCount'].unique().reset_index(name='uniquePatientCounts').copy()
print('number of patientCount values: {}'.format(len(patient_df)))
print('max patientCount value: {}'.format(patient_df.patientCount.max()))
print('min patientCount value: {}'.format(patient_df.patientCount.min()))
print('median patientCount value: {}'.format(patient_df.patientCount.median()))
print('mean patientCount value: {}'.format(patient_df.patientCount.mean()))
print('st. deviation patientCount value: {}'.format(patient_df.patientCount.std()))
patient_df

In [None]:
patient_df.describe().round()

In [None]:
# plot
plt.plot(patient_df.uniquePatientCounts)

In [None]:
# boxplot of patientCount
plt.boxplot(variable_patientCount_df.patientCount)

In [None]:
# most common patientCount: description of patienCount frequency
patient_freq_df = variable_patientCount_df.groupby('patientCount')['patientCount'].count().reset_index(name='PatientCount Frequency').copy()
print('number of patientCount values: {}'.format(len(patient_freq_df)))
print('max patientCount value: {}'.format(patient_freq_df.patientCount.max()))
print('min patientCount value: {}'.format(patient_freq_df.patientCount.min()))
print('median patientCount value: {}'.format(patient_freq_df.patientCount.median()))
print('mean patientCount value: {}'.format(patient_freq_df.patientCount.mean()))
print('st. deviation patientCount value: {}'.format(patient_freq_df.patientCount.std()))
patient_freq_df

In [None]:
patient_freq_df.describe()

In [None]:
# hist of patientCount
plt.hist(variable_patientCount_df.patientCount,bins=50)

In [None]:
# alternative way to calculate variable distribution by patientCount dataframe
patient_distribution_df = (variable_patientCount_df
       .reset_index()
       .rename(columns={'KEY':'phenotypeVariable'})
       .groupby('patientCount')['phenotypeVariable']
       .nunique()
       .reset_index(name = 'variableFrequency')
)

patient_distribution_df['variablePercentage'] = 100 * patient_distribution_df['variableFrequency'] / patient_distribution_df['variableFrequency'].sum()
print(patient_distribution_df.shape)
patient_distribution_df.round()

In [None]:
# barplot
x_values = patient_distribution_df.patientCount
x_pos = np.arange(len(x_values))
y_values = patient_distribution_df.variablePercentage
plt.bar(x_pos, y_values, align='center', alpha=0.5)
plt.xticks(x_pos, x_values)
plt.ylabel('Variables Percentage')
plt.xlabel('PatientCount')
plt.title('Phenotype variables distribution by patientCount')
plt.show()

In [None]:
# clearer distribution: filter patientCount <= 50.0
patient_distribution_df = (variable_patientCount_df
       .reset_index()
       .rename(columns={'KEY':'phenotypeVariable'})
       .groupby('patientCount')['phenotypeVariable']
       .nunique()
       .reset_index(name = 'variableFrequency')
)

patient_distribution_df['variablePercentage'] = 100 * patient_distribution_df['variableFrequency'] / patient_distribution_df['variableFrequency'].sum()
print(patient_distribution_df.shape)
patient_distribution_df.round().sort_values(by='variablePercentage', ascending=False)

In [None]:
# barplot
#patient_distribution_percge1_df = patient_distribution_df.query('variablePercentage >= 1.0').copy()
patient_distribution_percge1_df = patient_distribution_df.query('patientCount <= 50.').copy()
x_values = patient_distribution_percge1_df.patientCount
x_pos = np.arange(len(x_values))
y_values = patient_distribution_percge1_df.variablePercentage
plt.bar(x_pos, y_values, align='center', alpha=0.5)
plt.xticks(x_pos, x_values)
plt.ylabel('Variables Percentage')
plt.xlabel('PatientCount')
plt.xticks(rotation=90)
plt.title('Phenotype variables distribution by patientCount')
plt.show()

In [None]:
# Distribution of patients by variable: most common phenotype variables
# most frequent phenotypes: description of phenotype frequency: 
phenotypes_freq = (data.query('HpdsDataType == "phenotypes"')
                   [['patientCount']]
                   .sort_values(by = 'patientCount', ascending = False)
                   .copy())
phenotypes_freq.head()

In [None]:
patients = phenotypes_freq.loc['\\000_UDN ID\\','patientCount']
phenotypes_freq['patientCount Percentage'] = phenotypes_freq.patientCount.apply(lambda x: round(x*100/patients))
phenotypes_freq.drop_duplicates().head(50)

In [None]:
phenotypes_freq.drop_duplicates().tail(50)

In [None]:
phenotypes_freq.describe().round()

In [None]:
# most frequent abnormal phenotypes
abnormal_phenotypes = resource.dictionary().find('Phenotypic abnormality').DataFrame().copy()
print(abnormal_phenotypes.shape)
abnormal_phenotypes.head(2)

In [None]:
phenotypes = phenotypes_freq.patientCount.count()
abnormalPhenotypes = len(abnormal_phenotypes)
print('There are {} ({}%) phenotypic abnormality phenotypes.'.format(abnormalPhenotypes,round(abnormalPhenotypes*100/phenotypes)))

In [None]:
abnormal_phenotypes_freq = (abnormal_phenotypes[['patientCount']]
                               .sort_values(by = 'patientCount', ascending = False)
                               .copy())
abnormal_phenotypes_freq.head()

In [None]:
patients = phenotypes_freq.loc['\\000_UDN ID\\','patientCount']
abnormal_phenotypes_freq['patientCount Percentage'] = abnormal_phenotypes_freq.patientCount.apply(lambda x: round(x*100/patients))
abnormal_phenotypes_freq.drop_duplicates().head(50)

In [None]:
abnormal_phenotypes_freq.describe().round()

### 3. Data characterization
#### Download data
##### demographics

In [None]:
# download data
demographics=get_data_df("\\00_Demographics\\")
print(demographics.shape)
demographics.head()

In [None]:
# get the Demographics dictionary
demographicsDict = get_tree_df("00_Demographics")
print(demographicsDict.shape)
demographicsDict

##### Age group separation: adult and pediatrics

In [None]:
# break down the analysis in two groups: pediatric (<18 yo) and adults (>=18 yo)
pediatric_patients=list(demographics["\\00_Demographics\\Age at symptom onset in years\\"][demographics["\\00_Demographics\\Age at symptom onset in years\\"]<18.0].index)
adult_patients=list(demographics["\\00_Demographics\\Age at symptom onset in years\\"][demographics["\\00_Demographics\\Age at symptom onset in years\\"]>=18.0].index)

In [None]:
patients = len(demographics)
print('Number of patient records: {}'.format(patients))
pediatrics = len(pediatric_patients)
pediatrics_percentage = round(100 * pediatrics / patients)
print('Number of pediatric patient records: {} individuals ({}%)'.format(pediatrics,pediatrics_percentage))
adults = len(adult_patients)
adults_percentage = round(100 * adults / patients)
print('Number of adult patient records: {} individuals ({}%)'.format(adults,adults_percentage))

##### Analysis of demographics

In [None]:
# get the dataframes for patients with at least one phenotype, for adult or pediatric, diagnosed and undiagnosed 
demographics_pediatric = demographics.loc[pediatric_patients]
demographics_adult = demographics.loc[adult_patients]

In [None]:
print("Count eth for general ",collec.Counter(demographics['\\00_Demographics\\Ethnicity\\']))
print("Count eth for pediatric ",collec.Counter(demographics_pediatric['\\00_Demographics\\Ethnicity\\']))
print("Count eth for adult ",collec.Counter(demographics_adult['\\00_Demographics\\Ethnicity\\']))

In [None]:
print("Count eth for general ",collec.Counter(demographics['\\00_Demographics\\Ethnicity\\']))
print("Count eth for pediatric ",collec.Counter(demographics_pediatric['\\00_Demographics\\Ethnicity\\']))
print("Count eth for adult ",collec.Counter(demographics_adult['\\00_Demographics\\Ethnicity\\']))
all_eth = collec.Counter(demographics['\\00_Demographics\\Ethnicity\\'])
pediatric_eth = collec.Counter(demographics_pediatric['\\00_Demographics\\Ethnicity\\'])
adult_eth = collec.Counter(demographics_adult['\\00_Demographics\\Ethnicity\\'])
all_eth_percentage = all_eth
pediatric_eth_percentage = pediatric_eth
adult_eth_percentage = adult_eth
for key in all_eth.keys():
    all_eth_percentage[key] = "{} ({})".format(all_eth[key], round(all_eth[key] * 100 / patients))
for key in pediatric_eth.keys():
    pediatric_eth_percentage[key] = "{} ({})".format(pediatric_eth[key], round(pediatric_eth[key] * 100 / patients))
for key in adult_eth.keys():
    adult_eth_percentage[key] = "{} ({})".format(adult_eth[key], round(adult_eth[key] * 100 / patients))
print("\nPercentage eth for general ", all_eth_percentage)
print("Percentage eth for pediatric ", pediatric_eth_percentage)
print("Percentage eth for adult ", adult_eth_percentage)

In [None]:
print("Count race for general ",collec.Counter(demographics["\\00_Demographics\\Race\\"]))
print("Count race for pediatric ",collec.Counter(demographics_pediatric["\\00_Demographics\\Race\\"]))
print("Count race for adult ",collec.Counter(demographics_adult["\\00_Demographics\\Race\\"]))
all_race = collec.Counter(demographics["\\00_Demographics\\Race\\"])
pediatric_race = collec.Counter(demographics_pediatric["\\00_Demographics\\Race\\"])
adult_race = collec.Counter(demographics_adult["\\00_Demographics\\Race\\"])
all_race_percentage = all_race
pediatric_race_percentage = pediatric_race
adult_race_percentage = adult_race
for key in all_race.keys():
    all_race_percentage[key] = "{} ({})".format(all_race[key], round(all_race[key] * 100 / patients))
for key in pediatric_race.keys():
    pediatric_race_percentage[key] = "{} ({})".format(pediatric_race[key], round(pediatric_race[key] * 100 / patients))
for key in adult_race.keys():
    adult_race_percentage[key] = "{} ({})".format(adult_race[key], round(adult_race[key] * 100 / patients))
print("\nPercentage race for general ", all_race_percentage)
print("Percentage race for pediatric ", pediatric_race_percentage)
print("Percentage race for adult ", adult_race_percentage)

In [None]:
# get the statistics for demographics for all patients
demographics.describe().round()

In [None]:
# get the statistics for demographics, for pediatric diagnosed patients
demographics_pediatric.describe().round()

In [None]:
# get the statistics for demographics for adult diagnosed patients
demographics_adult.describe().round()

In [None]:
def show_age_distrib(demographics):
    """Show the age distribution in the network
    Parameters: demographics: pd dataframe, with columns containing age at symptom onset
    Returns: None
    Shows the age distribution as a plot
    """
    X=list(collec.Counter(demographics["\\00_Demographics\\Age at symptom onset in years\\"].fillna(0)))
    Y=[collec.Counter(demographics["\\00_Demographics\\Age at symptom onset in years\\"])[i] for i in X]
    plt.figure(figsize=(20,20))
    plt.plot(X,Y)
    plt.title("Age at symptom onset (in y) distribution in UDN")
    plt.xlabel("Age at symptom onset (in y)")
    plt.ylabel("Count of patients")
    plt.show()

In [None]:
show_age_distrib(demographics)

In [None]:
# get the gender count, for adult or pediatric, diagnosed and undiagnosed
gender_count = demographics.groupby("\\00_Demographics\\Gender\\")['Patient ID'].nunique()
gender_count_ped = demographics_pediatric.groupby("\\00_Demographics\\Gender\\")['Patient ID'].nunique()
gender_count_adu = demographics_adult.groupby("\\00_Demographics\\Gender\\")['Patient ID'].nunique()

In [None]:
print("Gender count general")
print(gender_count)
print('All: {}'.format(len(demographics)))
print('Percentage Female: {}'.format(round(100 * gender_count[0]/len(demographics))))
print('Percentage Male: {}'.format(round(100 * gender_count[1]/len(demographics))))
print()
print("Gender count pediatric")
print(gender_count_ped)
print('Pediatrics: {}'.format(len(demographics_pediatric)))
print('Percentage Female: {}'.format(round(100 * gender_count_ped[0]/len(demographics_pediatric))))
print('Percentage Male: {}'.format(round(100 * gender_count_ped[1]/len(demographics_pediatric))))
print()
print("Gender count adult")
print(gender_count_adu)
print('Adults: {}'.format(len(demographics_adult)))
print('Percentage Female: {}'.format(round(100 * gender_count_adu[0]/len(demographics_adult))))
print('Percentage Male: {}'.format(round(100 * gender_count_adu[1]/len(demographics_adult))))

### 4. Results
#### 4.1 General summary

* UDN database accessed on: 2020-01-27
* Cohort: 1570 individuals
* Number of variables: 13414
    * Number of `info` type: 6 (0.04%)
    * Number of `phenotype` type (ICD9 diagnostic codes ?): 13408 (99.96%)
* Categorical variables: all are phenotypes:
    ```
    HpdsDataType  categorical
    phenotypes    True           13402
                  False              6
    ```
    * list of non categorical phenotypes: ('\\00_Demographics\\Age at UDN Evaluation (in years)\\', '\\09_Prenatal and perinatal history (from PhenoTips)\\Paternal Age\\', '\\00_Demographics\\Age at symptom onset in years\\', '\\09_Prenatal and perinatal history (from PhenoTips)\\Gestation\\', '\\09_Prenatal and perinatal history (from PhenoTips)\\Maternal Age\\', '\\00_Demographics\\Current age in years\\')
* Continuous variable: only one which is `Variant_frequency_in_ExAC` of 'info' type.
* Variables tree-like organization description: 
    * Number of levels: 0-14
    * Parent class (level 0): 24
        * `info` variables: 6 ('Gene_with_variant', 'Variant_class', 'Variant_consequence_calculated', 'Variant_frequency_as_text', 'Variant_frequency_in_ExAC', 'Variant_severity')
        * `phenotype` variables: 18 
        ```
       '000_UDN ID', 
       '00_Demographics',
       '01_Primary symptom category reported by patient or caregiver',
       '02_Sequence Submitted', '02_Type of sequencing',
       '03_UDN Clinical Site',
       '04a_Clinical symptoms and physical findings (in HPO, from PhenoTips)',
       '04b_Clinical symptoms and physical findings (in HPO, from PhenoTips)',
       '04c_Clinical symptoms and physical findings (in HPO, from PhenoTips)',
       '05_Maternal ethnicity (from PhenoTips)',
       '06_Paternal ethnicity (from PhenoTips)',
       '08_Family history (from PhenoTips)',
       '09_Prenatal and perinatal history (from PhenoTips)',
       '10_Medications (from PhenoTips)', 
       '11_Candidate genes',
       '12_Candidate variants', 
       '13_Status',
       '14_Disorders (in OMIM, from PhenoTips)'
       ```
* Entries description (10 data fields):
    * HpdsDataType
    * description
    * categorical
    * categoryValues
    * values
    * continuous
    * min
    * max
    * observationCount
    * patientCount
* Phenotypes distribution: 46% phenotypes recorded only for one patient (patientCount = 1.0). The majority (75%) of patientCounts are not common (variables < 10.8).
* Most frequent patientCount: 1.0
* Patients distribution: 84% patients recorded the `Clinical symptoms and physical findings` variable. The majority (75%) of variables are not common (patientCount < 4.0)  
* Most frequent patient phenotype variables (keys): 
    * 04c_Clinical symptoms and physical findings (in HPO, from PhenoTips)\A 1198 patients (76%)	
    * 04c_Clinical symptoms and physical findings (in HPO, from PhenoTips)\P (72%)
    * 04c_Clinical symptoms and physical findings (in HPO, from PhenoTips)\S (71%)
* Most frequent patient specific phenotype variables (keys): Abnormal phenotypes (12077 (90.0%) phenotypic abnormalities) are not very common, the majority have low frequencies
    * Global developmental delay: 375 patients (24.05) 
    * Seizures: 288 patients (18.0%)
    * Generalized hypotonia: 230 patients (15.0%)

#### 4.1 Table 1
Table 1 presents the characterization of entries recorded for UDN 1570 individuals.

In [None]:
table_list = [
    {'Attribute':'Age',
     'Value, median (IQR)':'13.0 (5-32)'
    },
    {'Attribute':'Female (%)',
     'Value, median (IQR)':'49.0'
    },
    {'Attribute':'PatientCount Frequency',
     'Value, median (IQR)':'2.0 (1.0-10.8)'
    },
    {'Attribute':'Variables Frequency',
     'Value, median (IQR)':'1.0 (1.0-4.0)'
    },
    {'Attribute':'Specific Phenotypes Frequency',
     'Value, median (IQR)':'1.0 (1.0-4.0)'
    }
]
table = pd.DataFrame(table_list)
table

#### 4.2 Demographics summary

* 1570 patient records in the UDN database
* **Adult** (358 individuals (23%)) and **Pediatric** (1211 individuals (77%))
* Age: 20 (5-32) at UDN Evaluation (in years), 11 (0-15) at symptom onset (in years)
* The Female:Male (F:M) ratio is (774:796), Female (%) is 49.0
* Race: white (1200 individuals (76%) in all, 914 (75%) in pediatrics, 298 (83%) in adults)

#### 4.2 Table 2
Table 2 represents the demographic data of the UDN cohort by adult, pediatric and all 1570 individuals.

In [None]:
table_list = [
    {'Attribute':'Age mean (IQR) at UDN evaluation (years)',
     'Adult':'47.0 (36.0-58.0)',
     'Pediatric':'12.0 (4.0-17.0)',
     'All':'20.0 (5.0-32.0)'
    },
    {'Attribute':'Age mean (IQR) at symptom onset (years)',
     'Adult':'39.0 (26.0-50.0)',
     'Pediatric':'2.0 (0.0-2.0)',
     'All':'11.0 (0.0-15.0)'
    },
    {'Attribute':'Age mean (IQR) current age (years)',
     'Adult':'49.0 (38.0-60.0)',
     'Pediatric':'14.0 (6.0-18.0)',
     'All':'22.0 (7.0-33.0)'
    },
    {'Attribute':'Female (%)',
     'Adult':'50.0',
     'Pediatric':'49.0',
     'All':'49.0'
    },
    {'Attribute':'Gender ratio (F:M)',
     'Adult':'178:180',
     'Pediatric':'596:615',
     'All':'774:796'
    },
    {'Attribute':'Race individuals (%) | White',
     'Adult':'286 (18)',
     'Pediatric':'914 (58)',
     'All':'1200 (76)'
    },
    {'Attribute':'Race individuals (%) | Asian',
     'Adult':'16 (1)',
     'Pediatric':'96 (6)',
     'All':'112 (7)'
    },
    {'Attribute':'Race individuals (%) | Black or African-American',
     'Adult':'32 (2)',
     'Pediatric':'71 (5)',
     'All':'103 (7)'
    },
    {'Attribute':'Race individuals (%) | American-Indian or Alaska Native',
     'Adult':'1 (0)',
     'Pediatric':'20 (1)',
     'All':'21 (1)'
    },
    {'Attribute':'Race individuals (%) | Native Hawaiian or Pacific Islander',
     'Adult':'0 (0)',
     'Pediatric':'3 (0)',
     'All':'3 (0)'
    },
    {'Attribute':'Race individuals (%) | Other',
     'Adult':'18 (1)',
     'Pediatric':'90 (6)',
     'All':'109 (7)'
    },
    {'Attribute':'Ethnicity individuals (%) | Not Hispanic or Latino',
     'Adult':'298 (19)',
     'Pediatric':'904 (58)',
     'All':'1202 (77)'
    },
    {'Attribute':'Ethnicity individuals (%) | Hispanic or Latino',
     'Adult':'22 (1)',
     'Pediatric':'199 (13)',
     'All':'221 (14)'
    },
    {'Attribute':'Ethnicity individuals (%) | Unknown or Not Reported Ethnicity',
     'Adult':'36 (2)',
     'Pediatric':'107 (7)',
     'All':'144 (9)'
    }
]
table = pd.DataFrame(table_list)
pd.set_option('max_colwidth',500)
table

#### 4.3 Figures

In [None]:
# Gender
# Adult
labels = 'Female (178:358)', 'Male (180:358)'
sizes = [50, 50]
explode = (0.1, 0)  # only "explode" the 1st slice (i.e. 'Female')

fig1, ax1 = plt.subplots()
ax1.pie(sizes, explode=explode, labels=labels, autopct='%1.1f%%',
        shadow=True, startangle=90)
ax1.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.

plt.show()

In [None]:
# Pediatric
labels = 'Female (596:1211)', 'Male (615:1211)'
sizes = [49, 51]
explode = (0.1, 0)  # only "explode" the 1st slice (i.e. 'Female')

fig1, ax1 = plt.subplots()
ax1.pie(sizes, explode=explode, labels=labels, autopct='%1.1f%%',
        shadow=True, startangle=90)
ax1.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.

plt.show()

In [None]:
# All
labels = 'Female (774:1570)', 'Male (796:1570)'
sizes = [49, 51]
explode = (0.1, 0)  # only "explode" the 1st slice (i.e. 'Female')

fig1, ax1 = plt.subplots()
ax1.pie(sizes, explode=explode, labels=labels, autopct='%1.1f%%',
        shadow=True, startangle=90)
ax1.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.

plt.show()

In [None]:
# Race::All
labels = 'White (1200)', 'Asian (112)', 'African-American (103)', 'American-Indian (21)', 'Native Hawaiian (3)', 'Other (109)'
sizes = [76,7,7,1,0,7]
explode = (0,0,0,0,0.8,0)  # only "explode" the 1st slice (i.e. 'Female')

fig1, ax1 = plt.subplots()
ax1.pie(sizes, explode=explode, labels=labels, autopct='%1.1f%%',
        shadow=True, startangle=90)
ax1.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.

plt.show()