# NCATS Translator Workflow 5, Modules 1-4 - Red Team (COHD)
## Gender-related conditions
This is a Red Team implementation of NCATS Translator Workflow 5, Modules 1-4 using COHD to find conditions more prevalent in women than in men and vice versa.

In [1]:
import pandas as pd
import numpy as np
from cohd_requests import *

### Display settings (optional)

In [2]:
# Pandas display options
pd.options.display.max_colwidth = 255
pd.options.display.max_rows = None

# Wider notebook display
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

## 1) Using the 5-year non-hierarchical data set

In [3]:
dataset_id = 1

## 2) Define male and female cohorts

In [4]:
concept_female = 8532
concept_male = 8507
domain = 'Condition'  # This can be changed to 'Drug' or 'Procedure' to find concepts in those domains instead

## 3) Get conditions associated with each gender

In [5]:
df_association_female = obs_exp_ratio(concept_female, concept_id_2=None, domain_id=domain, dataset_id=dataset_id)
df_association_male = obs_exp_ratio(concept_male, concept_id_2=None, domain_id=domain, dataset_id=dataset_id)

### Sample of the data for female and male associations

In [6]:
display(df_association_female.head(100))

Unnamed: 0,dataset_id,concept_id_1,concept_id_2,concept_2_name,concept_2_domain,observed_count,expected_count,ln_ratio
0,1,8532,197339,"Congenital abnormality of uterus, affecting pregnancy",Condition,13,1.737627,2.012429
1,1,8532,435357,Conjoined twins,Condition,11,2.316837,1.557693
2,1,8532,45770827,Female genital mutilation,Condition,12,2.896046,1.42156
3,1,8532,443538,Plateau iris syndrome,Condition,23,5.792091,1.379001
4,1,8532,438226,Short cord,Condition,11,2.896046,1.334549
5,1,8532,4138173,Intractable ophthalmic migraine,Condition,24,6.371301,1.32625
6,1,8532,4055640,Lung disease with systemic lupus erythematosus,Condition,32,9.267346,1.239239
7,1,8532,45757789,Postpartum gestational diabetes mellitus,Condition,12,3.475255,1.239239
8,1,8532,441087,Vulval and/or perineal hematoma during delivery - delivered,Condition,17,5.212882,1.18208
9,1,8532,81987,Open wound of breast with complication,Condition,45,13.901019,1.1747


In [7]:
display(df_association_male.head(100))

Unnamed: 0,dataset_id,concept_id_1,concept_id_2,concept_2_name,concept_2_domain,observed_count,expected_count,ln_ratio
0,1,8507,433745,"Amphetamine abuse, episodic",Condition,11,1.684205,1.876602
1,1,8507,78009,Traumatic pneumothorax with open wound into thorax,Condition,17,2.947358,1.752304
2,1,8507,374646,Toxic cataract,Condition,11,2.105256,1.653458
3,1,8507,22426,Congenital macrostomia,Condition,16,3.368409,1.558148
4,1,8507,4147021,"Contusion, scrotum or testis",Condition,24,5.894716,1.403997
5,1,8507,4154579,Injury of thoracic cavity,Condition,46,11.368381,1.397806
6,1,8507,193162,Carcinoma in situ of penis,Condition,17,4.210511,1.395629
7,1,8507,4174824,Congenital stricture of urinary meatus,Condition,17,4.210511,1.395629
8,1,8507,200003,Rupture of corpus cavernosum of penis,Condition,49,12.210483,1.389525
9,1,8507,4148093,Abuse of non-dependence-producing substances,Condition,20,5.052614,1.375827


## 4) Filter the list of associated conditions

### 4.1) Exclude concept-pairs with low co-occurrence because these results may be heavily swayed by the Poisson randomization

In [8]:
cooccurrence_threshold = 100
df_association_female = df_association_female[df_association_female['observed_count'] > cooccurrence_threshold]
df_association_male = df_association_male[df_association_male['observed_count'] > cooccurrence_threshold]

### 4.2) Exclude gender-specific conditions from list of associated conditions.
Many of the enhanced conditions in each gender are just conditions that do not occur in the opposite gender. Do this by comparing the single concept prevalences with the co-occurrence rate. If the co-occurrence rate is much less than the expected co-occurrence rate for independence (or doesn't appear in the data), then exclude that concept from the list

In [9]:
# Get the prevalences for male and female
df_gender_prev = concept_frequency([concept_female, concept_male], dataset_id=dataset_id)

# Pull out the prevalences
prev_female = df_gender_prev[df_gender_prev['concept_id']==concept_female]['concept_frequency'].iloc[0]
prev_male = df_gender_prev[df_gender_prev['concept_id']==concept_male]['concept_frequency'].iloc[0]

# Get single concept prevalences for all concepts
df_conditions = most_frequent_concepts(limit=1000000, dataset_id=dataset_id, domain_id=domain)

# Set the index to the appropriate concept IDs for join operation
df_conditions = df_conditions.set_index('concept_id')
df_association_female = df_association_female.set_index('concept_id_2')
df_association_male = df_association_male.set_index('concept_id_2')

# Exctract certain columns and rename for join operation
df_association_female_join = df_association_female[['observed_count']].rename(columns={'observed_count': 'observed_count_female'})
df_association_male_join = df_association_male[['observed_count']].rename(columns={'observed_count': 'observed_count_male'})

# Left-join co-occurrence table with single
df_joined = df_conditions.join([df_association_female_join, df_association_male_join], how='left')

# Find (potentially) gender-specific concepts by finding concepts where the observed gender-concept co-occurrence is much smaller than the expected gender-concept co-occurrence
# Note that when co-occurrences are <= 10, they do not appear in the COHD data, hence we will exclude all concepts that have a gender-concept co-occurrence <= 10.
def gender_specific(row):
    threshold = 20
    return np.isnan(row['observed_count_female']) or \
        (row['observed_count_female'] < (row['concept_count'] * prev_female / threshold)) or \
        np.isnan(row['observed_count_male']) or \
        (row['observed_count_male'] < (row['concept_count'] * prev_male / threshold))
    
df_joined['gender_specific'] = df_joined.apply(gender_specific, axis=1)
df_gender_specific = df_joined[['gender_specific']]

# Show the concepts detected as gender specific
display(df_joined[df_joined.gender_specific])

# Remove gender-specific concepts from the associations
df_association_female_filtered = df_association_female.join(df_gender_specific, how='left')
df_association_female_filtered = df_association_female[~df_association_female_filtered['gender_specific']]
df_association_male_filtered = df_association_male.join(df_gender_specific, how='left')
df_association_male_filtered = df_association_male[~df_association_male_filtered['gender_specific']]

Unnamed: 0_level_0,concept_class_id,concept_count,concept_frequency,concept_name,dataset_id,domain_id,vocabulary_id,observed_count_female,observed_count_male,gender_specific
concept_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
434169,Clinical Finding,24832,0.013869,Abnormal findings on diagnostic imaging of breast,1,Condition,SNOMED,24734.0,161.0,True
4094910,Clinical Finding,21963,0.012267,Pregnancy test positive,1,Condition,SNOMED,22155.0,,True
4094448,Clinical Finding,21126,0.011799,Pregnancy test negative,1,Condition,SNOMED,21089.0,,True
4299535,Clinical Finding,20553,0.011479,Patient currently pregnant,1,Condition,SNOMED,20441.0,,True
198194,Clinical Finding,19182,0.010714,Female genital organ symptoms,1,Condition,SNOMED,19130.0,,True
198803,Clinical Finding,18852,0.010529,Benign prostatic hyperplasia,1,Condition,SNOMED,,18630.0,True
197236,Clinical Finding,17533,0.009793,Uterine leiomyoma,1,Condition,SNOMED,17284.0,,True
443800,Clinical Finding,17007,0.009499,Amenorrhea,1,Condition,SNOMED,16796.0,,True
441641,Clinical Finding,14848,0.008293,Delivery normal,1,Condition,SNOMED,14570.0,,True
201909,Clinical Finding,14655,0.008185,Female infertility,1,Condition,SNOMED,14824.0,,True


## 5) Calculate the ratio of prevalence between genders
Add columns with ratio between genders for easier interpretation

In [10]:
df_female = df_association_female_filtered.rename(columns={'observed_count': 'observed_count_female'})
df_female = df_female.join(df_association_male_filtered[['observed_count']], how='left')
df_female = df_female.rename(columns={'observed_count': 'observed_count_male'})
# Calculate the ratio between prevalences for each gender
df_female['F:M'] = (df_female.observed_count_female / prev_female) / (df_female.observed_count_male / prev_male)

df_male = df_association_male_filtered.rename(columns={'observed_count': 'observed_count_male'})
df_male = df_male.join(df_association_female_filtered[['observed_count']], how='left')
df_male = df_male.rename(columns={'observed_count': 'observed_count_female'})
# Calculate the ratio between prevalences for each gender
df_male['M:F'] = (df_male.observed_count_male / prev_male) / (df_male.observed_count_female / prev_female)

### Female
Showing the top 100 conditions more prevalent in females

In [11]:
display(df_female.head(100))

Unnamed: 0_level_0,dataset_id,concept_id_1,concept_2_name,concept_2_domain,observed_count_female,expected_count,ln_ratio,observed_count_male,F:M
concept_id_2,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
441788,1,8532,Human papilloma virus infection,Condition,20911,12273.44168,0.532838,505,30.101134
73819,1,8532,Pain of breast,Condition,10146,5981.49279,0.528409,368,20.04225
80767,1,8532,Breast lump,Condition,16752,10063.75881,0.509577,551,22.101131
45773176,1,8532,Low grade squamous intraepithelial lesion on cervical Papanicolaou smear,Condition,5210,3133.521448,0.508422,125,30.29892
195007,1,8532,Female stress incontinence,Condition,2169,1322.913676,0.49443,128,12.31825
40479565,1,8532,Carrier of cystic fibrosis gene mutation,Condition,5723,3500.160834,0.491684,292,14.247555
434298,1,8532,Secondary malignant neoplasm of lymph nodes of upper limb,Condition,969,596.006205,0.486014,102,6.905944
80824,1,8532,Senile osteoporosis,Condition,3063,1915.444626,0.469445,228,9.765884
4147829,1,8532,Pain in pelvis,Condition,16639,10446.616052,0.465471,1478,8.183747
4146239,1,8532,Pruritus of genital organs,Condition,1434,901.249422,0.464441,164,6.356305


### Male
Showing the top 100 conditions more prevalent in males

In [12]:
display(df_male.head(100))

Unnamed: 0_level_0,dataset_id,concept_id_1,concept_2_name,concept_2_domain,observed_count_male,expected_count,ln_ratio,observed_count_female,M:F
concept_id_2,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
4237140,1,8507,Abnormal sexual function,Condition,1482,635.366159,0.846946,107,19.053071
4163261,1,8507,Malignant tumor of prostate,Condition,10766,4734.298932,0.821559,676,21.908278
433813,1,8507,Bladder neck obstruction,Condition,971,454.735223,0.758611,160,8.348334
194406,1,8507,Urinary tract obstruction,Condition,5599,2636.20114,0.753249,491,15.686626
4012231,1,8507,Poor stream of urine,Condition,1753,831.997038,0.745255,227,10.623231
197023,1,8507,Bilateral inguinal hernia,Condition,1916,912.417804,0.741897,235,11.215747
195926,1,8507,Slowing of urinary stream,Condition,1384,665.68184,0.731921,148,12.863968
4295624,1,8507,Squamous cell carcinoma of skin of ear,Condition,1477,714.523772,0.726152,193,10.527464
195590,1,8507,Urethral stricture,Condition,1598,775.576186,0.722902,272,8.081806
201688,1,8507,Delay when starting to pass urine,Condition,2696,1319.153198,0.714779,446,8.315447
