## Mondo Mass Obsoletion - SME Review Summary

This notebook summmarizes the results of the SME Review of the Strategic Refinement (Mondo Mass Obsoletion) results to gauge whether classes predicted to leave the branch should when their parent class is obsoleted should actually leave the branch.


In [1]:
# Load packages
import os
import re
import pandas as pd
import numpy as np

# Enable display of all columns in df
pd.set_option('display.max_columns', None)

# Enable display of entire cell value
pd.set_option('display.max_colwidth', None)

In [2]:
# List all files and get their size

path = "./data"

# Get list of all files only in the given directory
data_files = lambda x : os.path.isfile(os.path.join(path,x))
files_list = filter(data_files, os.listdir(path))
 
# Create a list of files in directory along with the size
file_metadata = [
    (file, os.stat(os.path.join(path, file)).st_size)
    for file in files_list
]

# Display number of files
display(len(file_metadata))


# Iterate over list of files along with size
# for filename, size in file_metadata:
#     print("{} : {} B".format(filename, size))

46

In [3]:
# Create dictionary of file groups based on their branch name

branches = {}

seen = []

for full_filename, size in file_metadata:
    branch, reviewer = full_filename.split('_')

    if branch in seen:
        branches[branch].append((full_filename, size))
    else:
        seen.append(branch)
        branches[branch] = [(full_filename, size)]


print('** All Branch Files **')
for k,v in branches.items():
    print("** Branch:", k, '(',len(v), ')', "ReviewerFiles:", v)

** All Branch Files **
** Branch: Cancer-or-benign-tumor ( 5 ) ReviewerFiles: [('Cancer-or-benign-tumor_reviewer-JessicaMester.xlsx', 9095), ('Cancer-or-benign-tumor_reviewer-SumirPandit.xlsx', 9191), ('Cancer-or-benign-tumor_reviewer-GiocondaAlyea.xlsx', 9095), ('Cancer-or-benign-tumor_reviewer-1.xlsx', 9087), ('Cancer-or-benign-tumor_reviewer-NeomiCaban.xlsx', 9095)]
** Branch: Inflammatory-disease ( 3 ) ReviewerFiles: [('Inflammatory-disease_reviewer-MeganKraus.xlsx', 10548), ('Inflammatory-disease_reviewer-GiocondaAlyea.xlsx', 10431), ('Inflammatory-disease_reviewer-1.xlsx', 10032)]
** Branch: Musculoskeletal-system ( 2 ) ReviewerFiles: [('Musculoskeletal-system_reviewer-1.xlsx', 38813), ('Musculoskeletal-system_reviewer-GiocondaAlyea.xlsx', 39783)]
** Branch: Endocrine-Disease ( 3 ) ReviewerFiles: [('Endocrine-Disease_reviewer-JessicaMester.xlsx', 10115), ('Endocrine-Disease_reviewer-GiocondaAlyea.xlsx', 10115), ('Endocrine-Disease_reviewer-1.xlsx', 10106)]
** Branch: .DS ( 1 ) Re

In [4]:
# Find template file names

templates = {}

for branch, files in branches.items():
    template_pattern = r'(.+)_reviewer-1\.xlsx'
    
    # Find filename of template file
    for file in files:
        template_filename = re.search(template_pattern, file[0])
        
        if template_filename:
            templates[file[0]] = file[1]


In [5]:
# Check which files were reviewed by comparing file size of "template" file to the other files
# NOTE: Even though some reviewer files were (slightly) larger than the "template", there was
# no reviewer data in the files to analyze 

data_files = {}

for template_filename, template_size in templates.items():
    filenames = []

    branch_name = template_filename.split('_')[0]
    
    for filename, size in branches[branch_name]:
        if filename != template_filename:
            if size > template_size:
                filenames.append(filename)
                
    data_files[branch_name] = filenames
    

# Display results
for key in sorted(data_files.keys()):
    print('**', key, ', NumReviewers: (', len(data_files[key]), ')', data_files[key])
    
    

** Cancer-or-benign-tumor , NumReviewers: ( 4 ) ['Cancer-or-benign-tumor_reviewer-JessicaMester.xlsx', 'Cancer-or-benign-tumor_reviewer-SumirPandit.xlsx', 'Cancer-or-benign-tumor_reviewer-GiocondaAlyea.xlsx', 'Cancer-or-benign-tumor_reviewer-NeomiCaban.xlsx']
** Cardiovascular-Disorder , NumReviewers: ( 2 ) ['Cardiovascular-Disorder_reviewer-DylanGration.xlsx', 'Cardiovascular-Disorder_reviewer-GiocondaAlyea.xlsx']
** Connective-tissue , NumReviewers: ( 0 ) []
** Digestive-system-disorder , NumReviewers: ( 1 ) ['Digestive-system-disorder_reviewer-GiocondaAlyea.xlsx']
** Disorder-of-visual-system , NumReviewers: ( 1 ) ['Disorder-of-visual-system_reviewer-GiocondaAlyea.xlsx']
** Endocrine-Disease , NumReviewers: ( 2 ) ['Endocrine-Disease_reviewer-JessicaMester.xlsx', 'Endocrine-Disease_reviewer-GiocondaAlyea.xlsx']
** Hematologic-disorder , NumReviewers: ( 2 ) ['Hematologic-disorder_reviewer-GiocondaAlyea.xlsx', 'Hematologic-disorder_reviewer-JessicaMester.xlsx']
** Immune-system-disorde

In [None]:
# # Analyze the branch data_files (in batch)
# # NOTE: Batch analysis needs more work. Analzye each set of branch files separately/manually

# df = pd.DataFrame()

# for branch_name, files in data_files.items():
#     for f in files:
#         data = pd.read_excel(f"data/{f}")

#         df = pd.concat(pd.read_excel(f"data/{excel_file}") for excel_file in files)

# df.head()

---
### Analyze Individual Branches in Batch --> too many empty files

In [7]:
#####
# Analyze "Cancer-or-benign-tumor" SME file
# NOTE: While all file_paths ar files greater in size than the template file,
# only "Cancer-or-benign-tumor_reviewer-SumirPandit.xlsx" has any data entered. 
#####


branch_name = "Cancer-or-benign-tumor" 
file_paths = ['Cancer-or-benign-tumor_reviewer-JessicaMester.xlsx', 'Cancer-or-benign-tumor_reviewer-SumirPandit.xlsx', 'Cancer-or-benign-tumor_reviewer-GiocondaAlyea.xlsx', 'Cancer-or-benign-tumor_reviewer-NeomiCaban.xlsx']


selected_columns_indices = [0, 1, 2, 3, 4]


# Read the first file into a DataFrame
merged_dataframe = pd.read_excel(f"data/{file_paths[0]}", usecols=selected_columns_indices)

# Iterate through the remaining file paths and merge them based on column 'Mondo ID'
for file_path in file_paths[1:]:
    # Read the next file into a DataFrame
    next_dataframe = pd.read_excel(f"data/{file_path}", usecols=[0,2, 3, 4])
    
    # Extract the reviewer name from the file path
    reviewer_name = file_path.split('-')[-1].split('.')[0]

    # Define custom suffixes based on the reviewer name
    suffix_left = f'_left_{reviewer_name}'
    suffix_right = f'_right_{reviewer_name}'

    # Merge based on column A with custom suffixes
    merged_dataframe = pd.merge(merged_dataframe, next_dataframe, on='Mondo ID', how='outer', suffixes=('', f'_{reviewer_name}')) # suffixes=(suffix_left, suffix_right)

    
    
#     # Define custom suffixes based on the file index
#     suffix_left = f'_left_{file_paths.index(file_path)}'
#     suffix_right = f'_right_{file_paths.index(file_path)}'

#     # Merge based on column A
#     merged_dataframe = pd.merge(merged_dataframe, next_dataframe, on='Mondo ID', how='outer', suffixes=(suffix_left, suffix_right))


# Print the resulting merged DataFrame
# merged_dataframe.head()

### Result for "Cancer-or-benign-tumor"

Only 1 reviewer has added any data and very few files have any reviewer data so let's just manually list out which files have data!


---
---
### Check Manually which branches have any review data

---
- Cancer-or-benign-tumor_reviewer-SumirPandit.xlsx

---
- Cardiovascular-Disorder_reviewer-DylanGration.xlsx

---
- Digestive-system-disorder_reviewer-GiocondaAlyea.xlsx

---
- Disorder-of-visual-system_reviewer-GiocondaAlyea.xlsx

---
- Immune-system-disorder_reviewer-RachelSparks.xlsx

---
- Inflammatory-disease_reviewer-GiocondaAlyea.xlsx
- Inflammatory-disease_reviewer-MeganKraus.xlsx

---
- Integumentary-system-disorder_reviewer-GiocondaAlyea.xlsx
- Integumentary-system-disorder_reviewer-MeganKraus.xlsx

---
- Musculoskeletal-system_reviewer-GiocondaAlyea.xlsx

---
- Reproductive-system-disorder_reviewer-PaolaRoncaglia.xlsx

---
- Respiratory-system_reviewer-GiocondaAlyea.xlsx



---
## Read in Mass Obsoletion Report

In [8]:
# Load file
report_df = pd.read_csv('mass_obsoletion_qc_report_15Nov2023.csv', sep=',')
report_df.head()

Unnamed: 0,curie,label,previous_parents,latest_parents,added_parents,removed_parents,obsoleted_parents,previous_branches,latest_branches,is_branch_assignment_changed,added_branches,removed_branches
0,MONDO:0016281,"46,XX ovotesticular disorder of sex development","46,XX disorder of gonadal development(MONDO:0017961)",disorder of sexual differentiation(MONDO:0002145),disorder of sexual differentiation(MONDO:0002145),"46,XX disorder of gonadal development(MONDO:0017961)","46,XX disorder of gonadal development(MONDO:0017961)","disorder of development or morphogenesis(MONDO:0021147), endocrine system disorder(MONDO:0005151), reproductive system disorder(MONDO:0005039)","disorder of development or morphogenesis(MONDO:0021147), endocrine system disorder(MONDO:0005151), reproductive system disorder(MONDO:0005039)",False,,
1,MONDO:0019608,"46,XX disorder of sex development induced by maternal-derived androgen","46,XX disorder of sex development induced by androgens excess(MONDO:0020039)",,,"46,XX disorder of sex development induced by androgens excess(MONDO:0020039)","46,XX disorder of sex development induced by androgens excess(MONDO:0020039)","disorder of development or morphogenesis(MONDO:0021147), endocrine system disorder(MONDO:0005151), reproductive system disorder(MONDO:0005039)",,True,,"reproductive system disorder(MONDO:0005039), disorder of development or morphogenesis(MONDO:0021147), endocrine system disorder(MONDO:0005151)"
2,MONDO:0019593,"46,XX disorder of sex development induced by fetal androgens excess","female reproductive system disorder(MONDO:0002263), 46,XX disorder of sex development induced by androgens excess(MONDO:0020039)",,,"female reproductive system disorder(MONDO:0002263), 46,XX disorder of sex development induced by androgens excess(MONDO:0020039)","46,XX disorder of sex development induced by androgens excess(MONDO:0020039)","disorder of development or morphogenesis(MONDO:0021147), endocrine system disorder(MONDO:0005151), reproductive system disorder(MONDO:0005039)",,True,,"reproductive system disorder(MONDO:0005039), disorder of development or morphogenesis(MONDO:0021147), endocrine system disorder(MONDO:0005151)"
3,MONDO:0017962,"46,XX disorder of sex development induced by fetoplacental androgens excess","female reproductive system disorder(MONDO:0002263), 46,XX disorder of sex development induced by androgens excess(MONDO:0020039)",,,"female reproductive system disorder(MONDO:0002263), 46,XX disorder of sex development induced by androgens excess(MONDO:0020039)","46,XX disorder of sex development induced by androgens excess(MONDO:0020039)","disorder of development or morphogenesis(MONDO:0021147), endocrine system disorder(MONDO:0005151), reproductive system disorder(MONDO:0005039)",,True,,"reproductive system disorder(MONDO:0005039), disorder of development or morphogenesis(MONDO:0021147), endocrine system disorder(MONDO:0005151)"
4,MONDO:0014421,glucocorticoid resistance,"developmental anomaly of metabolic origin(MONDO:0015327), 46,XX disorder of sex development induced by fetal androgens excess(MONDO:0019593), adrenogenital syndrome(MONDO:0015898)",adrenogenital syndrome(MONDO:0015898),,"developmental anomaly of metabolic origin(MONDO:0015327), 46,XX disorder of sex development induced by fetal androgens excess(MONDO:0019593)","46,XX disorder of sex development induced by fetal androgens excess(MONDO:0019593)","disorder of development or morphogenesis(MONDO:0021147), endocrine system disorder(MONDO:0005151), metabolic disease(MONDO:0005066), reproductive system disorder(MONDO:0005039)","endocrine system disorder(MONDO:0005151), metabolic disease(MONDO:0005066)",True,,"reproductive system disorder(MONDO:0005039), disorder of development or morphogenesis(MONDO:0021147)"


---
### Check for Agreement with Cancer or benign tumor data from Reviewer

#### Step 1 (Cancer or benign tumor)

In [14]:
# Copy report_df
report_copy_cancer_df = report_df.copy()

# Read in reviewer file
reviewer_df = pd.read_excel('data/Cancer-or-benign-tumor_reviewer-SumirPandit.xlsx', usecols=['Mondo ID', 'yes', 'no', 'unsure', 'comment'])
reviewer_df.head()

# # Find which column has the first True value for each row --> sometimes reviewers added yes/no and unsure
# reviewer_df['first_true_column'] = reviewer_df[['yes', 'no', 'unsure']].idxmax(axis=1)
# # Display the result
# print(reviewer_df[['Mondo ID', 'first_true_column']])


# Find which column has the first True value for each row
reviewer_df['columns_marked_as_true'] = reviewer_df[['yes', 'no', 'unsure']].apply(lambda row: row.index[row].tolist(), axis=1)
# Display the result
print(reviewer_df[['Mondo ID', 'columns_marked_as_true']])


# The form question was "In your opinion, are the following diseases considered a "cancer or benign tumor"?"
# Therefore, 'yes' means the term should 'Stay in the branch' and 'no' means it should 'Leave the Branch'.

         Mondo ID columns_marked_as_true
0   MONDO:0019328                   [no]
1   MONDO:0017046                  [yes]
2   MONDO:0017571               [unsure]
3   MONDO:0015756                  [yes]
4   MONDO:0008876                   [no]
5   MONDO:0033954                   [no]
6   MONDO:0015757                  [yes]
7   MONDO:0019329                   [no]
8   MONDO:0008594               [unsure]
9   MONDO:0016725                  [yes]
10  MONDO:0016745                  [yes]
11  MONDO:0019476                  [yes]
12  MONDO:0015185                   [no]
13  MONDO:0018070               [unsure]
14  MONDO:0018192                  [yes]
15  MONDO:0018717                   [no]
16  MONDO:0044350                   [no]
17  MONDO:0011605                   [no]
18  MONDO:0015811                  [yes]
19  MONDO:0015812                  [yes]
20  MONDO:0008390                   [no]
21  MONDO:0011512               [unsure]


#### Step 2 (Cancer of benign tumor)

In [20]:
# Check whether there is agreement between Reviwer and Curator Report
# to show that the term will 'Stay in the branch'

#-------
# FIRST - filter dataframe to only contain rows where 'columns_marked_as_true' is only yes or no, to remove
# Convert the lists to strings and keep rows where the value is 'yes' or 'no'
df_filtered = reviewer_df[reviewer_df['columns_marked_as_true'].apply(lambda x: isinstance(x, list) and any(value in x for value in ['yes', 'no']))]

# Convert the values in 'columns_marked_as_true' to strings vs. lists
df_filtered['columns_marked_as_true'] = df_filtered['columns_marked_as_true'].apply(lambda x: ', '.join(x))
print(df_filtered)


# cancer_target_value = 'MONDO:0045024' #cancer or benign tumor(MONDO:0045024)

# # Filter rows in reviewer_df where 'first_true_column' is 'yes'
# target_rows_reviewer_df = reviewer_df[reviewer_df['first_true_column'] == 'yes']

# # Merge reviewer_df and report_df based on 'Mondo ID'
# stay_cancer_merged_df = pd.merge(target_rows_reviewer_df, report_copy_cancer_df, left_on='Mondo ID', right_on='curie', how='left')
# # stay_cancer_merged_df.head()


# # Check for the existence of 'MONDO:0045024' in the merged row
# stay_cancer_merged_df['target_exists_in_reportdf'] = stay_cancer_merged_df['latest_branches'].str.contains(cancer_target_value)

# # Display the result
# display(stay_cancer_merged_df[['Mondo ID', 'latest_branches', 'target_exists_in_reportdf']])


# stay_cancer_merged_df['target_exists_in_reportdf'].value_counts()


         Mondo ID    yes     no  unsure columns_marked_as_true
0   MONDO:0019328  False   True   False                     no
1   MONDO:0017046   True  False   False                    yes
3   MONDO:0015756   True  False   False                    yes
4   MONDO:0008876  False   True   False                     no
5   MONDO:0033954  False   True   False                     no
6   MONDO:0015757   True  False   False                    yes
7   MONDO:0019329  False   True   False                     no
9   MONDO:0016725   True  False   False                    yes
10  MONDO:0016745   True  False   False                    yes
11  MONDO:0019476   True  False   False                    yes
12  MONDO:0015185  False   True   False                     no
14  MONDO:0018192   True  False   False                    yes
15  MONDO:0018717  False   True   False                     no
16  MONDO:0044350  False   True   False                     no
17  MONDO:0011605  False   True   False                

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_filtered['columns_marked_as_true'] = df_filtered['columns_marked_as_true'].apply(lambda x: ', '.join(x))


#### Step 3 (Cancer)

In [None]:
# Check whether there is agreement between Reviwer and Curator Report
# to show that the term will 'Leave the branch'
cancer_target_value = 'MONDO:0045024' #cancer or benign tumor(MONDO:0045024)

# Filter rows in reviewer_df where 'first_true_column' is 'yes'
target_rows_reviewer_df = reviewer_df[reviewer_df['first_true_column'] == 'no']

# Merge reviewer_df and report_df based on 'Mondo ID'
leave_cancer_merged_df = pd.merge(target_rows_reviewer_df, report_copy_cancer_df, left_on='Mondo ID', right_on='curie', how='left')
# leave_cancer_merged_df.head()


# Check for the existence of 'MONDO:0045024' in the merged row
leave_cancer_merged_df['target_exists_in_reportdf'] = leave_cancer_merged_df['latest_branches'].str.contains(cancer_target_value)

# Display the result
display(leave_cancer_merged_df[['Mondo ID', 'latest_branches', 'target_exists_in_reportdf']])


leave_cancer_merged_df['target_exists_in_reportdf'].value_counts()

# NOTE: A result of 'True' when checking for 'Leave the branch' means there is _disagreement_ 
# between reviewer and curator since the term _did_ remain in the branch.

---
---
### Check for Agreement for Cardiovascular Disorder

#### Step 1 (Cardio)


In [None]:
# Make a copy of report_df
report_copy_cardio_df = report_df.copy()

# Read in reviewer file
cardio_reviewer_df = pd.read_excel('data/Cardiovascular-Disorder_reviewer-DylanGration.xlsx', usecols=['Mondo ID', 'yes', 'no', 'unsure'])
cardio_reviewer_df.head()

# Find which column has the first True value for each row
cardio_reviewer_df['first_true_column'] = reviewer_df[['yes', 'no', 'unsure']].idxmax(axis=1)

# Display the result
print(cardio_reviewer_df[['Mondo ID', 'first_true_column']])


# The form question was "In your opinion, are the following diseases considered a "cancer or benign tumor"?"
# Therefore, 'yes' means the term should 'Stay in the branch' and 'no' means it should 'Leave the Branch'.


#### Step 2 (Cardio)

In [None]:
# Check whether there is agreement between Reviwer and Curator Report
# to show that the term will 'Stay in the branch'
cardio_target_value = 'MONDO:0004995' #cardiovascular disorder(MONDO:0004995)

# Filter rows in reviewer_df where 'first_true_column' is 'yes'
target_rows_cardio_reviewer_df = cardio_reviewer_df[cardio_reviewer_df['first_true_column'] == 'yes']

# Merge reviewer_df and report_df based on 'Mondo ID'
stay_cardio_merged_df = pd.merge(target_rows_cardio_reviewer_df, report_copy_cardio_df, left_on='Mondo ID', right_on='curie', how='left')
# stay_cardio_merged_df.head()


# Check for the existence of 'MONDO:0004995' in the merged row
stay_cardio_merged_df['target_exists_in_reportdf'] = stay_cardio_merged_df['latest_branches'].str.contains(cardio_target_value)

# Display the result
display(stay_cardio_merged_df[['Mondo ID', 'latest_branches', 'target_exists_in_reportdf']])


stay_cardio_merged_df['target_exists_in_reportdf'].value_counts()


#### Step 3 (Cardio)

In [None]:
# Check whether there is agreement between Reviwer and Curator Report
# to show that the term will 'Leave the branch'
cardio_target_value = 'MONDO:0004995' #cardiovascular disorder(MONDO:0004995)

# Filter rows in reviewer_df where 'first_true_column' is 'yes'
target_rows_reviewer_df = reviewer_df[reviewer_df['first_true_column'] == 'no']

# Merge reviewer_df and report_df based on 'Mondo ID'
leave_cardio_merged_df = pd.merge(target_rows_reviewer_df, report_copy_cardio_df, left_on='Mondo ID', right_on='curie', how='left')
# leave_cardio_merged_df.head()


# Check for the existence of cardio_target_value in the merged row
leave_cardio_merged_df['target_exists_in_reportdf'] = leave_cancer_merged_df['latest_branches'].str.contains(cardio_target_value)

# Display the result
display(leave_cardio_merged_df[['Mondo ID', 'latest_branches', 'target_exists_in_reportdf']])


leave_cardio_merged_df['target_exists_in_reportdf'].value_counts()

# NOTE: A result of 'True' when checking for 'Leave the branch' means there is _disagreement_ 
# between reviewer and curator since the term _did_ remain in the branch.

---
---
### Check for Agreement with Digestive data from Reviewer

#### Step 1 (Digestive)

In [None]:
# Make a copy of report_df
report_copy_digestive_df = report_df.copy()

# Read in reviewer file
digestive_reviewer_df = pd.read_excel('data/Digestive-system-disorder_reviewer-GiocondaAlyea.xlsx', usecols=['Mondo ID', 'yes', 'no', 'unsure'])
digestive_reviewer_df.head()

# Find which column has the first True value for each row
digestive_reviewer_df['first_true_column'] = reviewer_df[['yes', 'no', 'unsure']].idxmax(axis=1)

# Display the result
print(digestive_reviewer_df[['Mondo ID', 'first_true_column']])


# The form question was "In your opinion, are the following diseases considered a "cancer or benign tumor"?"
# Therefore, 'yes' means the term should 'Stay in the branch' and 'no' means it should 'Leave the Branch'.

#### Step 2 (Digestive)

In [None]:
# Check whether there is agreement between Reviwer and Curator Report
# to show that the term will 'Stay in the branch'
disgestive_target_value = 'MONDO:0004335' #digestive system disorder(MONDO:0004335

# Filter rows in reviewer_df where 'first_true_column' is 'yes'
target_rows_digestive_reviewer_df = digestive_reviewer_df[digestive_reviewer_df['first_true_column'] == 'yes']

# Merge reviewer_df and report_df based on 'Mondo ID'
stay_digestive_merged_df = pd.merge(target_rows_digestive_reviewer_df, report_copy_digestive_df, left_on='Mondo ID', right_on='curie', how='left')

# Check for the existence of 'MONDO:0004995' in the merged row
stay_digestive_merged_df['target_exists_in_reportdf'] = stay_digestive_merged_df['latest_branches'].str.contains(disgestive_target_value)

# Display the result
display(stay_digestive_merged_df[['Mondo ID', 'latest_branches', 'target_exists_in_reportdf']])


stay_digestive_merged_df['target_exists_in_reportdf'].value_counts()


#### Step 3 (Digestive)

In [None]:
# Check whether there is agreement between Reviwer and Curator Report
# to show that the term will 'Leave the branch'
disgestive_target_value = 'MONDO:0004335'

# Filter rows in reviewer_df where 'first_true_column' is 'yes'
target_rows_reviewer_df = digestive_reviewer_df[digestive_reviewer_df['first_true_column'] == 'no']

# Merge reviewer_df and report_df based on 'Mondo ID'
leave_digestive_merged_df = pd.merge(target_rows_reviewer_df, report_copy_digestive_df, left_on='Mondo ID', right_on='curie', how='left')


# Check for the existence of cardio_target_value in the merged row
leave_digestive_merged_df['target_exists_in_reportdf'] = leave_digestive_merged_df['latest_branches'].str.contains(cardio_target_value)

# Display the result
display(leave_digestive_merged_df[['Mondo ID', 'latest_branches', 'target_exists_in_reportdf']])


leave_digestive_merged_df['target_exists_in_reportdf'].value_counts()

# NOTE: A result of 'True' when checking for 'Leave the branch' means there is _disagreement_ 
# between reviewer and curator since the term _did_ remain in the branch.

---
---
### Check for Agreement on Disorder of visual system

#### Step 1 (Visual)

In [9]:
# Make a copy of report_df
report_copy_visual_df = report_df.copy()

# Read in reviewer file
visual_reviewer_df = pd.read_excel('data/Disorder-of-visual-system_reviewer-GiocondaAlyea.xlsx', usecols=['Mondo ID', 'yes', 'no', 'unsure'])
visual_reviewer_df.head()

# Find which column has the first True value for each row
# ORIG Analysis
# visual_reviewer_df['first_true_column'] = visual_reviewer_df[['yes', 'no', 'unsure']].idxmax(axis=1)

# visual_reviewer_df[['yes', 'no', 'unsure']] = visual_reviewer_df[['yes', 'no', 'unsure']].astype(bool)
# visual_reviewer_df['first_true_column'] = visual_reviewer_df[['yes', 'no', 'unsure']].idxmax(axis=1, skipna=True)

visual_reviewer_df['true_columns'] = visual_reviewer_df[['yes', 'no', 'unsure']].apply(lambda row: row.index[row].tolist(), axis=1)


# Display the result
print(visual_reviewer_df[['Mondo ID', 'true_columns']])


visual_reviewer_df.head(len(visual_reviewer_df))

# The form question was "In your opinion, are the following diseases considered a "cancer or benign tumor"?"
# Therefore, 'yes' means the term should 'Stay in the branch' and 'no' means it should 'Leave the Branch'.

          Mondo ID   true_columns
0    MONDO:0007383          [yes]
1    MONDO:0017304          [yes]
2    MONDO:0017212  [yes, unsure]
3    MONDO:0010181          [yes]
4    MONDO:0008927          [yes]
..             ...            ...
128  MONDO:0020252             []
129  MONDO:0018152             []
130  MONDO:0018997             []
131  MONDO:0009485             []
132  MONDO:0008397             []

[133 rows x 2 columns]


Unnamed: 0,Mondo ID,yes,no,unsure,true_columns
0,MONDO:0007383,True,False,False,[yes]
1,MONDO:0017304,True,False,False,[yes]
2,MONDO:0017212,True,False,True,"[yes, unsure]"
3,MONDO:0010181,True,False,False,[yes]
4,MONDO:0008927,True,False,False,[yes]
...,...,...,...,...,...
128,MONDO:0020252,False,False,False,[]
129,MONDO:0018152,False,False,False,[]
130,MONDO:0018997,False,False,False,[]
131,MONDO:0009485,False,False,False,[]
