# Metadata Model Alignment Score (MAS) calculation.

This script calculate the MAS score for the metadata before and after insilicom harmonization API, as well as the manually curated gold standard. 

We map the data to [NCI's Genomic Data Commons](https://gdc.cancer.gov/).

Date: 2025-04-11


The key function is `assess_data_fairness()`. It takes three parameters:

1. `input_path`: The path to the csv or json file that you want to calculate MAS score.
2. `meta_path`: The required data standard. In this version, we used sixteen required variables from GDC.
3. `file_type`: The input file type, either `csv` (user upload) or `json` (from our API output).


## 0. Prep

In [9]:
import json
import os
import pandas as pd
from typing import Dict, List, Tuple, Set
from IPython.display import display

from utils_fairness import assess_data_fairness

## 1. Define file path data.

GDC dictionary.

GEO raw input json file.

Insilicom API harmonized output json file.

Gold standard json file.

In [11]:
benchmark_dir = "/home/jhuang/project/misc/harmonization_benchmark"

gdc_meta_path = os.path.join(benchmark_dir, "data", "filtered_16_gdc_meta.json")

raw_input_file = os.path.join(benchmark_dir, "data", "geo_input_20.json")

api_output_file = os.path.join(benchmark_dir, "result", "api_all_sample_results.json")

gold_output_file = os.path.join(benchmark_dir, "data", "gold_standard_20.json")

## 2. Convert the raw input and gold standard json file to CSVs.

In [12]:
with open(raw_input_file, 'r') as file:
    luad_data = json.load(file)


with open(gold_output_file, 'r') as f:
    gold_output = json.load(f)

In [13]:
# Convert to DataFrame
df_raw = pd.DataFrame.from_dict(luad_data, orient="index")

print(f"The DataFrame 'df_raw' contains {df_raw.shape[0]} samples (rows) and {df_raw.shape[1]} variables (columns).")

df_raw.to_csv(os.path.join(benchmark_dir, "data", "geo_input_20.csv"), index=True)

The DataFrame 'df_raw' contains 20 samples (rows) and 148 variables (columns).


In [15]:
# Convert to DataFrame
df_gold = pd.DataFrame.from_dict(gold_output, orient="index")
df_gold.rename(columns={'gender': 'sex'}, inplace=True)

print(f"The DataFrame 'df_gold' contains {df_gold.shape[0]} samples (rows) and {df_gold.shape[1]} variables (columns).")

df_gold.to_csv(os.path.join(benchmark_dir, "data", "gold_standard_20.csv"), index=True)

The DataFrame 'df_gold' contains 20 samples (rows) and 23 variables (columns).


## 3. Calculate MAS score for raw input metadata.

In [None]:
raw_input_file = os.path.join(benchmark_dir, "data", "geo_input_20.csv")

counts_df, fairness_score = assess_data_fairness(input_path = raw_input_file, 
                                                 meta_path = gdc_meta_path,
                                                 file_type = "csv")
# Display the results
print("Counts DataFrame:")
display(counts_df.style.hide(axis="index"))
print(f"MAS Score: {fairness_score}")

Counts DataFrame:


Variable,Matched counts,Total counts
race,0,20
sex,1,20


Fairness Score: 0.003125


## 4. Calculate MAS score for API harmonized metadata.


In [17]:
counts_df, fairness_score = assess_data_fairness(input_path = api_output_file, 
                                                 meta_path = gdc_meta_path, file_type="json")

# Display the results
print("Counts DataFrame:")
display(counts_df.style.hide(axis="index"))
print(f"Fairness Score: {fairness_score}")

Counts DataFrame:


Variable,Matched counts,Total counts
age_at_diagnosis,12,20
days_to_follow_up,0,20
disease_type,20,20
gene_symbol,2,20
molecular_analysis_method,18,20
primary_diagnosis,20,20
primary_site,20,20
race,1,20
sex,12,20
tissue_or_organ_of_origin,20,20


Fairness Score: 0.40625


## 5. Calculate MAS score for the manually curated metadata.


In [None]:
gold_input_file = os.path.join(benchmark_dir, "data", "gold_standard_20.csv")

counts_df, fairness_score = assess_data_fairness(input_path = gold_input_file,
                                                 meta_path = gdc_meta_path, file_type="csv")

# Display the results
print("Counts DataFrame:")
display(counts_df.style.hide(axis="index"))
print(f"MAS Score: {fairness_score}")

Counts DataFrame:


Variable,Matched counts,Total counts
age_at_diagnosis,12,20
days_to_follow_up,4,20
disease_type,19,20
ethnicity,0,20
gene_symbol,3,20
molecular_analysis_method,20,20
primary_diagnosis,19,20
primary_site,20,20
race,1,20
sex,12,20


Fairness Score: 0.421875
