This code was generated with LLM assistance as described in the paper "Leveraging Large Language Models for Academic Conference Organization" (see Citation in README). When using the LLM to write initial Python code for paper-reviewer matching, the LLM was only instructed to generate generic code and did not access any specific paper or reviewer data. It neither viewed nor directly matched any paper with a reviewer. After that the conference organizers have revised the code and run the code on AMIA_PC_domains_keywords.csv and AMIA_submission.csv. We were unable to share those files as they contain proprietary information under AMIA terms. However, for use with a future conference, the code should be easily adaptable (e.g., updating the csv header).

Note that the process is human-in-the-loop, with close monitoring, interaction, and necessary revisions from the conference chair and vice chairs and AMIA staff on using LLM assisted code for automation tasks such as paper reviewer assignment and duplicate paper detection. 

In [None]:
import networkx as nx

In [None]:
keywords = """Advanced data visualization tools and techniques
Bioimaging techniques and applications
Biomarker discovery and development
Biomedical informatics and data science workforce education
Citizen Science and democratization of AI and informatics
Clinical decision support for translational/data science interventions
Clinical genomics/omics and interventions based on omics data
Clinical and research data collection, curation, preservation, or sharing
Clinical trials innovations
Cohort discovery
Collaborative workflow systems
Data commons
Data-driven research and discovery
Data Integration
Data Literacy and numeracy
Data/system integration, standardization and interoperability
Data mining and knowledge discovery
Data quality
Data security and privacy
Data sharing / interoperability
Data standards
Data transformation/ETL
Digital research enterprise
Drug discovery, repurposing, and side-effect discovery
Education and Training
EHR-based phenotyping
Enterprise data warehouse/Data lake
Epigenomics
Ethical, legal, and social issues
Exposome and data integration
Fairness and disparity research in health informatics
FHIR
Genomics/Omic data interpretation
Genotype-phenotype association studies (including GWAS)
Geographical information systems (GIS)
Health Information and biomedical data dissemination strategies
Health literacy issues and solutions
Implementation Science
Infectious disease modeling
Influence of informatics on pharma and insurance industry
Informatics for cancer immunotherapy
Informatics research/biomedical informatics research methods
Integrating omics, clinical, and imaging data
Integrative omic analysis
Knowledge representation, management, or engineering
Learning healthcare system
Machine learning and predictive modeling
Medical Imaging
Measuring outcomes
Mobile Health, wearable devices and patient-generated health data
Natural Language Processing
Ontologies
Open Science for biomedical research and translational medicine
Outcomes research, clinical epidemiology, population health
Patient centered research and care
Pharmacogenomics
Phenomics and phenome-wide association studies
Proactive machine learning and reinforcement learning
Public health informatics
Real-world evidence and policy making
Recruitment technologies
Reproducible research methods and tools
Single Cell Analysis
Secondary use of EHR data
Social determinants of health
Stakeholder (i.e., patients or community) engagement
Sustainable research data infrastructure
Systems biology and network analysis
Transcriptomics
Telehealth and remote care"""
keywords = keywords.split('\n')

In [None]:
from ast import literal_eval
import pandas as pd
pc = pd.read_csv('AMIA_PC_domains_keywords.csv') # program committee (reviewer) information
pc.KEYWORDS_SELECTION = pc.KEYWORDS_SELECTION.apply(literal_eval)
pc.head()

In [None]:
submission  = pd.read_csv('AMIA_submission.csv') # submission information
submission.KEYWORDS_SELECTION = submission.KEYWORDS_SELECTION.apply(literal_eval)
submission['diff'] = submission.KEYWORDS_SELECTION.apply(lambda x: set(x).difference(set(keywords)))

submission.head()

In [None]:
import pandas as pd
from collections import Counter
import string

# Function to normalize titles
def normalize_title(title):
    lower = title.lower()
    return lower.translate(str.maketrans('', '', string.punctuation))

# Normalize titles: lowercase and remove punctuation
submission['normalized_title'] = submission['submission_TITLE'].apply(normalize_title)

# Find exact duplicates
title_counts = Counter(submission['normalized_title'])
duplicates = {title for title, count in title_counts.items() if count > 1}

# Create a column to flag duplicates
submission['is_duplicate'] = submission['normalized_title'].apply(lambda x: x in duplicates)

# Optionally, display or process the DataFrame with flagged duplicates
submission[['submission_TITLE', 'is_duplicate']]

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# Vectorize the titles
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(submission['normalized_title'])

# Calculate cosine similarity
cosine_sim = cosine_similarity(tfidf_matrix)

# Flag near-duplicates (adjust threshold as needed)
threshold = 0.8  # Example threshold
near_duplicates = np.where((cosine_sim > threshold)) #  & (cosine_sim < 1)

# Example of processing near-duplicates
for i, j in zip(*near_duplicates):
    if i < j:
        print(f"Potential near-duplicate pair {submission.iloc[i]['submission_id']}-{submission.iloc[j]['submission_id']}: '{submission.iloc[i]['submission_TITLE']}' and '{submission.iloc[j]['submission_TITLE']}'")

In [None]:
N_pc = len(pc)
N_submission = len(submission)

max_workload = N_submission//N_pc + 1

In [None]:
N_pc, N_submission, max_workload

In [None]:
# Duplicate each PC member max_workload times
pc_expanded = pd.DataFrame(
    pc.loc[pc.index.repeat(max_workload)].values, 
    columns=pc.columns
)
pc_expanded['PC_ID'] = pc_expanded['PC_ID'].astype(str) + '_' + (pc_expanded.groupby('PC_ID').cumcount() + 1).astype(str)
pc_expanded

In [None]:
# Maximum Bipartite Matching

# Create a bipartite graph
B = nx.Graph()

# Add nodes with the attribute 'bipartite'
B.add_nodes_from(pc_expanded['PC_ID'], bipartite=0)  # PC nodes
B.add_nodes_from(submission['submission_id'], bipartite=1)  # Submission nodes

# Add edges based on your criteria
for _, pc_row in pc_expanded.iterrows():
    for _, submission_row in submission.iterrows():
        if pc_row['org_domain'] not in submission_row['org_domain']:
            # Count overlapping keywords
            overlap = len(set(pc_row['KEYWORDS_SELECTION']) & set(submission_row['KEYWORDS_SELECTION']))
            if overlap > 0:
                # Add an edge with weight
                B.add_edge(pc_row['PC_ID'], submission_row['submission_id'], weight=overlap)

            
# Iterate over the edges in the graph
# for (u, v, wt) in B.edges(data='weight'):
#     print(f"Edge from {u} to {v} with weight: {wt}")

# Apply maximum matching
max_match = nx.bipartite.maximum_matching(B, top_nodes=set(pc_expanded['PC_ID']))

# Filter out only the edges that are from submission to PC
filtered_match = {k: v for k, v in max_match.items() if k in set(submission['submission_id'])}
print(max_match)

print(filtered_match)

# Convert the matching result to a DataFrame
results_df = pd.DataFrame(list(filtered_match.items()), columns=['submission_id', 'PC_ID'])

# Sort the DataFrame based on submission_id
results_df.sort_values(by='submission_id', inplace=True)

# Reset index of the DataFrame
results_df.reset_index(drop=True, inplace=True)

# Join results_df with pc_expanded on the PC_ID column
# This join will add the FIRST_NAME and LAST_NAME columns to the results_df
merged_df = pd.merge(results_df, pc_expanded[['PC_ID', 'FIRST_NAME', 'LAST_NAME']], on='PC_ID', how='left')

# Reorder columns if needed and reset index
final_df = merged_df[['PC_ID', 'FIRST_NAME', 'LAST_NAME', 'submission_id']]
final_df.reset_index(drop=True, inplace=True)

# Display or use the final DataFrame
final_df.to_csv('assignment_results.csv', index=False)

In [None]:
# Modify the PC_ID column to remove "_" and characters after it
final_df['PC_ID'] = final_df['PC_ID'].str.split('_').str[0]

# Group by PC_ID and collapse submission_id into a list
grouped_df = final_df.groupby('PC_ID')['submission_id'].apply(list).reset_index()

# Display or use the grouped DataFrame
print(grouped_df)

In [None]:
# Group by PC_ID and aggregate
grouped_df = final_df.groupby('PC_ID').agg({
    'submission_id': lambda x: list(x),
    'FIRST_NAME': 'first',
    'LAST_NAME': 'first'
}).reset_index()

