In [1]:
from biomed_eval import BiomedicalRetrievalEvaluator

data_path = 'intermedia_data/gold_df.pkl'

In [2]:
evaluator = BiomedicalRetrievalEvaluator(data_path, predicate='biolink:treats')
evaluator.check_embedding_cache_status()

Loaded 1778 edges total
After excluding 'maybe': 1432 edges
After filtering for predicate 'biolink:treats': 375 edges
Label distribution:
abstract_support?
yes    200
no     175
Name: count, dtype: int64

EMBEDDING CACHE STATUS

concat_general:
  Cached examples: 742
  Cache size: 11.30 MB

concat_biomedical:
  Cached examples: 742
  Cache size: 22.33 MB

ai_general:
  Cached examples: 742
  Cache size: 11.30 MB

ai_biomedical:
  Cached examples: 742
  Cache size: 22.32 MB

TOTAL:
  Cached examples: 2968
  Total cache size: 67.25 MB


In [5]:
evaluator.split_data()
# evaluator.precompute_all_embeddings()
evaluator.run_validation_pipeline() 
test_result = evaluator.run_test_evaluation() 


Splitting data into validation/test by predicate and label...
Found 1 unique predicates: ['biolink:treats']
  biolink:treats + yes: 100 val, 100 test
  biolink:treats + no: 88 val, 87 test

Final split sizes:
  val: 188 edges
    Labels: {'yes': 100, 'no': 88}
    Predicates: {'biolink:treats': 188}
  test: 187 edges
    Labels: {'yes': 100, 'no': 87}
    Predicates: {'biolink:treats': 187}

VALIDATION PHASE: Comparing configurations
Filtered by predicate: 'biolink:treats'

--- Configuration: concat_general ---

Computing similarities: concat + general
  Checking embedding cache: concat + general
  ✓ Embeddings: 188 from cache, 0 newly computed
Best aggregation: max (PR-AUC: 0.8281)
  Optimal threshold: 0.6082
    Precision: 0.7802
    Recall: 0.7100

--- Configuration: concat_biomedical ---

Computing similarities: concat + biomedical
  Checking embedding cache: concat + biomedical
  ✓ Embeddings: 188 from cache, 0 newly computed
Best aggregation: max (PR-AUC: 0.7796)
  Optimal thres

In [7]:
evaluator.save_results(filepath='evaluation_results.json')


DETERMINING BEST SENTENCE-LEVEL THRESHOLD

Analyzing AI + General:

SENTENCE-LEVEL THRESHOLD ANALYSIS
Purpose: Find optimal similarity threshold for vector search filtering
Filtered by predicate: 'biolink:treats'

Using configuration: ai_general

RECOMMENDED MILVUS PARAMETERS:

For Milvus range search with COSINE metric:
  (Returns entities with similarity in [radius, range_filter])

  For BALANCED (Max F1):
    radius: 0.6500
    range_filter: 1.0
    → Precision: 0.4393, Recall: 0.4691, F1: 0.4537


Analyzing Concat + General:

SENTENCE-LEVEL THRESHOLD ANALYSIS
Purpose: Find optimal similarity threshold for vector search filtering
Filtered by predicate: 'biolink:treats'

Using configuration: concat_general
  Recomputing similarities for requested configuration...

Computing similarities: concat + general
  Checking embedding cache: concat + general
  ✓ Embeddings: 187 from cache, 0 newly computed

RECOMMENDED MILVUS PARAMETERS:

For Milvus range search with COSINE metric:
  (Returns

In [9]:
evaluator = BiomedicalRetrievalEvaluator(data_path, predicate='biolink:affects')
# evaluator.check_embedding_cache_status()
evaluator.split_data()
# evaluator.precompute_all_embeddings()
validation_result = evaluator.run_validation_pipeline()
test_result = evaluator.run_test_evaluation()
evaluator.save_results(filepath='evaluation_results.json')

Loaded 1778 edges total
After excluding 'maybe': 1432 edges
After filtering for predicate 'biolink:affects': 367 edges
Label distribution:
abstract_support?
no     184
yes    183
Name: count, dtype: int64

Splitting data into validation/test by predicate and label...
Found 1 unique predicates: ['biolink:affects']
  biolink:affects + yes: 92 val, 91 test
  biolink:affects + no: 92 val, 92 test

Final split sizes:
  val: 184 edges
    Labels: {'yes': 92, 'no': 92}
    Predicates: {'biolink:affects': 184}
  test: 183 edges
    Labels: {'no': 92, 'yes': 91}
    Predicates: {'biolink:affects': 183}

VALIDATION PHASE: Comparing configurations
Filtered by predicate: 'biolink:affects'

--- Configuration: concat_general ---

Computing similarities: concat + general
  Checking embedding cache: concat + general
  ✓ Embeddings: 184 from cache, 0 newly computed
Best aggregation: max (PR-AUC: 0.7213)
  Optimal threshold: 0.4246
    Precision: 0.6283
    Recall: 0.7717

--- Configuration: concat_biom

In [11]:
# Option 1: Get all predicates' parameters
all_params = evaluator.get_latest_parameters('evaluation_results.json')


Loaded evaluation history from: evaluation_results.json
Total evaluations: 2

LATEST PARAMETERS FOR ALL PREDICATES

biolink:affects:
  Updated: 2025-10-21T15:31:13.141028
  N examples: 367
  Sentence threshold: 0.550 (ai + general)
  Abstract threshold: 0.547 (ai + top2_mean)
  Test MRR: 0.652, PR-AUC: 0.768

biolink:treats:
  Updated: 2025-10-21T15:31:05.380874
  N examples: 375
  Sentence threshold: 0.600 (concat + general)
  Abstract threshold: 0.647 (ai + max)
  Test MRR: 0.886, PR-AUC: 0.743


In [15]:
# Option 2: Get specific predicate's parameters
treats_params = evaluator.get_latest_parameters(
    'evaluation_results.json', 
    predicate='biolink:treats'
)


Loaded evaluation history from: evaluation_results.json
Total evaluations: 2

LATEST PARAMETERS FOR PREDICATE: 'biolink:treats'
Timestamp: 2025-10-21T15:31:05.380874
Based on 375 examples (188 val, 187 test)

Sentence-Level Search:
  Model: sentence-transformers/all-MiniLM-L6-v2
  Representation: concat
  Threshold: 0.600
  Expected F1: 0.454
  Expected Precision: 0.394
  Expected Recall: 0.537

Abstract-Level Classification:
  Model: sentence-transformers/all-MiniLM-L6-v2
  Representation: ai
  Aggregation: max
  Threshold: 0.647

Test Performance:
  MRR: 0.886
  Recall@3: 0.849
  PR-AUC: 0.743
  Classification F1: 0.683
