# Experimentation with Approach to Embedding and Comparing Text

# Setup

In [1]:
import pandas as pd
import numpy as np
from pathlib import Path
import os
import sys
from dotenv import load_dotenv
import matplotlib.pyplot as plt
import seaborn as sns
colours = sns.color_palette("Set2")
import warnings
warnings.filterwarnings("ignore")
from approach_tester import compare_approaches, COMBINATIONS_DICT, RECIPIENTS_SECTIONS

#get keys from env
load_dotenv()
url = os.getenv("SUPABASE_URL")
key = os.getenv("SUPABASE_KEY")

----

# Retrieving Data from Checkpoints

I will reuse the checkpoints created in the model selection notebook and use the same data to explore the best approach to comparing the similarity of embeddings.

In [2]:
#get checkpoint folder
checkpoint_folder = Path("./7.1_checkpoints/")

recipients_df = pd.read_pickle(checkpoint_folder / "recipients_df.pkl")
funders_df = pd.read_pickle(checkpoint_folder / "funders_df.pkl")
embedding_pairs = pd.read_pickle(checkpoint_folder / "embedding_pairs.pkl")

In [3]:
#check dfs
print(f"Recipients: {recipients_df.shape} | Funders: {funders_df.shape} | Evaluation Pairs: {embedding_pairs.shape}")

Recipients: (12, 6) | Funders: (12, 14) | Evaluation Pairs: (12, 6)


----

# Approach Evaluation - Recipient Text as Independent Variable

Based on the model selection notebook, `all-roberta-large-v1` performed best when comparing funder and recipient text. Now I will test different approaches to determine which of the following approaches to embedding/comparison works best:
1. Like-for-like comparisons (i.e. funder activities vs recipient activities, or objectives vs objectives)
2. Combining some columns (e.g. activities + objectives)
3. Combining all available columns

In [4]:
#set model constant
MODEL = "all-roberta-large-v1"

## Test 1 - Like-for-Like Comparisons

In [5]:
#define like-for-like approaches
like_for_like = {
    "activities_only": (["activities"], ["recipient_activities"]),
    "objectives_only": (["objectives"], ["recipient_objectives"])
}

#test approaches
results_lfl, pairs_lfl = compare_approaches(
    model_name=MODEL,
    funders_df=funders_df,
    recipients_df=recipients_df,
    embedding_pairs=embedding_pairs,
    approaches_dict=like_for_like
)

activities_only: r=0.724, time=4.5s
objectives_only: r=0.688, time=5.2s

Total time: 9.7s
Best approach(es):
  activities_only: r=0.724, time=4.5s
  objectives_only: r=0.688, time=5.2s


## Test 2 - Combined Column Comparisons

In [6]:
#test combined column approaches
results_combined, pairs_combined = compare_approaches(
    model_name=MODEL,
    funders_df=funders_df,
    recipients_df=recipients_df,
    embedding_pairs=embedding_pairs,
    approaches_dict=COMBINATIONS_DICT
)

acts_objs: r=0.688, time=5.7s
acts_objacts: r=0.855, time=5.2s
acts_achs: r=0.860, time=6.2s
acts_policy: r=0.919, time=9.3s
objs_objacts: r=0.812, time=6.2s
objs_achs: r=0.823, time=6.0s
objs_policy: r=0.905, time=7.1s
acts_objs_objacts: r=0.803, time=5.9s
acts_objs_achs: r=0.865, time=44.2s
acts_objs_policy: r=0.871, time=59.1s
acts_objacts_achs: r=0.824, time=17.4s
acts_objacts_policy: r=0.859, time=13.4s
acts_achs_policy: r=0.873, time=14.3s
objs_objacts_achs: r=0.811, time=22.6s
objs_objacts_policy: r=0.858, time=16.3s
objs_achs_policy: r=0.813, time=47.3s
acts_objs_objacts_achs: r=0.828, time=36.6s
acts_objs_objacts_policy: r=0.844, time=31.2s
acts_objs_achs_policy: r=0.872, time=10.2s
acts_objacts_achs_policy: r=0.830, time=7.6s
objs_objacts_achs_policy: r=0.802, time=18.5s

Total time: 390.5s
Best approach(es):
  acts_policy: r=0.919, time=9.3s
  objs_policy: r=0.905, time=7.1s
  acts_achs_policy: r=0.873, time=14.3s
  acts_objs_achs_policy: r=0.872, time=10.2s
  acts_objs_poli

## Test 3 - Full Text Comparisons

In [7]:
#define full text approaches
full_text = {
    "all_columns": (["activities", "objectives", "objectives_activities", "achievements_performance", "grant_policy"], RECIPIENTS_SECTIONS)
}

#test approaches
results_full, pairs_full = compare_approaches(
    model_name=MODEL,
    funders_df=funders_df,
    recipients_df=recipients_df,
    embedding_pairs=embedding_pairs,
    approaches_dict=full_text
)

all_columns: r=0.831, time=6.4s

Total time: 6.4s
Best approach(es):
  all_columns: r=0.831, time=6.4s


## Observations

Whilst the combination of `activities` and `grant_policy` achieved the highest correlation (r=0.92), I would hesitate to use this for the final model due to concerns about missing context. From the top 5 results, I will select `acts_objs_achs_policy` (r=0.87). This decision reflects:

- The small sample size of these experiments (n=12) means that the 5-point difference between the first and fourth best performers may not be reliable.
- That said, a score of 0.87 still represents a very strong correlation between the embeddings and my ratings.
- Domain expertise: in UK trust fundraising, activities data is critical for alignment assessment - funders explicitly evaluate what organisations do, not just what they aim to achieve. I am confident that excluding both activities and objectives would be inadvisable due to the risk of losing important contextual data.
- The four-way combination of `activities`, `objectives`, `achievements_performance` and `grant_policy` achieved essentially identical performance to the third-best combination (r=0.872 vs r=0.873 respectively), whilst reducing computation time by approximately 30%.

# Approach Evaluation - Funder Text as Independent Variable

Finally, I will test the best performing approaches against different combinations of the recipient sections, to evaluate whether separating our `recipient_activities` and `recipient_objectives` produces better results.

## Test 5 - Recipient Activities Text Only

## Test 6 - Recipient Objectives Text Only