# Experimentation with Approach to Embedding and Comparing Text

# Setup

In [1]:
import pandas as pd
import numpy as np
from pathlib import Path
import os
import sys
from dotenv import load_dotenv
import matplotlib.pyplot as plt
import seaborn as sns
colours = sns.color_palette("Set2")
import warnings
warnings.filterwarnings("ignore")
from approach_tester import test_embedding_approach, compare_approaches

#get keys from env
load_dotenv()
url = os.getenv("SUPABASE_URL")
key = os.getenv("SUPABASE_KEY")

----

# Retrieving Data from Checkpoints

I will reuse the checkpoints created in the model selection notebook and use the same data to explore the best approach to comparing the similarity of embeddings.

In [2]:
#get checkpoint folder
checkpoint_folder = Path("./7.1_checkpoints/")

recipients_df = pd.read_pickle(checkpoint_folder / "recipients_df.pkl")
funders_df = pd.read_pickle(checkpoint_folder / "funders_df.pkl")
embedding_pairs = pd.read_pickle(checkpoint_folder / "embedding_pairs.pkl")

In [3]:
#check dfs
print(f"Recipients: {recipients_df.shape} | Funders: {funders_df.shape} | Evaluation Pairs: {embedding_pairs.shape}")

Recipients: (12, 6) | Funders: (12, 14) | Evaluation Pairs: (12, 6)


----

# Approach Evaluation

Based on the model selection notebook, `all-roberta-large-v1` performed best when comparing funder and recipient text. Now I will test different approaches to determine which of the following approaches to embedding/comparison works best:
1. Like-for-like comparisons (i.e. funder activities vs recipient activities, or objectives vs objectives)
2. Combining some columns (e.g. activities + objectives)
3. Combining all available columns

In [4]:
#set model constant
MODEL = "all-roberta-large-v1"

## Test 1 - Like-for-Like Comparisons

First, I will test whether comparing similar sections works best - activities to activities, and objectives to objectives.

In [6]:
#define like-for-like approaches
like_for_like = {
    "activities_only": (["activities"], ["recipient_activities"]),
    "objectives_only": (["objectives"], ["recipient_objectives"])
}

#test approaches
results_lfl, pairs_lfl = compare_approaches(
    model_name=MODEL,
    funders_df=funders_df,
    recipients_df=recipients_df,
    embedding_pairs=embedding_pairs,
    approaches_dict=like_for_like
)

activities_only: r=0.724, time=2.9s
objectives_only: r=0.688, time=4.5s

Total time: 7.4s
Best approach: activities_only (r=0.724)


## Test 2 - Combined Column Comparisons

I will test whether combining activities and objectives - with each other and with other combinations of columns - works best.

In [None]:
#define combined column approaches


## Test 3 - Full Text Comparisons

Finally, I will test whether using all available text columns (including extracted accounts data) provides the most comprehensive comparison.

In [None]:
#define full text approaches
full_text_approaches = {
    "all_columns": (["activities", "objectives", "objectives_activities", "achievements_performance", "grant_policy"], 
                    ["recipient_activities", "recipient_objectives"])
}

#test approaches
results_full, pairs_full = compare_approaches(
    model_name=MODEL,
    funders_df=funders_df,
    recipients_df=recipients_df,
    embedding_pairs=embedding_pairs,
    approaches_dict=full_text_approaches
)