# The Scoring Logic - Testing

# Setup

In [1]:
import pandas as pd
from pathlib import Path
import warnings
warnings.filterwarnings("ignore")
from sentence_transformers import SentenceTransformer
from logic_utils import calculate_similarity_score, format_test_1, format_tests
from logic_calculations import *

----

# Retrieving Data from Checkpoints

In [18]:
#get checkpoint folder
checkpoint_folder = Path("./10.1_checkpoints/")

#get checkpoint
funders_df = pd.read_pickle(checkpoint_folder / "funders_df.pkl")
grants_df = pd.read_pickle(checkpoint_folder / "grants_df.pkl")
pairs_df = pd.read_pickle(checkpoint_folder / "eval_final_df.pkl")
areas_df = pd.read_pickle(checkpoint_folder / "areas_df.pkl")
hierarchies_df = pd.read_pickle(checkpoint_folder / "hierarchies_df.pkl")

----

# Preparation of User and Funder Data

To test the logic, I will use the 12 funder-recipient pairs that I identifed earlier. As with the logic development process, I will use the recipients of these pairs as proxies for user input.

First, the user will input information about their charity (the applicant), then embeddings will be created for the inputted text data. For the purposes of this testing notebook, I will simulate users' keyword input by using extracted classifications from recipients' data, but in the final artefact, the user will be asked to enter their own keywords.

## Creation of Embeddings from User Input

In [3]:
model = SentenceTransformer("all-roberta-large-v1")
user_cols = ["recipient_name", "recipient_activities", "recipient_objectives"]

for col in user_cols:
    #replace nans with empty string
    texts = pairs_df[col].fillna("").tolist()
    embeddings = model.encode(texts)
    
    #add to df
    pairs_df[f"{col}_em"] = list(embeddings)

pairs_df["concat_text"] = pairs_df[user_cols[0]].fillna("")
for col in user_cols[1:]:
    pairs_df["concat_text"] += " " + pairs_df[col].fillna("")

#make lowercase
pairs_df["concat_text"] = pairs_df["concat_text"].str.lower()

#create embeddings
texts = pairs_df["concat_text"].tolist()
embeddings = model.encode(texts)
pairs_df["user_concat_em"] = list(embeddings)

#drop concatenated text
pairs_df = pairs_df.drop(columns=["concat_text"])

#change recipient_ to user_
pairs_df = pairs_df.rename(columns=lambda col: f"user_{col[len('recipient_'):]}" if col.startswith("recipient_") else col)

In [4]:
pd.set_option("display.max_columns", None)
pairs_df.head(1)

Unnamed: 0,id,funder_registered_num,user_id,name,website,activities,objectives,income_latest,expenditure_latest,objectives_activities,achievements_performance,grant_policy,is_potential_sbf,is_on_list,is_nua,name_em,activities_em,objectives_em,objectives_activities_em,achievements_performance_em,grant_policy_em,concat_em,extracted_class,causes,areas,beneficiaries,income_history,expenditure_history,list_entries,user_name,user_activities,user_objectives,user_areas,user_causes,user_beneficiaries,user_extracted_class,user_name_em,user_activities_em,user_objectives_em,user_concat_em
0,1,1124856,328729,ROSA FUND,https://www.rosauk.org,ROSA IS THE FIRST UK-WIDE FUND FOR WOMEN'S INI...,THE OBJECTS OF THE CHARITY ARE TO FURTHER ANY ...,1407453.0,1372296.0,,,,False,False,False,"[0.036068745,0.02428467,-0.026885081,-0.001224...","[0.005698055,0.011768709,-0.015513399,0.004063...","[-0.023482107,-0.02466424,0.0036000705,-0.0259...","[-0.019817753,-0.00571729,0.022262126,-0.03666...","[-0.019817753,-0.00571729,0.022262126,-0.03666...","[-0.019817753,-0.00571729,0.022262126,-0.03666...","[0.00031545621,0.014991585,-0.003386881,0.0022...","[""UK"",""WALES"",""GIRLS"",""WOMEN"",""CHARITY AND VCS...",[General Charitable Purposes],[Throughout England And Wales],"[Other Charities Or Voluntary Bodies, Other De...","{2020: 155612.0, 2021: 4478996.0, 2022: 237267...","{2020: 974678.0, 2021: 2118687.0, 2022: 266530...",[],ASYLUM AID,THE PROVISION OF LEGAL ADVICE AND REPRESENTATI...,2. OBJECTS2.1 THE CHARITY IS ESTABLISHED FOR T...,[Throughout England And Wales],"[Education/training, The Prevention Or Relief ...",[People Of A Particular Ethnic Or Racial Origi...,"[""UK"",""ASYLUM SEEKERS AND REFUGEES"",""MIGRANTS""...","[-0.008021083, 0.0150393685, -0.02299179, -0.0...","[-0.022860443, 0.029296884, -0.017596053, -0.0...","[0.016410686, 0.00963383, -0.04015383, -0.0136...","[-0.00083443156, -0.01810797, -0.0055338936, -..."


In [5]:
pairs_backup = pairs_df.copy()

-----

# The Alignment Score Calculator

In [19]:
#set up weights
weights_v1 = {
    "areas_weight": 0.06,
    "beneficiaries_weight": 0.04,
    "causes_weight": 0.02,
    "text_similarity_weight": 0.17,
    "keyword_similarity_weight": 0.15,
    "name_rp_weight": 0.16,
    "grants_rp_weight": 0.2,
    "recipients_rp_weight": 0.2
}

weights_v2 = {
    "areas_weight": 0.08,
    "beneficiaries_weight": 0.04,
    "causes_weight": 0.02,
    "text_similarity_weight": 0.16,
    "keyword_similarity_weight": 0.13,
    "name_rp_weight": 0.16,
    "grants_rp_weight": 0.2,
    "recipients_rp_weight": 0.21
}

weights_v3 = {
    "areas_weight": 0.08,
    "beneficiaries_weight": 0.04,
    "causes_weight": 0.02,
    "text_similarity_weight": 0.16,
    "keyword_similarity_weight": 0.11,
    "name_rp_weight": 0.17,
    "grants_rp_weight": 0.21,
    "recipients_rp_weight": 0.21
}

weight_dicts = {
    "v1": weights_v1,
    "v2": weights_v2,
    "v3": weights_v3
}

model = SentenceTransformer("all-roberta-large-v1")

## Stage 1: Scores and Reasonings Retrieval Function

For the first stage of the test, I will simply combine all of the functions to produce all scores and reasonings. This will provide a first layer to the final alignment calculator. From this, I will be able to gain an understanding of the data available for each pair and the scores that are produced by the first layer.

In [28]:
def get_scores_and_reasonings(pairs_df, idx, grants_df, areas_df, hierarchies_df, model):
    """
    Calls all calculation functions to get scores and reasonings for each step.
    """

    #get funder's data
    funder_num = pairs_df["funder_registered_num"].iloc[idx]
    funder_grants_df = grants_df[grants_df["funder_num"] == funder_num].copy()
    has_grants_data = not funder_grants_df.empty
    
    #1 check if funder has a single beneficiary
    is_sbf = pairs_df["is_potential_sbf"].iloc[idx]

    #2 check if funder states no unsolicited applications
    is_nua = pairs_df["is_nua"].iloc[idx]

    #3 check if funder is on the list
    is_on_list = pairs_df["is_on_list"].iloc[idx]
    list_reasoning = set(pairs_df["list_entries"].iloc[idx]) if is_on_list else None

    #4 check if funder has ever given a grant to applicant
    user_num = pairs_df["user_id"].iloc[idx]
    existing_relationship, num_grants, relationship = check_existing_relationship(grants_df, funder_num, user_num)

    #5 get areas score
    funder_areas = pairs_df["areas"].iloc[idx].copy()
    user_areas = pairs_df["user_areas"].iloc[idx].copy()
    areas_score, areas_reasoning = check_areas(funder_areas, user_areas, areas_df, hierarchies_df)

    #6 get beneficiaries score
    funder_beneficiaries = pairs_df["beneficiaries"].iloc[idx].copy()
    user_beneficiaries = pairs_df["user_beneficiaries"].iloc[idx].copy()
    beneficiaries_score, beneficiaries_reasoning = check_beneficiaries(funder_beneficiaries, user_beneficiaries)

    #7 get causes score
    funder_causes = pairs_df["causes"].iloc[idx].copy()
    user_causes = pairs_df["user_causes"].iloc[idx].copy()
    causes_score, causes_reasoning, has_gcp = check_causes(funder_causes, user_causes)

    #8 get text semantic similarity score
    funder_embedding = pairs_df["concat_em"].iloc[idx]
    user_embedding = pairs_df["user_concat_em"].iloc[idx]
    text_similarity_score = calculate_similarity_score(funder_embedding, user_embedding)

    #9 get keyword semantic similarity score
    funder_keywords = pairs_df["extracted_class"].iloc[idx]
    user_keywords = pairs_df["user_extracted_class"].iloc[idx]
    keyword_similarity_score, keyword_strong_matches, keyword_reasoning, keyword_gets_bonus = check_keywords(funder_keywords, user_keywords, model)

    #10 get name (RP) semantic similarity score
    recipients_name_all_em = dict(zip(funder_grants_df["recipient_name"], funder_grants_df["recipient_name_em"]))
    user_name_em = pairs_df["user_name_em"].iloc[idx]
    user_name = pairs_df["user_name"].iloc[idx]
    name_rp_score, name_rp_reasoning = check_name_rp(recipients_name_all_em, user_name_em, user_name)

    #11 get grants (RP) semantic similarity score
    non_empty_grants = funder_grants_df[
        (funder_grants_df["grant_title"].notna() & (funder_grants_df["grant_title"] != "")) |
        (funder_grants_df["grant_desc"].notna() & (funder_grants_df["grant_desc"] != ""))
    ]
    grants_all_em = dict(zip(non_empty_grants["recipient_name"], non_empty_grants["grant_concat_em"]))
    user_concat_em = pairs_df["user_concat_em"].iloc[idx]
    user_name = pairs_df["user_name"].iloc[idx]
    grants_rp_score, grants_rp_reasoning = check_grants_rp(grants_all_em, user_concat_em, user_name)

    #12 get recipients (RP) semantic similarity score
    recipients_all_em = dict(zip(funder_grants_df["recipient_name"], funder_grants_df["recipient_concat_em"]))
    user_concat_em = pairs_df["user_concat_em"].iloc[idx]
    user_name = pairs_df["user_name"].iloc[idx]
    recipients_rp_score, recipients_rp_reasoning = check_recipients_rp(recipients_all_em, user_concat_em, user_name)

    #13 get sbf penalty
    sbf_penalty = 0.1 if is_sbf else 1.0

    #14 get nua penalty
    if existing_relationship:
        nua_penalty = 1.0
    else:
        nua_penalty = 0.2 if is_nua else 1.0       

    #15 get keywords bonus
    if keyword_strong_matches:
        ukcat_url = "https://raw.githubusercontent.com/lico27/ukcat/main/data/ukcat.csv"
        ukcat_df = pd.read_csv(ukcat_url)
        keywords_bonus = calculate_keywords_bonus(keyword_strong_matches, ukcat_df)
    else:
        keywords_bonus = 1.0

    #16 get relationship bonus
    if existing_relationship:
        time_lapsed, relationship_bonus, last_grant_year = calculate_relationship_bonus(relationship)
    else:
        time_lapsed = None
        relationship_bonus = 1.0
        last_grant_year = None

    #17 get gcp bonus
    gcp_bonus = 1.2 if has_gcp else 1.0

    #18 get areas (RP) bonus
    user_areas = pairs_df["user_areas"].iloc[idx].copy()
    areas_rp_bonus, areas_rp_reasoning = calculate_areas_bonus_rp(funder_grants_df, user_areas, areas_df, hierarchies_df)

    #19 get keywords (RP) bonus
    user_keywords = pairs_df["user_extracted_class"].iloc[idx]
    keywords_rp_bonus, keywords_rp_reasoning = calculate_keywords_bonus_rp(funder_grants_df, user_keywords)

    #20 get low variance penalty
    lv_penalty = calculate_lv_penalty(funder_grants_df)
    
    return (is_sbf, is_nua, is_on_list, list_reasoning, existing_relationship, num_grants, relationship, areas_score, areas_reasoning, 
            beneficiaries_score, beneficiaries_reasoning, causes_score, causes_reasoning, has_gcp, text_similarity_score,
            keyword_similarity_score, keyword_strong_matches, keyword_reasoning, keyword_gets_bonus, name_rp_score, name_rp_reasoning,
            grants_rp_score, grants_rp_reasoning, recipients_rp_score, recipients_rp_reasoning, sbf_penalty, nua_penalty, keywords_bonus,
            time_lapsed, relationship_bonus, last_grant_year, gcp_bonus, areas_rp_bonus, areas_rp_reasoning, keywords_rp_bonus, keywords_rp_reasoning, lv_penalty,
            has_grants_data
    )

## Test 1: All Pairs Separate Scores

I will now produce a simple card for each funder-recipient pair to display the binary outputs and the calculated first layer scores.

In [29]:
#get scores for all pairs and display
for idx, row in pairs_df.iterrows():
    result = get_scores_and_reasonings(pairs_df, idx, grants_df, areas_df, hierarchies_df, model)
    format_test_1(idx, row, result)

## Stage 2: Simple Alignment Score Function

This second stage will build the first iteration of the alignment score calculator, which calls the scoring function defined above and combines the outputted scores using the SAW methodology. The second layer then consists of applying the multipliers (bonuses and penalties) to the weighted scores.

In [30]:
def calculate_alignment_score_v1(pairs_df, idx, grants_df, areas_df, hierarchies_df, model, weights):
    """
    Combines all 20 scoring steps to produce one final alignment score. 
    """

    #get scores
    result = get_scores_and_reasonings(pairs_df, idx, grants_df, areas_df, hierarchies_df, model)

    #unpack score elements
    (is_sbf, is_nua, is_on_list, list_reasoning,
    existing_relationship, num_grants, relationship, areas_score,
    areas_reasoning, beneficiaries_score, beneficiaries_reasoning,
    causes_score, causes_reasoning, has_gcp, text_similarity_score,
    keyword_similarity_score, keyword_strong_matches, keyword_reasoning, keyword_gets_bonus,
    name_rp_score, name_rp_reasoning, grants_rp_score, grants_rp_reasoning,
    recipients_rp_score, recipients_rp_reasoning, sbf_penalty, nua_penalty, keywords_bonus,
    time_lapsed, relationship_bonus, last_grant_year, gcp_bonus, areas_rp_bonus, areas_rp_reasoning,
    keywords_rp_bonus, keywords_rp_reasoning, lv_penalty, has_grants_data) = result

    weighted_scores = (
        areas_score * weights["areas_weight"] +
        beneficiaries_score * weights["beneficiaries_weight"] +
        causes_score * weights["causes_weight"] +
        text_similarity_score * weights["text_similarity_weight"] +
        keyword_similarity_score * weights["keyword_similarity_weight"] +
        name_rp_score * weights["name_rp_weight"] +
        grants_rp_score * weights["grants_rp_weight"] +
        recipients_rp_score * weights["recipients_rp_weight"]
    )

    final_score = (
        weighted_scores *
        sbf_penalty *
        nua_penalty *
        keywords_bonus *
        relationship_bonus *
        gcp_bonus *
        areas_rp_bonus *
        keywords_rp_bonus *
        lv_penalty
    )

    final_score = min(max(final_score, 0.05), 0.95)
    
    return final_score

## Test 2: All Pairs Simple Alignment

I will call the calculator function for each set of weights, and display the resulting alignment scores to gain an understanding of which weights produce the most logical results.

In [31]:
pd.set_option("display.max_rows", None)
pd.set_option("display.max_columns", None)
pd.set_option("display.width", None)

In [32]:
#create results dataframe
results_v1 = []

for idx, row in pairs_df.iterrows():
    result_row = {
        "pair_id": idx + 1,
        "funder": row["name"],
        "recipient": row["user_name"]
    }

    #calculate score for each weight set
    for version, weights in weight_dicts.items():
        score = calculate_alignment_score_v1(pairs_df, idx, grants_df, areas_df, hierarchies_df, model, weights)
        result_row[f"score_{version}"] = round(score * 100, 2)

    results_v1.append(result_row)

#create and display dataframe
results_v1_df = pd.DataFrame(results_v1)
results_v1_df

Unnamed: 0,pair_id,funder,recipient,score_v1,score_v2,score_v3
0,1,BACON CHARITABLE TRUST,CATTON GROVE COMMUNITY CENTRE CIO,35.33,37.37,37.27
1,2,BEAVERBROOK FOUNDATION,FAITH IN LATER LIFE LTD,28.54,26.22,24.77
2,3,JESSIE SPENCER TRUST,COMBAT STRESS,86.81,87.04,86.45
3,4,GRUT TRUST,NORTHERN CANCER VOICES,11.46,10.57,9.97
4,5,JOHN WHIPPY FOUNDATION,ANIMAL RESCUE CYMRU,12.72,11.44,10.48
5,6,MRS WATERHOUSE CHARITABLE TRUST,KIDZ KLUB - LEEDS,64.23,63.72,62.24
6,7,TESLER FOUNDATION,SEVENTH HEAVEN,29.88,30.23,29.08
7,8,FRIENDS OF FAWLEY CHURCH,"BERKSHIRE, BUCKINGHAMSHIRE AND OXFORDSHIRE WIL...",5.0,5.0,5.0
8,9,3 TS CHARITABLE TRUST,WELDMAR HOSPICECARE,5.0,5.0,5.0
9,10,DAVID AND RUTH BEHREND FUND,FREEDOM FROM TORTURE,72.23,73.47,73.03


## Stage 3: Alignment with Reweighting Score Function

The scores above roughly align with my expectations based on my own prospecting of these funder-recipient pairs. However, I feel that the SAW method produces artificially deflated scores for funders without a grants history. I will add another layer to the calculation, to reweight the scores proportionally based on whether or not a funder has a grants history.

In [33]:
def calculate_alignment_score_v2(pairs_df, idx, grants_df, areas_df, hierarchies_df, model, weights):
    """
    Combines all 20 scoring steps to produce one final alignment score with reweighting to account for missing data where funders have no grants history.
    """

    #get scores
    result = get_scores_and_reasonings(pairs_df, idx, grants_df, areas_df, hierarchies_df, model)
    
    #unpack score elements
    (is_sbf, is_nua, is_on_list, list_reasoning,
     existing_relationship, num_grants, relationship, areas_score,
     areas_reasoning, beneficiaries_score, beneficiaries_reasoning,
     causes_score, causes_reasoning, has_gcp, text_similarity_score,
     keyword_similarity_score, keyword_strong_matches, keyword_reasoning, keyword_gets_bonus,
     name_rp_score, name_rp_reasoning, grants_rp_score, grants_rp_reasoning,
     recipients_rp_score, recipients_rp_reasoning, sbf_penalty, nua_penalty, keywords_bonus,
     time_lapsed, relationship_bonus, last_grant_year, gcp_bonus, areas_rp_bonus, areas_rp_reasoning,
     keywords_rp_bonus, keywords_rp_reasoning, lv_penalty, has_grants_data) = result

    #define weights based on stated/revealsed
    sp_weights = {
        "areas": (areas_score, weights["areas_weight"]),
        "beneficiaries": (beneficiaries_score, weights["beneficiaries_weight"]),
        "causes": (causes_score, weights["causes_weight"]),
        "text_similarity": (text_similarity_score, weights["text_similarity_weight"]),
        "keyword_similarity": (keyword_similarity_score, weights["keyword_similarity_weight"])
    }

    rp_weights = {
        "name_rp": (name_rp_score, weights["name_rp_weight"]),
        "grants_rp": (grants_rp_score, weights["grants_rp_weight"]),
        "recipients_rp": (recipients_rp_score, weights["recipients_rp_weight"])
    }

    #calculate scores with proportional reweighting
    if has_grants_data:
        #normal calculation when grants history exists
        weighted_scores = sum(score * weight for score, weight in sp_weights.values())
        weighted_scores += sum(score * weight for score, weight in rp_weights.values())
    else:
        #get total weights when no grants history
        sp_total = sum(weight for _, weight in sp_weights.values())
        rp_total = sum(weight for _, weight in rp_weights.values())
        
        #apply proportional reweighting
        reweight_proportion = (sp_total + rp_total) / sp_total
        weighted_scores = sum(score * weight * reweight_proportion for score, weight in sp_weights.values())
    
    final_score = (
        weighted_scores *
        sbf_penalty *
        nua_penalty *
        keywords_bonus *
        relationship_bonus *
        gcp_bonus *
        areas_rp_bonus *
        keywords_rp_bonus *
        lv_penalty
    )
    
    final_score = min(max(final_score, 0.05), 0.95)
    
    return final_score

## Test 3: All Pairs with Reweighting

In [34]:
#create results dataframe
results_v2 = []

for idx, row in pairs_df.iterrows():
    result_row = {
        "pair_id": idx + 1,
        "funder": row["name"],
        "recipient": row["user_name"]
    }

    #calculate score for each weight set
    for version, weights in weight_dicts.items():
        score = calculate_alignment_score_v2(pairs_df, idx, grants_df, areas_df, hierarchies_df, model, weights)
        result_row[f"score_{version}"] = round(score * 100, 2)

    results_v2.append(result_row)

#create and display dataframe
results_v2_df = pd.DataFrame(results_v2)
results_v2_df

Unnamed: 0,pair_id,funder,recipient,score_v1,score_v2,score_v3
0,1,BACON CHARITABLE TRUST,CATTON GROVE COMMUNITY CENTRE CIO,35.33,37.37,37.27
1,2,BEAVERBROOK FOUNDATION,FAITH IN LATER LIFE LTD,64.87,60.98,60.43
2,3,JESSIE SPENCER TRUST,COMBAT STRESS,86.81,87.04,86.45
3,4,GRUT TRUST,NORTHERN CANCER VOICES,26.05,24.58,24.31
4,5,JOHN WHIPPY FOUNDATION,ANIMAL RESCUE CYMRU,28.91,26.6,25.56
5,6,MRS WATERHOUSE CHARITABLE TRUST,KIDZ KLUB - LEEDS,64.23,63.72,62.24
6,7,TESLER FOUNDATION,SEVENTH HEAVEN,29.88,30.23,29.08
7,8,FRIENDS OF FAWLEY CHURCH,"BERKSHIRE, BUCKINGHAMSHIRE AND OXFORDSHIRE WIL...",6.12,6.5,6.51
8,9,3 TS CHARITABLE TRUST,WELDMAR HOSPICECARE,7.23,7.33,7.31
9,10,DAVID AND RUTH BEHREND FUND,FREEDOM FROM TORTURE,72.23,73.47,73.03


### Observations

The differences between the weighting versions are subtle but I am satisfied that v3 produces results that align with my own expectations. Having added the reweighting layer, I am also now satisfied that funders without a grant history are not over-penalised. I have also added a `max` and `min` function to ensure that the final score is between 5% and 95%, to avoid implying a level of certainty that is out of scope of what prospie can provide.

This has very much confirmed the importance of contextual information to the prospecting process. My alignment score is informed and measurable, per project objective #3; however, looking at the score alone highlights that the true power of prospie will come from the reasoning presented to the user in the final prototype. This is what market competitors lack, and what will be the USP of prospie.

## Final Version

In [35]:
#get scores for all pairs and display
for idx, row in pairs_df.iterrows():
    result_v2 = calculate_alignment_score_v2(pairs_df, idx, grants_df, areas_df, hierarchies_df, model, weights_v3)
    format_tests(idx, row, result_v2)