# Concept Utility Framework

### ENRICH ENDOH CSV WITH WORD COUNT + PUBMED FREQUENCY (LAST 5 YEARS) 

**Importing necessary dependencies**

In [None]:
import pandas as pd
from Bio import Entrez
import time
from datetime import datetime

**Load original ENDOH.csv**

EnDOH.csv consists of the current version of the Environmental Determinants of Health (EnDOH) from BioPortal

In [None]:
df = pd.read_csv("ENDOH.csv")
df = df[['Preferred Label', 'Parents']].dropna()
df['Preferred Label'] = df['Preferred Label'].astype(str)

**Compute Word Count (treat underscores as spaces)**

Here we compute the word count for each concept present in EnDOH 

In [None]:
df['Word Count'] = df['Preferred Label'].apply(lambda x: len(x.replace('_', ' ').split()))

**Set up Entrez API**

API keys to retrieve PubMed frequency of occurence for all the concepts present in the seed ontology (EnDOH)

In [None]:
Entrez.api_key = "ba9c6cd0806a467f30ca76b5ebd32531b508"
Entrez.email = "name@example.com"

**Compute PubMed Frequency for the last 5 years**

Here we are targeting the retrieval of PubMed Frequency of occurence of a concept for the last 5 year as its necessary to consider the recent developemnts/presence of concepts in the research domain

In [None]:
start_year = datetime.now().year - 5
end_year = datetime.now().year

def get_pubmed_freq(term):
    try:
        query = f'"{term.replace("_", " ")}" AND ({start_year}[PDAT] : {end_year}[PDAT])'
        handle = Entrez.esearch(db="pubmed", term=query, retmax=1)
        record = Entrez.read(handle)
        handle.close()
        return int(record["Count"])
    except:
        return 0

# Apply to all concepts
frequencies = []
for idx, term in enumerate(df['Preferred Label']):
    freq = get_pubmed_freq(term)
    frequencies.append(freq)
    if idx % 10 == 0:
        print(f"Processed {idx + 1} of {len(df)}")
    time.sleep(0.34)  # stay within NCBI limit

df['Frequency'] = frequencies

**Save enriched file**

In [1]:
df.to_csv("ENDOH_enriched.csv", index=False)
print("✅ ENDOH_enriched.csv saved with Word Count + Frequency (5-year window).")

Processed 1 of 102
Processed 11 of 102
Processed 21 of 102
Processed 31 of 102
Processed 41 of 102
Processed 51 of 102
Processed 61 of 102
Processed 71 of 102
Processed 81 of 102
Processed 91 of 102
Processed 101 of 102
✅ ENDOH_enriched.csv saved with Word Count + Frequency (5-year window).


### CELL 1: Stage 1 - Utility Score (US)

In [None]:
import torch
from sentence_transformers import SentenceTransformer, util

**Load enriched CSV**

In [None]:
df = pd.read_csv("ENDOH_enriched.csv")
df = df[['Preferred Label', 'Parents', 'Word Count', 'Frequency']].dropna()
df['Preferred Label'] = df['Preferred Label'].astype(str)
df['Cluster'] = df['Parents'].apply(lambda x: x.strip().split('#')[-1])
cluster_dict = df.groupby('Cluster')['Preferred Label'].apply(list).to_dict()

**Concept and weights**

Here we add weights to the Utility Score formula's two parameters which are 

	•	Semantic Similarity to a cluster (how well this concept fits into an existing topic group) - w1
	•	Redundancy (how much this concept overlaps in wording with existing concepts, which is bad) - w2
    
**concept_x** represents the concept we want to evaluate for inclusion in the ontology.

In [None]:
concept_x = "badly_maintained_urban_public_parks"
w1 = 1.0  # Semantic similarity
w2 = 1.0  # Redundancy

**Embedding model**

Here we Load a pre-trained sentence embedding model from the Sentence Transformers library. 'all-MiniLM-L6-v2' is a lightweight and fast model with good performance for semantic similarity tasks.

- Convert the concept from underscore_case to normal spaced text for better language model interpretation.
- Then encode it into a vector (embedding) that represents the meaning of the phrase.
- This embedding will later be compared to other concept embeddings to measure similarity.

In [None]:
model = SentenceTransformer('all-MiniLM-L6-v2')
x_embedding = model.encode(concept_x.replace('_', ' '), convert_to_tensor=True)

**Semantic similarity**

For each sub hierarchy, it computes the average cosine similarity between the embedding of concept_x and the embeddings of all concepts already present in that cluster. Before encoding, underscores in concept labels are replaced with spaces to better match the language model’s training data. 

The Sentence-BERT model (all-MiniLM-L6-v2) is used to generate these embeddings. The resulting average similarity score represents how close in meaning the test concept is to that cluster. 

The sub hierarchy with the highest average similarity is selected as the “best fit” for concept_x, and this highest similarity value becomes the best_similarity score. This value is a key component of the Stage 1 utility score, as it reflects the semantic alignment of the new concept with existing structures in the ontology.

In [None]:
semantic_scores = {}
for cluster_name, terms in cluster_dict.items():
    cluster_embeddings = model.encode([t.replace('_', ' ') for t in terms], convert_to_tensor=True)
    cosine_scores = util.cos_sim(x_embedding, cluster_embeddings)
    avg_score = torch.mean(cosine_scores).item()
    semantic_scores[cluster_name] = avg_score

best_cluster = max(semantic_scores, key=semantic_scores.get)
best_similarity = semantic_scores[best_cluster]

**Redundancy (Jaccard) — only use the best (semantically similar) sub hierarchy**

This section calculates the redundancy of the concept based on Jaccard similarity, but only within the cluster identified as the most semantically similar (i.e., the best sub-hierarchy for concept_x). Redundancy is measured by comparing the words in concept_x to the words in each concept already present in that cluster. For each comparison, the Jaccard similarity is computed as the size of the intersection divided by the size of the union of the word sets. The highest redundancy score across all comparisons is retained as max_redundancy. 

By focusing only on the most semantically relevant cluster, the algorithm avoids penalizing the concept for similarities with unrelated parts of the ontology, and ensures that redundancy is assessed in the most contextually appropriate way. This score is subtracted from the semantic similarity to calculate the final utility score in Stage 1.

In [None]:
def jaccard_sim(a, b):
    a_words = set(a.lower().split('_'))
    b_words = set(b.lower().split('_'))
    union = a_words | b_words
    intersection = a_words & b_words
    return len(intersection) / len(union) if union else 0

max_redundancy = -1
most_redundant_concept = ""
for term in cluster_dict[best_cluster]:
    score = jaccard_sim(concept_x, term)
    if score > max_redundancy and term != concept_x:
        max_redundancy = score
        most_redundant_concept = term

**Utility Score**

Calculate the **utility_score** based on the **semantic similarity score** and the **maximum redundancy score**

In [None]:
utility_score = (w1 * best_similarity) - (w2 * max_redundancy)

**Word count & frequency stats from seed ontology**

This code calculates descriptive statistics for the two key quantitative features used in Stage 2 of the Goodness Score: word count and PubMed frequency. By using the .describe() function on the entire seed ontology, it extracts values such as the mean, standard deviation, minimum, and maximum for each feature

In [None]:
word_stats = df['Word Count'].describe()
freq_stats = df['Frequency'].describe()

seed_stats = {
    'mean_wc': word_stats['mean'], 'std_wc': word_stats['std'],
    'min_wc': word_stats['min'], 'max_wc': word_stats['max'],
    'mean_freq': freq_stats['mean'], 'std_freq': freq_stats['std'],
    'min_freq': freq_stats['min'], 'max_freq': freq_stats['max']
}

## Stage 1 - Calculations

In [27]:
print("\n=== Stage 1 ===")
print("Concept:", concept_x)
print("Most Similar Cluster:", best_cluster)
print("Avg. Semantic Similarity:", round(best_similarity, 4))
print("Max Redundancy (Jaccard):", round(max_redundancy, 4), f"with '{most_redundant_concept}'")
print("Utility Score (US):", round(utility_score, 4))


=== Stage 1 ===
Concept: badly_maintained_urban_public_parks
Most Similar Cluster: Accessibility_to_green_space
Avg. Semantic Similarity: 0.5462
Max Redundancy (Jaccard): 0.125 with 'loss_of_urban_forest'
Utility Score (US): 0.4212


----------------

### CELL 2: Stage 2 - Goodness Score (Improved) 

In [None]:
import requests
import re
from Bio import Entrez

In [None]:
API Setup

In [None]:
merriam_key = "18cad792-e991-4203-a8a7-b41746f1d538"
entrez_key = "ba9c6cd0806a467f30ca76b5ebd32531b508"
Entrez.api_key = entrez_key
Entrez.email = "nk88@njit.edu"

Valid POS Combinations

In [None]:
valid_combos = {
    frozenset(['noun', 'noun']): 1.0,
    frozenset(['adjective', 'noun']): 0.95,
    frozenset(['noun', 'noun', 'noun']): 0.9,
    frozenset(['verb', 'noun']): 0.85,
    frozenset(['noun', 'adjective']): 0.8,
    frozenset(['noun', 'verb']): 0.75,
    frozenset(['adjective', 'noun', 'noun']): 0.7,
    frozenset(['adjective', 'adjective', 'noun']): 0.65,
    frozenset(['noun', 'prepositional phrase']): 0.6,
    frozenset(['adjective', 'adjective', 'adjective', 'noun']): 0.55,
    frozenset(['noun', 'noun', 'prepositional phrase']): 0.5,
    frozenset(['adjective', 'noun', 'noun', 'noun']): 0.45,
    frozenset(['noun', 'noun', 'noun', 'noun']): 0.4,
    frozenset(['noun', 'adjective', 'noun', 'noun']): 0.35,
    frozenset(['adjective', 'noun', 'noun', 'noun', 'noun']): 0.3
}

Merriam-Webster Combination Score

In [None]:
def check_merriam(term):
    url = f"https://www.dictionaryapi.com/api/v3/references/medical/json/{term}?key={merriam_key}"
    try:
        r = requests.get(url).json()
        for entry in r:
            if isinstance(entry, dict) and 'meta' in entry and entry['meta']['id'] == term:
                if 'fl' in entry:
                    return entry['fl']
        return None
    except:
        return None

def combination_score(term):
    tags = []
    for word in term.replace('_', ' ').split():
        tag = check_merriam(word)
        if tag:
            tags.append(tag)
    pos_set = frozenset(tags)
    for combo in valid_combos:
        if pos_set.issubset(combo):
            return valid_combos[combo]
    return 0.0

Normalize Functions

In [None]:
def normalize_freq(f, stats):
    return max(0, min(1, (f - stats['min_freq']) / (stats['max_freq'] - stats['min_freq'])))

def normalize_wc(wc, stats):
    return max(0, min(1, 1 - ((wc - stats['min_wc']) / (stats['max_wc'] - stats['min_wc']))))

Google Translate API

In [None]:
def translate_to_german(term):
    try:
        url = "https://translate.googleapis.com/translate_a/single"
        params = {"client": "gtx", "sl": "en", "tl": "de", "dt": "t", "q": term.replace('_', ' ')}
        res = requests.get(url, params=params).json()
        return res[0][0][0]
    except:
        return term

Word Utilities

In [None]:
def word_count(term):
    if not isinstance(term, str):
        return 0
    cleaned_term = re.sub(r'[^A-Za-z0-9\\s]', '', term)
    return len(cleaned_term.split())

def extract_words(term):
    if not isinstance(term, str):
        return set()
    cleaned_term = re.sub(r'[^A-Za-z0-9\\s]', '', term)
    return set(cleaned_term.split())

def decompose_german_term(term):
    if not isinstance(term, str):
        return set()
    segments = re.findall(r'[A-ZÄÖÜa-zäöüß]+', term)
    return set(segments)

def is_compound_word(eng_term, ger_term):
    eng_words = extract_words(eng_term)
    ger_words = decompose_german_term(ger_term)
    return bool(eng_words & ger_words)

Translation Quality Score

In [None]:
def translation_score(eng_term, ger_term):
    eng_wc = word_count(eng_term)
    ger_wc = word_count(ger_term)
    score = 0.0

    # Case 1: Short concepts (1-3 words)
    if 1 <= eng_wc <= 3:
        if ger_wc <= eng_wc:
            score = 1.0
        elif ger_wc <= eng_wc + 1:
            score = 0.8
        else:
            score = 0.5

    # Case 2: Medium concepts (4-6 words)
    elif 4 <= eng_wc <= 6:
        if ger_wc < eng_wc:
            score = 1.0
        elif ger_wc == eng_wc:
            score = 0.8
        else:
            score = 0.5

    # Case 3: Longer concepts (7-20 words)
    elif 7 <= eng_wc <= 20:
        if ger_wc < eng_wc * 0.8:
            score = 0.9
        elif ger_wc < eng_wc:
            score = 0.7
        else:
            score = 0.4

    # Case 4: Very long concepts (21-80 words)
    elif 21 <= eng_wc <= 80:
        if ger_wc < eng_wc * 0.8:
            score = 0.7
        else:
            score = 0.4

    # Bonus: Compound word check
    if is_compound_word(eng_term, ger_term):
        score += 0.1

    return min(score, 1.0)

Final Goodness Score Calculation

In [28]:
alpha, beta, lambd, theta = 0.15, 0.22, 0.31, 0.27

combo = combination_score(concept_x)
freq = get_pubmed_freq(concept_x)
norm_freq = normalize_freq(freq, seed_stats)
wc = len(concept_x.replace('_', ' ').split())
norm_wc = normalize_wc(wc, seed_stats)
german = translate_to_german(concept_x)
tscore = translation_score(concept_x, german)

goodness = (alpha * combo) + (beta * norm_wc) + (lambd * tscore) + (theta * norm_freq)




=== Stage 2 ===
German Translation: schlecht gepflegte städtische öffentliche Parks
Raw Word Count: 5 (Seed min: 1.0, max: 6.0)
Raw PubMed Frequency: 0 (Seed min: 0.0, max: 532625.0)

Combination Score: 0.4
Normalized Word Count: 0.2
Translation Quality Score: 1.0
Normalized Frequency: 0

🎯 Final Goodness Score: 0.414 [Weights: α=0.15, β=0.22, λ=0.31, θ=0.27]


## Stage 1 - Calculations

In [None]:
print("\n=== Stage 2 ===")
print(f"German Translation: {german}")
print(f"Raw Word Count: {wc} (Seed min: {seed_stats['min_wc']}, max: {seed_stats['max_wc']})")
print(f"Raw PubMed Frequency: {freq} (Seed min: {seed_stats['min_freq']}, max: {seed_stats['max_freq']})\n")

print(f"Combination Score: {round(combo, 3)}")
print(f"Normalized Word Count: {round(norm_wc, 3)}")
print(f"Translation Quality Score: {round(tscore, 3)}")
print(f"Normalized Frequency: {round(norm_freq, 3)}")

print(f"\n🎯 Final Goodness Score: {round(goodness, 4)} [Weights: α={alpha}, β={beta}, λ={lambd}, θ={theta}]")

------------------

## Concept Classification

Let’s break this down into our 3 categories:

⸻

**Categories:**

- Pass Stage 1 and Stage 2 → Accept as good concept.
- Pass Stage 1 only → Useful but may need refinement or more validation.
- Fail Stage 1 → Likely irrelevant or redundant concept.



Thresholds

In [29]:
STAGE1_THRESHOLD = 0.17  # Utility Score threshold
STAGE2_THRESHOLD = 0.45   # Goodness Score threshold

Classification Result

In [30]:
print("\n=== Concept Classification ===")
if utility_score >= STAGE1_THRESHOLD and goodness >= STAGE2_THRESHOLD:
    print("✅ Passes Stage 1 and Stage 2 – Strong candidate for inclusion.")
elif utility_score >= STAGE1_THRESHOLD and goodness < STAGE2_THRESHOLD:
    print("⚠️ Passes Stage 1 only – Semantically relevant but lacks other goodness criteria.")
else:
    print("❌ Fails Stage 1 – Not a useful or unique enough concept.")


=== Concept Classification ===
⚠️ Passes Stage 1 only – Semantically relevant but lacks other goodness criteria.


GOOD- climate_change, noise_regulations

BAD - housing_pressure

only stage 1 - badly_maintained_urban_public_parks
Extremely verbose