IR PROJECT- NSF Grant Retrieval

##1. PREPROCESSING

In [None]:
import pandas as pd

df = pd.read_csv("nsf_dataset.csv")

# Basic cleaning
df = df.dropna(subset=['abstract'])
df = df.drop_duplicates(subset=['abstract'])

# Normalize program names
df['program_element'] = df['program_element'].fillna("").str.upper().str.strip()

### NSF Categories Used in This Project

To simplify the dataset, each grant was assigned to one of four high-level NSF research categories:

- **BIO — Biological Sciences**  
  Topics include protein interactions, cancer detection, cellular processes, and biomedical AI.

- **CNS — Computer and Network Systems**  
  Covers cybersecurity, cloud systems, networks, and secure computation.

- **IIS — Information & Intelligent Systems**  
  Includes AI, machine learning, recommendation systems, and intelligent information processing.

- **OTHER — All remaining NSF programs**  
  Used as a catch-all when the program element does not match BIO, CNS, or IIS.


In [None]:
# These labels will be used as ground truth for evaluating relevance.


bio_programs = ['BIO', 'MCB', 'DBI', 'IOS']
iis_programs = ['IIS', 'CISE', 'SCC']
cns_programs = ['CNS', 'ENG', 'ECCS']
df['program_element'] = df['program_element'].fillna("").astype(str)
def map_bucket(program):
    if program == "":
        return "OTHER"
    tokens = program.split()

    if any(p in tokens for p in bio_programs):
        return "BIO"
    elif any(p in tokens for p in iis_programs):
        return "IIS"
    elif any(p in tokens for p in cns_programs):
        return "CNS"
    else:
        return "OTHER"

df['category'] = df['program_element'].apply(map_bucket)

print(df['category'].value_counts())


category
OTHER    39939
CNS        459
IIS        228
BIO        190
Name: count, dtype: int64


In [None]:
#downsampling the data
df_bio = df[df.category=="BIO"]
df_iis = df[df.category=="IIS"]
df_cns = df[df.category=="CNS"]
df_other = df[df.category=="OTHER"].sample(n=2000, random_state=42)  # reduce from 40k → 2k

df_eval = pd.concat([df_bio, df_iis, df_cns, df_other]).reset_index(drop=True)
print(df_eval['category'].value_counts())


category
OTHER    2000
CNS       459
IIS       228
BIO       190
Name: count, dtype: int64


In [None]:
#clean dataframe which will be used for streamlit UI retrieval part

df_eval = df_eval[['id', 'award_title', 'abstract', 'program_element', 'category']]
df_eval.to_csv("nsf_grants_clean.csv", index=False)

##2. Baseline Retrieval (BM25)

In [None]:
%pip install rank_bm25

Collecting rank_bm25
  Downloading rank_bm25-0.2.2-py3-none-any.whl.metadata (3.2 kB)
Downloading rank_bm25-0.2.2-py3-none-any.whl (8.6 kB)
Installing collected packages: rank_bm25
Successfully installed rank_bm25-0.2.2


In [None]:
import pandas as pd
from rank_bm25 import BM25Okapi
import numpy as np

df = pd.read_csv("nsf_grants_clean.csv")
df_eval = df[df['category'].isin(['BIO','IIS','CNS','OTHER'])].copy()

#Tokenize abstracts
def tokenize(text):
    return text.lower().split()

corpus = df_eval['abstract'].tolist()
tokenized_corpus = [tokenize(doc) for doc in corpus]

#Build BM25 index
bm25 = BM25Okapi(tokenized_corpus)
# Example queries selected to represent each NSF domain.
# Each query has an expected category for evaluation.

queries = [
    ("Developing AI-based models for early detection of cancer using blood biomarkers and imaging data to improve patient survival rates.", "BIO"),
    ("Developing machine learning models to study protein interactions and cellular processes in biological systems.","BIO"),
    ("Designing advanced cybersecurity methods to protect cloud-based systems from zero-day attacks and data breaches in enterprise networks.", "CNS"),
    ("Building machine learning algorithms to enhance personalized recommendation systems for e-commerce platforms, improving user engagement and sales","IIS")
]
top_k = 5


In [None]:
# === Baseline BM25 Metrics ===
#---------------------------------------------------

def precision_at_k(retrieved_categories, true_category, k):
    """
    Precision@k = (relevant documents in top-k) / k
    Uses category match as the relevance signal.
    """
    return sum([c == true_category for c in retrieved_categories[:k]]) / k


def dcg(relevance_scores):
    """Discounted cumulative gain for computing nDCG."""
    return sum([(2**rel - 1) / np.log2(idx + 2) for idx, rel in enumerate(relevance_scores)])


def ndcg_at_k(retrieved_categories, true_category, k):
    """
    Normalized DCG = DCG(actual ranking) / DCG(ideal ranking)
    Rewards systems that rank relevant items higher.
    """
    rel = [1 if c == true_category else 0 for c in retrieved_categories[:k]]
    ideal = sorted(rel, reverse=True)
    return dcg(rel) / dcg(ideal) if dcg(ideal) > 0 else 0

def mrr_at_k(retrieved_categories, true_category, k):
    """
    Mean Reciprocal Rank = 1 / rank_of_first_relevant_item
    """

    for idx, c in enumerate(retrieved_categories[:k]):
        if c == true_category:
            return 1 / (idx + 1)
    return 0


In [None]:
#---------------------------------------------------
# BASELINE EXPERIMENT: Evaluate BM25 keyword retrieval
#---------------------------------------------------

# This block tests how well BM25 performs on the 4 example
# queries BEFORE adding SBERT or hybrid ranking.
# The purpose is to show BM25’s limitations.

for query_text, true_cat in queries:
    tokenized_query = tokenize(query_text)
    # 1. BM25 computes a similarity score for each abstract
    scores = bm25.get_scores(tokenized_query)

    #2.Retrieve top-k highest scoring documents
    top_indices = scores.argsort()[-top_k:][::-1]


    #3.Extract titles and categories of the retrieved grants
    retrieved_titles = df_eval.iloc[top_indices]['award_title'].tolist()
    retrieved_categories = df_eval.iloc[top_indices]['category'].tolist()

    print(f"\nQuery: {query_text} (Expected category: {true_cat})")
    print("Top-5 retrieved grants:")
    for i, (title, cat) in enumerate(zip(retrieved_titles, retrieved_categories), 1):
        print(f"{i}. {title} ({cat})")

    # 4. Compute BM25 metrics using RENAMED metric functions
    prec = precision_at_k(retrieved_categories, true_cat, top_k)
    ndcg = ndcg_at_k(retrieved_categories, true_cat, top_k)
    mrr = mrr_at_k(retrieved_categories, true_cat, top_k)

    print(f"Precision@5: {prec:.2f}, nDCG@5: {ndcg:.2f}, MRR@5: {mrr:.2f}")



Query: Developing AI-based models for early detection of cancer using blood biomarkers and imaging data to improve patient survival rates. (Expected category: BIO)
Top-5 retrieved grants:
1. CAREER: Branching Processes on Graphs Inform Early Detection of Colorectal Cancer (OTHER)
2. SBIR Phase I:  Advanced microfluidic systems enabling development of novel circulating tumor cell diagnostics (OTHER)
3. III: Medium: Knowledge-Guided Meta Learning for Multi-Omics Survival Analysis (OTHER)
4. SBIR Phase I:  NARROW-BEAM DEDICATED BREAST COMPUTED TOMOGRAPHY (OTHER)
5. A Self-Contained Optical Heterodyned Second Harmonic Sensor (OTHER)
Precision@5: 0.00, nDCG@5: 0.00, MRR@5: 0.00

Query: Developing machine learning models to study protein interactions and cellular processes in biological systems. (Expected category: BIO)
Top-5 retrieved grants:
1. CDS&E: Controlling Protein - Protein Interactions: Computations and Experiments (CNS)
2. Collaborative Research: ABI Development: Integrated platf

The above low scores reflect BM25’s limitation: it relies only on exact word overlap and fails when queries use different wording or describe complex ideas.



## 3. SBERT + Hybrid Retrieval

In [None]:
!pip install sentence-transformers faiss-cpu rank_bm25 pandas tqdm


Collecting faiss-cpu
  Downloading faiss_cpu-1.13.1-cp310-abi3-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (7.6 kB)
Downloading faiss_cpu-1.13.1-cp310-abi3-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (23.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.7/23.7 MB[0m [31m78.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: faiss-cpu
Successfully installed faiss-cpu-1.13.1


In [None]:
# Build SBERT embeddings and FAISS index for semantic retrieval.
# This allows us to go beyond exact word matching and capture semantic similarity.
#---------------------------------------------------

from sentence_transformers import SentenceTransformer
import faiss
import numpy as np

model = SentenceTransformer("all-MiniLM-L6-v2")

dense_embeddings = model.encode(df_eval['abstract'].tolist(), batch_size=64, convert_to_numpy=True,show_progress_bar=True)
faiss.normalize_L2(dense_embeddings)

d = dense_embeddings.shape[1]
index = faiss.IndexFlatIP(d)
index.add(dense_embeddings)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/45 [00:00<?, ?it/s]

In [None]:
np.save("nsf_sbert_embeddings.npy", dense_embeddings)


In [None]:
#---------------------------------------------------
# HYBRID RETRIEVAL (BM25 + SBERT)
#---------------------------------------------------
# This combines:
# - BM25 keyword scores  → captures lexical similarity
# - SBERT dense scores   → captures semantic similarity
#
# The goal is to overcome BM25’s weakness on paraphrased queries
# by mixing both signals through a weighted score (alpha).
#---------------------------------------------------

def get_dense_scores(query, top_k=200):
    """
    Embeds the query using SBERT, normalizes it, and retrieves
    top-k dense matches from the FAISS index.
    Returns:
      D = similarity scores
      I = retrieved document indices
    """
    q_emb = model.encode([query], convert_to_numpy=True)
    faiss.normalize_L2(q_emb)
    D, I = index.search(q_emb, top_k)
    return D[0], I[0]

def hybrid_rank(query, top_k=5, alpha=0.5):
    """
    Produces a hybrid ranking by combining:
      - BM25 keyword matching scores
      - SBERT dense embedding scores

    alpha = weight for BM25 (0.5 means equal contribution)
    """
    q_tokens = tokenize(query)
    bm25_scores = bm25.get_scores(q_tokens)

    D, I = get_dense_scores(query, top_k=200)
    candidates = I.tolist()
    bm_vals = bm25_scores[candidates]

    # normalize both score ranges to [0,1]
    bm_norm = (bm_vals - bm_vals.min()) / (bm_vals.max() - bm_vals.min() + 1e-9)
    dense_norm = (D - D.min()) / (D.max() - D.min() + 1e-9)

    #weighted hybrid score
    hybrid = alpha * bm_norm + (1 - alpha) * dense_norm

    sorted_idx = np.argsort(hybrid)[::-1][:top_k]
    final_indices = [candidates[i] for i in sorted_idx]
    final_scores = [hybrid[i] for i in sorted_idx]

    return list(zip(final_indices, final_scores))


## 4. Evaluation (Human Labels + LLM Labels)

In [None]:
#---------------------------------------------------# We use 4 example research queries covering BIO, CNS, and IIS.
# These queries are used consistently across BM25, SBERT, Hybrid,and LLM-based evaluation for fair comparison.
# Used for final model evaluation and comparison.
#---------------------------------------------------

queries = [
    ("Developing AI-based models for early detection of cancer using blood biomarkers.", "BIO"),
    ("Developing machine learning models to study protein interactions and cellular processes in biological systems.","BIO"),
    ("Designing cybersecurity defenses for cloud systems against zero-day attacks.", "CNS"),
    ("Machine learning methods for personalized recommendation in e-commerce.", "IIS")
]

In [None]:
# Human relevance labels for the top-5 retrieved results of each query.
# 1 = relevant to the query, 0 = not relevant.
# These act as ground truth for computing IR metrics.
#---------------------------------------------------

human_labels_dict = {
    "bio_cancer": [0,0,0,1,0],
    "bio_protein": [0,1,0,0,0],
    "cns_zero_day": [0,1,1,1,0],
    "iis_recommendation": [0,1,0,0,0]
}


In [None]:
from openai import OpenAI
import os

#**Note:** To run LLM evaluation, set your API key:

client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

#---------------------------------------------------
# Use an LLM to act as a strict binary relevance judge (1 or 0)
# and also generate a short natural-language explanation.
#---------------------------------------------------

def llm_relevance_and_explanation(query, title, abstract, model="gpt-4o-mini"):
    """
    Returns:
      llm_label: 1 or 0
      explanation: short text explaining relevance or non-relevance
    """
    text = abstract[:1500]

    prompt = f"""
You are a STRICT binary relevance judge for an information retrieval system.

Your task:
- Read the user query and the grant abstract.
- Decide if the abstract is TRULY relevant to the query.
- You MUST respond in exactly this format:

Relevance: 1 or 0
Explanation: <your short explanation>

Rules:
- Relevance: 1 ONLY if the abstract CLEARLY matches the main topic of the query.
- If relevance = 0, the explanation MUST start with:
  "This grant is NOT relevant because ..."
- Do NOT stretch or force connections.
- If unsure, choose relevance = 0.

QUERY:
{query}

TITLE:
{title}

ABSTRACT:
{text}
"""

    try:
        res = client.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": prompt}],
            max_tokens=200,
        )
        output = res.choices[0].message.content.strip()

        # --- Parse relevance ---
        # Default to 0 if we don't see "Relevance: 1"
        llm_label = 0
        for line in output.splitlines():
            if "Relevance:" in line:
                if "1" in line:
                    llm_label = 1
                else:
                    llm_label = 0
                break

        # --- Parse explanation ---
        explanation = output
        for line in output.splitlines():
            if line.strip().startswith("Explanation:"):
                explanation = line.strip().replace("Explanation:", "").strip()
                break

        return llm_label, explanation

    except Exception as e:
        return 0, f"[LLM error: {e}]"

In [None]:
import numpy as np
#---------------------------------------------------
# Run a full evaluation for one query:
# - retrieve top-5 grants using hybrid rank
# - show human labels
# - compute Precision@5, MRR, nDCG
# - get LLM labels + explanations
# - compute human vs LLM agreement
#---------------------------------------------------

def evaluate_query(query_text, true_category, query_key, top_k=5):
    print("\n====================================================")
    print("QUERY:", query_text)
    print("EXPECTED CATEGORY:", true_category)

    # ---- 1. RETRIEVAL ----
    result = hybrid_rank(query_text, top_k=top_k, alpha=0.5)
    retrieved_indices = [idx for idx, score in result]

    print("\nTOP-5 RETRIEVED GRANTS:")
    for i, idx in enumerate(retrieved_indices, start=1):
        title = df_eval.iloc[idx]["award_title"]
        cat   = df_eval.iloc[idx]["category"]
        print(f"{i}. {title}  ({cat})")

    # ---- 2. HUMAN RELEVANCE LABELS ----
    human_labels = human_labels_dict[query_key]
    print("\nHUMAN RELEVANCE LABELS (1 = relevant, 0 = not):", human_labels)

    # ---- 3. METRICS (USING HUMAN LABELS) ----

    #Precision:for(relevant documents in top-k) / k
    precision = sum(human_labels) / top_k

    # Mean Reciprocal rank: rank_of_first_relevant_item
    mrr = 0.0
    for i, rel in enumerate(human_labels):
        if rel == 1:
            mrr = 1.0 / (i + 1)
            break

    #DCG:Discounted cumulative gain for computing nDCG
    def dcg(rels):
        return sum(rel / np.log2(i + 2) for i, rel in enumerate(rels))

    ideal = sorted(human_labels, reverse=True)
    dcg_val = dcg(human_labels)
    idcg_val = dcg(ideal)
    ndcg = dcg_val / idcg_val if idcg_val > 0 else 0.0

    print("\n--- METRICS (Human-based) ---")
    print("Precision@5:", round(precision, 3))
    print("MRR@5:", round(mrr, 3))
    print("nDCG@5:", round(ndcg, 3))

    # ---- 4. LLM RELEVANCE + EXPLANATIONS ----
    print("\n--- LLM RELEVANCE + EXPLANATIONS ---")
    llm_labels = []

    for i, idx in enumerate(retrieved_indices, start=1):
        title = df_eval.iloc[idx]["award_title"]
        abstract = df_eval.iloc[idx]["abstract"]

        llm_rel, explanation = llm_relevance_and_explanation(
            query_text, title, abstract
        )
        llm_labels.append(llm_rel)

        print(f"\n{i}. {title}")
        print("LLM Relevance:", llm_rel)
        print("LLM Explanation:", explanation)

    print("\nLLM LABELS:", llm_labels)

    # ---- 5. EXTRA CODE ADDED: SCORE STATS ----
    if len(llm_labels) == len(human_labels):
        agree = sum(1 for h, l in zip(human_labels, llm_labels) if h == l)
        agreement_rate = agree / len(human_labels)
        print("\nHuman vs LLM score:", round(agreement_rate, 3))





FINAL OUTPUT — Evaluation Report for All 4 Queries

(shown in the UI interface)



In [None]:
# 1) BIO – cancer / biomarkers
evaluate_query(
    "Developing AI-based models for early detection of cancer using blood biomarkers and imaging data to improve patient survival rates.",
    "BIO",
    query_key="bio_cancer"
)



QUERY: Developing AI-based models for early detection of cancer using blood biomarkers and imaging data to improve patient survival rates.
EXPECTED CATEGORY: BIO

TOP-5 RETRIEVED GRANTS:
1. SBIR Phase I:  Advanced microfluidic systems enabling development of novel circulating tumor cell diagnostics  (OTHER)
2. III: Medium: Knowledge-Guided Meta Learning for Multi-Omics Survival Analysis  (OTHER)
3. CAREER: Branching Processes on Graphs Inform Early Detection of Colorectal Cancer  (OTHER)
4. NSF/FDA: Towards an active surveillance framework to detect AI/ML-enabled Software as a Medical Device (SaMD) data and performance drift in clinical flow  (CNS)
5. SCH: Interpretable survival analysis of complex longitudinal data  (OTHER)

HUMAN RELEVANCE LABELS (1 = relevant, 0 = not): [0, 0, 0, 1, 0]

--- METRICS (Human-based) ---
Precision@5: 0.2
MRR@5: 0.25
nDCG@5: 0.431

--- LLM RELEVANCE + EXPLANATIONS ---

1. SBIR Phase I:  Advanced microfluidic systems enabling development of novel circulat

In [None]:
# 2) BIO – protein interactions
evaluate_query(
    "Developing machine learning models to study protein interactions and cellular processes in biological systems.",
    "BIO",
    query_key="bio_protein"
)



QUERY: Developing machine learning models to study protein interactions and cellular processes in biological systems.
EXPECTED CATEGORY: BIO

TOP-5 RETRIEVED GRANTS:
1. CDS&E: Controlling Protein - Protein Interactions: Computations and Experiments  (CNS)
2. Collaborative Research: ABI Development: Integrated platforms for protein structure and function predictions  (BIO)
3. SBIR Phase I:  Highly resource-efficient protein engineering using machine learning  (OTHER)
4. Heteropolymeric Semi-Autonomous Repeat Proteins: Coupling Energetics, Structure, and Function  (OTHER)
5. Collaborative Research: CISE-MSI: RPEP: III: celtSTEM Research Collaborative: Catapulting MSI Faculty and Students into Computational Research.  (IIS)

HUMAN RELEVANCE LABELS (1 = relevant, 0 = not): [0, 1, 0, 0, 0]

--- METRICS (Human-based) ---
Precision@5: 0.2
MRR@5: 0.5
nDCG@5: 0.631

--- LLM RELEVANCE + EXPLANATIONS ---

1. CDS&E: Controlling Protein - Protein Interactions: Computations and Experiments
LLM Rele

In [None]:
# 3) CNS – zero-day cloud cybersecurity
evaluate_query(
    "Designing advanced cybersecurity methods to protect cloud-based systems from zero-day attacks and data breaches in enterprise networks.",
    "CNS",
    query_key="cns_zero_day"
)



QUERY: Designing advanced cybersecurity methods to protect cloud-based systems from zero-day attacks and data breaches in enterprise networks.
EXPECTED CATEGORY: CNS

TOP-5 RETRIEVED GRANTS:
1. CAREER: SaTC: Bridging the Gap Between Research and Practice: Automation and Metrics in Security Operation Centers  (CNS)
2. Collaborative Research: CISE-MSI: RCBP-RF: SaTC: Building Research Capacity in AI Based Anomaly Detection in Cybersecurity  (IIS)
3. Active Control-Enabled Approaches for Handling Cyberattacks on Process Control Systems  (CNS)
4. CAREER: Accelerating General-Purpose Encrypted Computation on Diverse Hardware  (CNS)
5. CRII: SaTC: Identifying Emerging Threats in the Online Hacker Community for Proactive Cyber Threat Intelligence: A Diachronic Graph Convolutional Autoencoder Framework  (IIS)

HUMAN RELEVANCE LABELS (1 = relevant, 0 = not): [0, 1, 1, 1, 0]

--- METRICS (Human-based) ---
Precision@5: 0.6
MRR@5: 0.5
nDCG@5: 0.733

--- LLM RELEVANCE + EXPLANATIONS ---

1. CAREER

In [None]:
# 4) IIS – recommendation systems
evaluate_query(
    "Building machine learning algorithms to enhance personalized recommendation systems for e-commerce platforms, improving user engagement and sales.",
    "IIS",
    query_key="iis_recommendation"
)



QUERY: Building machine learning algorithms to enhance personalized recommendation systems for e-commerce platforms, improving user engagement and sales.
EXPECTED CATEGORY: IIS

TOP-5 RETRIEVED GRANTS:
1. Collaborative Research: CCRI: New: A Research News Recommender Infrastructure with Live Users for Algorithm and Interface Experimentation  (OTHER)
2. CAREER: Leveraging Recommendations for Self-Actualization  (IIS)
3. SBIR Phase I: A Method to Expand Personalized Experiential Learning  (OTHER)
4. CAREER: Enhanced Analysis & Algorithms to Minimize the Spread of Misinformation in Social Networks  (CNS)
5. Collaborative Research: CNS Core: Medium: Learning to Cache and Caching to Learn in High Performance Caching Systems  (CNS)

HUMAN RELEVANCE LABELS (1 = relevant, 0 = not): [0, 1, 0, 0, 0]

--- METRICS (Human-based) ---
Precision@5: 0.2
MRR@5: 0.5
nDCG@5: 0.631

--- LLM RELEVANCE + EXPLANATIONS ---

1. Collaborative Research: CCRI: New: A Research News Recommender Infrastructure with 