<a href="https://colab.research.google.com/github/gr98765/Semantic-hybrid-retrieval-for-funding-discovery/blob/main/IR_project_grant.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

IR PROJECT

In [None]:
import pandas as pd

df = pd.read_csv(
    "nsf_dataset.csv",
    engine='python',
    on_bad_lines='skip'
)

In [None]:
# print(df.columns)
# # print(df.shape)
# # df.head()

MAP AND DROP DUPES

In [None]:
df = df.dropna(subset=['abstract'])
df = df.drop_duplicates(subset=['abstract'])
bio_programs=['BIO', 'MCB', 'CBET']
iis_programs=['IIS', 'AI', 'CISE']
cns_programs=['CNS', 'ENG', 'NSF']

def map_bucket(program):
    if pd.isna(program):
        return 'OTHER'
    elif any(p in program for p in bio_programs):
        return 'BIO'
    elif any(p in program for p in iis_programs):
        return 'IIS'
    elif any(p in program for p in cns_programs):
        return 'CNS'
    else:
        return 'OTHER'

df['category']=df['program_element'].apply(map_bucket)


CLEAN THE FILE

In [None]:
df_clean = df[['id', 'award_title', 'abstract', 'program_element', 'category']]
df_clean.rename(columns={'award_title': 'title'}, inplace=True)
df_clean.to_csv("nsf_grants_clean.csv", index=False)
print(df_clean['category'].value_counts())

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_clean.rename(columns={'award_title': 'title'}, inplace=True)


category
OTHER    38734
BIO        822
CNS        675
IIS        585
Name: count, dtype: int64


BM25 keyword matching retrieval

In [None]:
%pip install rank_bm25

Collecting rank_bm25
  Downloading rank_bm25-0.2.2-py3-none-any.whl.metadata (3.2 kB)
Downloading rank_bm25-0.2.2-py3-none-any.whl (8.6 kB)
Installing collected packages: rank_bm25
Successfully installed rank_bm25-0.2.2


In [None]:
import pandas as pd
from rank_bm25 import BM25Okapi
import numpy as np

df = pd.read_csv("nsf_grants_clean.csv")
df_eval = df[df['category'].isin(['BIO','IIS','CNS','OTHER'])].copy()

#Tokenize abstracts
def tokenize(text):
    return text.lower().split()

corpus = df_eval['abstract'].tolist()
tokenized_corpus = [tokenize(doc) for doc in corpus]

#Build BM25 index
bm25 = BM25Okapi(tokenized_corpus)
queries = [
    ("Developing AI-based models for early detection of cancer using blood biomarkers and imaging data to improve patient survival rates.", "BIO"),
    ("Designing advanced cybersecurity methods to protect cloud-based systems from zero-day attacks and data breaches in enterprise networks.", "CNS"),
    ("Building machine learning algorithms to enhance personalized recommendation systems for e-commerce platforms, improving user engagement and sales","IIS")
]
top_k = 5


Evaluation metrics

In [None]:
def precision_at_k(retrieved_categories, true_category, k):
    return sum([c==true_category for c in retrieved_categories[:k]]) / k

def dcg(relevance_scores):
    return sum([(2**rel - 1)/np.log2(idx+2) for idx, rel in enumerate(relevance_scores)])

def ndcg_at_k(retrieved_categories, true_category, k):
    relevance = [1 if c==true_category else 0 for c in retrieved_categories[:k]]
    ideal_relevance=sorted(relevance, reverse=True)
    return dcg(relevance)/dcg(ideal_relevance) if dcg(ideal_relevance) > 0 else 0

def mrr_at_k(retrieved_categories, true_category, k):
    for idx, c in enumerate(retrieved_categories[:k]):
        if c==true_category:
            return 1/(idx+1)
    return 0

Evaluation of queries

In [None]:
for query_text, true_cat in queries:
    tokenized_query=tokenize(query_text)
    scores=bm25.get_scores(tokenized_query)
    top_indices=scores.argsort()[-top_k:][::-1]

    retrieved_titles=df_eval.iloc[top_indices]['title'].tolist()
    retrieved_categories=df_eval.iloc[top_indices]['category'].tolist()

    print(f"\nQuery: {query_text} (Expected category: {true_cat})")
    print("Top-5 retrieved grants:")
    for i, (title, cat) in enumerate(zip(retrieved_titles, retrieved_categories), 1):
        print(f"{i}. {title} ({cat})")

  #precision: measures correctness here
    prec=precision_at_k(retrieved_categories, true_cat, top_k)
  #nDCG:Measures ranking quality
    ndcg=ndcg_at_k(retrieved_categories, true_cat, top_k)
  #MRR:measures how quickly system find something relevant
    mrr=mrr_at_k(retrieved_categories, true_cat, top_k)

    print(f"Precision@5:{prec:.2f},nDCG@5:{ndcg:.2f},MRR@5:{mrr:.2f}")


Query: Developing AI-based models for early detection of cancer using blood biomarkers and imaging data to improve patient survival rates. (Expected category: BIO)
Top-5 retrieved grants:
1. I-Corps:  Translation potential of a new Surface-Enhanced Raman Spectroscopy (SERS) substrate for early detection of cancer (OTHER)
2. I-Corps: Rapid Ultrasensitive Biodetection Chip for Early Lung Cancer Diagnosis (OTHER)
3. I-Corps:  Multiplex diagnostic assay using interdigitated nano-sensing technology implemented point-of-care device (OTHER)
4. PFI-TT:  Point-of-Care Sensor Based on Electric Fields and Machine Learning for the  Detection of Circulating MicroRNA to Identify Early Stage Pancreatic Cancer (OTHER)
5. I-Corps:  Early Detection of Recurrence for High-Risk Breast Cancer Patients (OTHER)
Precision@5:0.00,nDCG@5:0.00,MRR@5:0.00

Query: Designing advanced cybersecurity methods to protect cloud-based systems from zero-day attacks and data breaches in enterprise networks. (Expected categ

The above low scores reflect BM25â€™s limitation: it relies only on exact word overlap and fails when queries use different wording or describe complex ideas.

further,semenatic search SBERT overcomes these limitations ---WIP