## Research Influencer Identification Notebook
#
**Objective:** Identify influential researchers within a dataset of research awards based on collaboration patterns and research breadth.
#
**Methodology:**
1.  **Load Data:** Read award data from JSON files, extracting relevant information about awards, organizations, programs, and investigators (PIs).
2.  **Clean & Preprocess:** Create a Pandas DataFrame, clean the data (handle missing values, filter roles), and perform feature engineering (leadership indicator, experience).
3.  **Generate Embeddings:** Create text embeddings for each award's descriptive text using a Sentence Transformer model (`all-MiniLM-L6-v2`) to capture semantic meaning.
4.  **Aggregate PI Data:** Group the data by Principal Investigator (PI) ID, aggregating metrics like average experience, leadership status, total awards, and average text embedding.
5.  **Normalize Features:** Scale numeric features (experience, award count) for fair comparison.
6.  **Candidate Selection:** Implement functions to select a pool of candidate PIs based on specific criteria:
    * **Topic Similarity:** Find PIs whose aggregated research text embeddings are most similar to a given research topic.
    * **Department:** Find PIs associated with a specific academic department, potentially ranking them by award count.
7.  **LLM-Based Ranking:**
    * Format the aggregated data (project count, unique collaborators, field diversity) for the selected candidates into a structured text format.
    * Generate a prompt instructing a Large Language Model (LLM - Google's Gemini) to rank these candidates based on the provided metrics and a definition of "influence" (project volume, collaborator network size, field diversity).
    * Send the prompt to the LLM and retrieve the ranked list with justifications.
8.  **Verification:** Implement a function to independently recalculate the key metrics (project count, collaborator count, field count) for a specific PI directly from the source data to verify the numbers used in the LLM ranking.
#
**Libraries Used:**
* `dotenv`: To load environment variables (like API keys).
* `os`: For file system operations (listing directories/files).
* `google.generativeai`: Google Gemini API client.
* `IPython.display`: For displaying rich content in notebooks (optional).
* `time`: For timing API calls.
* `typing`: For type hinting.
* `datetime`: For calculating experience based on dates.
* `sentence_transformers`: For generating text embeddings.
* `sklearn.metrics.pairwise`: For calculating cosine similarity (for topic search).
* `sklearn.preprocessing`: For data normalization (MinMaxScaler).
* `json`: For reading JSON data files.
* `pandas`: For data manipulation and analysis.
* `numpy`: For numerical operations, especially with embeddings.

In [48]:
from dotenv import load_dotenv
load_dotenv()
import os
import google.generativeai as genai
from IPython.display import display
import time
from typing import List, Dict, Tuple, Optional # For type hinting
import google.generativeai as genai
from datetime import datetime
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import json
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import MinMaxScaler # Import MinMaxScaler here

genai.configure(api_key=os.getenv("GOOGLE_API_KEY"))

In [49]:
## Load Gemini model
# Using a known stable model, replace if needed
try:
    # model=genai.GenerativeModel('gemini-1.0-pro')
    # Updated model name based on your original code
    model=genai.GenerativeModel('gemini-2.0-flash-thinking-exp-01-21')
    print("Gemini model loaded successfully.")
except Exception as e:
    print(f"Error loading Gemini model: {e}")
    exit() # Exit if model cannot be loaded

data_directory = 'data/ranking_data/'
records = []

Gemini model loaded successfully.


## Data Loading and Initial Parsing

This section reads JSON files containing award data. Each file typically represents one award and may contain information about multiple investigators (PIs) associated with it.

We define a helper function `safe_get` to navigate potentially missing keys or incorrect structures within the JSON data gracefully. The code iterates through subdirectories and files, extracts key fields for each award and associated PI, and stores them in a list of dictionaries (`records`).


In [50]:
def safe_get(data, keys, default=None):
    """
    Safely get a nested key from a dictionary using a list of keys.
    Returns default if any key is missing.
    """
    temp_data = data
    for key in keys:
        if isinstance(temp_data, dict) and key in temp_data:
            temp_data = temp_data[key]
        # Allow lists if they are not the final step
        elif isinstance(temp_data, list) and key < len(temp_data) and key >= 0:
             temp_data = temp_data[key]
        else:
            # print(f"Debug: Key '{key}' not found or invalid index in path {keys} within data: {data}")
            return default
    return temp_data

In [52]:
print(f"Starting data loading from: {data_directory}")
if not os.path.isdir(data_directory):
    print(f"Error: Data directory '{data_directory}' not found.")
    exit()
    
for sub_dir in os.listdir(data_directory):
    sub_directory_path = os.path.join(data_directory, sub_dir)
    if not os.path.isdir(sub_directory_path):
        continue # Skip if it's not a directory

    print(f"Reading files in {sub_dir}...")
    for filename in os.listdir(sub_directory_path):
        if filename.endswith('.json'):
            filepath = os.path.join(sub_directory_path, filename)
            try:
                with open(filepath, 'r', encoding='utf-8') as file: # Specify encoding
                    data = json.load(file)
            except json.JSONDecodeError as e:
                print(f"Error decoding JSON from {filepath}: {e}")
                continue
            except Exception as e:
                print(f"Error reading {filepath}: {e}")
                continue

            # Extract award-level context information safely
            award_type = data.get("awd_istr_txt")
            award_title = data.get("awd_titl_txt")
            abstract = data.get("abst_narr_txt")
            org_name = safe_get(data, ["org", "org_long_name"]) # Adjusted path based on potential structure
            # org_name2 = data.get("org_long_name2") # Keep if exists, otherwise remove or use safe_get
            perf_inst_name = safe_get(data, ["perf_inst", "perf_inst_name"])

            # Extract program element and reference safely (checking if list exists)
            pgm_ele_list = data.get("pgm_ele")
            program_element = None
            if isinstance(pgm_ele_list, list) and len(pgm_ele_list) > 0:
                 program_element = safe_get(pgm_ele_list, [0, "pgm_ele_long_name"]) # Safe get within list

            pgm_ref_list = data.get("pgm_ref")
            program_reference = None
            if isinstance(pgm_ref_list, list) and len(pgm_ref_list) > 0:
                 program_reference = safe_get(pgm_ref_list, [0, "pgm_ref_long_name"]) # Safe get within list

            # Get investigator information, ensuring it's a list
            pi_list = data.get("pi")
            if not isinstance(pi_list, list):
                # print(f"Skipping file {filename}: 'pi' data is not a list.")
                continue # Skip if 'pi' is not a list

            # Loop through each investigator in the file
            for pi in pi_list:
                 # Ensure pi is a dictionary before trying to get keys
                 if not isinstance(pi, dict):
                     # print(f"Skipping PI entry in {filename}: Entry is not a dictionary: {pi}")
                     continue

                 record = {
                    "award_type": award_type,
                    "award_title": award_title,
                    "abstract": abstract,
                    "org_name": org_name,
                    # "org_name2": org_name2, # Remove if not consistently present
                    "perf_inst_name": perf_inst_name,
                    "program_element": program_element,
                    "program_reference": program_reference,
                    "pi_id": pi.get("pi_id"),
                    "pi_full_name": pi.get("pi_full_name", "").strip() if pi.get("pi_full_name") else None,
                    "role": pi.get("proj_role_code2", "").strip() if pi.get("proj_role_code2") else None,
                    "department": pi.get("pi_dept_name"),
                    "email": pi.get("pi_email_addr"),
                    "start_date": pi.get("start_date")
                }
                 records.append(record)

print(f"Finished reading files. Total records loaded: {len(records)}")

# Create a DataFrame from the records
if not records:
    print("Error: No records were loaded. Cannot create DataFrame.")
    exit()

df = pd.DataFrame(records)
print(f"DataFrame created with shape: {df.shape}")

Starting data loading from: data/ranking_data/
Reading files in 2022...
Reading files in 2024...
Reading files in 2023...
Reading files in 2021...
Reading files in 2020...
Finished reading files. Total records loaded: 89056
DataFrame created with shape: (89056, 13)


## DataFrame Creation and Preprocessing

Converts the list of records into a Pandas DataFrame. Performs essential cleaning:
* Drops records missing a `pi_id`.
* Filters the DataFrame to include only records where the investigator's role is 'Principal Investigator' or 'Co-Principal Investigator', as these roles are central to the influence analysis.
* Handles potential `NaN` values in text columns before combining them.

## Feature Engineering

Creates new features based on the existing data:
* **`combined_text`**: Concatenates relevant textual fields (title, abstract, program names, etc.) into a single string for each record. This text will be used to generate embeddings. Missing values in text columns are filled with empty strings before concatenation.
* **`leadership`**: A binary indicator (1 or 0) set to 1 if the PI's role is 'Principal Investigator', signifying primary leadership on the award.
* **`experience_years`**: Calculates the approximate years of experience based on the award's `start_date` relative to the current date. Handles potential errors in date conversion and fills missing experience values with 0.


In [56]:
# --- Data Cleaning and Preprocessing ---
# Drop rows where pi_id is missing, as it's crucial for grouping
df.dropna(subset=['pi_id'], inplace=True)
print(f"Shape after dropping rows with missing pi_id: {df.shape}")

# Filter for relevant roles
df = df[df['role'].isin(['Co-Principal Investigator', 'Principal Investigator'])].copy() # Use .copy()
print(f"Shape after filtering for PI/Co-PI roles: {df.shape}")

if df.empty:
    print("Error: DataFrame is empty after filtering for PI/Co-PI roles. Check data or filters.")
    exit()

# Combine relevant text columns into one
text_columns = [
    "award_type", "award_title", "abstract",
    "org_name", #"org_name2", # Remove if not used
    "perf_inst_name",
    "program_element", "program_reference"
]
# Fill NaN values with empty strings before combining
for col in text_columns:
    if col not in df.columns:
        print(f"Warning: Column '{col}' not found in DataFrame. Skipping for combined_text.")
        text_columns.remove(col) # Remove if it doesn't exist
    else:
        df[col] = df[col].fillna('') # Fill NaNs specifically

df["combined_text"] = df[text_columns].astype(str).agg(" ".join, axis=1)

Shape after dropping rows with missing pi_id: (89056, 15)
Shape after filtering for PI/Co-PI roles: (83112, 15)


In [57]:
# Feature Engineering
# a. Leadership indicator
df["leadership"] = df["role"].apply(lambda x: 1 if "Principal Investigator" in str(x) else 0)

# b. Experience in years
df["start_date"] = pd.to_datetime(df["start_date"], errors='coerce')
reference_date = datetime.now()
df["experience_years"] = (reference_date - df["start_date"]).dt.days / 365.25
df["experience_years"].fillna(0, inplace=True) # Handle potential NaNs from bad dates

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df["experience_years"].fillna(0, inplace=True) # Handle potential NaNs from bad dates


## Text Embedding Generation
Uses a pre-trained Sentence Transformer model (`all-MiniLM-L6-v2`) to convert the `combined_text` for each award record into a dense vector representation (embedding). These embeddings capture the semantic meaning of the text. Error handling is included for cases where text cannot be embedded.


In [58]:
# Load Sentence Transformer
try:
    embedder = SentenceTransformer("all-MiniLM-L6-v2")
    print("SentenceTransformer model loaded.")
except Exception as e:
    print(f"Error loading SentenceTransformer model: {e}")
    exit()

# Compute embeddings (handle potential errors during embedding)
embeddings = []
for text in df["combined_text"]:
    try:
        embeddings.append(embedder.encode(text))
    except Exception as e:
        print(f"Error embedding text: {text[:100]}... Error: {e}")
        # Append a zero vector or handle as appropriate
        embeddings.append(np.zeros(embedder.get_sentence_embedding_dimension()))
df["text_embedding"] = embeddings

SentenceTransformer model loaded.


## Data Aggregation by PI

Groups the DataFrame by `pi_id` to create a summary profile for each unique investigator. It aggregates the features calculated earlier:
* `award_count`: Total number of awards the PI is listed on (as PI or Co-PI).
* `experience_years`: Average experience across all their awards.
* `leadership`: Maximum value of the leadership indicator (1 if they were a PI on at least one award, 0 otherwise).
* `text_embedding`: Average of the text embeddings from all their awards. This creates a single vector representing the PI's overall research area(s).

Error handling is included for cases where a PI might have missing embedding data.

In [59]:
# Group by PI
print("Grouping data by PI ID...")
award_counts = df.groupby("pi_id").size().reset_index(name="award_count")

# Ensure text_embedding exists before aggregation
if "text_embedding" not in df.columns:
     print("Error: 'text_embedding' column not found before grouping.")
     # Handle error appropriately, maybe exit or create dummy embeddings
     exit()

# Check for empty groups or non-numeric data before aggregation
numeric_cols_agg = {
    "experience_years": "mean",
    "leadership": "max"
}
lambda_agg = {
    "text_embedding": lambda embs: np.mean(np.stack(embs), axis=0) if len(embs) > 0 else np.zeros(embedder.get_sentence_embedding_dimension())
}

# Perform aggregation
df_grouped = df.groupby("pi_id").agg({**numeric_cols_agg, **lambda_agg}).reset_index()

# Merge award counts
df_grouped = df_grouped.merge(award_counts, on="pi_id", how="left")
print(f"Grouped DataFrame created with shape: {df_grouped.shape}")

Grouping data by PI ID...
Grouped DataFrame created with shape: (54006, 5)


## Feature Normalization

Scales the aggregated numeric features (`experience_years`, `award_count`) to a range between 0 and 1 using `MinMaxScaler`. Normalization prevents features with larger ranges from disproportionately influencing distance-based calculations or models sensitive to feature scales.


In [60]:
# Normalize numeric features
print("Normalizing features...")
scaler = MinMaxScaler()
# Ensure columns exist before scaling
features_to_scale = ["experience_years", "award_count"]
if all(col in df_grouped.columns for col in features_to_scale):
    df_grouped[["exp_norm", "award_norm"]] = scaler.fit_transform(df_grouped[features_to_scale])
    print("Features normalized.")
else:
    print(f"Warning: One or more columns ({features_to_scale}) not found for normalization.")

Normalizing features...
Features normalized.


## Helper and Core Logic Functions

This section defines the functions that implement the core logic for finding and ranking influencers.

In [61]:
# --- Function to get collaborators (as used in format_influencer_data) ---
def get_collaborators_for_awards(df: pd.DataFrame, award_titles: List[str]) -> Dict[str, List[str]]:
    """Helper to get collaborators for specific awards."""
    collaborators = {}
    # Optimize by filtering the main df once for relevant roles and titles
    relevant_awards_df = df[
        df['award_title'].isin(award_titles) &
        df['role'].isin(['Principal Investigator', 'Co-Principal Investigator'])
    ]
    for title in award_titles:
        # Filter further for the specific title
        award_pis = relevant_awards_df[relevant_awards_df['award_title'] == title]
        # Get unique names, handling potential missing names and ensuring they are strings
        names = [str(name) for name in award_pis['pi_full_name'].unique() if pd.notna(name)]
        collaborators[title] = names
    return collaborators

In [62]:
# --- Function to format data (as used in the LLM call) ---
def format_influencer_data(df: pd.DataFrame, pi_ids: List[str]) -> Tuple[str, Dict[str, str]]:
    print(f"Formatting influencer data for PI IDs: {pi_ids}...")
    formatted_data = ""
    pi_names = {} # Dictionary to store PI names mapped to IDs

    # Filter main df once for all relevant PIs and valid roles
    filtered_df = df[
        df['pi_id'].isin(pi_ids) &
        df['role'].isin(['Principal Investigator', 'Co-Principal Investigator'])
    ].copy() # Ensure it's a copy

    if filtered_df.empty:
        print("Warning: No data found for any specified PI IDs with roles PI/Co-PI.")
        formatted_data = "No data could be retrieved for the specified potential influencers with roles PI/Co-PI.\n"
        for pi_id in pi_ids:
             pi_names[pi_id] = f"PI ID {pi_id}" # Use ID as placeholder name
        return formatted_data, pi_names

    # Iterate through the requested PI IDs to structure the output
    for pi_id in pi_ids:
        pi_specific_data = filtered_df[filtered_df['pi_id'] == pi_id]

        if not pi_specific_data.empty:
            # Get consistent name from the first entry (handle potential NAs)
            full_name = pi_specific_data['pi_full_name'].dropna().iloc[0] if not pi_specific_data['pi_full_name'].dropna().empty else f"PI ID {pi_id}"
            pi_names[pi_id] = full_name
            print(f"  Processing data for {full_name} ({pi_id})...")

            formatted_data += f"--- Potential Influencer: {full_name} (ID: {pi_id}) ---\n"

            # --- Project & Connection Analysis ---
            # Count unique *non-null* award titles for this PI
            unique_award_titles = pi_specific_data['award_title'].dropna().unique()
            num_projects = len(unique_award_titles)
            formatted_data += f"Total Projects Involved In (as PI/Co-PI): {num_projects}\n"

            # Get all collaborators across these unique projects using the helper function
            # We need the full df here to find collaborators on the *same* awards
            collaborators_by_award = get_collaborators_for_awards(df, list(unique_award_titles))
            all_collaborators = set()
            for title, names in collaborators_by_award.items():
                # Add collaborators, excluding the PI themselves
                all_collaborators.update(name for name in names if name != full_name and pd.notna(name)) # Ensure name is not NA

            num_unique_collaborators = len(all_collaborators)
            formatted_data += f"Total Unique Collaborators (excluding self): {num_unique_collaborators}\n"
            # Optionally list some collaborators:
            collaborators_preview = ", ".join(list(all_collaborators)[:5]) # Preview first 5
            formatted_data += f"  Collaborators Sample: {collaborators_preview}{'...' if num_unique_collaborators > 5 else ''}\n"

            # --- Field Diversity Analysis ---
            # Use the PI-specific data for fields
            unique_elements = pi_specific_data['program_element'].dropna().unique()
            unique_references = pi_specific_data['program_reference'].dropna().unique()
            # Combine unique fields into a set
            all_fields = set(unique_elements) | set(unique_references)
            num_unique_fields = len(all_fields)
            formatted_data += f"Number of Unique Research Fields (Program Elements/References): {num_unique_fields}\n"
            # Optionally list some fields:
            fields_preview = ", ".join(list(all_fields)[:5]) # Preview first 5
            formatted_data += f"  Fields Sample: {fields_preview}{'...' if num_unique_fields > 5 else ''}\n\n"

        else:
            # Handle case where a specific PI ID from the list had no PI/Co-PI data
            formatted_data += f"--- Potential Influencer ID: {pi_id} ---\n"
            formatted_data += "No award data found in the provided dataset for this PI with role PI/Co-PI.\n\n"
            pi_names[pi_id] = f"PI ID {pi_id}" # Use ID as placeholder name

    print("Influencer data formatting complete.")
    return formatted_data, pi_names

In [63]:
# --- Generate Prompt Function ---
def generate_influencer_prompt(formatted_data_string: str, pi_names_dict: Dict[str, str]) -> str:
    print("Generating influencer prompt...")
    # Get names from the dictionary values, filtering out placeholders if needed
    candidate_names_list = ", ".join(name for name in pi_names_dict.values() if not name.startswith("PI ID"))
    if not candidate_names_list:
        candidate_names_list = "the specified PIs" # Fallback

    prompt = f"""
        Context:
        You are an AI assistant analyzing research collaboration data to identify 'influencers'. An influencer is defined as a researcher who has significant connections within the network, demonstrated by:
        1.  Being involved (as PI or Co-PI) in a relatively high number of distinct projects/awards.
        2.  Having collaborated with a relatively high number of unique individuals (other PIs or Co-PIs on the same projects).
        3.  Having experience across a diverse range of research fields (indicated by different Program Elements or Program References associated with their awards).

        Below is summarized data for potential influencers ({candidate_names_list}):

        {formatted_data_string}

        Task:
        Based *only* on the summarized information provided above, please analyze each researcher's profile according to the 'influencer' criteria (number of projects, number of unique collaborators, and field diversity count).

        Rank these individuals ({candidate_names_list}) from most influential to least influential based *solely* on the metrics presented in the context.

        Provide a clear ranking (numbered list) and a concise justification for your ranking. For each person in the ranking, explicitly state the project count, collaborator count, and field count that you used from the provided context to make your decision.
    """
    # Removed the last sentence "The ranking prioritizes..." from the prompt to avoid leading the model.
    print("Influencer prompt generated.")
    return prompt

In [64]:
# --- Get Gemini Response Function ---
def get_gemini_response(model: genai.GenerativeModel, prompt: str) -> Tuple[Optional[str], float]:
    """ Sends prompt to Gemini, streams response, measures time. """
    print("--- Sending Request to Gemini ---")
    start_time = time.time()
    full_response_text = ""
    contents = [prompt]

    try:
        # Use generate_content for non-streaming if streaming causes issues or for simpler handling
        response = model.generate_content(contents)
        # Check if response has parts and text (structure might vary)
        if hasattr(response, 'text'):
            full_response_text = response.text
        elif hasattr(response, 'parts') and response.parts:
             full_response_text = "".join(part.text for part in response.parts)
        else:
             # Fallback or specific handling if the structure is different
             # Check the actual response object structure if this happens
             print("Warning: Unexpected response structure from Gemini.")
             # Try a common alternative structure (may need adjustment)
             if hasattr(response, 'candidates') and response.candidates:
                 full_response_text = "".join(part.text for part in response.candidates[0].content.parts)

        response_time = time.time() - start_time
        # Simple check if the response seems empty or like an error message
        if not full_response_text or "error" in full_response_text.lower() or "could not process" in full_response_text.lower():
             print(f"\nWarning: Received potentially empty or error response from Gemini after {response_time:.2f}s.")
             print(f"Response received: {full_response_text}")
        else:
             print(f"\nResponse generated successfully in {response_time:.2f} seconds.")

        # print(f"\n--- Full Raw Response ---\n{full_response_text}\n------------------------") # Optional: print raw response for debug

        return full_response_text, response_time

    except AttributeError as e:
        response_time = time.time() - start_time
        print(f"\nError: Attribute error during API call (check model object/API setup): {e}")
        return None, response_time
    except Exception as e:
        response_time = time.time() - start_time
        # More detailed error logging
        import traceback
        print(f"\nAn error occurred during the Gemini API call: {e}")
        print(f"Traceback: {traceback.format_exc()}")
        print(f"Attempt failed after {response_time:.2f} seconds.")
        return None, response_time

In [71]:
# --- Function to identify influencers using LLM ---
def identify_influencer_llm(df: pd.DataFrame, model: genai.GenerativeModel, pi_ids: List[str]) -> Optional[str]:
    print(f"\n--- Starting Influencer Identification Process for PI IDs: {pi_ids} ---")

    # 1. Format Data for Influencer Analysis
    formatted_text, pi_names = format_influencer_data(df, pi_ids)

    # Add a check here: If formatted_text indicates no data, stop.
    if "No data could be retrieved" in formatted_text or not pi_names or all(name.startswith("PI ID") for name in pi_names.values()):
         print("Stopping LLM influencer identification: No usable data formatted for the specified PIs.")
         return "Could not generate influencer ranking: No valid data found for the specified PI IDs with PI/Co-PI roles."

    # 2. Generate Influencer Prompt
    prompt_text = generate_influencer_prompt(formatted_text, pi_names)
    # print("\n--- Generated Prompt ---") # Optional: Debugging
    # print(prompt_text) # Optional: Debugging
    # print("----------------------") # Optional: Debugging

    # 3. Get Response
    print("--- Sending Request to Gemini for Influencer Ranking ---")
    ranking_result, duration = get_gemini_response(model, prompt_text)

    if ranking_result:
        print(f"--- Influencer Identification LLM call complete ({duration:.2f}s) ---")
        return ranking_result # Return the text received from the LLM
    else:
        print("--- Influencer Identification Failed (LLM Error) ---")
        return "Failed to get influencer ranking from the model due to API error."

In [66]:
# unique_award = df[df.pi_id == '269779708'].award_title.unique()
# len(unique_award), unique_award

In [67]:
# --- Function to Select Candidate PIs ---
def select_candidate_pis(
    df: pd.DataFrame,
    df_grouped: pd.DataFrame,
    embedder, # Your SentenceTransformer model
    criterion_type: str, # "topic" or "department"
    criterion_value: str, # The actual topic or department name
    top_k: int = 10 # Number of top candidates to select
) -> List[str]:
    """Selects candidate PI IDs based on topic or department."""
    print(f"Selecting top {top_k} candidates based on {criterion_type}: '{criterion_value}'...")
    candidate_ids = []

    if criterion_type == "topic":
        # Check necessary components
        if 'text_embedding' not in df_grouped.columns or embedder is None:
            print("Error: df_grouped['text_embedding'] and embedder required for topic search.")
            return []
        if df_grouped['text_embedding'].isnull().any():
             print("Warning: Null values found in 'text_embedding'. Filling with zeros for similarity calculation.")
             # Ensure embeddings are numpy arrays and handle potential Nones/NaNs before stacking
             embeddings_list = [emb if isinstance(emb, np.ndarray) else np.zeros(embedder.get_sentence_embedding_dimension()) for emb in df_grouped['text_embedding']]
             all_embeddings = np.stack(embeddings_list)
        else:
             all_embeddings = np.stack(df_grouped['text_embedding'].values)

        if not criterion_value or not isinstance(criterion_value, str):
            print("Error: Invalid topic criterion value provided.")
            return []

        # Compute embedding for the research topic
        try:
            topic_emb = embedder.encode(criterion_value)
        except Exception as e:
            print(f"Error encoding topic '{criterion_value}': {e}")
            return []

        # Calculate similarity
        try:
            similarities = cosine_similarity([topic_emb], all_embeddings)[0]
            # Get indices of top k PIs sorted by similarity
            # Handle cases where k is larger than the number of PIs
            num_candidates = min(top_k, len(df_grouped))
            top_indices = np.argsort(similarities)[::-1][:num_candidates]
            # Get the corresponding PI IDs
            candidate_ids = df_grouped.iloc[top_indices]['pi_id'].tolist()
        except Exception as e:
            print(f"Error calculating similarities or selecting top K: {e}")
            return []


    elif criterion_type == "department":
        # Ensure 'department' column exists
        if 'department' not in df.columns:
             print("Error: 'department' column not found in the main DataFrame.")
             return []
        if not criterion_value or not isinstance(criterion_value, str):
             print("Error: Invalid department criterion value provided.")
             return []

        # Filter the main df by department (case-insensitive partial match)
        try:
            # Make matching robust to NaN and case
            dept_match_df = df[df['department'].str.contains(criterion_value, case=False, na=False)]
        except Exception as e:
            print(f"Error filtering departments: {e}")
            return []

        if dept_match_df.empty:
            print(f"No PIs found matching department pattern: '{criterion_value}'")
            return []

        # Get unique PI IDs from the matching departments
        unique_dept_pi_ids = dept_match_df['pi_id'].unique().tolist() # Convert to list

        # If more than top_k PIs, rank them by award count from df_grouped
        if len(unique_dept_pi_ids) > top_k:
            # Ensure df_grouped and 'award_count' are available
            if 'award_count' not in df_grouped.columns:
                 print("Warning: 'award_count' not in df_grouped. Selecting first K PIs found.")
                 candidate_ids = unique_dept_pi_ids[:top_k]
            else:
                 # Filter df_grouped for these PIs and sort
                 candidate_subset = df_grouped[df_grouped['pi_id'].isin(unique_dept_pi_ids)].copy() # Use .copy()
                 # Handle potential NaNs in award_count before sorting
                 candidate_subset['award_count'] = candidate_subset['award_count'].fillna(0)
                 ranked_candidates = candidate_subset.sort_values(by='award_count', ascending=False)
                 candidate_ids = ranked_candidates.head(top_k)['pi_id'].tolist()
                 print(f"  (Found {len(unique_dept_pi_ids)} PIs, selected top {top_k} based on award count)")
        else:
            candidate_ids = unique_dept_pi_ids # Already a list

    else:
        print(f"Error: Invalid criterion_type '{criterion_type}'. Use 'topic' or 'department'.")
        return []

    if not candidate_ids:
         print("No candidate PI IDs were selected.")
    else:
         print(f"Selected candidate PI IDs: {candidate_ids}")
    return candidate_ids

In [68]:
# --- Orchestrator Function ---
def find_influencers_by_criterion(
    df: pd.DataFrame,
    df_grouped: pd.DataFrame,
    embedder,
    model: genai.GenerativeModel,
    criterion_type: str,
    criterion_value: str,
    top_k_candidates: int = 10
) -> Optional[str]:
    """Finds influencers based on a criterion using candidate selection and LLM ranking."""
    print(f"\n--- Starting Influencer Search by {criterion_type.capitalize()}: '{criterion_value}' ---")

    # 1. Select Candidate PIs
    candidate_pi_ids = select_candidate_pis(
        df, df_grouped, embedder, criterion_type, criterion_value, top_k=top_k_candidates
    )

    if not candidate_pi_ids:
        print("No candidates found for the specified criterion.")
        # Provide a more informative message depending on the criterion type
        if criterion_type == "department":
             return f"Could not find potential influencers: No PIs found matching department pattern '{criterion_value}'."
        elif criterion_type == "topic":
             return f"Could not find potential influencers: No PIs found sufficiently matching the topic '{criterion_value}'."
        else:
             return "Could not find potential influencers: No candidates selected."


    # 2. Analyze Selected Candidates using LLM
    print(f"\n--- Analyzing {len(candidate_pi_ids)} Selected Candidate(s) for Influence ---")
    influencer_ranking_result = identify_influencer_llm(df, model, candidate_pi_ids)

    return influencer_ranking_result

## Verification Function
#
This function recalculates the key metrics (project count, unique collaborator count, field diversity count) for a *single, specified PI* directly from the base `df` DataFrame. Its purpose is to **verify** that the numbers generated by the `format_influencer_data` function (which are fed to the LLM) are correct according to the raw data and the defined logic. This is useful for debugging and ensuring the LLM's ranking is based on accurate inputs.


In [None]:
# ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
# +++ NEW Verification Function ++++++++++++++++++++++++++++++++
# ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
def verify_pi_metrics(df: pd.DataFrame, target_pi_id: str, target_pi_name: str):
    """
    Calculates and prints key influencer metrics for a specific PI directly
    from the DataFrame for verification purposes.

    Args:
        df: The main DataFrame containing all award and PI records.
        target_pi_id: The pi_id of the researcher to verify.
        target_pi_name: The full name of the researcher to verify (used for excluding self).
    """
    print(f"\n--- Verification Check for {target_pi_name} (ID: {target_pi_id}) ---")

    # Filter data for the specific PI and relevant roles
    pi_data = df[
        (df['pi_id'] == target_pi_id) &
        (df['role'].isin(['Principal Investigator', 'Co-Principal Investigator']))
    ].copy() # Use .copy()

    if pi_data.empty:
        print(f"No data found for PI {target_pi_name} (ID: {target_pi_id}) with roles PI/Co-PI.")
        print("Metrics cannot be verified.")
        return None, None, None # Return None for all metrics

    # 1. Calculate Project Count
    # Count unique non-null award titles for this PI
    unique_projects = pi_data['award_title'].dropna().unique()
    project_count = len(unique_projects)
    print(f"  1. Project Count (Unique Awards as PI/Co-PI): {project_count}")
    # print(f"     Projects: {list(unique_projects)}") # Optional: list projects

    # 2. Calculate Unique Collaborators Count
    if project_count > 0:
        # Find all people (PIs/Co-PIs) associated with these specific projects
        collaborator_df = df[
            df['award_title'].isin(unique_projects) &
            df['role'].isin(['Principal Investigator', 'Co-Principal Investigator'])
        ]
        # Get all unique names involved in these projects
        all_involved_names = set(collaborator_df['pi_full_name'].dropna())
        # Remove the target PI's name to get collaborators
        unique_collaborators = all_involved_names - {target_pi_name}
        collaborator_count = len(unique_collaborators)
        print(f"  2. Unique Collaborators Count (Excluding Self): {collaborator_count}")
        # print(f"     Collaborators: {list(unique_collaborators)}") # Optional: list collaborators
    else:
        collaborator_count = 0
        print(f"  2. Unique Collaborators Count: {collaborator_count} (No projects found)")


    # 3. Calculate Field Diversity Count
    # Combine unique non-null program elements and references from the PI's data
    unique_elements = set(pi_data['program_element'].dropna().unique())
    unique_references = set(pi_data['program_reference'].dropna().unique())
    all_fields = unique_elements | unique_references
    field_count = len(all_fields)
    print(f"  3. Field Diversity Count (Unique Program Elements/References): {field_count}")
    # print(f"     Fields: {list(all_fields)}") # Optional: list fields

    print("--- Verification Check Complete ---")
    return project_count, collaborator_count, field_count
# ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
# +++ End Verification Function ++++++++++++++++++++++++++++++++
# ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

In [None]:
# --- Example Usage ---
try:
    # --- Find Influencers by DEPARTMENT ---
    dept_to_search = "Computer Science"
    dept_ranking_result_text = find_influencers_by_criterion(
        df, df_grouped, embedder, model,
        criterion_type="department",
        criterion_value=dept_to_search,
        top_k_candidates=20 # Keep consistent with your example
    )

    # Print the LLM's ranking result
    if dept_ranking_result_text:
        print(f"\n--- LLM Influencer Ranking Result for Department '{dept_to_search}' ---")
        print(dept_ranking_result_text)
    else:
        print(f"\n--- LLM Influencer Ranking Failed for Department '{dept_to_search}' ---")


    # --- Verification Step for a specific PI from the results ---
    # Manually identify the PI ID and Name you want to check from the output/ranking
    pi_id_to_verify = '269779708' # Prasad Calyam's ID from your output
    pi_name_to_verify = 'Prasad Calyam' # Prasad Calyam's Name from your output

    print(f"\n<<< Running Verification for {pi_name_to_verify} >>>")
    # Call the verification function
    verified_proj, verified_collab, verified_field = verify_pi_metrics(
        df, pi_id_to_verify, pi_name_to_verify
    )

    # Compare with LLM output (manual comparison or parse LLM output if needed)
    if verified_proj is not None: # Check if verification ran successfully
        print(f"\nVerification Summary for {pi_name_to_verify}:")
        print(f" - Verified Project Count: {verified_proj}")
        print(f" - Verified Collaborator Count: {verified_collab}")
        print(f" - Verified Field Count: {verified_field}")
        print("\nCompare these numbers with the metrics reported for this PI in the LLM's ranking above.")
        # Example LLM reported numbers for Prasad Calyam from your output:
        print("LLM Reported Metrics (from example output): Projects=14, Collaborators=25, Fields=22")
        print("--------------------------------------------")
        match_proj = verified_proj == 14
        match_collab = verified_collab == 25
        match_field = verified_field == 22
        print(f"Project Count Match: {match_proj}")
        print(f"Collaborator Count Match: {match_collab}")
        print(f"Field Count Match: {match_field}")
        if match_proj and match_collab and match_field:
            print("--> Verification PASSED: All calculated metrics match the example LLM report.")
        else:
            print("--> Verification FAILED: One or more calculated metrics DO NOT match the example LLM report. Review data/logic.")
        print("--------------------------------------------")


except NameError as e:
     # Added more specific error checking
     if 'df' not in locals() and 'df' not in globals():
          print("Error: DataFrame 'df' is not defined. Data loading might have failed.")
     elif 'df_grouped' not in locals() and 'df_grouped' not in globals():
          print("Error: Grouped DataFrame 'df_grouped' is not defined. Grouping might have failed.")
     elif 'embedder' not in locals() and 'embedder' not in globals():
         print("Error: SentenceTransformer 'embedder' is not defined. Model loading might have failed.")
     elif 'model' not in locals() and 'model' not in globals():
         print("Error: Gemini 'model' is not defined. Model loading or configuration might have failed.")
     else:
         print(f"Error: Required variable not defined. Details: {e}")
     import traceback
     traceback.print_exc() # Print traceback for debugging NameErrors

except Exception as e:
     print(f"An unexpected error occurred during the main execution: {e}")
     import traceback
     traceback.print_exc() # Print detailed traceback


--- Starting Influencer Search by Department: 'Computer Science' ---
Selecting top 20 candidates based on department: 'Computer Science'...
  (Found 2938 PIs, selected top 20 based on award count)
Selected candidate PI IDs: ['269935164', '269779708', '269765937', '000235919', '270031750', '000207040', '270018850', '269779084', '269985475', '269680242', '270018579', '269852294', '000025066', '000265186', '000234696', '269904795', '269746833', '269823209', '269786775', '269749707']

--- Analyzing 20 Selected Candidate(s) for Influence ---

--- Starting Influencer Identification Process for PI IDs: ['269935164', '269779708', '269765937', '000235919', '270031750', '000207040', '270018850', '269779084', '269985475', '269680242', '270018579', '269852294', '000025066', '000265186', '000234696', '269904795', '269746833', '269823209', '269786775', '269749707'] ---
Formatting influencer data for PI IDs: ['269935164', '269779708', '269765937', '000235919', '270031750', '000207040', '270018850', 