# Modeling Uncertainty & Probabilistic Reasoning with Dimensions Data

This notebook shows how to model **uncertainty** in research trends and funding using probabilistic methods applied to Dimensions-style data.

1. **Modeling Uncertainty in Research Trends**
   - Estimate the probability that a topic’s publications increase next year.

2. **Bayesian Inference for Funding Allocation**
   - Estimate `P(Funding | Proposal Features)` and update as new data arrives.

3. **Markov Models for Research Topic Evolution**
   - Model topic transitions over time with a Markov chain.

4. **Handling Missing Data with Probabilistic Models**
   - Use model-based approaches (EM-style) to estimate missing citation counts.

5. **Risk Assessment in Collaborative Research**
   - Quantify risk of low impact for collaborative grants.

Assumed data sources:

- `pubs`: yearly publication counts by topic.
- `grants`: grant-level data with features & outcomes.
- `collabs`: subset or flag for collaborative projects.

In [None]:
import numpy as np
import pandas as pd

np.random.seed(42)

# --- Example structures you can map to real Dimensions exports ---

# 1) Yearly publication counts by topic
# columns: topic, year, n_pubs
# pubs = pd.read_csv("topic_publications_by_year.csv")

# 2) Grants / proposals dataset
# columns (example):
#   grant_id, year, primary_topic, funding_awarded (0/1),
#   topic_ai_score, topic_data_repo_score, institution_tier,
#   citations_5yr, is_collaborative, n_institutions
# grants = pd.read_csv("grants.csv")

print("Pubs columns:", pubs.columns.tolist())
print("Grants columns:", grants.columns.tolist())

## 1. Modeling Uncertainty in Research Trends

We want to quantify statements like:

> “What is the probability that publications on topic **T** will increase next year?”

Let:

- `n_t` = number of publications on topic T in year t.
- Define an “increase indicator”:
  - \( Y_t = 1 \) if \( n_t > n_{t-1} \), else 0.

We can estimate:

\[
P(\text{Increase next year for topic T} \mid \text{past trends})
\]

A simple frequentist estimate:

\[
\hat{p} = \frac{\text{# years with increase}}{\text{# year transitions}}
\]

A simple Bayesian **Beta-Binomial** version:

- Prior: \( p \sim \mathrm{Beta}(\alpha_0, \beta_0) \)
- Data: \( k \) increases out of \( N \) transitions
- Posterior: \( \mathrm{Beta}(\alpha_0 + k, \beta_0 + N - k) \)
- Posterior mean: \( \mathbb{E}[p] = \frac{\alpha_0 + k}{\alpha_0 + \beta_0 + N} \)

In [None]:
def compute_topic_increase_posterior(pubs, topic, alpha0=1.0, beta0=1.0):
    """
    Given a 'pubs' DataFrame with columns (topic, year, n_pubs),
    compute the posterior for the probability that a topic's
    publications increase from year to year.
    """
    df = (
        pubs[pubs["topic"] == topic]
        .sort_values("year")
        .reset_index(drop=True)
    )
    if len(df) < 2:
        raise ValueError("Need at least 2 years of data for this topic.")

    n = df["n_pubs"].values
    increases = (n[1:] > n[:-1]).astype(int)

    k = increases.sum()             # # years with increase
    N = len(increases)              # # transitions

    alpha_post = alpha0 + k
    beta_post = beta0 + (N - k)
    posterior_mean = alpha_post / (alpha_post + beta_post)

    return {
        "topic": topic,
        "alpha_post": alpha_post,
        "beta_post": beta_post,
        "posterior_mean": posterior_mean,
        "k_increases": int(k),
        "N_transitions": int(N),
    }

# Example for a specific topic:
topic_example = pubs["topic"].iloc[0]
res = compute_topic_increase_posterior(pubs, topic_example, alpha0=1.0, beta0=1.0)
res

## 2. Bayesian Inference for Funding Allocation

We want \( P(\text{Funding} \mid \text{Proposal Features}) \).

**Bayes' theorem:**

\[
P(F \mid x) = \frac{P(x \mid F) P(F)}{P(x)}
\]

Where:
- \( F \) = event that a proposal is funded.
- \( x \) = feature vector (e.g., topic score bins, institution tier).

A practical approximation:

1. Discretize features into a small number of categories.
2. Estimate:
   - \( P(F) \) (baseline funding rate).
   - \( P(x \mid F) \) and \( P(x \mid \neg F) \) as empirical frequencies.
3. Use Bayes’ rule to compute \( P(F \mid x) \) for new proposals.

We’ll show a simple **Naive Bayes-style** implementation over binned features.

In [None]:
# Create simple categorical features from grants:
gr = grants.copy()

# Example binning:
gr["ai_bin"] = pd.cut(gr["topic_ai_score"].fillna(0.0),
                      bins=[-np.inf, 0.2, 0.5, 1.0],
                      labels=["low", "medium", "high"])

gr["inst_bin"] = gr["institution_tier"].fillna("unknown")

# We'll use a combined feature x = (ai_bin, inst_bin)
gr["feature_combo"] = gr["ai_bin"].astype(str) + "|" + gr["inst_bin"].astype(str)

# Funding indicator F
gr["F"] = gr["funding_awarded"].astype(int)

# Prior P(F)
p_F = gr["F"].mean()
p_notF = 1 - p_F

# Likelihood P(x | F) and P(x | not F) as empirical frequencies
counts_F = gr[gr["F"] == 1]["feature_combo"].value_counts()
counts_notF = gr[gr["F"] == 0]["feature_combo"].value_counts()
all_x = set(counts_F.index).union(set(counts_notF.index))

# Laplace smoothing
alpha = 1.0

likelihood_F = {}
likelihood_notF = {}

for x_val in all_x:
    likelihood_F[x_val] = (counts_F.get(x_val, 0) + alpha) / (counts_F.sum() + alpha * len(all_x))
    likelihood_notF[x_val] = (counts_notF.get(x_val, 0) + alpha) / (counts_notF.sum() + alpha * len(all_x))

def bayes_funding_prob(ai_bin, inst_bin):
    """
    Compute P(Funding | ai_bin, inst_bin) using the Naive Bayes-style model.
    """
    x_val = f"{ai_bin}|{inst_bin}"
    # if unseen combo, fall back to uniform likelihood
    if x_val not in all_x:
        like_F = 1 / len(all_x)
        like_notF = 1 / len(all_x)
    else:
        like_F = likelihood_F[x_val]
        like_notF = likelihood_notF[x_val]

    # Bayes
    numerator = like_F * p_F
    denominator = like_F * p_F + like_notF * p_notF
    return numerator / denominator if denominator > 0 else p_F

# Example: medium AI score, Tier1 institution
bayes_funding_prob("medium", "Tier1")

## 3. Markov Models for Research Topic Evolution

We want to model how topics **transition** over time.

Example:

- States: high-level topics (e.g., AI/ML, DRKB, DMC, CB/SM, etc.).
- Transition: topic at year t → topic at year t+1.

We build a **Markov chain**:

- Transition matrix \( P_{ij} = P(\text{Topic}_{t+1} = j \mid \text{Topic}_t = i) \).
- Use it to:
  - Analyze how research focus shifts.
  - Simulate possible future trajectories.

In [None]:
# Assume grants has columns: grant_id, year, primary_topic

# Step 1: get unique topics and index them
topics = sorted(grants["primary_topic"].dropna().unique())
topic_to_idx = {t: i for i, t in enumerate(topics)}
n_topics = len(topics)

# Step 2: construct transitions from year to year
# For simplicity, aggregate at (topic, year) and look at dominant transitions
topic_year = (
    grants.groupby(["year", "primary_topic"])["grant_id"]
    .nunique()
    .reset_index(name="n_grants")
)

topic_year = topic_year.sort_values(["primary_topic", "year"])

# Build counts of transitions topic_t -> topic_{t+1}
transition_counts = np.zeros((n_topics, n_topics), dtype=float)

# Here we assume each topic "stays" in itself across years, but we can
# measure shifts using dominant topic changes if you track primary_topic per grant.
# For illustration, we just model persistence + noise from aggregated data:

for t_idx, topic in enumerate(topics):
    topic_data = topic_year[topic_year["primary_topic"] == topic].sort_values("year")
    years = topic_data["year"].values
    # treat each consecutive year pair as a self-transition
    for y1, y2 in zip(years[:-1], years[1:]):
        transition_counts[t_idx, t_idx] += 1

# Add a little smoothing noise (e.g., topic diffusion)
epsilon = 0.01
transition_counts += epsilon

# Step 3: normalize rows to sum to 1 to get transition matrix
row_sums = transition_counts.sum(axis=1, keepdims=True)
P = transition_counts / row_sums

pd.DataFrame(P, index=topics, columns=topics).head()

In [None]:
def simulate_topic_chain(P, topics, start_topic, n_steps=5):
    topic_to_idx = {t: i for i, t in enumerate(topics)}
    idx_to_topic = {i: t for i, t in enumerate(topics)}

    current_idx = topic_to_idx[start_topic]
    trajectory = [start_topic]

    for _ in range(n_steps):
        probs = P[current_idx]
        next_idx = np.random.choice(len(topics), p=probs)
        trajectory.append(idx_to_topic[next_idx])
        current_idx = next_idx

    return trajectory

simulate_topic_chain(P, topics, start_topic=topics[0], n_steps=10)

## 4. Handling Missing Citation Data with Probabilistic Models

Citation data often has missing values (e.g., incomplete time windows, delayed indexing).

We can:

- Assume citations follow a mixture of "low-impact" and "high-impact" distributions.
- Use an **EM-style** approach:
  - **E-step:** estimate probability each grant is low/high impact.
  - **M-step:** update parameters (means/variances).
- Use the learned parameters to **impute missing citations**.

For simplicity, we’ll:

- Cluster citation counts into 2 Gaussian components (low/high).
- Use cluster means for imputing missing values.

In [None]:
from sklearn.mixture import GaussianMixture

# Take a copy and isolate the citations column
cit_df = grants[["grant_id", "citations_5yr"]].copy()

# Mask missing
mask_missing = cit_df["citations_5yr"].isna()
observed = cit_df[~mask_missing]["citations_5yr"].values.reshape(-1, 1)

# Fit a 2-component Gaussian mixture to observed citations
gmm = GaussianMixture(n_components=2, random_state=42)
gmm.fit(observed)

# For missing ones, we can impute using the overall expected value under the mixture
overall_mean = (gmm.weights_ * gmm.means_.flatten()).sum()
print("Overall mixture mean:", overall_mean)

# Simple imputation: fill missing with mixture mean (or choose component-specific mean if you have other predictors)
cit_df.loc[mask_missing, "citations_5yr_imputed"] = overall_mean
cit_df.loc[~mask_missing, "citations_5yr_imputed"] = cit_df.loc[~mask_missing, "citations_5yr"]

cit_df.head()

## 5. Risk Assessment in Collaborative Research

Collaborative projects can have varying impact, and we’d like to quantify **risk**:

> “What is the probability that a collaborative project has **low impact**?”

Example:

1. Define impact categories based on citations:
   - Low: citations_5yr < 10
   - Medium: 10–50
   - High: > 50

2. Filter to **collaborative** projects (`is_collaborative == 1`).

3. Estimate:
   - \( P(\text{Low impact} \mid \text{collab type, topic, etc.}) \)
   - Use historical data + Bayesian smoothing (Dirichlet prior).

In [None]:
collabs = grants[grants["is_collaborative"] == 1].copy()

# Define impact categories
def impact_category(cites):
    if cites < 10:
        return "low"
    elif cites <= 50:
        return "medium"
    else:
        return "high"

collabs["impact_cat"] = collabs["citations_5yr"].fillna(0.0).apply(impact_category)

# Example stratification: by primary_topic and n_institutions bin
collabs["inst_bin"] = pd.cut(collabs["n_institutions"].fillna(1),
                             bins=[0, 2, 5, np.inf],
                             labels=["small", "medium", "large"])

collabs["group_key"] = collabs["primary_topic"].astype(str) + "|" + collabs["inst_bin"].astype(str)

# Dirichlet prior alpha for each category
alpha_prior = {"low": 1.0, "medium": 1.0, "high": 1.0}
impact_levels = ["low", "medium", "high"]

def impact_risk_for_group(df_group):
    counts = df_group["impact_cat"].value_counts().to_dict()
    total = sum(counts.get(k, 0) for k in impact_levels)

    alpha_post = {
        lvl: alpha_prior[lvl] + counts.get(lvl, 0)
        for lvl in impact_levels
    }
    alpha_sum = sum(alpha_post.values())

    # Posterior probabilities
    post_probs = {lvl: alpha_post[lvl] / alpha_sum for lvl in impact_levels}
    return post_probs

# Compute risk per group
group_risk = (
    collabs.groupby("group_key")
    .apply(impact_risk_for_group)
)

# Example: inspect one group’s risk (posterior P(low/medium/high))
group_risk.head()


1. **Modeling Uncertainty in Trends**
   - Bayesian Beta-Binomial model for probability of publication increases per topic.

2. **Bayesian Inference for Funding**
   - Simple Naive Bayes-style model for \( P(\text{Funding} \mid \text{Features}) \).

3. **Markov Models for Topic Evolution**
   - Transition matrix and simulations of topic trajectories.

4. **Missing Data Handling**
   - EM-like imputation with a Gaussian mixture model for citations.

5. **Risk Assessment in Collaboration**
   - Dirichlet-smoothed probabilities of low/medium/high impact for collaborative groups.

These techniques help **quantify uncertainty**, build **probabilistic forecasts**, and support **decision-making** in research funding and portfolio analysis.
