# CS50 AI Concepts Applied to Dimensions Research Data

This notebook maps core CS50 AI topics to **Dimensions-style** data:

1. **Search & Problem Solving**
   - State-space search (BFS, A*, etc.) on co-authorship and collaboration graphs.

2. **Knowledge Representation**
   - Propositional-style rules over publication metadata.

3. **Uncertainty (Probability, Bayes Nets)**
   - Probabilistic modeling of citations, funding, and topic trends.

4. **Machine Learning**
   - Classification, clustering, regression, and NLP on abstracts.

5. **Reinforcement Learning**
   - Simulated decision-making for institutional research strategies.

In [None]:
import numpy as np
import pandas as pd

from collections import defaultdict, deque

# For ML
from sklearn.model_selection import train_test_split, KFold, cross_val_score
from sklearn.metrics import accuracy_score, mean_absolute_error, mean_squared_error
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.cluster import KMeans
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation

# For basic RL example
np.random.seed(42)

# --- Toy "papers" dataset ---
papers = pd.DataFrame({
    "paper_id": ["P1", "P2", "P3", "P4"],
    "title": [
        "AI for infectious disease modeling",
        "Data repositories for immunology",
        "Collaborative networks in pandemic preparedness",
        "Applied ML for clinical trials"
    ],
    "primary_topic": ["AI", "DRKB", "Collab", "AI"],
    "author_a": ["A1", "A2", "A1", "A3"],
    "author_b": ["A2", "A3", "A3", "A4"],  # simple pair co-authorship
    "inst_a": ["NIAID", "UniversityX", "NIAID", "HospitalY"],
    "inst_b": ["UniversityX", "InstituteY", "UniversityX", "NIAID"],
    "citations_5yr": [120, 30, 80, 55],
    "is_basic_research": [1, 0, 1, 0]  # toy label: 1 = basic, 0 = applied
})

# --- Toy "grants" dataset ---
grants = pd.DataFrame({
    "grant_id": ["G1", "G2", "G3", "G4"],
    "topic_ai_score": [0.9, 0.2, 0.7, 0.4],
    "topic_repo_score": [0.3, 0.8, 0.1, 0.5],
    "total_funding": [2e6, 5e5, 1.2e6, 8e5],
    "citations_5yr": [200, 40, 150, 60],
    "funding_awarded": [1, 1, 0, 0]
})

# --- Toy "topic_trends" dataset ---
topic_trends = pd.DataFrame({
    "topic": ["AI", "AI", "AI", "DRKB", "DRKB", "DRKB"],
    "year": [2019, 2020, 2021, 2019, 2020, 2021],
    "n_pubs": [50, 70, 95, 20, 25, 22]
})

papers

## 1. Search & Problem Solving on Collaboration Networks

From CS50 AI: **state-space search** (DFS, BFS, A*, etc.).

Application with Dimensions data:

- Nodes = researchers or institutions.
- Edges = co-authorship or co-funding links.
- Use **BFS** to find the shortest collaboration path.
- Use **A*** if edges have weights (e.g., inverse of collaboration strength).

**Example:** Find the shortest co-authorship path between two institutions via shared papers.

In [None]:
# Build an undirected graph of institutions based on co-authorship in papers
graph_inst = defaultdict(set)

for _, row in papers.iterrows():
    a = row["inst_a"]
    b = row["inst_b"]
    graph_inst[a].add(b)
    graph_inst[b].add(a)

In [None]:
def bfs_shortest_path(graph, start, goal):
    """
    Standard BFS to find the shortest path in an unweighted graph.
    """
    queue = deque([[start]])
    visited = set([start])

    while queue:
        path = queue.popleft()
        node = path[-1]
        if node == goal:
            return path

        for neighbor in graph[node]:
            if neighbor not in visited:
                visited.add(neighbor)
                queue.append(path + [neighbor])
    return None

bfs_shortest_path(graph_inst, "NIAID", "InstituteY")

####  A* on Weighted Graph (Strongest Influence Path)

In [None]:
# Suppose we have co-authorship counts → used to define edge weights
edge_weights = defaultdict(lambda: 1.0)  # default weight

for _, row in papers.iterrows():
    a = row["inst_a"]
    b = row["inst_b"]
    edge_weights[(a, b)] += 1
    edge_weights[(b, a)] += 1

# Weight = 1 / collaboration_strength (higher co-auth count → lower cost)
def weight(u, v):
    return 1.0 / edge_weights[(u, v)]

def astar_search(graph, start, goal, h=None):
    """
    Simple A* where h is a heuristic function h(node).
    For illustration, we set h = 0 (reduces to Dijkstra).
    """
    if h is None:
        h = lambda x: 0

    open_set = {start}
    came_from = {}

    g_score = defaultdict(lambda: float("inf"))
    g_score[start] = 0.0

    f_score = defaultdict(lambda: float("inf"))
    f_score[start] = h(start)

    while open_set:
        current = min(open_set, key=lambda n: f_score[n])
        if current == goal:
            # reconstruct path
            path = [current]
            while current in came_from:
                current = came_from[current]
                path.append(current)
            return list(reversed(path)), g_score[goal]

        open_set.remove(current)

        for neighbor in graph[current]:
            tentative_g = g_score[current] + weight(current, neighbor)
            if tentative_g < g_score[neighbor]:
                came_from[neighbor] = current
                g_score[neighbor] = tentative_g
                f_score[neighbor] = tentative_g + h(neighbor)
                open_set.add(neighbor)

    return None, float("inf")

astar_search(graph_inst, "NIAID", "InstituteY")

## 2. Knowledge Representation over Dimensions Metadata

From CS50 AI: **propositional logic** & **inference**.

Application:

- Represent facts like:
  - `works_on(A, X)` – researcher A works on topic X.
  - `coauthor(A, B)` – A and B co-authored a paper.
- Rule:
  - If A and B work on X, and B and C co-publish, then C may also work on X.

\[
works\_on(A,X) \land coauthor(A,B) \land coauthor(B,C) \rightarrow works\_on(C,X)
\]


In [None]:
# Step 1: derive "works_on(author, topic)" facts from primary_topic
works_on = set()
coauthor = set()

for _, row in papers.iterrows():
    topic = row["primary_topic"]
    works_on.add((row["author_a"], topic))
    works_on.add((row["author_b"], topic))
    coauthor.add((row["author_a"], row["author_b"]))
    coauthor.add((row["author_b"], row["author_a"]))

works_on, coauthor

In [None]:
def infer_new_works_on(works_on, coauthor, max_iterations=3):
    """
    Apply rule:
      works_on(A,X) & coauthor(A,B) & coauthor(B,C) -> works_on(C,X)
    to closure.
    """
    inferred = set(works_on)
    for _ in range(max_iterations):
        changed = False
        for (A, X) in list(inferred):
            # A works on X, find B such that coauthor(A,B)
            Bs = [b for (a, b) in coauthor if a == A]
            for B in Bs:
                # find C such that coauthor(B,C)
                Cs = [c for (b, c) in coauthor if b == B]
                for C in Cs:
                    if (C, X) not in inferred:
                        inferred.add((C, X))
                        changed = True
        if not changed:
            break
    return inferred

inferred_works_on = infer_new_works_on(works_on, coauthor)
sorted(inferred_works_on)

## 3. Uncertainty & Probabilistic Modeling

From CS50 AI: **probability, Bayes Nets, Bayesian reasoning**.

Applications:

- Predict **whether a paper/grant is highly cited**.
- Predict **funding outcome** from proposal features.
- Model topic evolution or “hot topics”.

We’ll do a simple probabilistic model:

1. Binary label: `high_cited = 1 if citations_5yr >= 80 else 0`.
2. Features: topic scores, funding.
3. Train logistic regression → approximate `P(high_cited | features)`.

In [None]:
gr = grants.copy()
gr["high_cited"] = (gr["citations_5yr"] >= 100).astype(int)

features = ["topic_ai_score", "topic_repo_score", "total_funding"]
gr[features] = gr[features].fillna(0.0)
gr["total_funding"] = np.log1p(gr["total_funding"])

X = gr[features].values
y = gr["high_cited"].values

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.5, random_state=42
)

logit = LogisticRegression()
logit.fit(X_train, y_train)
proba = logit.predict_proba(X_test)[:, 1]

print("Test probabilities of high-citation:", proba)
print("Test labels:", y_test)

## 4. Machine Learning with Dimensions Data

From CS50 AI: supervised/unsupervised learning, neural networks, NLP.

Examples:

- **Classification**: label papers as basic vs applied research.
- **Clustering**: group publications into emerging research areas.
- **Regression**: predict citation counts.
- **NLP**: topic modeling on abstracts.

We’ll show:

1. Classification: basic vs applied (`is_basic_research`).
2. Clustering: unsupervised topic clusters from titles.
3. Topic modeling: LDA on titles (as a lightweight NLP example).

#### Classification (Basic vs Applied)

In [None]:
X_ml = pd.DataFrame({
    "citations_5yr": papers["citations_5yr"],
    "is_ai_topic": (papers["primary_topic"] == "AI").astype(int)
})
y_ml = papers["is_basic_research"].values

X_train, X_test, y_train, y_test = train_test_split(
    X_ml.values, y_ml, test_size=0.5, random_state=42
)

clf = LogisticRegression()
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

print("Classification accuracy (basic vs applied):", accuracy_score(y_test, y_pred))

#### Clustering & Topic Modeling with Titles

In [None]:
# TF-IDF on titles
vectorizer = TfidfVectorizer(max_features=100, stop_words="english")
X_text = vectorizer.fit_transform(papers["title"])

# k-Means clustering
kmeans = KMeans(n_clusters=2, random_state=42, n_init=10)
clusters = kmeans.fit_predict(X_text)
papers["cluster"] = clusters
papers[["paper_id", "title", "cluster"]]

In [None]:
# LDA topic modeling on titles (tiny toy example, but pattern holds)
lda = LatentDirichletAllocation(n_components=2, random_state=42)
lda.fit(X_text)

terms = np.array(vectorizer.get_feature_names_out())
for topic_idx, comp in enumerate(lda.components_):
    top_indices = comp.argsort()[-5:][::-1]
    print(f"Topic {topic_idx}: ", terms[top_indices])

## 5. Reinforcement Learning for Research Strategy

From CS50 AI: **reinforcement learning, MDPs, Q-learning**.

Dimensions-style application:

- An institution chooses strategies (actions) over time:
  - e.g., allocate more funding to AI, DRKB, or Vaccines.
- State includes:
  - current topic portfolio.
  - recent impact metrics.
- Reward:
  - simulated citations, strategic alignment, or funding success.

We’ll build a toy MDP:

- **States**: 0 = low AI emphasis, 1 = medium, 2 = high.
- **Actions**: 0 = decrease AI emphasis, 1 = maintain, 2 = increase.
- **Reward**: higher AI emphasis yields higher expected impact in this toy.

#### Simple Q-Learning Example

In [None]:
n_states = 3
n_actions = 3

# Toy transition & reward model
def step(state, action):
    """
    Simple dynamics:
    - state ∈ {0,1,2} (AI emphasis level)
    - action ∈ {0=down,1=same,2=up}
    - transitions clip to [0,2]
    - reward ~ state (higher emphasis => higher reward), plus noise.
    """
    if action == 0:
        next_state = max(0, state - 1)
    elif action == 2:
        next_state = min(2, state + 1)
    else:
        next_state = state

    base_reward = [10, 30, 60][next_state]
    reward = base_reward + np.random.normal(0, 5)
    return next_state, reward

Q = np.zeros((n_states, n_actions))

alpha = 0.1   # learning rate
gamma = 0.9   # discount factor
epsilon = 0.1 # exploration

def choose_action(state):
    if np.random.rand() < epsilon:
        return np.random.randint(0, n_actions)
    return int(np.argmax(Q[state]))

# Q-learning loop
for episode in range(200):
    state = np.random.randint(0, n_states)  # random starting emphasis
    for t in range(10):  # short horizon
        action = choose_action(state)
        next_state, reward = step(state, action)
        best_next = np.max(Q[next_state])
        Q[state, action] = Q[state, action] + alpha * (reward + gamma * best_next - Q[state, action])
        state = next_state

Q

## 6. Example Research Questions & Methods Mapping

### Q1. “Which topics are most likely to receive funding in the next 5 years?”

- **Data**: past grants with topic scores, funding outcomes, years.
- **Tools**:
  - Supervised learning on `funding_awarded` (classification).
  - Probabilistic prediction: `P(funding | topic features, institution, year)`.
- **Approach**:
  - Logistic regression / random forest on grant features.
  - Calibrate predicted probabilities, then aggregate by topic.

---

### Q2. “What is the most efficient collaboration network for high-impact research?”

- **Data**: co-authorship & citations (papers or grants).
- **Tools**:
  - Search algorithms (BFS, A*) on co-authorship graph.
  - Graph metrics (betweenness, eigenvector centrality).
- **Approach**:
  - Build author- or institution-level graph.
  - Use shortest paths & centrality to identify key intermediaries.

---

### Q3. “Can we predict the next hot research area in AI?”

- **Data**: yearly publication counts, topics from NLP on abstracts.
- **Tools**:
  - Unsupervised learning: clustering (k-means), topic modeling (LDA).
  - Time series & probabilistic trend models (e.g., Beta-Binomial trend).
- **Approach**:
  - Extract topic clusters from text.
  - Track cluster growth over time.
  - Rank topics by predicted probability of increase or growth rate.

---

## Summary

This notebook ties together:

- **Search** → traversal & optimization on collaboration networks.
- **Knowledge Representation** → rule-based reasoning about topics & authors.
- **Uncertainty** → probabilistic models over citations, funding, and trends.
- **Machine Learning** → classifiers, clusters, regressors, and NLP models.
- **Reinforcement Learning** → simulated decision-making for research strategy.
