# Core Machine Learning with Dimensions Grant Data

This notebook introduces **core ML concepts** using Dimensions-style grant data:

1. **Supervised Learning**
   - Classification: k-NN, Perceptron, SVM
   - Regression
   - Loss functions (0–1, L₁, L₂)
   - Overfitting & Regularization
   - Validation: holdout & k-fold cross-validation

2. **Unsupervised Learning**
   - Clustering and k-means

3. **Reinforcement Learning**
   - Markov Decision Process (MDP) intuition
   - Q-Learning with ε-greedy policy

4. **Tools**
   - `scikit-learn` models: Perceptron, SVM, KNeighborsClassifier, Naive Bayes

In [None]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split, KFold, cross_val_score
from sklearn.metrics import (
    accuracy_score,
    mean_absolute_error,
    mean_squared_error,
    classification_report
)

from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import Perceptron, LogisticRegression
from sklearn.svm import SVC
from sklearn.cluster import KMeans
from sklearn.naive_bayes import GaussianNB

# For reproducibility
np.random.seed(42)

# Load your Dimensions-style data
# grants = pd.read_csv("grants.csv")

print("Columns:", grants.columns.tolist())

# Example: verify we have the needed columns (adapt if your names differ)
expected_cols = [
    "grant_id",
    "topic_ai_score",
    "topic_bioinfo_score",
    "topic_data_repo_score",
    "total_funding",
    "citations_5yr",
    "is_ai_ml"
]
missing = [c for c in expected_cols if c not in grants.columns]
if missing:
    print("WARNING: Missing columns:", missing)
else:
    print("All expected columns found.")

## 1. Supervised Learning

Supervised learning learns a function that maps **inputs → outputs** given labeled examples.


- **Inputs (features):** topic scores + log-transformed funding
- **Outputs (labels):**
  - Classification: `is_ai_ml` (AI vs non-AI grant)
  - Regression: `citations_5yr` (continuous)

#### Feature and Label Setup

In [None]:
features = ["topic_ai_score", "topic_bioinfo_score", "topic_data_repo_score", "total_funding"]

df = grants.copy()
df[features] = df[features].fillna(0.0)
df["total_funding"] = np.log1p(df["total_funding"])  # stabilize
df["citations_5yr"] = df["citations_5yr"].fillna(0.0)
df["is_ai_ml"] = df["is_ai_ml"].fillna(0).astype(int)

X = df[features].values
y_class = df["is_ai_ml"].values
y_reg = df["citations_5yr"].values

X_train, X_test, y_train_class, y_test_class = train_test_split(
    X, y_class, test_size=0.2, random_state=42, stratify=y_class
)

X_train_reg, X_test_reg, y_train_reg, y_test_reg = train_test_split(
    X, y_reg, test_size=0.2, random_state=42
)

### k-Nearest Neighbors (k-NN)

For a new grant, k-NN:

1. Finds the **k closest** training examples in feature space.
2. Predicts the **majority class** among those neighbors.

In [None]:
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train_class)
y_pred_knn = knn.predict(X_test)
print("k-NN accuracy:", accuracy_score(y_test_class, y_pred_knn))

### Perceptron

A Perceptron:

- Computes a weighted sum of inputs + bias.
- Applies a step-like decision function.
- Adjusts weights via updates based on misclassified examples.

In [None]:
perc = Perceptron(max_iter=1000, eta0=0.1, random_state=42)
perc.fit(X_train, y_train_class)
y_pred_perc = perc.predict(X_test)
print("Perceptron accuracy:", accuracy_score(y_test_class, y_pred_perc))

### Support Vector Machine (SVM)

SVM:

- Finds the boundary that **maximizes the margin** between classes.
- Can handle non-linear separation via kernels (e.g. RBF).

In [None]:
svm = SVC(kernel="rbf", C=1.0, gamma="scale", random_state=42)
svm.fit(X_train, y_train_class)
y_pred_svm = svm.predict(X_test)
print("SVM accuracy:", accuracy_score(y_test_class, y_pred_svm))

### Regression

Goal: predict **continuous values**, e.g. `citations_5yr`.

- Fit a simple linear regression model.
- Evaluate using:
  - **L₁ loss:** mean absolute error (MAE)
  - **L₂ loss:** mean squared error (MSE)

In [None]:
from sklearn.linear_model import LinearRegression

reg = LinearRegression()
reg.fit(X_train_reg, y_train_reg)
y_pred_reg = reg.predict(X_test_reg)

mae = mean_absolute_error(y_test_reg, y_pred_reg)  # L1
mse = mean_squared_error(y_test_reg, y_pred_reg)   # L2

print("Regression MAE (L1):", mae)
print("Regression MSE (L2):", mse)

### Loss Functions

- **0–1 Loss (classification):** 1 if prediction ≠ truth, else 0.
- **L₁ Loss (regression):** |y − ŷ|
- **L₂ Loss (regression):** (y − ŷ)²

In [None]:
# 0–1 loss for SVM classifier
zero_one_loss_svm = np.mean(y_test_class != y_pred_svm)
print("0-1 loss (SVM classification):", zero_one_loss_svm)

# L1 and L2 for regression (already partly computed)
l1_manual = np.mean(np.abs(y_test_reg - y_pred_reg))
l2_manual = np.mean((y_test_reg - y_pred_reg) ** 2)
print("Manual L1:", l1_manual, "Manual L2:", l2_manual)

### Overfitting & Regularization

- **Overfitting:** model fits training data too closely, fails to generalize.
- **Regularization:** penalizes overly complex models.

Example: Logistic Regression with **L₂ penalty**:
- Large `C` → weak regularization → higher risk of overfitting.
- Small `C` → stronger regularization.

In [None]:
from sklearn.metrics import roc_auc_score

# Weak regularization (large C)
logit_weak = LogisticRegression(C=1000.0, penalty="l2", max_iter=1000)
logit_weak.fit(X_train, y_train_class)
proba_weak = logit_weak.predict_proba(X_test)[:, 1]
auc_weak = roc_auc_score(y_test_class, proba_weak)

# Strong regularization (small C)
logit_strong = LogisticRegression(C=0.1, penalty="l2", max_iter=1000)
logit_strong.fit(X_train, y_train_class)
proba_strong = logit_strong.predict_proba(X_test)[:, 1]
auc_strong = roc_auc_score(y_test_class, proba_strong)

print("AUC with weak regularization (C=1000):", auc_weak)
print("AUC with strong regularization (C=0.1):", auc_strong)

### Validation

- **Holdout:** one-time train/test split (we’ve already used this).
- **k-Fold Cross-Validation:** rotate which fold is used for testing to get more stable performance estimates.

In [None]:
kf = KFold(n_splits=5, shuffle=True, random_state=42)

svm_cv = SVC(kernel="rbf", C=1.0, gamma="scale", random_state=42)

cv_scores = cross_val_score(svm_cv, X, y_class, cv=kf, scoring="accuracy")

print("5-fold CV accuracies:", cv_scores)
print("Mean CV accuracy:", cv_scores.mean())

## 2. Unsupervised Learning

Unsupervised learning finds **patterns without labels**.

### Clustering

- Groups similar data points.
- We’ll use **k-means** to cluster grants based on topic and funding features.

In [None]:
kmeans = KMeans(n_clusters=4, random_state=42, n_init=10)
cluster_labels = kmeans.fit_predict(X)

df["cluster"] = cluster_labels
print(df[["grant_id", "cluster"]].head())

# Inspect cluster centers
centers = pd.DataFrame(kmeans.cluster_centers_, columns=features)
centers

## 3. Reinforcement Learning

RL learns by **trial and error** via feedback (rewards).

**Markov Decision Process (MDP)**:
- **States (s):** describe environment (e.g., grant type).
- **Actions (a):** decisions (e.g., fund vs not fund).
- **Transition model:** how actions move between states.
- **Reward function:** immediate feedback.


In [None]:
# Discretize AI score into 3 bins → states
ai_score = df["topic_ai_score"].fillna(0.0)
df["state"] = pd.qcut(ai_score, q=3, labels=False, duplicates="drop")

env_grants = df[["grant_id", "state", "citations_5yr"]].dropna().head(500)

n_states = env_grants["state"].nunique()
n_actions = 2   # 0 = don't fund, 1 = fund

Q = np.zeros((n_states, n_actions))

alpha = 0.1   # learning rate
gamma = 0.9   # discount factor
epsilon = 0.1 # exploration rate

def choose_action(state):
    if np.random.rand() < epsilon:
        return np.random.randint(0, n_actions)  # explore
    return int(np.argmax(Q[state]))             # exploit

def reward_function(row, action):
    if action == 1:  # fund
        return float(row["citations_5yr"] / 10.0)
    else:            # don't fund
        return 0.0

# Simple episodic loop: each episode iterates over grants
for episode in range(50):
    for _, g in env_grants.iterrows():
        s = int(g["state"])
        a = choose_action(s)
        r = reward_function(g, a)
        s_next = s  # one-step toy environment

        best_next = np.max(Q[s_next])
        Q[s, a] = Q[s, a] + alpha * (r + gamma * best_next - Q[s, a])

print("Learned Q-table (state × action):")
print(Q)

## 4. Tools: scikit-learn Models

`scikit-learn` provides consistent APIs for:

- **Classification**
  - `KNeighborsClassifier`
  - `Perceptron`
  - `SVC` (Support Vector Machine)
  - `NaiveBayes` variants (`GaussianNB`, `MultinomialNB`, etc.)
- **Regression**
  - `LinearRegression`, `Ridge`, `Lasso`, etc.
- **Clustering**
  - `KMeans`, `DBSCAN`, etc.

Below: small “cheat sheet” instantiating and fitting the main models

In [None]:
# k-NN
knn = KNeighborsClassifier(n_neighbors=5).fit(X_train, y_train_class)

# Perceptron
perc = Perceptron(max_iter=1000, eta0=0.1, random_state=42).fit(X_train, y_train_class)

# SVM
svm = SVC(kernel="rbf", C=1.0, gamma="scale", random_state=42).fit(X_train, y_train_class)

# Naive Bayes (Gaussian - good for continuous features)
gnb = GaussianNB().fit(X_train, y_train_class)

print("k-NN prediction sample:", knn.predict(X_test[:5]))
print("Perceptron prediction sample:", perc.predict(X_test[:5]))
print("SVM prediction sample:", svm.predict(X_test[:5]))
print("GaussianNB prediction sample:", gnb.predict(X_test[:5]))

1. **Supervised Learning**
   - k-NN, Perceptron, SVM for classification
   - Linear regression for citations
   - Loss functions (0–1, L₁, L₂)
   - Overfitting & regularization with logistic regression
   - Holdout and k-fold cross-validation

2. **Unsupervised Learning**
   - k-means clustering on grant features

3. **Reinforcement Learning**
   - Toy Q-learning example with discretized grant “states”

4. **Tools**
   - `scikit-learn` classifiers, regressors, and clustering models

This notebook forms your **core ML module** built directly on top of Dimensions grant data.  
You can now plug in different features (abstract embeddings, PI country, disease tags) or different labels (funding decisions, impact tiers) with the same patterns.