<a href="https://colab.research.google.com/github/lawrennd/economic-fitness/blob/main/reddit_user_word_complexity_2d.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Reddit user × word data: “fitness / complexity” analysis

This notebook:

- Downloads a sample of Reddit users from BigQuery and aggregates their comments into per-user text.
- Builds a `user` $\times$ `word` matrix and its *support* (analogous to `country` $\times$ `product`).


### BigQuery Setup Instructions

To run the BigQuery query cells in this notebook, you need to have a Google Cloud Project with the BigQuery API enabled and proper authentication setup.

Here's a general guide:

1.  **Google Cloud Account**: If you don't have one, sign up for a Google Cloud account. You might be eligible for a free trial.
    *   [Sign up for Google Cloud](https://cloud.google.com/free)

2.  **Create/Select a Project**: In the [Google Cloud Console](https://console.cloud.google.com/), create a new project or select an existing one.
    *   Ensure that **billing is enabled** for your project, as BigQuery usage incurs costs (though often minimal for small queries, especially with the free tier).

3.  **Enable the BigQuery API**: For your selected project, ensure the BigQuery API is enabled.
    *   Go to the [API Library](https://console.cloud.google.com/apis/library) in the Cloud Console.
    *   Search for "BigQuery API" and enable it if it's not already enabled.

4.  **Authenticate Colab**: In your Colab environment, the `google.colab.auth.authenticate_user()` function (called in Cell 3) will handle the authentication process by prompting you to log in with your Google account. This provides the necessary credentials for BigQuery access.

    Alternatively, if you are working locally or need specific application-default credentials, you might use the `gcloud` CLI:
    ```bash
    gcloud auth application-default login
    ```

Once these steps are complete, you should be able to run the BigQuery cells successfully!

### 0) Setup

You’ll need BigQuery credentials configured locally (e.g. `gcloud auth application-default login`) and permission to access the public dataset `fh-bigquery.reddit_comments`.

If you don’t have BigQuery access, you can still run the later cells by loading a cached dataframe (see the caching cell below).


In [None]:
# Core
import os
from dataclasses import dataclass

import numpy as np
import pandas as pd

# Sparse matrices
import scipy.sparse as sp

# Text features
from sklearn.feature_extraction.text import CountVectorizer

# Plotting
import matplotlib.pyplot as plt


@dataclass(frozen=True)
class QueryConfig:
    target_subreddit: str = "datascience"
    start_suffix: str = "2015_01"
    end_suffix: str = "2015_03"
    max_authors: int = 200
    max_rows: int = 200_000
    min_comments_per_author: int = 20
    max_docs_per_author: int = 2000


cfg = QueryConfig()

CACHE_DIR = "data"
os.makedirs(CACHE_DIR, exist_ok=True)

CACHE_PATH = os.path.join(
    CACHE_DIR,
    f"reddit_{cfg.target_subreddit}_{cfg.start_suffix}_{cfg.end_suffix}_authors{cfg.max_authors}_rows{cfg.max_rows}.parquet",
)

print("Cache path:", CACHE_PATH)


In [None]:
# 1) Run the BigQuery query (or load cache)

def load_or_query_reddit(cfg: QueryConfig, cache_path: str) -> pd.DataFrame:
    if os.path.exists(cache_path):
        print("Loading cached dataframe…")
        df = pd.read_parquet(cache_path)
        return df

    print("No cache found. Querying BigQuery…")
    # imported here so the notebook still imports without GCP
    from google.cloud import bigquery
    from google.colab import auth
    import subprocess

    auth.authenticate_user()

    # Automate project selection: Find a project with BigQuery enabled
    print("Searching for a valid BigQuery project...")
    client = None

    try:
        # Get list of projects using gcloud
        proc = subprocess.run(['gcloud', 'projects', 'list', '--format=value(projectId)'], capture_output=True, text=True)
        projects = proc.stdout.strip().split('\n')

        for pid in projects:
            if not pid: continue
            try:
                print(f"Trying project: {pid}...")
                c = bigquery.Client(project=pid)
                # Run a minimal test query to check permissions/API status
                c.query("SELECT 1").result()
                print(f"-> Success! Using project: {pid}")
                client = c
                break
            except Exception as e:
                print(f"   Skipping {pid} (BigQuery not accessible): {e}")

        if client is None:
             raise RuntimeError("Could not find any project with BigQuery enabled. Please enable the BigQuery API on at least one project in your Google Cloud Console.")

    except Exception as e:
        print(f"Automatic project setup failed: {e}")
        raise e

    QUERY = f"""
DECLARE target_subreddit STRING DEFAULT "{cfg.target_subreddit}";
DECLARE start_suffix STRING DEFAULT "{cfg.start_suffix}";
DECLARE end_suffix   STRING DEFAULT "{cfg.end_suffix}";
DECLARE max_authors  INT64  DEFAULT {cfg.max_authors};
DECLARE max_rows     INT64  DEFAULT {cfg.max_rows};
DECLARE min_comments_per_author INT64 DEFAULT {cfg.min_comments_per_author};
DECLARE max_docs_per_author INT64 DEFAULT {cfg.max_docs_per_author};

WITH base AS (
  SELECT author, body
  FROM `fh-bigquery.reddit_comments.*`
  WHERE _TABLE_SUFFIX BETWEEN start_suffix AND end_suffix
    AND subreddit = target_subreddit
    AND author IS NOT NULL
    AND author NOT IN ("[deleted]", "[removed]")
    AND body IS NOT NULL
  LIMIT max_rows
),
sampled_authors AS (
  SELECT author
  FROM base
  GROUP BY author
  HAVING COUNT(*) >= min_comments_per_author
  ORDER BY RAND()
  LIMIT max_authors
)
SELECT
  b.author,
  STRING_AGG(b.body, "\n" ORDER BY LENGTH(b.body) DESC LIMIT max_docs_per_author) AS user_text,
  COUNT(*) AS n_comments
FROM base b
JOIN sampled_authors a
USING (author)
GROUP BY author
ORDER BY n_comments DESC
"""

    df = client.query(QUERY).to_dataframe()
    print("Saving cache…")
    df.to_parquet(cache_path, index=False)
    return df


df = load_or_query_reddit(cfg, CACHE_PATH)
print(df.head())
print("N users:", len(df))
print(df["n_comments"].describe())

In [None]:
# List available Google Cloud projects to find the Project ID
!gcloud projects list

In [None]:
# 2) Build user × word matrix
#
# In the paper's language, we will treat the *support* as M_{uw} = 1{X_{uw} > 0}.
# For the rank-2 extension to be identifiable, it's helpful to keep *counts* (not just binary).
# If you prefer pure support-only (presence/absence), set BINARY=True.

BINARY = False

vectorizer = CountVectorizer(
    lowercase=True,
    stop_words="english",
    min_df=3,          # ignore words used by <3 users
    max_features=5000, # keep it demo-friendly
    binary=BINARY,
)

X = vectorizer.fit_transform(df["user_text"].fillna(""))
X = X.astype(np.float64)

vocab = vectorizer.get_feature_names_out()
user_ids = df["author"].astype(str).tolist()

print("Users:", X.shape[0], "Vocab:", X.shape[1], "binary:", BINARY)

# Support mask (structural zeros off-support)
M = X.copy()
M.data = np.ones_like(M.data)

# Basic margins (analogues of diversification and ubiquity)
user_strength = np.asarray(X.sum(axis=1)).ravel()
word_strength = np.asarray(X.sum(axis=0)).ravel()

print("User strength:", pd.Series(user_strength).describe())
print("Word strength:", pd.Series(word_strength).describe())

# Filter out degenerate rows/cols (helps numerics)
min_user_mass = 5
min_word_mass = 5

keep_users = user_strength >= min_user_mass
keep_words = word_strength >= min_word_mass

X = X[keep_users][:, keep_words]
M = M[keep_users][:, keep_words]

user_ids = [u for u, ok in zip(user_ids, keep_users) if ok]
vocab = vocab[keep_words]

user_strength = np.asarray(X.sum(axis=1)).ravel()
word_strength = np.asarray(X.sum(axis=0)).ravel()

print("After filtering -> Users:", X.shape[0], "Vocab:", X.shape[1])


### 3) Baseline: 1D Pietronero Fitness–Complexity fixed point

This is the usual nonlinear rank-1 fixed point on the **support matrix** \(M\) (binary incidence). We’ll compute it as a scalar reference, then move to the rank-2 extension.


In [None]:
def fitness_complexity_1d(M_bin: sp.spmatrix, n_iter: int = 200, tol: float = 1e-10):
    """Compute 1D Fitness–Complexity fixed point on binary incidence matrix M.

    M_bin: scipy sparse matrix (n_users × n_words), entries in {0,1}

    Returns:
      F (n_users,), Q (n_words,)
    """
    n_users, n_words = M_bin.shape
    F = np.ones(n_users, dtype=float)
    Q = np.ones(n_words, dtype=float)

    M_csr = M_bin.tocsr()
    M_csc = M_bin.tocsc()

    for it in range(n_iter):
        F_new = M_csr @ Q
        F_new = np.maximum(F_new, 1e-12)
        F_new = F_new / F_new.mean()

        invF = 1.0 / F_new
        denom = M_csc.T @ invF
        denom = np.maximum(denom, 1e-12)
        Q_new = 1.0 / denom
        Q_new = Q_new / Q_new.mean()

        delta = max(np.max(np.abs(F_new - F)), np.max(np.abs(Q_new - Q)))
        F, Q = F_new, Q_new
        if delta < tol:
            print(f"Converged in {it+1} iterations")
            break

    return F, Q


F1, Q1 = fitness_complexity_1d(M)

word_scores_1d = pd.Series(Q1, index=vocab).sort_values(ascending=False)
user_scores_1d = pd.Series(F1, index=user_ids).sort_values(ascending=False)

print("Top 15 words by 1D complexity:")
print(word_scores_1d.head(15))


### 4) 2D Pietronero-style extension (vector fitness / complexity)

Per `economic-fitness.tex`, the multidimensional extension can be written on the support as

\[
\mu_{uw} = M_{uw}\,a_u\,b_w\,\exp(u_u^\top v_w),\qquad u_u, v_w\in\mathbb{R}^2.
\]

But to keep the interpretation “everything lives in fitness and complexity”, we will **absorb** \(a,b\) into *augmented* vectors (paper’s reparameterisation):

\[
\widetilde F_u = \begin{bmatrix}\log a_u\\ 1\\ u_u\end{bmatrix},
\qquad
\widetilde Q_w = \begin{bmatrix}1\\ \log b_w\\ v_w\end{bmatrix},
\qquad
\mu_{uw} = M_{uw}\exp(\widetilde F_u^\top \widetilde Q_w).
\]

Algorithmically (same as the paper’s Sinkhorn/IPF story):

- For **fixed** \(u,v\), we solve the diagonal scalings \(a,b\) via **IPF/Sinkhorn** on the masked kernel \(K_{uw}=M_{uw}\exp(u_u^\top v_w)\).
- Then we update \(u,v\) to better explain the observed data on the support.

The “2D” part here refers to the dimension of the extra coordinates \(u_u\) and \(v_w\) (rank-2 association), not the total dimension of \(\widetilde F,\widetilde Q\) (which is \(2+2=4\)).


In [None]:
def sinkhorn_scale_sparse(K: sp.spmatrix, r: np.ndarray, c: np.ndarray, n_iter: int = 500, tol: float = 1e-9):
    """Diagonal scaling w = diag(A) K diag(B) to match marginals r,c.

    K: nonnegative sparse matrix
    r: desired row sums (n_rows,)
    c: desired col sums (n_cols,)

    Returns:
      A (n_rows,), B (n_cols,)
    """
    K_csr = K.tocsr()
    K_csc = K.tocsc()

    A = np.ones(K.shape[0], dtype=np.float64)
    B = np.ones(K.shape[1], dtype=np.float64)

    # Safety epsilons
    eps = 1e-32

    for it in range(n_iter):
        # A <- r / (K B)
        KB = K_csr @ B
        KB = np.maximum(KB, eps)
        A_new = r / KB

        # B <- c / (K^T A)
        KtA = K_csc.T @ A_new
        KtA = np.maximum(KtA, eps)
        B_new = c / KtA

        # Check marginal residuals occasionally
        if it % 25 == 0 or it == n_iter - 1:
            w_row = A_new * (K_csr @ B_new)
            w_col = B_new * (K_csc.T @ A_new)
            err = max(np.max(np.abs(w_row - r)), np.max(np.abs(w_col - c)))
            if err < tol:
                A, B = A_new, B_new
                break

        A, B = A_new, B_new

    return A, B


def build_kernel_from_uv(M_bin: sp.spmatrix, U: np.ndarray, V: np.ndarray) -> sp.coo_matrix:
    """Kernel K = M * exp(U V^T) evaluated only on support of M."""
    M_coo = M_bin.tocoo()
    dots = np.einsum("ik,ik->i", U[M_coo.row], V[M_coo.col])
    data = np.exp(np.clip(dots, -50, 50))
    return sp.coo_matrix((data, (M_coo.row, M_coo.col)), shape=M_bin.shape)


In [None]:
def fit_vector_fitness_complexity(
    X_counts: sp.spmatrix,
    M_bin: sp.spmatrix,
    k: int = 2,
    margins: str = "data",  # "data" or "uniform"
    n_outer: int = 60,
    sinkhorn_iter: int = 300,
    lr: float = 0.05,
    reg: float = 1e-3,
    seed: int = 0,
):
    """Fit the vector fitness/complexity extension from `economic-fitness.tex`.

    Base (rank-1 / separable) model on support:
      mu_uw = M_uw * a_u * b_w

    Rank-k extension on support:
      mu_uw = M_uw * a_u * b_w * exp(U_u^T V_w)

    Reparameterisation (paper): define augmented vectors in R^{k+2}
      Ftilde_u = [log a_u, 1, U_u]
      Qtilde_w = [1, log b_w, V_w]
    so that
      mu_uw = M_uw * exp(Ftilde_u^T Qtilde_w).

    Returns:
      U (n_users,k), V (n_words,k)
      a (n_users,), b (n_words,)
      Ftilde (n_users,k+2), Qtilde (n_words,k+2)
      mu_support (coo)
    """
    rng = np.random.default_rng(seed)
    n_users, n_words = X_counts.shape

    # Choose marginals (paper: data margins natural for counts; uniform for support-only)
    if margins == "data":
        r = np.asarray(X_counts.sum(axis=1)).ravel().astype(np.float64)
        c = np.asarray(X_counts.sum(axis=0)).ravel().astype(np.float64)
        total = r.sum()
        r = r / max(total, 1e-12)
        c = c / max(total, 1e-12)
    elif margins == "uniform":
        r = np.full(n_users, 1.0 / n_users, dtype=np.float64)
        c = np.full(n_words, 1.0 / n_words, dtype=np.float64)
    else:
        raise ValueError("margins must be 'data' or 'uniform'")

    # Initialize latent coordinates small
    U = 0.01 * rng.standard_normal((n_users, k))
    V = 0.01 * rng.standard_normal((n_words, k))

    X_coo = X_counts.tocoo()
    M_coo = M_bin.tocoo()

    for t in range(n_outer):
        # 1) given U,V: build kernel and solve diagonal scaling (Sinkhorn/IPF)
        K = build_kernel_from_uv(M_bin, U, V)
        a, b = sinkhorn_scale_sparse(K, r=r, c=c, n_iter=sinkhorn_iter, tol=1e-10)

        # mu on support (same sparsity as M)
        mu_data = a[M_coo.row] * b[M_coo.col] * K.data
        mu_lookup = {(i, j): m for i, j, m in zip(M_coo.row, M_coo.col, mu_data)}

        # 2) gradient step on U,V (Poisson log-likelihood on observed support)
        dU = np.zeros_like(U)
        dV = np.zeros_like(V)

        for i, j, x in zip(X_coo.row, X_coo.col, X_coo.data):
            mu = mu_lookup.get((i, j), 0.0)
            mu = max(mu, 1e-32)
            err = float(x) - float(mu)  # derivative wrt dot
            dU[i] += err * V[j]
            dV[j] += err * U[i]

        dU -= reg * U
        dV -= reg * V

        U = U + lr * dU
        V = V + lr * dV

        if t % 10 == 0 or t == n_outer - 1:
            ll = 0.0
            for i, j, x in zip(X_coo.row, X_coo.col, X_coo.data):
                mu = mu_lookup.get((i, j), 0.0)
                mu = max(mu, 1e-32)
                ll += float(x) * np.log(mu) - float(mu)
            print(f"iter={t:03d}  ll={ll:.3e}")

    # Final scaling and augmented vectors
    K = build_kernel_from_uv(M_bin, U, V)
    a, b = sinkhorn_scale_sparse(K, r=r, c=c, n_iter=sinkhorn_iter, tol=1e-10)

    loga = np.log(np.maximum(a, 1e-300))
    logb = np.log(np.maximum(b, 1e-300))

    Ftilde = np.concatenate([loga[:, None], np.ones((n_users, 1)), U], axis=1)
    Qtilde = np.concatenate([np.ones((n_words, 1)), logb[:, None], V], axis=1)

    M_coo = M_bin.tocoo()
    mu_data = a[M_coo.row] * b[M_coo.col] * build_kernel_from_uv(M_bin, U, V).data
    mu_support = sp.coo_matrix((mu_data, (M_coo.row, M_coo.col)), shape=M_bin.shape)

    return U, V, a, b, Ftilde, Qtilde, mu_support


U2, V2, a2, b2, Ftilde2, Qtilde2, mu2 = fit_vector_fitness_complexity(
    X_counts=X,
    M_bin=M,
    k=2,
    margins="uniform",  # try "data" as well
    n_outer=60,
    sinkhorn_iter=300,
    lr=0.02,
    reg=1e-3,
    seed=0,
)

# For a 2D analysis we typically visualize the *latent* coordinates V2 (the last k dims of Qtilde)
word_xy = pd.DataFrame(V2, index=vocab, columns=["dim1", "dim2"])
word_xy["complexity_2d"] = np.linalg.norm(V2, axis=1)

# Full vector complexity objects (in R^{k+2})
word_Qtilde = pd.DataFrame(Qtilde2, index=vocab, columns=["const", "log_b", "v1", "v2"])

print("Top 15 words by latent (v) norm:")
print(word_xy.sort_values("complexity_2d", ascending=False).head(15))


In [None]:
# Plot word embedding (dim1, dim2)

plt.figure(figsize=(9, 7))
plt.scatter(word_xy["dim1"], word_xy["dim2"], s=10, alpha=0.4)
plt.axhline(0, color="black", linewidth=0.5)
plt.axvline(0, color="black", linewidth=0.5)
plt.title(f"Word embedding from rank-2 Pietronero extension (margins=uniform, binary={BINARY})")
plt.xlabel("dim1")
plt.ylabel("dim2")

# Label a few most complex words
top = word_xy.sort_values("complexity_2d", ascending=False).head(20)
for w, row in top.iterrows():
    plt.text(row["dim1"], row["dim2"], w, fontsize=8)

plt.show()


### Notes on “weighting” / marginals (what the paper says)

`economic-fitness.tex` is explicit that after you define a support \(\mathcal{S}\), the remaining scale degrees of freedom are fixed by choosing row/column marginals \(r_u, c_w\):

- For **quantitative flows** \(X\): the natural choice is data margins \(r_u=X_{u\cdot}\), \(c_w=X_{\cdot w}\).
- For **support-only** work: you can choose convenient margins (e.g. uniform) to remove size/volume information.

In this notebook you can switch that choice by setting `margins="uniform"` or `margins="data"` in the rank-2 fit.
