# Company Knowledge Graphs from Financial Text

## Context
This is where my background maps most directly to Numerai's needs. My Cambridge University Press textbook (524 pages, 147 figures) covers spectral methods, community detection, and statistical inference on networks — all applicable to extracting graph-based features from financial text.

## Why Knowledge Graphs for Numerai
- **Highly orthogonal**: Graph topology features (centrality, clustering, community structure) are fundamentally different from Barra factors (momentum, value, size). Numerai's neutralization won't remove them.
- **Supply chain propagation**: Shocks propagate through company networks. A supplier's bad news affects its customers — graph features capture this.
- **Co-mention networks**: Companies mentioned together in news share latent relationships that predict correlated returns.

## Pipeline
Financial text → NER (extract company names) → Co-mention graph → Graph features per node → Stock-level predictions

## Connection to My Research
- Textbook: "Hands-On Network Machine Learning" (Cambridge UP, 2025) — spectral graph theory, community detection, random graph models
- Graspologic: Co-authored open-source graph statistics library (adopted by Microsoft Research)
- Blue Halo: Led knowledge graph effort for object detection project

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import networkx as nx
from collections import defaultdict, Counter
import re

## 1. Sample Financial News Articles
Each article mentions one or more companies. Co-mentions reveal relationships.

In [None]:
# Synthetic financial news articles with company mentions
# In production: from Common Crawl, news APIs, or SEC filings

# TODO: implement
...

## 2. Named Entity Recognition (Company Extraction)
In production, we'd use spaCy NER or a fine-tuned model. Here we use dictionary matching.

In [None]:
# Company name -> ticker mapping (in production: use a comprehensive database)
def extract_companies(text):
    """Extract company mentions from text using dictionary matching."""
    ...


## 3. Build Co-Mention Graph
Edge weight = number of articles mentioning both companies. This is the simplest network construction.

In [None]:
# Build co-mention graph

# TODO: implement
...

## 4. Compute Graph Features Per Node
These become stock-level features for prediction.

In [None]:
# Compute graph features for each node

# TODO: implement
...

## 5. Visualize the Network

In [None]:
# Sector mapping for coloring

# TODO: implement
...

## 6. Community Detection
Identifying clusters of companies that are discussed together reveals latent industry relationships.

In [None]:
# Community detection

# TODO: implement
...

## 7. Dynamic Network Features (Temporal Evolution)
Track how a company's network position changes over time.

In [None]:
# Split articles by date and build time-windowed graphs
def build_graph(articles_df):
    ...


## 8. Spectral Features (From My Textbook)

The eigenvalues of the graph Laplacian encode global network structure. The Fiedler vector (2nd smallest eigenvalue) reveals the fundamental partition of the network.

In [None]:
# Spectral analysis of the co-mention graph

# TODO: implement
...

## Discussion & Interview Talking Points

### Why This Is My Strongest Angle for Numerai
- **My textbook**: 524-page Cambridge University Press book on network ML — I wrote the chapter on spectral methods
- **Graspologic**: Co-authored open-source graph statistics library (adopted by Microsoft Research)
- **Blue Halo**: Led knowledge graph effort for a defense project — same entity extraction -> graph -> features pipeline

### Feature Summary for Numerai Signals
Per-stock graph features (all per time window):
| Feature | What It Captures | Orthogonality |
|---------|-----------------|---------------|
| PageRank | Overall importance in news network | High |
| Betweenness Centrality | Bridges between industry clusters | Very High |
| Clustering Coefficient | How tightly connected a company's neighbors are | High |
| Community Assignment | Latent industry grouping from news (not standard GICS) | Very High |
| Fiedler Value | Position in spectral partition | Very High |
| Delta PageRank | Change in network importance over time | Very High |

### Why These Features Are Orthogonal
- **Barra factors** (momentum, value, size) operate on single-stock time series
- **Graph features** operate on inter-stock relationships derived from text
- These are structurally different — neutralization shouldn't remove them
- "Supply chain shocks propagate through networks — my graph features capture this propagation before it shows up in returns"

### Extensions (TODO)
- [ ] Use spaCy NER instead of dictionary matching for entity extraction
- [ ] Build directed graphs (company A mentioned as supplier OF company B)
- [ ] Add edge sentiment (positive vs negative co-mention)
- [ ] Adjacency spectral embedding (Chapter 5 of my textbook)
- [ ] Test on real news data from GDELT or Tiingo API
- [ ] Dynamic stochastic block model for evolving communities