# Company Knowledge Graphs from Financial Text

## Context
This is where my background maps most directly to Numerai's needs. My Cambridge University Press textbook (524 pages, 147 figures) covers spectral methods, community detection, and statistical inference on networks — all applicable to extracting graph-based features from financial text.

## Why Knowledge Graphs for Numerai
- **Highly orthogonal**: Graph topology features (centrality, clustering, community structure) are fundamentally different from Barra factors (momentum, value, size). Numerai's neutralization won't remove them.
- **Supply chain propagation**: Shocks propagate through company networks. A supplier's bad news affects its customers — graph features capture this.
- **Co-mention networks**: Companies mentioned together in news share latent relationships that predict correlated returns.

## Pipeline
Financial text → NER (extract company names) → Co-mention graph → Graph features per node → Stock-level predictions

## Connection to My Research
- Textbook: "Hands-On Network Machine Learning" (Cambridge UP, 2025) — spectral graph theory, community detection, random graph models
- Graspologic: Co-authored open-source graph statistics library (adopted by Microsoft Research)
- Blue Halo: Led knowledge graph effort for object detection project

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import networkx as nx
from collections import defaultdict, Counter
import re

## 1. Sample Financial News Articles
Each article mentions one or more companies. Co-mentions reveal relationships.

In [None]:
# Synthetic financial news articles with company mentions
# In production: from Common Crawl, news APIs, or SEC filings
articles = [
    {"date": "2024-01-15", "text": "Apple and Microsoft are leading the AI race, with both companies integrating generative AI across their product lines.", "source": "reuters"},
    {"date": "2024-01-15", "text": "NVIDIA reported record revenue driven by demand from Microsoft Azure, Google Cloud, and Amazon Web Services.", "source": "bloomberg"},
    {"date": "2024-01-15", "text": "Tesla and BYD are engaged in an intense price war across the Chinese electric vehicle market.", "source": "reuters"},
    {"date": "2024-01-16", "text": "Apple's new Vision Pro headset uses chips manufactured by TSMC, continuing their long-standing partnership.", "source": "wsj"},
    {"date": "2024-01-16", "text": "Meta and Google are competing for advertising dominance, both leveraging AI recommendation engines.", "source": "ft"},
    {"date": "2024-01-16", "text": "Amazon and Microsoft cloud services face growing competition from Google Cloud Platform.", "source": "bloomberg"},
    {"date": "2024-01-16", "text": "JPMorgan and Goldman Sachs both reported strong Q4 earnings, beating analyst expectations.", "source": "reuters"},
    {"date": "2024-01-17", "text": "NVIDIA's new Blackwell GPU architecture is expected to power next-gen AI training at Google, Meta, and Microsoft data centers.", "source": "techcrunch"},
    {"date": "2024-01-17", "text": "Intel struggles to compete with NVIDIA and AMD in the data center GPU market.", "source": "bloomberg"},
    {"date": "2024-01-17", "text": "ExxonMobil and Chevron are both expanding their Permian Basin operations.", "source": "reuters"},
    {"date": "2024-01-17", "text": "Apple supplier Foxconn reports strong revenue growth, suggesting robust iPhone demand.", "source": "nikkei"},
    {"date": "2024-01-18", "text": "Samsung and TSMC are racing to develop 2nm chip manufacturing processes.", "source": "bloomberg"},
    {"date": "2024-01-18", "text": "Pfizer and Johnson & Johnson face headwinds as post-pandemic pharmaceutical demand normalizes.", "source": "reuters"},
    {"date": "2024-01-18", "text": "Google's antitrust case could reshape the tech landscape, impacting Apple's search deal revenue.", "source": "wsj"},
    {"date": "2024-01-18", "text": "Tesla supplier Panasonic expands battery production capacity for the Cybertruck.", "source": "nikkei"},
    {"date": "2024-01-19", "text": "Microsoft and OpenAI deepen their partnership with additional $10B investment.", "source": "ft"},
    {"date": "2024-01-19", "text": "Meta launches Llama 3, challenging OpenAI and Google in the open-source AI model space.", "source": "techcrunch"},
    {"date": "2024-01-19", "text": "Amazon's advertising business grows rapidly, eating into Google and Meta's market share.", "source": "bloomberg"},
    {"date": "2024-01-19", "text": "Boeing faces new safety scrutiny after incidents involving its 737 MAX aircraft.", "source": "reuters"},
    {"date": "2024-01-19", "text": "Walmart and Amazon compete for dominance in the grocery delivery space.", "source": "wsj"},
    {"date": "2024-01-20", "text": "Apple, Google, and Samsung dominate the global smartphone market.", "source": "idc"},
    {"date": "2024-01-20", "text": "Broadcom and NVIDIA are the biggest beneficiaries of the AI infrastructure boom.", "source": "bloomberg"},
    {"date": "2024-01-20", "text": "JPMorgan warns that commercial real estate risks could impact Goldman Sachs and Morgan Stanley.", "source": "ft"},
    {"date": "2024-01-20", "text": "Tesla's autonomous driving technology faces scrutiny from regulators at NHTSA.", "source": "reuters"},
    {"date": "2024-01-20", "text": "Microsoft and Google race to integrate AI into enterprise productivity software.", "source": "techcrunch"},
]

df = pd.DataFrame(articles)
df["date"] = pd.to_datetime(df["date"])
print(f"Loaded {len(df)} articles spanning {df['date'].min().date()} to {df['date'].max().date()}")

## 2. Named Entity Recognition (Company Extraction)
In production, we'd use spaCy NER or a fine-tuned model. Here we use dictionary matching.

In [None]:
# Company name -> ticker mapping (in production: use a comprehensive database)
COMPANY_TICKERS = {
    "Apple": "AAPL", "Microsoft": "MSFT", "Google": "GOOGL", "Alphabet": "GOOGL",
    "Amazon": "AMZN", "Meta": "META", "Facebook": "META", "NVIDIA": "NVDA",
    "Tesla": "TSLA", "JPMorgan": "JPM", "Goldman Sachs": "GS", "Morgan Stanley": "MS",
    "Intel": "INTL", "AMD": "AMD", "Samsung": "005930.KS", "TSMC": "TSM",
    "ExxonMobil": "XOM", "Chevron": "CVX", "Pfizer": "PFE",
    "Johnson & Johnson": "JNJ", "Boeing": "BA", "Walmart": "WMT",
    "Foxconn": "2317.TW", "Panasonic": "6752.T", "BYD": "BYDDY",
    "Broadcom": "AVGO", "OpenAI": "OPENAI",  # private, but included for network
}

def extract_companies(text):
    """Extract company mentions from text using dictionary matching."""
    found = []
    for company, ticker in COMPANY_TICKERS.items():
        if company.lower() in text.lower():
            found.append(ticker)
    return list(set(found))  # deduplicate

# Extract companies from each article
df["companies"] = df["text"].apply(extract_companies)
df["n_companies"] = df["companies"].apply(len)

print("Sample extractions:")
for _, row in df.head(5).iterrows():
    print(f"  [{', '.join(row['companies'])}] {row['text'][:80]}...")
print(f"\nArticles with 2+ companies: {(df['n_companies'] >= 2).sum()}/{len(df)}")

## 3. Build Co-Mention Graph
Edge weight = number of articles mentioning both companies. This is the simplest network construction.

In [None]:
# Build co-mention graph
G = nx.Graph()

# Count co-mentions
co_mentions = defaultdict(int)
for _, row in df.iterrows():
    companies = row["companies"]
    for i in range(len(companies)):
        for j in range(i + 1, len(companies)):
            pair = tuple(sorted([companies[i], companies[j]]))
            co_mentions[pair] += 1

# Add edges
for (c1, c2), weight in co_mentions.items():
    G.add_edge(c1, c2, weight=weight)

# Add isolated nodes (companies mentioned but never co-mentioned)
all_companies = set()
for companies in df["companies"]:
    all_companies.update(companies)
for c in all_companies:
    if c not in G:
        G.add_node(c)

print(f"Graph: {G.number_of_nodes()} nodes, {G.number_of_edges()} edges")
print(f"Density: {nx.density(G):.3f}")
print(f"\nTop co-mentions:")
for (c1, c2), w in sorted(co_mentions.items(), key=lambda x: -x[1])[:10]:
    print(f"  {c1} — {c2}: {w} articles")

## 4. Compute Graph Features Per Node
These become stock-level features for prediction.

In [None]:
# Compute graph features for each node
pagerank = nx.pagerank(G, weight='weight')
degree_cent = nx.degree_centrality(G)
betweenness = nx.betweenness_centrality(G, weight='weight')
closeness = nx.closeness_centrality(G)
clustering = nx.clustering(G, weight='weight')

# Weighted degree (total co-mention count)
weighted_degree = dict(G.degree(weight='weight'))

# Compile into DataFrame
graph_features = pd.DataFrame({
    "ticker": list(G.nodes()),
    "pagerank": [pagerank[n] for n in G.nodes()],
    "degree_centrality": [degree_cent[n] for n in G.nodes()],
    "betweenness_centrality": [betweenness[n] for n in G.nodes()],
    "closeness_centrality": [closeness[n] for n in G.nodes()],
    "clustering_coefficient": [clustering[n] for n in G.nodes()],
    "weighted_degree": [weighted_degree[n] for n in G.nodes()],
    "n_neighbors": [len(list(G.neighbors(n))) for n in G.nodes()],
}).sort_values("pagerank", ascending=False)

print("Graph Features Per Stock:")
print(graph_features.to_string(index=False, float_format="%.4f"))

## 5. Visualize the Network

In [None]:
# Sector mapping for coloring
sector_map = {
    "AAPL": "Tech", "MSFT": "Tech", "GOOGL": "Tech", "AMZN": "Tech", "META": "Tech",
    "NVDA": "Tech", "INTL": "Tech", "AMD": "Tech", "AVGO": "Tech", "OPENAI": "Tech",
    "TSM": "Semiconductor", "005930.KS": "Semiconductor",
    "TSLA": "Auto", "BYDDY": "Auto",
    "JPM": "Finance", "GS": "Finance", "MS": "Finance",
    "XOM": "Energy", "CVX": "Energy",
    "PFE": "Healthcare", "JNJ": "Healthcare",
    "BA": "Industrial", "WMT": "Retail",
    "2317.TW": "Manufacturing", "6752.T": "Manufacturing",
}
sector_colors = {
    "Tech": "#4285F4", "Semiconductor": "#34A853", "Auto": "#EA4335",
    "Finance": "#FBBC04", "Energy": "#FF6D01", "Healthcare": "#46BDC6",
    "Industrial": "#7B1FA2", "Retail": "#E91E63", "Manufacturing": "#795548",
}

fig, ax = plt.subplots(figsize=(16, 12))
pos = nx.spring_layout(G, k=2, seed=42, weight='weight')

# Node sizes proportional to PageRank
node_sizes = [pagerank[n] * 15000 + 200 for n in G.nodes()]
node_colors = [sector_colors.get(sector_map.get(n, "Tech"), "#999999") for n in G.nodes()]

# Edge widths proportional to weight
edge_weights = [G[u][v]['weight'] for u, v in G.edges()]
max_weight = max(edge_weights) if edge_weights else 1

# Draw
nx.draw_networkx_edges(G, pos, width=[w / max_weight * 4 for w in edge_weights], 
                       alpha=0.3, edge_color='gray', ax=ax)
nx.draw_networkx_nodes(G, pos, node_size=node_sizes, node_color=node_colors,
                       edgecolors='black', linewidths=0.5, alpha=0.8, ax=ax)
nx.draw_networkx_labels(G, pos, font_size=8, font_weight='bold', ax=ax)

# Legend
from matplotlib.patches import Patch
legend_elements = [Patch(facecolor=c, label=s) for s, c in sector_colors.items() if any(sector_map.get(n) == s for n in G.nodes())]
ax.legend(handles=legend_elements, loc='upper left', fontsize=9)
ax.set_title("Company Co-Mention Network from Financial News", fontsize=14, fontweight='bold')
ax.axis('off')
plt.tight_layout()
plt.show()

## 6. Community Detection
Identifying clusters of companies that are discussed together reveals latent industry relationships.

In [None]:
# Community detection
from networkx.algorithms.community import greedy_modularity_communities

communities = list(greedy_modularity_communities(G, weight='weight'))
print(f"Detected {len(communities)} communities:\n")
for i, comm in enumerate(communities):
    sectors = [sector_map.get(n, "?") for n in comm]
    print(f"  Community {i}: {sorted(comm)}")
    print(f"    Sectors: {Counter(sectors).most_common()}\n")

# Add community assignment as a feature
community_map = {}
for i, comm in enumerate(communities):
    for node in comm:
        community_map[node] = i

graph_features["community"] = graph_features["ticker"].map(community_map)

# Visualize communities
fig, ax = plt.subplots(figsize=(16, 12))
community_colors = plt.cm.Set3(np.linspace(0, 1, len(communities)))
node_colors_comm = [community_colors[community_map.get(n, 0)] for n in G.nodes()]

nx.draw_networkx_edges(G, pos, width=[w / max_weight * 4 for w in edge_weights],
                       alpha=0.3, edge_color='gray', ax=ax)
nx.draw_networkx_nodes(G, pos, node_size=node_sizes, node_color=node_colors_comm,
                       edgecolors='black', linewidths=1, alpha=0.8, ax=ax)
nx.draw_networkx_labels(G, pos, font_size=8, font_weight='bold', ax=ax)
ax.set_title("Community Structure in Co-Mention Network", fontsize=14, fontweight='bold')
ax.axis('off')
plt.tight_layout()
plt.show()

## 7. Dynamic Network Features (Temporal Evolution)
Track how a company's network position changes over time.

In [None]:
# Split articles by date and build time-windowed graphs
dates = sorted(df["date"].unique())
mid = len(dates) // 2
window1 = df[df["date"] <= dates[mid]]
window2 = df[df["date"] > dates[mid]]

def build_graph(articles_df):
    g = nx.Graph()
    co_m = defaultdict(int)
    for _, row in articles_df.iterrows():
        companies = row["companies"]
        for i in range(len(companies)):
            for j in range(i + 1, len(companies)):
                pair = tuple(sorted([companies[i], companies[j]]))
                co_m[pair] += 1
    for (c1, c2), w in co_m.items():
        g.add_edge(c1, c2, weight=w)
    return g

G1 = build_graph(window1)
G2 = build_graph(window2)

# Compare PageRank across windows
pr1 = nx.pagerank(G1, weight='weight')
pr2 = nx.pagerank(G2, weight='weight')

# Compute delta
all_nodes = sorted(set(list(pr1.keys()) + list(pr2.keys())))
delta_pr = {n: pr2.get(n, 0) - pr1.get(n, 0) for n in all_nodes}

# Plot
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Top movers
sorted_delta = sorted(delta_pr.items(), key=lambda x: abs(x[1]), reverse=True)[:15]
tickers = [x[0] for x in sorted_delta]
deltas = [x[1] for x in sorted_delta]
colors = ['green' if d > 0 else 'red' for d in deltas]

axes[0].barh(tickers[::-1], deltas[::-1], color=colors[::-1], edgecolor='black', alpha=0.7)
axes[0].axvline(x=0, color='black', linewidth=0.5)
axes[0].set_xlabel("PageRank Change")
axes[0].set_title(f"PageRank Change: Window 1 -> Window 2")

# Side-by-side PageRank
top_nodes = sorted(pr1.keys(), key=lambda x: pr1.get(x, 0) + pr2.get(x, 0), reverse=True)[:12]
x = np.arange(len(top_nodes))
width = 0.35
axes[1].bar(x - width/2, [pr1.get(n, 0) for n in top_nodes], width, label="Window 1", color="steelblue")
axes[1].bar(x + width/2, [pr2.get(n, 0) for n in top_nodes], width, label="Window 2", color="coral")
axes[1].set_xticks(x)
axes[1].set_xticklabels(top_nodes, rotation=45, ha='right')
axes[1].set_ylabel("PageRank")
axes[1].set_title("PageRank by Time Window")
axes[1].legend()

plt.suptitle("Dynamic Network Features: How Company Importance Evolves", fontsize=13, fontweight='bold')
plt.tight_layout()
plt.show()

## 8. Spectral Features (From My Textbook)

The eigenvalues of the graph Laplacian encode global network structure. The Fiedler vector (2nd smallest eigenvalue) reveals the fundamental partition of the network.

In [None]:
# Spectral analysis of the co-mention graph
L = nx.laplacian_matrix(G).toarray().astype(float)
eigenvalues, eigenvectors = np.linalg.eigh(L)

# Fiedler vector (2nd eigenvector) — reveals fundamental graph partition
fiedler = eigenvectors[:, 1]
nodes = list(G.nodes())

# Plot spectrum
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

axes[0].plot(eigenvalues[:15], 'o-', color='steelblue', markersize=8)
axes[0].set_xlabel("Index")
axes[0].set_ylabel("Eigenvalue")
axes[0].set_title("Graph Laplacian Spectrum (first 15)")
axes[0].axhline(y=0, color='gray', linestyle='--', alpha=0.3)

# Fiedler vector values per node
sorted_idx = np.argsort(fiedler)
axes[1].barh([nodes[i] for i in sorted_idx], fiedler[sorted_idx],
             color=['steelblue' if f > 0 else 'coral' for f in fiedler[sorted_idx]],
             edgecolor='black', alpha=0.7)
axes[1].axvline(x=0, color='black', linewidth=0.5)
axes[1].set_xlabel("Fiedler Vector Value")
axes[1].set_title("Spectral Partition (Fiedler Vector)")

plt.suptitle("Spectral Graph Features from Co-Mention Network", fontsize=13, fontweight='bold')
plt.tight_layout()
plt.show()

# Add spectral features
spectral_features = pd.DataFrame({
    "ticker": nodes,
    "fiedler_value": fiedler,
    "spectral_partition": (fiedler > 0).astype(int),
})
print("Spectral features:")
print(spectral_features.sort_values("fiedler_value").to_string(index=False))

## Discussion & Interview Talking Points

### Why This Is My Strongest Angle for Numerai
- **My textbook**: 524-page Cambridge University Press book on network ML — I wrote the chapter on spectral methods
- **Graspologic**: Co-authored open-source graph statistics library (adopted by Microsoft Research)
- **Blue Halo**: Led knowledge graph effort for a defense project — same entity extraction -> graph -> features pipeline

### Feature Summary for Numerai Signals
Per-stock graph features (all per time window):
| Feature | What It Captures | Orthogonality |
|---------|-----------------|---------------|
| PageRank | Overall importance in news network | High |
| Betweenness Centrality | Bridges between industry clusters | Very High |
| Clustering Coefficient | How tightly connected a company's neighbors are | High |
| Community Assignment | Latent industry grouping from news (not standard GICS) | Very High |
| Fiedler Value | Position in spectral partition | Very High |
| Delta PageRank | Change in network importance over time | Very High |

### Why These Features Are Orthogonal
- **Barra factors** (momentum, value, size) operate on single-stock time series
- **Graph features** operate on inter-stock relationships derived from text
- These are structurally different — neutralization shouldn't remove them
- "Supply chain shocks propagate through networks — my graph features capture this propagation before it shows up in returns"

### Extensions (TODO)
- [ ] Use spaCy NER instead of dictionary matching for entity extraction
- [ ] Build directed graphs (company A mentioned as supplier OF company B)
- [ ] Add edge sentiment (positive vs negative co-mention)
- [ ] Adjacency spectral embedding (Chapter 5 of my textbook)
- [ ] Test on real news data from GDELT or Tiingo API
- [ ] Dynamic stochastic block model for evolving communities