# Part 2: Community Detection for Portfolio Optimization

## Background

Building upon my exploratory data analysis from [Part 1](https://github.com/naomatheus/modulariti-opt-public/blob/0f6e04ef977bf7d0c2a48ffd6017c7fa6831e2b8/mod-opt/eda_modularity_opt_cuda.ipynb), I now implement community detection algorithms to identify meaningful clusters within the NASDAQ 100 stock dataset. Community detection reveals hidden structures in complex networks. This technique is particularly valuable in financial markets where asset relationships can provide crucial insights for portfolio construction.

Financial networks exhibit natural clustering tendencies where groups of stocks often move together due to shared industry characteristics, common risk factors, or other market dynamics. By identifying these communities, I can construct more resilient portfolios that better distribute risk across distinct market segments.

Most importantly, I focus on rapid analysis of extremely large datasets to generate insights that are useful to a human operator. While the financial sector is familiar with highly technical solutions, there is typically a human being on either end of any financial transaction or financial decision. All corporations represented in the index in this dataset and in other indices across the world are human organizations, with goals, agency, constraints, and activities. This is what I'm measuring, not just the numbers. The guiding principle in this demonstration is to be able to "Drill Down" into a very complex system and make sense of it, thus being able to make better decisions.

## Theoretical Justification

My approach draws inspiration from network science applications in finance, particularly the work by [Zhao et. al. 2021](https://arxiv.org/abs/2112.13383). They demonstrated that community detection methods effectively identify modular structures in stock market networks. These structures can then be leveraged to create diversified portfolios with improved risk-return profiles.

The fundamental insight driving my methodology is that stocks belonging to different communities tend to exhibit weaker correlations than stocks within the same community. This property makes community detection particularly valuable for portfolio diversification. Selecting stocks from different communities may provide better protection against market downturns and geopolitical volatility than traditional sector-based diversification approaches.

## Methodology

Building on the engineered features from Part 1, I transform stock metrics into network representations for community detection. My previous work established Price Change Percentage (PCP), Volume-Weighted Average Price (VWAP), and Price Volatility measurements. I now extend this analysis by introducing a `VolatilityChange` metric as the foundation for network construction.

This `VolatilityChange` metric serves as the weight connecting nodes (stocks) in my graph network. It captures day-to-day relationships across the time series. By analyzing how volatility changes propagate across stocks throughout the dataset, I identify natural community formations that reveal underlying market structures beyond traditional sector classifications.

### GPU Acceleration
I take advantage of GPU Acceleration to make the analysis done here take mere miliseconds to complete. My network has over 2 Million individual stocks and this represents a network of around 400 Million data points across connections and lower level attributes. With GPU Acceleration, every single calculation takes the blink of an eye to complete.

### Community Detection Algorithms

I deliberately selected three distinct algorithms for community detection:

1. **Spectral Clustering** - My initial baseline approach uses eigendecomposition of the network's Laplacian matrix to identify communities. This provides a mathematical foundation for community separation.
   
2. **Modularity Optimization** - A technique that maximizes the modularity metric Q, defined as:
   
   $$Q = \frac{1}{2m} \sum_{vw} \left[A_{vw} - \frac{k_v k_w}{2m}\right] \delta(c_v, c_w)$$
   
   where $A_{vw}$ represents the edge weight between nodes v and w, $k_v$ and $k_w$ are the degrees of nodes v and w respectively, m is the total edge weight in the network, and $\delta(c_v, c_w)$ equals 1 if nodes v and w belong to the same community and 0 otherwise.

3. **Leiden Algorithm** - Lastly, I applied an advanced method that addresses limitations in modularity optimization by preventing the isolation of well-connected nodes. This algorithm allows nodes to remain in their communities even when traditional modularity approaches would suggest movement. The result is more coherent community structures that better reflect real-world relationships.

My baseline measures using spectral clustering through modularity optimization initially produced sparse results with inefficient community detection. The application of the Leiden algorithm significantly improved community coherence. It maintained meaningful connections that would otherwise be disrupted by strict modularity maximization.

### "Drill Down" Capabilities for Financial Services

A key innovation in my approach is the implementation of "drill down" capabilities that simulate answers to critical business questions in financial services. This functionality allows financial professionals to quickly analyze aggregate metrics across portfolios or trading activity datasets. These capabilities directly address real-world scenarios faced by market participants.

For example, a fund manager reviewing a large portfolio can instantly identify which stock communities are exhibiting concerning volatility patterns. A risk manager can quickly assess exposure across different market segments without manually grouping hundreds of positions. An auditor can trace the movement of stocks between communities to identify potential anomalies or market shifts.

My system enables financial professionals to:

1. **Track Stock Movement Across Communities** - Monitor how individual stocks transition between different communities over time. This highlights potential shifts in market dynamics or risk exposures that might otherwise go unnoticed in traditional analysis.

2. **Examine Aggregate Community Metrics** - View consolidated statistics of engineered features across all stocks within identified communities. This provides a macro-level perspective on market segments without getting lost in individual stock details.

3. **Perform Aggregate Volatility Selections** - Identify communities exhibiting volatility levels above specified thresholds. This enables rapid identification of high-risk market segments. Financial professionals can quickly wind down positions in volatile communities or strategically enter communities that have moved below risk thresholds. This facilitates efficient construction of diversified positions based on actual market behavior rather than predefined sectors.

This approach transforms abstract network analysis into actionable financial intelligence. It provides a novel framework for portfolio construction and risk management that complements traditional sector-based approaches. Most importantly, it puts powerful analytical capabilities in the hands of human decision-makers and enhances their ability to navigate a complex market environment like todays.

Author: Matt K Robinson <mattkrobinson@berkeley.edu>

Date: 21-04-25


In [8]:
# Verify running on nvidia gpu
!nvidia-smi --version | grep "CUDA"


CUDA Version        : 12.4


In [None]:
! pip install --upgrade pip

In [None]:
# Install cugraph (check that the version is correct)
! pip install cugraph-cu12 --extra-index-url=https://pypi.nvidia.com
# source https://docs.rapids.ai/api/cugraph/stable/installation/getting_cugraph/#pip

In [None]:
# Install cudf (check that version is correct) (though likely already installed)
! pip install --extra-index-url=https://pypi.nvidia.com cudf-cu12

In [None]:
# if running this in Google Colab - connect google drive
from google.colab import drive
drive.mount('/content/gdrive')

In [13]:
# Read in data created from previous notebook Part 1
DATA_PATH = "/content/gdrive/MyDrive/Portfolio_Berkeley"
# prepare paths
df_w_vwap = "df_w_vwap.csv"
sharpe_ratios = "sharpe_ratios.csv"

In [None]:
import pandas as pd
import cudf
# load data into pandas dataframes
stock_df = cudf.read_csv(f"{DATA_PATH}/{df_w_vwap}")
sharpe_df = cudf.read_csv(f"{DATA_PATH}/{sharpe_ratios}")
stock_df.describe, sharpe_df.describe

In [15]:
# Examine columns in stock data
stock_df.columns

Index(['Date', 'Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume', 'Name',
       'volatility', 'pcp', 'vwap'],
      dtype='object')

In [16]:
# Examine columns in sharpe ratios
sharpe_df.columns

Index(['Name', 'Sharpe'], dtype='object')

In [17]:
# Import needed libraries
import cugraph
from cugraph.experimental import PropertyGraph
import numpy as np

In [18]:
# Create a property graph with Nodes/Vertices target attributes
pG = PropertyGraph()
pG.add_vertex_data(stock_df,vertex_col_name="Date",type_name="TradingDay",property_columns=([col for col in stock_df.columns if col in ['Name','Date','volatility','pcp','vwap']]))

# Here we get a dataframe back
trading_days_cudf = pG.get_vertex_data()
trading_days_cudf.head(10)

Unnamed: 0,_VERTEX_,Date,Name,volatility,pcp,vwap,_TYPE_
0,2010-01-04,2010-01-04,AAPL,0.9933,0.2718,7.6296,TradingDay
1,2010-01-05,2010-01-05,AAPL,1.0904,-0.1025,7.6574,TradingDay
2,2010-01-06,2010-01-06,AAPL,2.0898,-1.5906,7.5827,TradingDay
3,2010-01-07,2010-01-07,AAPL,1.3932,-0.5525,7.5194,TradingDay
4,2010-01-08,2010-01-08,AAPL,1.398,0.7989,7.5362,TradingDay
5,2010-01-11,2010-01-11,AAPL,2.1382,-1.2641,7.5186,TradingDay
6,2010-01-12,2010-01-12,AAPL,1.6014,-0.7027,7.4275,TradingDay
7,2010-01-13,2010-01-13,AAPL,3.2857,1.3374,7.4486,TradingDay
8,2010-01-14,2010-01-14,AAPL,0.6854,-0.3236,7.487,TradingDay
9,2010-01-15,2010-01-15,AAPL,2.7165,-2.3705,7.4214,TradingDay


In [19]:
# Calculate `VolatilityChange`
# Sort
trading_days_cudf = trading_days_cudf.sort_values(by=['Name', 'Date'])

# Add shifted times series as next day's volatility
trading_days_cudf['next_volatility'] = trading_days_cudf.groupby('Name').volatility.shift(-1)

# Compute volatility change directly
# Note: some volatilities will be 0, as when there was no price change or insufficient volume for the day
# Apply the calculation conditionally and leave values as 0 if there are 0 volatility days
trading_days_cudf['volatility_change'] = (
    (trading_days_cudf['next_volatility'] - trading_days_cudf['volatility']) /
    trading_days_cudf['volatility']
).where(trading_days_cudf['volatility'] != 0, 0) # Set to 0 when current_volatility == 0

# Drop rows where volatility_change is NaN (last day for each stock)
trading_days_cudf = trading_days_cudf.dropna(subset=['volatility_change'])
trading_days_cudf = trading_days_cudf.dropna(subset=['next_volatility'])

# Drop the temporary next_volatility column (optional)
trading_days_cudf = trading_days_cudf.drop(columns=['next_volatility'])

In [20]:
import hashlib

# Ensure the data is sorted by Name and Date
trading_days_cudf = trading_days_cudf.sort_values(by=['Name', 'Date'])

# Create source and destination columns for edges
trading_days_cudf['source'] = trading_days_cudf['Date'].astype(str) + '_' + stock_df['Name']
trading_days_cudf['destination'] = trading_days_cudf.groupby('Name')['source'].shift(-1)

# Drop rows where destination is NaN (last day for each stock)
trading_days_cudf = trading_days_cudf.dropna(subset=['destination'])

# Generate unique edge IDs using GPU hashing
trading_days_cudf['edge_string'] = trading_days_cudf['source'] + '_' + trading_days_cudf['destination']

# Use cuDF's hash_values to generate a unique hash for each edge
trading_days_cudf['edge_id'] = trading_days_cudf['edge_string'].hash_values().astype(str)

# Apply Min-Max scaling to volalitity change to increase the likelihood of detecting communities in our baseline
min_vol_change = trading_days_cudf['volatility_change'].min()
max_vol_change = trading_days_cudf['volatility_change'].max()
scale_factor = 10

trading_days_cudf['scaled_volatility_change'] = (trading_days_cudf['volatility_change'] - min_vol_change) / (max_vol_change - min_vol_change) * scale_factor

# Drop the temporary edge_string column (optional)
trading_days_cudf = trading_days_cudf.drop(columns=['edge_string'])
trading_days_cudf.head(5)

Unnamed: 0,_VERTEX_,Date,Name,volatility,pcp,vwap,_TYPE_,volatility_change,source,destination,edge_id,scaled_volatility_change
0,2010-01-04,2010-01-04,AAPL,0.9933,0.2718,7.6296,TradingDay,0.097755,2010-01-04_AAPL,2010-01-05_AAPL,2699677915,0.505183
1,2010-01-05,2010-01-05,AAPL,1.0904,-0.1025,7.6574,TradingDay,0.916544,2010-01-05_AAPL,2010-01-06_AAPL,1370729126,0.881988
2,2010-01-06,2010-01-06,AAPL,2.0898,-1.5906,7.5827,TradingDay,-0.333333,2010-01-06_AAPL,2010-01-07_AAPL,419919682,0.306798
3,2010-01-07,2010-01-07,AAPL,1.3932,-0.5525,7.5194,TradingDay,0.003445,2010-01-07_AAPL,2010-01-08_AAPL,4292718468,0.461782
4,2010-01-08,2010-01-08,AAPL,1.398,0.7989,7.5362,TradingDay,0.529471,2010-01-08_AAPL,2010-01-11_AAPL,2975937858,0.703857


In [21]:
# Now we must define the Links/relations/edges (These are like a linkedlist where we will use the trading day of each stock node to link all trading days sequentially)
pG.add_edge_data(
    trading_days_cudf,
    vertex_col_names=['source','destination'],
    edge_id_col_name='edge_id',
    type_name="VolatilityChange",
    property_columns=['scaled_volatility_change'],
    # vector_properties=[] # here we can optionally include other metrics on the graph's edges
  )

In [22]:
pG.get_edge_data()

Unnamed: 0,_EDGE_ID_,_TYPE_,scaled_volatility_change,_SRC_,_DST_
0,2699677915,VolatilityChange,0.505183,2010-01-04_AAPL,2010-01-05_AAPL
1,1370729126,VolatilityChange,0.881988,2010-01-05_AAPL,2010-01-06_AAPL
2,419919682,VolatilityChange,0.306798,2010-01-06_AAPL,2010-01-07_AAPL
3,4292718468,VolatilityChange,0.461782,2010-01-07_AAPL,2010-01-08_AAPL
4,2975937858,VolatilityChange,0.703857,2010-01-08_AAPL,2010-01-11_AAPL
...,...,...,...,...,...
271471,2889526176,VolatilityChange,0.309142,2021-09-01_ZM,2021-09-02_ZM
271472,2890987050,VolatilityChange,0.710185,2021-09-02_ZM,2021-09-03_ZM
271473,2104335710,VolatilityChange,0.288056,2021-09-03_ZM,2021-09-07_ZM
271474,2605490554,VolatilityChange,0.699269,2021-09-07_ZM,2021-09-08_ZM


### Apply Modularity Optimization (Maximalization)

In [23]:
# Extract a subgraph to verify that the edge list contains the right information
sub_graph = pG.extract_subgraph(create_using=pG, edge_weight_property='scaled_volatility_change')
sub_graph_edge_data = sub_graph.get_edge_data()
sub_graph_edge_data.head()

Unnamed: 0,_EDGE_ID_,_TYPE_,scaled_volatility_change,_SRC_,_DST_
0,2699677915,VolatilityChange,0.505183,2010-01-04_AAPL,2010-01-05_AAPL
1,1370729126,VolatilityChange,0.881988,2010-01-05_AAPL,2010-01-06_AAPL
2,419919682,VolatilityChange,0.306798,2010-01-06_AAPL,2010-01-07_AAPL
3,4292718468,VolatilityChange,0.461782,2010-01-07_AAPL,2010-01-08_AAPL
4,2975937858,VolatilityChange,0.703857,2010-01-08_AAPL,2010-01-11_AAPL


In [24]:
# Create the graph with edge weights
G = cugraph.Graph()
G.from_cudf_edgelist(
    sub_graph_edge_data,
    source='_SRC_',
    destination='_DST_',
    weight='scaled_volatility_change',
)

In [25]:
# Apply Spectral Modularity Clustering
mod_opt_graph = cugraph.community.spectralModularityMaximizationClustering(
    G,
    num_clusters=80,
    num_eigen_vects=80,
    evs_tolerance=1e-4,
    evs_max_iter=1000,
    kmean_tolerance=1e-4,
    kmean_max_iter=4000
)

## Evaluation of Baseline Model

In [26]:
# calculate modularity score
score = cugraph.analyzeClustering_modularity(G, 80, mod_opt_graph, 'vertex', 'cluster')

print("Modularity Score:", score)

Modularity Score: 0.0356435290768389


### Challenges with Modularity Optimization on Time Series Data

A modularity score ranging from ~0.031 to ~0.049 after manual hyperparameter tuning revealed an important insight: the granular nature of the time series dataset likely causes sparse connectivity when applying the Modularity Optimization algorithm.

#### Why This Occurs

Modularity optimization algorithms leverage spectral clustering, which identifies eigenvectors within the input matrix to detect community structures. The algorithm tests for characteristic quality (where eigenvectors remain constant when a linear transform of the input matrix is applied to a transposed input matrix) and organizes communities around these stable vectors.

With highly granular time series data, many tests within the algorithm identify characteristic eigenvalues, resulting in the formation of discrete communities across the volatility change attribute. In simpler terms, the structure of volatility changes rarely crosses detection thresholds, causing many data points to cluster into the same communities. This leaves us with limited coherent structure for meaningful analysis.

Fortunately, I identified a solution to this challenge through the application of the Leiden algorithm, which addresses these limitations by taking a different approach to community formation.


### Leiden approach for sparse networks

In [27]:
leiden_results, modularity_score = cugraph.community.leiden(G)
# Snapshot of results
print("Modularity Score:", modularity_score)
leiden_results.head(5)

Modularity Score: 0.9989169380531239


Unnamed: 0,partition,vertex
0,376,2014-06-30_ATVI
1,102,2013-03-04_EXC
2,335,2016-12-16_XLNX
3,585,2012-11-12_EBAY
4,64,2019-07-15_FISV


## Evaluation (Leiden)

### Breakthrough with the Leiden Algorithm

With the implementation of the Leiden algorithm, the modularity score for the graph network approaches 1 - an optimal criterion. This significant improvement indicates that the Leiden-based community detection and modularity optimization, when applied to volatility weights in the graph, successfully identifies optimal community structures within the network.

The near-ideal modularity score provides strong validation of the detected communities. This breakthrough gives me clear parameters to analyze the resulting clusters across the time series and gain more meaningful insights into stock behavior patterns and their evolution over time. The improved community detection creates a solid foundation for the subsequent portfolio optimization and risk analysis phases of this project.


# Drill Down

### "Drill Down" Capabilities for Financial Services

A key innovation in my approach is the implementation of "drill down" capabilities that simulate answers to critical business questions in financial services. This functionality allows financial professionals to quickly analyze aggregate metrics across portfolios or trading activity datasets. These capabilities directly address real-world scenarios faced by market participants.

For example, a fund manager reviewing a large portfolio can instantly identify which stock communities are exhibiting concerning volatility patterns. A risk manager can quickly assess exposure across different market segments without manually grouping hundreds of positions. An auditor can trace the movement of stocks between communities to identify potential anomalies or market shifts.

### Evaluation and Data Analysis

In [116]:
# Join the clustering results back to the original data
leiden_results = leiden_results.rename(columns={'vertex': 'source'})  # Rename for merging
result_df = trading_days_cudf.merge(leiden_results, on='source', how='left')

# View the results
result_df[['Date', 'Name', 'volatility_change', 'partition']].head(5)

Unnamed: 0,Date,Name,volatility_change,partition
0,2014-12-03,ADI,-0.641972,215
1,2014-12-04,ADI,0.499404,215
2,2014-12-05,ADI,0.41554,215
3,2014-12-08,ADI,-0.355751,215
4,2014-12-09,ADI,0.269206,215


### Tracking Transitions Between Communities

To track transitions between communities for each stock, I calculated the differences in partition assignments over time. The transition rate tells us there's a measurable degree of movement across the volatility metric between communities of stocks. Many communities remain stable throughout the dataset, indicating certain NASDAQ100 stocks maintain consistent relationship patterns over long periods.

The visualization shows darker colors for communities with fewer stocks, larger bubbles for more transitions between partitions, and brighter colors for communities with higher stock counts. We would be particularly interested in comparing the communities that stay stable with few transitions (smaller, lower bubbles) versus the communities with large, higher bubbles where there seems to be significant volatility movement. These high-transition communities likely represent areas of market instability or sectors experiencing structural changes worth investigating further.


In [29]:
# Sort the dataframe by Name and Date
sorted_results = result_df.sort_values(by=['Name', 'Date'])

# Rename
stock_movements = sorted_results

# Reset the index for easier access (Optional)
stock_movements = stock_movements.reset_index(drop=True)

# calculate transitions (difference in partitions)
stock_movements['partition_transition'] = stock_movements.groupby('Name')['partition'].diff()

In [30]:
import cudf
import cupy as cp
import plotly.express as px


# Convert the data to a CUDF DataFrame
df = stock_movements
df['Date'] = cudf.to_datetime(df['Date'])


# Preprocess the partition_transition column
# Convert to binary (1 for transitions, 0 otherwise)
df['is_transition'] = (df['partition_transition'] != 0).astype('int32')

# Group by Date and calculate the number of transitions
daily_transitions = (
    df.groupby('Date')['is_transition']
    .sum()  # Sum up the binary transitions
    .reset_index()
    .rename(columns={'is_transition': 'total_transitions'})
)

# Count the number of unique stocks per date
daily_stock_count = (
    df.groupby('Date')['Name']
    .nunique()
    .reset_index()
    .rename(columns={'Name': 'stock_count'})
)

# Merge results and calculate the transition rate
transition_rate = daily_transitions.merge(daily_stock_count, on='Date')
transition_rate['transition_rate'] = (transition_rate['total_transitions'] / transition_rate['stock_count'])

# Drop NaNs before visualization
transition_rate = transition_rate.dropna()

# Convert to Pandas for visualization with Plotly
transition_rate_pd = transition_rate.to_pandas()

# GPU-Accelerated Scatter Plot with Plotly
fig = px.scatter(
    transition_rate_pd,
    x='Date',
    y='transition_rate',
    size='total_transitions',  # Bubble size based on total transitions
    color='stock_count',       # Color based on the number of stocks
    title='Community Transition Rate Over Time',
    labels={'transition_rate': 'Transition Rate', 'Date': 'Date'},
    hover_data=['total_transitions', 'stock_count']
)
fig.update_layout(template='plotly_dark', xaxis_title='Date', yaxis_title='Transition Rate')
fig.show()

## Aggregate Community Metrics

### Community-Level Market Analysis

This community analysis provides a more comprehensive view of market behavior than traditional methods. Rather than examining individual stocks, industries, or comparables in isolation, this approach synthesizes information into meaningful clusters.

I grouped stocks into **communities** based on shared volatility characteristics and aggregated their metrics over time. This summarizes complex market dynamics into digestible patterns, allowing me to analyze market activity at the **community level**.

The power of this approach lies in the aggregate metrics. By combining volatility, price changes, and volume data across community members, I create composite indicators that smooth out individual stock noise while preserving meaningful signals. These aggregate metrics reveal patterns that individual stock analysis might miss—such as when an entire community shows increasing volatility before the broader market, or when price movements in one community consistently precede similar movements in another.

Financial professionals can leverage these community-level metrics to make more informed decisions. Rather than monitoring hundreds of individual securities or relying on the consistency of ETF analytists, one can track a manageable number of communities, intervening when aggregate metrics cross significant thresholds. This approach also facilitates portfolio construction by identifying diversification opportunities across communities with different behavioral characteristics.


In [114]:
# Group by 'Date' (i.e., the community cluster) and aggregate metrics
community_stats = result_df.groupby('Date').agg({
    'Name': 'nunique',  # number of stocks in each community by date
    'Date': 'count',    # Total entries per community
    'volatility': ['mean', 'median', 'min', 'max'],
    'pcp': ['mean', 'median', 'min', 'max'],
    'vwap': ['mean', 'median', 'min', 'max'],
    'partition': 'unique'
})

# Reset the index to keep 'Date' as a column
community_stats = community_stats.reset_index()

# Renaming columns for clarity
community_stats.columns = [
    'date', 'num_stocks', 'num_entries',
    'volatility_mean', 'volatility_median', 'volatility_min', 'volatility_max',
    'pcp_mean', 'pcp_median', 'pcp_min', 'pcp_max',
    'vwap_mean', 'vwap_median', 'vwap_min', 'vwap_max',
    'partition_ids'
]

# Display a sample of the resulting dataset
community_stats.sample(5)

Unnamed: 0,date,num_stocks,num_entries,volatility_mean,volatility_median,volatility_min,volatility_max,pcp_mean,pcp_median,pcp_min,pcp_max,vwap_mean,vwap_median,vwap_min,vwap_max,partition_ids
601,2012-05-22,87,87,2.639879,2.3289,0.9588,8.1263,-0.196502,-0.0828,-5.6964,3.9776,53.051372,31.4633,1.9567,664.97,"[1024, 970, 734, 968, 257, 314, 507, 48, 26, 5..."
1260,2015-01-06,90,90,3.21388,2.9376,1.4592,8.3916,-1.409478,-1.47805,-4.8123,3.7063,93.121401,56.53165,2.6133,1085.48,"[345, 1085, 542, 215, 702, 640, 518, 362, 853,..."
1543,2016-02-22,93,93,2.428546,1.9366,0.9222,6.6759,0.604391,0.4768,-2.8144,5.3271,101.961584,58.6633,1.9933,1282.62,"[94, 879, 301, 690, 660, 599, 774, 59, 859, 93..."
830,2013-04-23,88,88,2.505669,2.14695,0.98,8.5066,0.721128,0.6463,-3.0723,5.3439,63.836989,36.84665,2.5167,700.74,"[276, 934, 607, 718, 520, 314, 662, 729, 407, ..."
411,2011-08-19,85,85,4.955332,4.1758,1.7234,16.8482,-0.072253,-0.1501,-14.2491,9.2229,42.1542,28.0933,1.7633,457.3,"[8, 65, 963, 914, 1083, 455, 930, 924, 1059, 7..."


## Aggregate Volatility Selections
### Volatility Windows  

In 2025, global financial markets are experiencing unusually high volatility and investors urgently need tools to identify where risks and opportunities lie. The Volatility Window metrics provide a rapid, efficient snapshot of volatility concentration, duration, and intensity across the market. As a financial professional, you can drill down into specific volatility windows, revealing potential alpha sources or hedging opportunities that might otherwise remain hidden in turbulent conditions.

In [132]:
# Set a threshold to filter high-volatility communities
VOLATILITY_THRESHOLD = 5

# Group by 'partition' (i.e., the community cluster) and aggregate metrics
volatility_stats = result_df.groupby('partition').agg({
    'Name': 'count',
    'Date': ['min','max'],
    'volatility': ['mean', 'median', 'min', 'max'],
    'pcp': ['mean', 'median', 'min', 'max'],
    'vwap': ['mean', 'median', 'min', 'max'],
})

# Convert the aggregated 'min' and 'max' Date columns to datetime
volatility_stats[('Date', 'min')] = cudf.to_datetime(volatility_stats[('Date', 'min')])
volatility_stats[('Date', 'max')] = cudf.to_datetime(volatility_stats[('Date', 'max')])


# Compute the length of the time frame for each partition in days.
# This measures the time span of the dates in that partition.
volatility_stats['time_frame_length'] = (volatility_stats[('Date', 'max')] - volatility_stats[('Date', 'min')]).dt.days


# Renaming columns for clarity
volatility_stats.columns = [
    'Number Stocks',
    'Part_Start', 'Part_End',
    'volatility_mean', 'volatility_median', 'volatility_min', 'volatility_max',
    'pcp_mean', 'pcp_median', 'pcp_min', 'pcp_max',
    'vwap_mean', 'vwap_median', 'vwap_min', 'vwap_max',
    'Part Time Frame (d)'
]

# Display a sample of the resulting dataset
volatility_stats.sample(5)

Unnamed: 0_level_0,Number Stocks,Part_Start,Part_End,volatility_mean,volatility_median,volatility_min,volatility_max,pcp_mean,pcp_median,pcp_min,pcp_max,vwap_mean,vwap_median,vwap_min,vwap_max,Part Time Frame (d)
partition,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
470,334,2011-07-22,2012-11-16,4.146199,3.51085,1.0681,24.6895,0.225128,0.13815,-9.643,10.9437,101.86579,113.64165,44.95,162.9767,483
762,264,2012-09-14,2013-10-03,1.542131,1.4092,0.5156,11.0411,0.002331,-0.1054,-8.0128,3.9927,399.928872,402.08375,321.495,459.785,384
1137,284,2010-01-04,2011-02-16,4.477278,4.0,0.7194,22.7273,0.033208,0.0,-13.7255,11.9565,1.149778,1.0533,0.64,1.8367,408
456,251,2012-02-24,2013-02-25,1.870103,1.7201,0.6897,4.8905,-0.062029,-0.0772,-4.1356,3.0394,61.846294,62.24,53.7433,68.5667,367
203,123,2010-10-11,2011-04-05,1.890723,1.6954,0.8837,5.9337,0.265187,0.1406,-2.9595,4.4879,32.930572,33.1867,28.0267,36.4633,176


In [133]:
# Drill down to individual stocks among high-volatility communities
hi_vols = volatility_stats[volatility_stats['volatility_mean'] > VOLATILITY_THRESHOLD]

# Get the list of high-volatility community IDs
high_vol_community_ids = result_df['partition'].to_arrow().to_pylist()

# Select stocks from Leiden Communities that are in high-volatility communities
high_vol_stocks = result_df[result_df['partition'].isin(high_vol_community_ids)]['Name'].unique()

In [139]:
# Convert the high-volatility stocks to a CUDF DataFrame
high_vol_stocks_df = cudf.DataFrame({'Stock Name': high_vol_stocks})

# Convert to Pandas for clean display in Jupyter Notebook
high_vol_stocks_df = high_vol_stocks_df.to_pandas()
high_vol_stocks_df.sample(5)

Unnamed: 0,Stock Name
8,AMAT
22,CHTR
6,AMD
7,ALGN
63,MRVL


### Volatility Variance vs. Volatility Change

**Volatility Variance** measures the spread of volatility across stocks within each community on a given date. High variance indicates diverse behavior within a community, while low variance suggests uniform activity among member stocks.

Analyzing volatility variance within partitions provides crucial insights beyond average measures. When a community shows high internal volatility variance, it signals potential fracturing of previously correlated stocks—often a leading indicator of broader market regime changes. Conversely, communities with consistently low volatility variance represent stable market segments that might offer reliable hedging opportunities during turbulent periods.

**Volatility Change**, meanwhile, captures the rate at which volatility is increasing or decreasing within communities. This dynamic measure reveals acceleration patterns in market stress or recovery that static variance measures might miss.

In my analysis, I observe communities with high volatility variance but low volatility change, indicating established but stable dispersion in stock behavior. Other communities show low variance but high change, suggesting uniformly shifting risk profiles. The most concerning pattern—high variance coupled with high change—appears in several communities during major market events, providing early warning of potential contagion effects. By monitoring both metrics simultaneously, financial professionals can distinguish between normal market diversity and emerging instability.


In [None]:
# Remove zeros
df_no_zeros = result_df[result_df['volatility'] != 0]

# Group by 'Date' and calculate volatility aggregates, keep Date as a column
vol_stats_by_date = df_no_zeros.groupby('Date').agg({
    'volatility': ['var', 'mean', 'max']
}).reset_index()

# rename the column names for clarity
vol_stats_by_date.columns = ['Date', 'vol_variance', 'mean_volatility', 'max_volatility']

# Order by 'Date'
vol_stats_by_date = vol_stats_by_date.sort_values(by='Date')

In [72]:
# Replace NaN with 0 for plotting
vol_stats_by_date['vol_variance'] = vol_stats_by_date['vol_variance'].fillna(0)

# GPU Accelerated Plot using Plotly
fig = px.line(
    vol_stats_by_date.to_pandas(),  # Convert to Pandas for Plotly compatibility
    x='Date',
    y='vol_variance',
    title='Volatility Variance Over Time',
    labels={'vol_variance': 'Volatility Variance', 'Date': 'Date'}
)

# Customize the layout for better readability
fig.update_layout(
    xaxis_title='Date',
    yaxis_title='Volatility Variance',
    xaxis=dict(tickangle=45),
    template='plotly_dark',
    width=1000,
    height=600
)

# Show the plot
fig.show()

In [136]:
# Remove zeros
volatility_cov = result_df[result_df['volatility'] != 0]

# Group by 'Date' and calculate aggregates
vol_stats_by_partition = volatility_cov.groupby('Date').agg({
    'volatility_change': ['var', 'mean', 'min', 'max'],
    'partition': 'collect'
}).reset_index()

# Renaming columns for clarity
vol_stats_by_partition.columns = [
    'date',
    'vol_change_var',
    'vol_change_mean',
    'vol_change_min',
    'vol_change_max',
    'partition_ids'
]

In [137]:
# Replace NaN with 0 for plotting
vol_stats_by_partition['vol_change_var'] = vol_stats_by_partition['vol_change_var'].fillna(0)

# get partition ids as list
partition_ids_list = vol_stats_by_partition['partition_ids'].to_arrow().to_pylist()

# Sample some partition ids for each row (These are lookup ids for individual stocks)
sampled_ids_list = [
    cp.random.choice(cp.array(partition), size=min(3, len(partition)), replace=False).tolist()
    for partition in partition_ids_list
]


# Add the sampled IDs back to the CUDF DataFrame
vol_stats_by_partition['Stocks in Part'] = sampled_ids_list

# GPU Accelerated Plot using Plotly
fig = px.line(
    vol_stats_by_partition.to_pandas(),  # Convert to Pandas for Plotly compatibility
    x='date',
    y='vol_change_var',
    title='Volatility Change Over Time By Partition',
    labels={'vol_change_var': 'Volatility Change', 'Date': 'Date'},
    hover_data={'Stocks in Part': True}
)

# Customize the layout for better readability
fig.update_layout(
    xaxis_title='Date',
    yaxis_title='Volatility Change',
    xaxis=dict(tickangle=45),
    template='plotly_dark',
    width=1000,
    height=600
)

# Show the plot
fig.show()