
# Categorical Homophily and Degree-Based Analysis

This notebook performs two key analyses:
1. **Categorical Homophily Analysis**: Examines the tendency of nodes (users) to form connections based on shared categorical attributes, such as group affiliation.
2. **Degree-Based Assortativity Analysis**: Evaluates whether nodes with similar degrees (number of connections) are more likely to connect.

## Dataset Overview
The analyses are conducted using two datasets from the online platform `derStandard.at`:
- **Votes Data**: Represents undirected interactions between users based on shared voting behavior.
- **Replies Data**: Represents undirected interactions between users based on replies.

## Objectives
1. **Categorical Homophily**:
   - Identify intra-group connections.
   - Compute modularity to measure clustering by categories.
2. **Degree-Based Assortativity**:
   - Compute the degree assortativity coefficient.
   - Interpret the patterns of connectivity based on node degree.

Each step is explained in detail, with results and interpretations provided.



## Categorical Homophily Analysis

This section analyzes whether users tend to connect with others who share a similar categorical attribute. 
For demonstration purposes, synthetic categories (`A` and `B`) are assigned to the nodes in the absence of real metadata.

### Steps:
1. Load the dataset containing user interactions based on shared voting behavior.
2. Create an undirected graph with weights representing the interaction strength.
3. Assign random synthetic categories to nodes.
4. Calculate:
   - The number of intra-group edges.
   - Modularity to measure clustering by categories.
5. Interpret results and compare with a random baseline.


In [3]:

import pandas as pd
import networkx as nx
import random

# Load the dataset for categorical homophily analysis
votes_df = pd.read_parquet(
    "./df_edge_list_undirected_users_votes_to_same_postings_net.parquet"
)

# Create an undirected graph
G_votes = nx.Graph()
G_votes.add_weighted_edges_from(
    votes_df[
        ["ID_CommunityIdentity_min", "ID_CommunityIdentity_max", "count_votes_to_same_posting_net"]
    ].itertuples(index=False, name=None)
)

# Assign random categories to nodes
categories = ["A", "B"]
node_categories = {node: random.choice(categories) for node in G_votes.nodes()}
nx.set_node_attributes(G_votes, node_categories, "category")

# Calculate the number of intra-group edges
intra_group_edges = sum(
    1
    for u, v in G_votes.edges()
    if G_votes.nodes[u]["category"] == G_votes.nodes[v]["category"]
)

# Calculate modularity
modularity = nx.algorithms.community.quality.modularity(
    G_votes, [[n for n in G_votes.nodes if G_votes.nodes[n]["category"] == c] for c in categories]
)

# Display results
print(f"Number of intra-group edges: {intra_group_edges}")
print(f"Modularity (Q): {modularity}")


Number of intra-group edges: 7091396
Modularity (Q): -0.0004246397518191114



### Results Interpretation

1. **Intra-Group Edges**:
   - The number of edges connecting nodes of the same category indicates the extent of homophily.
   - A higher count suggests stronger within-group connectivity.

2. **Modularity (Q)**:
   - Positive modularity values indicate significant clustering based on the assigned categories.
   - Values close to 0 suggest no preference for within-group connections.
   - Negative values would indicate disassortative mixing (unlikely in this scenario).

The results are compared with a random baseline to understand whether the observed patterns deviate from randomness.



## Degree-Based Assortativity Analysis

This section evaluates whether users with similar degrees (number of connections) are more likely to interact.

### Steps:
1. Load the dataset containing user interactions based on replies.
2. Create an undirected graph where edge weights represent the reply count.
3. Compute:
   - Degree assortativity coefficient to measure the correlation of node degrees across edges.
4. Interpret the results to determine whether the network exhibits assortative or disassortative mixing.


In [None]:

# Load the dataset for degree-based analysis
replies_df = pd.read_parquet(
    "./shared/194.050-2024W/Data/Group_Project/df_edge_list_undirected_users_postings_replies.parquet"
)

# Create an undirected graph
G_replies = nx.Graph()
G_replies.add_weighted_edges_from(
    replies_df[
        ["ID_CommunityIdentity_min", "ID_CommunityIdentity_max", "count_posting_replies"]
    ].itertuples(index=False, name=None)
)

# Compute the degree assortativity coefficient
degree_assortativity = nx.degree_assortativity_coefficient(G_replies, weight="weight")

# Display results
print(f"Degree assortativity coefficient: {degree_assortativity}")

# Interpret the coefficient
if degree_assortativity > 0:
    print("The network exhibits assortative mixing: nodes with similar degrees are more likely to connect.")
elif degree_assortativity < 0:
    print("The network exhibits disassortative mixing: nodes with dissimilar degrees are more likely to connect.")
else:
    print("The network does not show a clear preference for assortative or disassortative mixing.")
