In [1]:
import hypergraph
import pickle
import pandas as pd
import numpy as np
import os

When node metadata is available, we can label each inferred community based on the estimated community membership matrix `U` and the metadata. In our manuscript, we streamlined this process using a Large Language Model (LLM). Here, we demonstrate the procedure to generate the prompt used for this purpose.

In this example, we use the **high-school hypergraph**, constructed from contact data among students in a French high school. Each node (student) has metadata indicating their class affiliation. There are 9 classes in total, and this hypergraph is known to exhibit a strong community structure corresponding to these classes.

First, please download `contact-high-school.zip` from the following website and unzip it in the `./data/` directory:
[https://www.cs.cornell.edu/~arb/data/contact-high-school-labeled/](https://www.cs.cornell.edu/~arb/data/contact-high-school-labeled/)

This will create a `./data/contact-high-school` directory containing `hyperedges-contact-high-school.txt` and `node-labels-contact-high-school.txt`.

After confirming the files, run the following code to generate the high-school hypergraph object. A `contact-high-school.pickle` file will be created in the `./data/` directory.

In [2]:
hypergraph_name = "contact-high-school"

H = hypergraph.read_benson_hypergraph_data(
    data_name=hypergraph_name,
    multiple_hyperedges=False,
)

f_path = './data/' + str(hypergraph_name) + '.pickle'
with open(f_path, mode='wb') as f:
    pickle.dump(H, f)

Number of nodes: 327
Number of hyperedges: 7818
Sum of hyperedge weights: 7818
Average degree of the node: 55.63302752293578
Average size of the hyperedge: 2.3269378357636223
Maximum size of the hyperedge: 5
Hyperedge size distribution: [(2, 5498), (3, 2091), (4, 222), (5, 7)]
Connected hypergraph: True



We load the parameters `U` (membership matrix) and `W` (affinity matrices) inferred by HyperMOSBM for the high-school data. 
In this example, we assume the results are stored in `./results/hypermosbm_inferred_u_w.pickle`.
(Note: You would need to run the inference beforehand to generate this file.)

In [3]:
with open("./results/hypermosbm_inferred_u_w.pickle", mode='rb') as f:
    (U, W) = pickle.load(f)

We create a pandas DataFrame to facilitate prompt generation. This DataFrame stores the class label (one of "2BIO1", "2BIO2", "2BIO3", "MP*1", "MP*2", "PSI*", "PC", "PC*", "MP") and the inferred community membership vector for each node.

In [4]:
def create_dataframe(H, U):
    
    U[U < 0] = 0
    row_sums = U.sum(axis=1, keepdims=True)
    row_sums[row_sums == 0] = 1
    U_normalized = U / row_sums
    
    N, K = int(U.shape[0]), int(U.shape[1])
    
    # Map class ID to class name
    classroom = {0: "2BIO1", 1: "2BIO2", 2: "2BIO3", 3: "MP*1", 4: "MP*2", 5: "PSI*", 6: "PC", 7: "PC*", 8: "MP"}
    try:
        classes = [classroom[H.X[i]] for i in range(N)]
    except (AttributeError, IndexError, KeyError) as e:
        print(f"Error accessing H.X. Please check its structure. Details: {e}")
        exit()
    
    df = pd.DataFrame({
        'class': classes,
        'membership_vector': list(U_normalized)
    })
    
    print(f"Created DataFrame for {N} students and {K} communities.")
        
    return df

In [5]:
df = create_dataframe(H, U)
df.head()

Created DataFrame for 327 students and 9 communities.


Unnamed: 0,class,membership_vector
0,MP,"[6.084673452719183e-133, 0.0, 0.0, 1.0, 0.0, 6..."
1,MP,"[5.099610544950103e-35, 0.0, 0.0, 1.0, 0.0, 0...."
2,2BIO3,"[0.0, 0.0, 0.0, 0.043312838837001, 0.0, 0.0, 0..."
3,2BIO3,"[0.0, 2.3935928889182956e-120, 0.0, 0.0, 0.042..."
4,PC*,"[0.9497389805800429, 0.05026101941995713, 0.0,..."


We define the `generate_prompt` function to create a text prompt for the LLM. 
This function performs the following steps:

1.  **Identify Representative Members**: For each inferred community, it identifies "representative members" â€” nodes that have a high membership probability (default: $\ge 0.9$) for that community.
2.  **Construct Prompt Text**: It constructs a prompt that instructs the LLM to act as an expert in social network analysis. The prompt includes the list of representative members (with their class labels) for each community.
3.  **Save to File**: The generated prompt is saved as a text file (e.g., `./prompts/contact-high-school_hypermosbm_prompt.txt`).

This text file can then be copied and pasted into an LLM (like ChatGPT or Gemini) to obtain interpretable labels and descriptions for each community.

In [6]:
def generate_prompt(
    hypergraph_name: str,
    df: pd.DataFrame,
    prob_threshold: float = 0.9,
    output_directory: str = "./prompts",
):

    def create_representative_nodes_text(dataframe):
        K_count = len(dataframe['membership_vector'].iloc[0])
        all_communities_text = []
        
        for k in range(K_count):
            community_text = []
            community_text.append(f"**Community {k}**\n")
            
            df_copy = dataframe.copy()
            df_copy['target_prob'] = df_copy['membership_vector'].apply(lambda vec: vec[k])
            
            nodes_above_threshold = df_copy[df_copy['target_prob'] >= prob_threshold]
            nodes_sorted = nodes_above_threshold.sort_values(by='target_prob', ascending=False)
            
            community_text.append(f"*   **Representative members (Membership Probability >= {prob_threshold}): {len(nodes_sorted)} members found.**")
            
            if nodes_sorted.empty:
                community_text.append("    1.  (No members met the threshold for this community)")
            else:
                # Display anonymized IDs and their metadata
                for i, (index, row) in enumerate(nodes_sorted.iterrows()):
                    # Assuming 'class' column contains the metadata
                    node_metadata = row['class'] 
                    prob = row['target_prob']
                    community_text.append(f"    {i+1}. (Prob: {prob:.3f}) {node_metadata}")
            
            all_communities_text.append("\n".join(community_text))
            
        return "\n\n---\n\n".join(all_communities_text)

    U_matrix = np.vstack(df['membership_vector'].values)
    K = U_matrix.shape[1]
    
    rep_nodes_text = create_representative_nodes_text(df)

    prompt_template = f"""
You are an expert in social network analysis.
I will provide you with data on {K} communities of individuals, extracted from a contact network using a machine learning model. Your task is to analyze this data, infer the social group that each community represents, and provide a suitable English label for each, along with your reasoning.

**Analysis Steps:**
1.  Analyze the "Representative Members List" for each community. This list shows the metadata (e.g., class affiliation) of members who have a membership probability of 90% or higher for that community.
2.  Based on this information, interpret what kind of group each community is (e.g., 'MP', 'Mainly Class 2BIO1', 'Mixed Group of MP*1 and MP*2'). Then, create your response following the requirements below.

**Output Requirements:**
*   **Label:** A concise English label that represents the nature of the community. If a metadata class name accurately describes the community, prioritize using that name without modification. If there are two or more communities with the same label, differentiate them (e.g., 'First Group ...', 'Second Group ...').
*   **Composition:** Describe which classes primarily constitute this community, based on the representative members list.
*   **Rationale:** Explain your reasoning for assigning the label, specifically based on the class composition of the representative members.

**Output Format:**
Strictly adhere to the following Markdown format for each community.

```markdown
### Community Labeling Results

**Community 0**
*   **Label:** [Enter the English label here]
*   **Composition:** [Describe the class composition]
*   **Rationale:** [Provide your reasoning]

---

**Community 1**
*   **Label:** [Enter the English label here]
*   **Composition:** [Describe the class composition]
*   **Rationale:** [Provide your reasoning]

---
(... continue for all communities up to Community {K-1} ...)

---

**Summary Output:**
Finally, provide a summary of all community labels in the following Python dictionary format.

```python
{{
    0: "Label for Community 0",
    1: "Label for Community 1",
    2: "Label for Community 2",
    ...
    # (continue up to Community {K-1})
}}

### **Data for Analysis**
---

Representative Members List for Each Community

{rep_nodes_text}

Your response:

"""
    
    if not os.path.exists(output_directory):
        os.makedirs(output_directory)
        
    file_path = os.path.join(output_directory, f"{hypergraph_name}_hypermosbm_prompt.txt")
    
    try:
        with open(file_path, 'w', encoding='utf-8') as f:
            f.write(prompt_template.strip())
        print(f"\nSuccessfully generated and saved prompt to: {file_path}")
    except Exception as e:
        print(f"\nError saving prompt file: {e}")

In [7]:
generate_prompt(hypergraph_name=hypergraph_name, df=df)


Successfully generated and saved prompt to: ./prompts/contact-high-school_hypermosbm_prompt.txt
