<div style="text-align: right;">Â© 2026 Moses Boudourides. All Rights Reserved.</div>

# LLMs for Qualitative and Mixed-Methods Social Network Analysis (SNA)
## Moses Boudourides

# Session 6: Transition to Practice and Research Agendas

In [1]:
import networkx as nx
import pandas as pd
import matplotlib.pyplot as plt
from pyvis.network import Network
import os
import json
import sys
import hashlib
import random
from sklearn.datasets import fetch_20newsgroups
import IPython
from openai import OpenAI
# import google.generativeai as genai

In [2]:
# --- 1. & 2. KEY LOADING & INITIALIZATION ---

# Force Google to use REST to avoid ALTS/GCP credential errors
os.environ["GOOGLE_API_USE_MTLS"] = "never" 

def get_api_key(file_path):
    if os.path.exists(file_path):
        with open(file_path, 'r') as f:
            return f.read().strip().replace('"', '').replace("'", "")
    return None

oa_key = get_api_key("openai_key.txt")
# gem_key = get_api_key("gemini_key.txt")

# Initialize OpenAI
client_oa = OpenAI(api_key=oa_key)

# # Initialize Gemini using 'rest' transport to bypass gRPC/ALTS errors
# genai.configure(api_key=gem_key, transport='rest')

# # Dynamic Model Selection
# available_models = [m.name for m in genai.list_models() if 'generateContent' in m.supported_generation_methods]
# target_model = 'gemini-1.5-flash' if 'models/gemini-1.5-flash' in available_models else available_models[0].split('/')[-1]
# model_gemini = genai.GenerativeModel(target_model)

## Part 1: Transition to Practice and Research Agendas

**Goal:**  
Provide reusable templates and conceptual scaffolding for independent research using LLM-augmented qualitative SNA.

**Important Note:** This notebook is a **template**, not a pipeline. It is designed to be adapted, critiqued, and extended.

In [3]:
# --- 3. DATA & PERSISTENT QUERY STEP ---

# 2. Persistence Logic
CACHE_FILE = "llm_cache_s6.json"

if os.path.exists(CACHE_FILE):
    with open(CACHE_FILE, "r") as f:
        cache = json.load(f)
else:
    cache = {}

def get_label(model_id, text, api_func, prompt_type="practice"):
    # Unique key ensures cache stays valid even if prompt or text changes
    cache_key = f"{model_id}_{prompt_type}_{text[:50]}"
    
    if cache_key in cache:
        return cache[cache_key]
    
    # Cache Miss: Call API
    result = api_func(text)
    cache[cache_key] = result
    
    # Save updated cache to disk
    with open(CACHE_FILE, "w") as f:
        json.dump(cache, f)
    return result

# 3. API Execution Wrappers
def query_openai_extraction(text):
    prompt = f"""Extract relationships and their meanings from this text.
Identify the type of relationship (e.g., rivalry, mentorship, collaboration).

Text: {text}"""
    res = client_oa.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.7
    )
    return res.choices[0].message.content.strip()

## Research Design Checklist

Before using LLMs in your research, clarify these essential elements.

In [4]:
# Research Design Metadata Template
research_design = {
    "question": "How do relational meanings shape network structure in online communities?",
    "qualitative_sources": "Newsgroup posts, interviews, ethnographic notes",
    "network_concept": "Directed graph of interaction patterns",
    "llm_role": "Provisional extraction and interpretation of relational meanings",
    "ethical_risks": "Hallucination, misrepresentation, privacy violations",
    "mitigations": "Human validation, anonymization, audit trails"
}

print("RESEARCH DESIGN CHECKLIST:")
print("="*60)
print()
for key, value in research_design.items():
    print(f"{key.upper()}:")
    print(f"  {value}")
    print()

RESEARCH DESIGN CHECKLIST:

QUESTION:
  How do relational meanings shape network structure in online communities?

QUALITATIVE_SOURCES:
  Newsgroup posts, interviews, ethnographic notes

NETWORK_CONCEPT:
  Directed graph of interaction patterns

LLM_ROLE:
  Provisional extraction and interpretation of relational meanings

ETHICAL_RISKS:
  Hallucination, misrepresentation, privacy violations

MITIGATIONS:
  Human validation, anonymization, audit trails



## Data Layer

Describe who produced the data, in what context, and with what expectations.

In [5]:
# Data Layer: Qualitative texts
texts = [
    "Alice and Bob work together on the new project.",
    "Bob and Carol are friends and often collaborate.",
    "Carol mentors David, a new member of the team."
]

df_data = pd.DataFrame({"text": texts})

print("DATA LAYER:")
print("="*60)
print()
print("Data Source Metadata:")
print("- Producer: Simulated organizational texts")
print("- Context: Team collaboration and mentorship")
print("- Expectations: Explicit relationship descriptions")
print()
print("Data Corpus:")
print(df_data.to_string())

DATA LAYER:

Data Source Metadata:
- Producer: Simulated organizational texts
- Context: Team collaboration and mentorship
- Expectations: Explicit relationship descriptions

Data Corpus:
                                               text
0   Alice and Bob work together on the new project.
1  Bob and Carol are friends and often collaborate.
2    Carol mentors David, a new member of the team.


## Interpretive Coding Layer

Coding schemes are theoretical instruments. Document their evolution.

In [6]:
# Interpretive Coding Layer
codes = {
    ("Alice", "Bob"): "collaborative work",
    ("Bob", "Carol"): "friendship and collaboration",
    ("Carol", "David"): "mentorship"
}

print("INTERPRETIVE CODING LAYER:")
print("="*60)
print()
print("Coding Scheme:")
for pair, code in codes.items():
    print(f"  {pair[0]} -- {pair[1]}: {code}")
print()
print("Note: Coding schemes evolve through iterative analysis.")

INTERPRETIVE CODING LAYER:

Coding Scheme:
  Alice -- Bob: collaborative work
  Bob -- Carol: friendship and collaboration
  Carol -- David: mentorship

Note: Coding schemes evolve through iterative analysis.


## LLM Interaction Template

For each LLM use, record purpose, prompt, model, output, and researcher decision.

In [7]:
# LLM Log Template
llm_log = []

print("LLM INTERACTION TEMPLATE:")
print("="*60)
print()

for i, text in enumerate(texts):
    extraction = get_label("openai", text, query_openai_extraction, "extraction")
    
    log_entry = {
        "iteration": i + 1,
        "purpose": "Extract relational meanings",
        "input_text": text[:50],
        "model": "gpt-4o-mini",
        "output": extraction[:100],
        "researcher_decision": "Validated against coding scheme"
    }
    llm_log.append(log_entry)
    
    print(f"Iteration {i+1}:")
    print(f"  Text: {text}")
    print(f"  LLM Output: {extraction[:80]}...")
    print()

LLM INTERACTION TEMPLATE:

Iteration 1:
  Text: Alice and Bob work together on the new project.
  LLM Output: From the text provided, the following relationship can be extracted:

- **Relati...

Iteration 2:
  Text: Bob and Carol are friends and often collaborate.
  LLM Output: 1. **Relationship**: Bob and Carol  
   **Type**: Friendship  
   **Meaning**: A...

Iteration 3:
  Text: Carol mentors David, a new member of the team.
  LLM Output: 1. **Relationship:** Carol and David  
   **Type:** Mentorship  
   **Meaning:**...



## Network Construction Layer

Network edges are interpretive artifacts. Justify inclusion and exclusion.

In [8]:
# Network Construction Layer
G_practice = nx.Graph()

# Add edges based on interpretive coding
edges_practice = [
    ("Alice", "Bob", {"meaning": "collaborative work"}),
    ("Bob", "Carol", {"meaning": "friendship and collaboration"}),
    ("Carol", "David", {"meaning": "mentorship"})
]

for u, v, attr in edges_practice:
    G_practice.add_edge(u, v, **attr)

print("NETWORK CONSTRUCTION LAYER:")
print("="*60)
print()
print(f"Nodes: {list(G_practice.nodes())}")
print(f"Edges: {list(G_practice.edges())}")
print()
print("Edge Meanings:")
for u, v, d in G_practice.edges(data=True):
    print(f"  {u} -- {v}: {d['meaning']}")

NETWORK CONSTRUCTION LAYER:

Nodes: ['Alice', 'Bob', 'Carol', 'David']
Edges: [('Alice', 'Bob'), ('Bob', 'Carol'), ('Carol', 'David')]

Edge Meanings:
  Alice -- Bob: collaborative work
  Bob -- Carol: friendship and collaboration
  Carol -- David: mentorship


## Interpretation Layer

Ask: What does this network represent? What does it obscure? How does it relate to theory?

In [9]:
print("INTERPRETATION LAYER:")
print("="*60)
print()
print("Critical Questions:")
print()
print("1. What does this network represent?")
print("   - A small team with mentorship and collaborative relationships")
print("   - Directed flows of knowledge and support")
print()
print("2. What does it obscure?")
print("   - Temporal dynamics (how relationships change over time)")
print("   - Emotional dimensions (satisfaction, conflict)")
print("   - Structural holes and brokerage roles")
print()
print("3. How does it relate to theory?")
print("   - Social capital theory: Mentorship as knowledge transfer")
print("   - Network analysis: Density and clustering")
print("   - Qualitative SNA: Meaning-making in relationships")

INTERPRETATION LAYER:

Critical Questions:

1. What does this network represent?
   - A small team with mentorship and collaborative relationships
   - Directed flows of knowledge and support

2. What does it obscure?
   - Temporal dynamics (how relationships change over time)
   - Emotional dimensions (satisfaction, conflict)
   - Structural holes and brokerage roles

3. How does it relate to theory?
   - Social capital theory: Mentorship as knowledge transfer
   - Network analysis: Density and clustering
   - Qualitative SNA: Meaning-making in relationships


## Ethical Reflection Layer

Document potential harms, representation risks, and disclosure strategies.

In [10]:
print("ETHICAL REFLECTION LAYER:")
print("="*60)
print()
print("Potential Harms:")
print("- Misrepresentation of relationships")
print("- Privacy violations if identities are disclosed")
print("- Stigmatization of certain actors or relationships")
print()
print("Representation Risks:")
print("- Network visualization may oversimplify complexity")
print("- Absence of edges may be misinterpreted as lack of relationship")
print("- Centrality measures may be misused for evaluation")
print()
print("Disclosure Strategies:")
print("- Use anonymization and aggregation")
print("- Provide context and caveats")
print("- Obtain informed consent for any publication")
print("- Allow participants to review and comment")

ETHICAL REFLECTION LAYER:

Potential Harms:
- Misrepresentation of relationships
- Privacy violations if identities are disclosed
- Stigmatization of certain actors or relationships

Representation Risks:
- Network visualization may oversimplify complexity
- Absence of edges may be misinterpreted as lack of relationship
- Centrality measures may be misused for evaluation

Disclosure Strategies:
- Use anonymization and aggregation
- Provide context and caveats
- Obtain informed consent for any publication
- Allow participants to review and comment


## Replication Package Checklist

Ensure reproducibility and transparency by including all necessary materials.

In [11]:
replication_checklist = pd.DataFrame([
    {"Item": "Code", "Status": "âœ“", "Notes": "All analysis scripts included"},
    {"Item": "Prompts", "Status": "âœ“", "Notes": "All LLM prompts documented"},
    {"Item": "Logs", "Status": "âœ“", "Notes": "Audit trails for all LLM calls"},
    {"Item": "Interpretive Notes", "Status": "âœ“", "Notes": "Coding decisions documented"},
    {"Item": "Ethical Disclosures", "Status": "âœ“", "Notes": "Risks and mitigations listed"},
    {"Item": "Data (Anonymized)", "Status": "âœ“", "Notes": "Can be shared if consented"}
])

print("REPLICATION PACKAGE CHECKLIST:")
print("="*60)
print()
print(replication_checklist.to_string(index=False))

REPLICATION PACKAGE CHECKLIST:

               Item Status                          Notes
               Code      âœ“  All analysis scripts included
            Prompts      âœ“     All LLM prompts documented
               Logs      âœ“ Audit trails for all LLM calls
 Interpretive Notes      âœ“    Coding decisions documented
Ethical Disclosures      âœ“   Risks and mitigations listed
  Data (Anonymized)      âœ“     Can be shared if consented


## Open Research Directions

Consider these extensions and future research directions.

In [12]:
print("OPEN RESEARCH DIRECTIONS:")
print("="*60)
print()
print("1. Narrative Multiplexity")
print("   - How do different narratives about the same relationship coexist?")
print("   - Can we model multiple relational meanings simultaneously?")
print()
print("2. Dynamic Relational Meanings")
print("   - How do relationship meanings change over time?")
print("   - Can LLMs track semantic drift in relationships?")
print()
print("3. Role Transitions")
print("   - How do actors shift between roles (mentor, peer, subordinate)?")
print("   - What triggers role changes?")
print()
print("4. Interpretive Uncertainty")
print("   - How do we quantify ambiguity in relational meanings?")
print("   - Can we model disagreement between coders?")

OPEN RESEARCH DIRECTIONS:

1. Narrative Multiplexity
   - How do different narratives about the same relationship coexist?
   - Can we model multiple relational meanings simultaneously?

2. Dynamic Relational Meanings
   - How do relationship meanings change over time?
   - Can LLMs track semantic drift in relationships?

3. Role Transitions
   - How do actors shift between roles (mentor, peer, subordinate)?
   - What triggers role changes?

4. Interpretive Uncertainty
   - How do we quantify ambiguity in relational meanings?
   - Can we model disagreement between coders?


## Session 6 Takeaway: Seminar Closure

LLMs do not make qualitative SNA obsolete. They make its epistemic commitments more visible, more scalable, and more consequential.

## Part 2: Final Synthesis with 20 Newsgroups Dataset

Apply the complete research design template to the 20 Newsgroups dataset.

In [13]:
# --- CONFIGURATION ---
n = 20  # Number of Nodes (Researchers)
m = 100  # Number of Edges (Interactions/posts)

# Dataset Description
# The 20 Newsgroups dataset is a collection of approximately 18,000 newsgroup posts 
# that originated in the early days of the internet (Usenet) and they can be 
# displayed as a social network (a directed weighted multigraph) among thousands 
# of unique nodes/researchers interacting/replying in the posts of the 20 newsgroups.
# Taken from sklearn.datasets.fetch_20newsgroups

# Generate a unique filename based on m to avoid mixing samples
config_hash = hashlib.md5(f"{m}_newsgroups_s6".encode()).hexdigest()[:8]
SNAPSHOT_FILE = f"news_snapshot_m{m}_{config_hash}.csv"

# CHECK IF WE ALREADY HAVE THE COMPLETE DATA
if os.path.exists(SNAPSHOT_FILE):
    print(f"âœ… LOADING PERMANENT SNAPSHOT: {SNAPSHOT_FILE}")
    interactions = pd.read_csv(SNAPSHOT_FILE)
else:
    print(f"ðŸš€ SNAPSHOT NOT FOUND. GENERATING NEW SAMPLE...")
    
    # 1. Fetch the big dataset (11,000+ posts)
    newsgroups = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'))
    full_df = pd.DataFrame({'text': newsgroups.data})
    
    # 2. Filter and Sample M posts
    df = full_df[full_df['text'].str.strip().str.len() > 20].copy()
    subset = df.sample(n=m, random_state=42).reset_index(drop=True)
    
    # 3. Assign the Social Structure (Source/Target)
    user_pool = [f"Researcher_{i:02d}" for i in range(n)]
    sources = [random.choice(user_pool) for _ in range(m)]
    targets = [random.choice([u for u in user_pool if u != s]) for s in sources]

    interactions = pd.DataFrame({
        "source": sources,
        "target": targets,
        "text": subset['text'].str[:300].replace('\n', ' ', regex=True)
    })

    # 4. IMMEDIATELY SAVE (before LLM processing)
    interactions.to_csv(SNAPSHOT_FILE, index=False)
    print(f"ðŸ’¾ PERMANENTLY SAVED: {SNAPSHOT_FILE}")

print(f"\n--- READY: {len(interactions)} interactions between {n} nodes ---")
interactions.head()

ðŸš€ SNAPSHOT NOT FOUND. GENERATING NEW SAMPLE...
ðŸ’¾ PERMANENTLY SAVED: news_snapshot_m100_516e4df7.csv

--- READY: 100 interactions between 20 nodes ---


Unnamed: 0,source,target,text
0,Researcher_10,Researcher_16,In case you missed it on the news....the first...
1,Researcher_02,Researcher_08,We have no way of knowing because we cann...
2,Researcher_12,Researcher_14,The lengthy article you quote doesn't imply ...
3,Researcher_19,Researcher_18,"The recent rise of nostalgia in this group, co..."
4,Researcher_08,Researcher_12,"# ## Absolutely nothing, seeing as there is no..."


## Final Pipeline: Relational Extraction with Meaning Attribution

Integrate textual meaning into relational attributes.

In [14]:
# Sample interactions for analysis
sample_size = min(5, len(interactions))
sample_interactions = interactions.sample(n=sample_size, random_state=42)

print("FINAL PIPELINE: Relational Extraction")
print("="*60)
print()

# Build final network with meanings
G_final = nx.DiGraph()

for _, row in sample_interactions.iterrows():
    source = row['source']
    target = row['target']
    text = row['text']
    
    # Determine tie type based on text content
    tie_type = "Positive" if any(word in text.lower() for word in ["good", "great", "help", "support", "appreciate"]) else "Neutral"
    
    # Extract meaning with LLM
    meaning = get_label("openai", text, query_openai_extraction, "final_extraction")
    
    # Add edge with attributes
    G_final.add_edge(source, target, 
                     tie_type=tie_type, 
                     narrative=text[:50],
                     meaning=meaning[:50])
    
    print(f"{source} -> {target}")
    print(f"  Type: {tie_type}")
    print(f"  Text: {text[:60]}...")
    print()

FINAL PIPELINE: Relational Extraction

Researcher_13 -> Researcher_16
  Type: Neutral
  Text:     Ok boys & girls, hang on; here we go!      Christ's Eter...

Researcher_05 -> Researcher_01
  Type: Positive
  Text:     J.N. Darby was one of the founders of the "Plymouth Bret...

Researcher_19 -> Researcher_02
  Type: Neutral
  Text:  I missed the presentations given in the morning session (wh...

Researcher_07 -> Researcher_04
  Type: Neutral
  Text:  I think the original poster meant opening the mouse, not ju...

Researcher_15 -> Researcher_01
  Type: Neutral
  Text:   [After a small refresh Hasan got on the track again.]     ...



In [15]:
print("\nFinal Network Summary:")
print(f"Nodes: {G_final.number_of_nodes()}")
print(f"Edges: {G_final.number_of_edges()}")
print()
print("Edges with Qualitative Attributes:")
for u, v, d in G_final.edges(data=True):
    print(f"{u} -> {v}: {d['tie_type']} ({d['narrative']})")


Final Network Summary:
Nodes: 9
Edges: 5

Edges with Qualitative Attributes:
Researcher_13 -> Researcher_16: Neutral (    Ok boys & girls, hang on; here we go!      Chr)
Researcher_05 -> Researcher_01: Positive (    J.N. Darby was one of the founders of the "Ply)
Researcher_19 -> Researcher_02: Neutral ( I missed the presentations given in the morning s)
Researcher_07 -> Researcher_04: Neutral ( I think the original poster meant opening the mou)
Researcher_15 -> Researcher_01: Neutral (  [After a small refresh Hasan got on the track ag)


In [16]:
# Anonymize for visualization
def anonymize_user(username):
    return hashlib.sha256(username.encode()).hexdigest()[:8]

# Create anonymized version
G_anon = nx.DiGraph()
for u, v, d in G_final.edges(data=True):
    G_anon.add_edge(anonymize_user(u), anonymize_user(v), **d)

# Visualize
net = Network(height="500px", width="100%", directed=True, bgcolor="#ffffff")

# Add Nodes (labels only)
for node in G_anon.nodes():
    net.add_node(
        node, 
        label=node, 
        shape='dot',
        size=1,
        color='#ffffff',
        borderWidth=0,
        font={'size': 12, 'color': 'black', 'align': 'center'}
    )
    
# Add Edges with colors based on tie type
for source, target, d in G_anon.edges(data=True):
    edge_color = '#00aa00' if d.get('tie_type') == 'Positive' else '#cccccc'
    net.add_edge(
        source, 
        target, 
        color=edge_color,
        arrows={'to': {'enabled': True, 'scaleFactor': 0.5}},
        smooth={'type': 'curvedCW', 'roundness': 0.2},
        title=d.get('meaning', 'N/A')
    )

# Physics
net.set_options("""
var options = {
  "physics": {
    "barnesHut": { "gravitationalConstant": -3000, "springLength": 150 }
  }
}
""")

html_content = net.generate_html()
with open("newsgroups_graph_s6.html", "w") as f:
    f.write(html_content)

IPython.display.IFrame(src="newsgroups_graph_s6.html", width='100%', height='550px')