# FCM Extractor Tutorial: Working with the Codebase

This notebook provides a comprehensive tutorial for working with the FCM Extractor codebase - a tool for extracting Fuzzy Cognitive Maps (FCMs) from interview transcripts using NLP, clustering, and semantic analysis.

## Table of Contents
1. [Project Overview](#project-overview)
2. [Setting Up the Environment](#setting-up-environment)
3. [Understanding the Codebase Structure](#codebase-structure)
4. [Basic Usage Examples](#basic-usage)
5. [Advanced Configuration](#configuration)
6. [Pipeline Components](#pipeline-components)
7. [Working with Results](#results)
8. [Common Tasks and Examples](#examples)
9. [Troubleshooting](#troubleshooting)

## 1. Project Overview {#project-overview}

The FCM Extractor is a Python package that automates the extraction of Fuzzy Cognitive Maps from qualitative interview data. It uses:

- **NLP and LLMs** for concept extraction
- **Semantic clustering** using embeddings and LLMs 
- **Edge inference** to identify causal relationships
- **Interactive visualization** for exploring results

### Key Features:
- Automated concept extraction from interview transcripts
- Advanced clustering with hybrid embedding/LLM approaches
- Causal relationship inference with confidence scoring
- Interactive HTML visualizations
- Evaluation against ground truth FCMs

## 2. Setting Up the Environment {#setting-up-environment}

### Prerequisites and Installation

In [None]:
# First, ensure you're in the correct directory
import os
print("Current working directory:", os.getcwd())

# Change to fcm_extractor directory if needed
if not os.path.exists('fcm_extractor'):
    print("Please navigate to the directory containing 'fcm_extractor' folder")
else:
    os.chdir('fcm_extractor')
    print("Changed to:", os.getcwd())

In [None]:
# Install required packages (if not already installed)
!pip install -r requirements.txt

In [None]:
# Set up API keys (replace with your actual keys)
import os

# Set your API keys here - NEVER commit these to git!
os.environ['OPENAI_API_KEY'] = 'your-openai-api-key-here'
os.environ['GOOGLE_API_KEY'] = 'your-google-api-key-here'

# Verify API keys are set
print("OpenAI API key set:", bool(os.environ.get('OPENAI_API_KEY')))
print("Google API key set:", bool(os.environ.get('GOOGLE_API_KEY')))

## 3. Understanding the Codebase Structure {#codebase-structure}

Let's explore the project structure:

In [None]:
# Display the project structure
import os
from pathlib import Path

def show_tree(path, prefix="", max_depth=3, current_depth=0):
    """Display directory tree structure"""
    if current_depth > max_depth:
        return
    
    path = Path(path)
    if not path.exists():
        return
    
    items = sorted([p for p in path.iterdir() if not p.name.startswith('.')],
                  key=lambda x: (x.is_file(), x.name.lower()))
    
    for i, item in enumerate(items):
        is_last = i == len(items) - 1
        current_prefix = "└── " if is_last else "├── "
        print(f"{prefix}{current_prefix}{item.name}")
        
        if item.is_dir() and current_depth < max_depth:
            next_prefix = prefix + ("    " if is_last else "│   ")
            show_tree(item, next_prefix, max_depth, current_depth + 1)

print("FCM Extractor Project Structure:")
show_tree(".")

### Key Components Explanation:

- **`run_extraction.py`**: Main entry point for running FCM extraction
- **`config/constants.py`**: All configuration parameters
- **`src/core/`**: Core concept extraction and graph building
- **`src/clustering/`**: Concept clustering algorithms
- **`src/edge_inference/`**: Causal relationship inference
- **`src/pipeline/`**: Complete processing pipeline
- **`utils/`**: Visualization, scoring, and utilities

## 4. Basic Usage Examples {#basic-usage}

### Running the Main Pipeline

In [None]:
# Basic usage: Process the default interview file
# This is equivalent to running: python run_extraction.py

import sys
sys.path.insert(0, '.')

from src.pipeline import process_interviews
from config.constants import INTERVIEWS_DIRECTORY, OUTPUT_DIRECTORY, DEFAULT_INTERVIEW_FILE

print(f"Default interview file: {DEFAULT_INTERVIEW_FILE}")
print(f"Input directory: {INTERVIEWS_DIRECTORY}")
print(f"Output directory: {OUTPUT_DIRECTORY}")

# Process the default file (uncomment to run)
# results = process_interviews(
#     interviews_dir=INTERVIEWS_DIRECTORY,
#     output_dir=OUTPUT_DIRECTORY,
#     specific_file=DEFAULT_INTERVIEW_FILE
# )
# print(f"Processed {len(results)} documents")

In [None]:
# Process a specific interview file
specific_file = "BD007.docx"  # Change this to any interview file you want

# Uncomment to run processing:
# results = process_interviews(
#     interviews_dir=INTERVIEWS_DIRECTORY,
#     output_dir=OUTPUT_DIRECTORY, 
#     specific_file=specific_file
# )

print(f"To process {specific_file}, uncomment the lines above")

### Understanding the Processing Pipeline

In [None]:
# Let's examine what happens in the pipeline step by step
from src.core.extract_concepts import extract_concepts_with_metadata
from src.clustering.improved_clustering import cluster_concepts_with_metadata
from src.edge_inference.edge_inference import infer_edges

# Sample interview text (normally this would come from a .docx file)
sample_text = """
I find that when I'm stressed at work, it really affects my sleep quality. 
Poor sleep then leads to reduced concentration the next day, which impacts my productivity. 
This creates a cycle where I feel more stressed because I'm not getting things done efficiently.
Exercise helps break this cycle - when I work out regularly, I sleep better and feel less stressed.
"""

print("Sample interview text:")
print(sample_text)
print("\n" + "="*50 + "\n")

## 5. Advanced Configuration {#configuration}

The system is highly configurable through `config/constants.py`. Let's explore key settings:

In [None]:
# View current configuration settings
from config import constants

print("=== MODEL CONFIGURATION ===")
print(f"Concept Extraction Model: {constants.CONCEPT_EXTRACTION_MODEL}")
print(f"Edge Inference Model: {constants.EDGE_INFERENCE_MODEL}")
print(f"LLM Clustering Model: {constants.LLM_CLUSTERING_MODEL}")

print("\n=== CLUSTERING CONFIGURATION ===")
print(f"Clustering Method: {constants.CLUSTERING_METHOD}")
print(f"Embedding Model: {constants.CLUSTERING_EMBEDDING_MODEL}")
print(f"Algorithm: {constants.CLUSTERING_ALGORITHM}")
print(f"Min Cluster Size: {constants.HDBSCAN_MIN_CLUSTER_SIZE}")

print("\n=== EDGE INFERENCE CONFIGURATION ===")
print(f"Confidence Threshold: {constants.EDGE_CONFIDENCE_THRESHOLD}")
print(f"Use Confidence Filtering: {constants.USE_CONFIDENCE_FILTERING}")
print(f"Enable Intra-cluster Edges: {constants.ENABLE_INTRA_CLUSTER_EDGES}")

print("\n=== POST-CLUSTERING CONFIGURATION ===")
print(f"Enable Post-clustering: {constants.ENABLE_POST_CLUSTERING}")
print(f"Similarity Threshold: {constants.POST_CLUSTERING_SIMILARITY_THRESHOLD}")

In [None]:
# How to modify configuration at runtime
import config.constants as config

# Example: Change clustering method
print(f"Current clustering method: {config.CLUSTERING_METHOD}")

# Temporarily change it (this affects the current session only)
original_method = config.CLUSTERING_METHOD
config.CLUSTERING_METHOD = "embedding_enhanced"  # Options: llm_only, hybrid, embedding_enhanced
print(f"Changed to: {config.CLUSTERING_METHOD}")

# Reset it back
config.CLUSTERING_METHOD = original_method
print(f"Reset to: {config.CLUSTERING_METHOD}")

print("\nNote: To make permanent changes, edit config/constants.py directly")

## 6. Pipeline Components {#pipeline-components}

Let's examine each component of the pipeline:

In [None]:
# Step 1: Concept Extraction
from src.core.extract_concepts import extract_concepts_with_metadata

sample_text = """
Work stress significantly impacts my ability to maintain work-life balance.
When deadlines are tight, I tend to skip exercise and social activities.
This leads to increased anxiety and reduced job satisfaction over time.
"""

print("=== CONCEPT EXTRACTION ===")
print(f"Input text: {sample_text.strip()}")
print("\nExtracting concepts...")

# This would normally call the LLM - we'll show what the output looks like
# concepts_with_meta = extract_concepts_with_metadata(sample_text)

# Mock output for demonstration
mock_concepts = [
    "work stress", "work-life balance", "deadlines", "exercise", 
    "social activities", "anxiety", "job satisfaction"
]

print(f"Extracted concepts: {mock_concepts}")
print(f"Number of concepts: {len(mock_concepts)}")

In [None]:
# Step 2: Concept Clustering
from src.clustering.improved_clustering import cluster_concepts_with_metadata

print("=== CONCEPT CLUSTERING ===")
print("Grouping related concepts...")

# Mock clustering result
mock_clusters = {
    "Work Stressors": ["work stress", "deadlines"],
    "Wellness Activities": ["exercise", "social activities"],
    "Psychological Outcomes": ["anxiety", "job satisfaction"],
    "Life Balance": ["work-life balance"]
}

print("\nClustering results:")
for cluster_name, concepts in mock_clusters.items():
    print(f"  {cluster_name}: {concepts}")

print(f"\nNumber of clusters: {len(mock_clusters)}")

In [None]:
# Step 3: Edge Inference
print("=== EDGE INFERENCE ===")
print("Identifying causal relationships between clusters...")

# Mock edge inference results
mock_edges = [
    {"source": "Work Stressors", "target": "Wellness Activities", 
     "relationship": "negative", "confidence": 0.85, "weight": -0.7},
    {"source": "Work Stressors", "target": "Psychological Outcomes", 
     "relationship": "negative", "confidence": 0.90, "weight": -0.8},
    {"source": "Wellness Activities", "target": "Psychological Outcomes", 
     "relationship": "positive", "confidence": 0.75, "weight": 0.6},
    {"source": "Work Stressors", "target": "Life Balance", 
     "relationship": "negative", "confidence": 0.88, "weight": -0.75}
]

print("\nInferred relationships:")
for edge in mock_edges:
    print(f"  {edge['source']} → {edge['target']} ({edge['relationship']}, conf: {edge['confidence']})")

print(f"\nNumber of edges: {len(mock_edges)}")

## 7. Working with Results {#results}

Let's explore how to work with the output files:

In [None]:
# Examine typical output structure
import json
from pathlib import Path

# Look for existing outputs
output_dir = Path("../fcm_outputs_gpt-mini")
if output_dir.exists():
    print("Available output directories:")
    for subdir in output_dir.iterdir():
        if subdir.is_dir():
            files = list(subdir.glob("*"))
            print(f"  {subdir.name}: {len(files)} files")
            
    # Look at a specific output
    sample_dirs = [d for d in output_dir.iterdir() if d.is_dir()]
    if sample_dirs:
        sample_dir = sample_dirs[0]
        print(f"\nFiles in {sample_dir.name}:")
        for file in sample_dir.iterdir():
            print(f"  - {file.name}")
else:
    print("No output directory found. Run the pipeline first to generate results.")

In [None]:
# Load and examine an FCM JSON file
import json
from pathlib import Path

# Try to load a sample FCM file
output_dir = Path("../fcm_outputs_gpt-mini")
fcm_files = list(output_dir.glob("*/*/fcm.json"))

if fcm_files:
    sample_fcm_file = fcm_files[0]
    print(f"Loading FCM from: {sample_fcm_file}")
    
    with open(sample_fcm_file, 'r') as f:
        fcm_data = json.load(f)
    
    print("\n=== FCM DATA STRUCTURE ===")
    print(f"Keys in FCM data: {list(fcm_data.keys())}")
    
    if 'nodes' in fcm_data:
        print(f"\nNumber of nodes: {len(fcm_data['nodes'])}")
        print("First few nodes:")
        for i, node in enumerate(fcm_data['nodes'][:3]):
            print(f"  {i+1}. {node}")
    
    if 'edges' in fcm_data:
        print(f"\nNumber of edges: {len(fcm_data['edges'])}")
        print("First few edges:")
        for i, edge in enumerate(fcm_data['edges'][:3]):
            print(f"  {i+1}. {edge['source']} → {edge['target']} (weight: {edge.get('weight', 'N/A')})")
else:
    print("No FCM files found. Generate some results first!")
    
    # Show what an FCM structure typically looks like
    sample_fcm = {
        "nodes": [
            {"id": "cluster_1", "label": "Work Stressors", "type": "cluster"},
            {"id": "cluster_2", "label": "Wellness Activities", "type": "cluster"}
        ],
        "edges": [
            {
                "source": "cluster_1",
                "target": "cluster_2", 
                "weight": -0.7,
                "confidence": 0.85,
                "relationship_type": "negative"
            }
        ],
        "metadata": {
            "document": "sample_interview",
            "extraction_date": "2024-01-01",
            "num_concepts_extracted": 15,
            "num_clusters": 4
        }
    }
    
    print("\n=== SAMPLE FCM STRUCTURE ===")
    print(json.dumps(sample_fcm, indent=2))

## 8. Common Tasks and Examples {#examples}

### Task 1: Scoring FCMs Against Ground Truth

In [None]:
# Scoring FCMs against ground truth
from utils.score_fcm import score_fcm_semantic
from pathlib import Path

print("=== FCM SCORING ===")

# Show available ground truth files
gt_dir = Path("../ground_truth")
if gt_dir.exists():
    gt_files = list(gt_dir.glob("*.csv"))
    print(f"Available ground truth files: {len(gt_files)}")
    for gt_file in gt_files[:5]:  # Show first 5
        print(f"  - {gt_file.name}")
        
    # Example scoring command (uncomment to run)
    if gt_files:
        sample_gt = gt_files[0]
        print(f"\nTo score against {sample_gt.name}:")
        print(f"python utils/score_fcm.py --gt-path {sample_gt} --gen-path ../fcm_outputs/sample/sample_fcm.json")
else:
    print("Ground truth directory not found")

In [None]:
# Understanding scoring metrics
print("=== SCORING METRICS EXPLAINED ===")

scoring_info = """
The scoring system uses semantic similarity to compare generated FCMs with ground truth:

1. **Node Matching**: 
   - Uses embeddings to find semantically similar concepts
   - Threshold-based matching (default: 0.7 similarity)

2. **Edge Matching**:
   - Compares causal relationships between matched nodes
   - Considers relationship direction and polarity

3. **Metrics Computed**:
   - Precision: % of generated edges that match ground truth
   - Recall: % of ground truth edges found in generated FCM
   - F1-Score: Harmonic mean of precision and recall
   - Semantic similarity scores for nodes and edges

4. **Output Files**:
   - `*_scoring_results.csv`: Detailed matching results
   - `*_generated_matrix.csv`: Generated adjacency matrix
"""

print(scoring_info)

### Task 2: Creating Custom Visualizations

In [None]:
# Creating visualizations
from utils.visualize_fcm import create_interactive_visualization
import json

print("=== FCM VISUALIZATION ===")

# Create a sample FCM for visualization
sample_fcm_data = {
    "nodes": [
        {"id": "stress", "label": "Work Stress", "type": "cluster", "concepts": ["deadlines", "workload"]},
        {"id": "wellness", "label": "Wellness", "type": "cluster", "concepts": ["exercise", "sleep"]},
        {"id": "performance", "label": "Performance", "type": "cluster", "concepts": ["productivity", "focus"]}
    ],
    "edges": [
        {"source": "stress", "target": "wellness", "weight": -0.7, "confidence": 0.85},
        {"source": "wellness", "target": "performance", "weight": 0.6, "confidence": 0.78},
        {"source": "stress", "target": "performance", "weight": -0.5, "confidence": 0.82}
    ]
}

print("Sample FCM structure:")
print(json.dumps(sample_fcm_data, indent=2))

# Create visualization (uncomment to run)
# output_file = "sample_fcm_interactive.html"
# create_interactive_visualization(sample_fcm_data, output_file)
# print(f"\nVisualization created: {output_file}")
# print("Open this file in a web browser to view the interactive FCM")

### Task 3: Batch Processing Multiple Interviews

In [None]:
# Batch processing multiple interviews
from src.pipeline import process_interviews
from pathlib import Path

print("=== BATCH PROCESSING ===")

# Check available interview files
interviews_dir = Path("../interviews")
if interviews_dir.exists():
    interview_files = list(interviews_dir.glob("*.docx")) + list(interviews_dir.glob("*.doc"))
    print(f"Found {len(interview_files)} interview files:")
    for file in interview_files[:10]:  # Show first 10
        print(f"  - {file.name}")
    
    # Example batch processing (uncomment to run)
    print("\nTo process all files:")
    print("python run_extraction.py --all")
    
    # Or programmatically:
    # results = process_interviews(
    #     interviews_dir="../interviews",
    #     output_dir="../fcm_outputs",
    #     specific_file=None  # Process all files
    # )
else:
    print("Interviews directory not found")

### Task 4: Custom Configuration for Different Domains

In [None]:
# Customizing configuration for different domains
import config.constants as config

print("=== DOMAIN-SPECIFIC CONFIGURATIONS ===")

# Configuration for medical/healthcare interviews
healthcare_config = {
    "CLUSTERING_EMBEDDING_MODEL": "sentence-transformers/allenai-specter",  # Good for scientific text
    "HDBSCAN_MIN_CLUSTER_SIZE": 2,  # Smaller clusters for detailed medical concepts
    "EDGE_CONFIDENCE_THRESHOLD": 0.8,  # Higher confidence for medical relationships
    "POST_CLUSTERING_SIMILARITY_THRESHOLD": 0.7  # Conservative merging
}

# Configuration for general social science interviews
social_science_config = {
    "CLUSTERING_EMBEDDING_MODEL": "sentence-transformers/all-mpnet-base-v2",  # Good general model
    "HDBSCAN_MIN_CLUSTER_SIZE": 3,  # Moderate clustering
    "EDGE_CONFIDENCE_THRESHOLD": 0.7,  # Standard confidence
    "POST_CLUSTERING_SIMILARITY_THRESHOLD": 0.6  # Moderate merging
}

# Configuration for business/organizational interviews
business_config = {
    "CLUSTERING_EMBEDDING_MODEL": "sentence-transformers/all-MiniLM-L12-v2",  # Fast processing
    "HDBSCAN_MIN_CLUSTER_SIZE": 4,  # Larger clusters for high-level concepts
    "EDGE_CONFIDENCE_THRESHOLD": 0.6,  # Lower threshold for exploratory analysis
    "POST_CLUSTERING_SIMILARITY_THRESHOLD": 0.5  # Aggressive merging
}

configs = {
    "Healthcare/Medical": healthcare_config,
    "Social Science": social_science_config, 
    "Business/Organizational": business_config
}

for domain, cfg in configs.items():
    print(f"\n{domain} Configuration:")
    for key, value in cfg.items():
        print(f"  {key}: {value}")

print("\nTo apply a configuration, modify config/constants.py with these values")

## 9. Troubleshooting {#troubleshooting}

Common issues and solutions:

In [None]:
# Debugging and troubleshooting tools
import sys
import os
from pathlib import Path

print("=== SYSTEM CHECK ===")

# Check Python version
print(f"Python version: {sys.version}")

# Check if we can import key modules
modules_to_check = [
    'openai', 'sentence_transformers', 'sklearn', 'networkx', 
    'pandas', 'numpy', 'matplotlib', 'plotly'
]

print("\nModule availability:")
for module in modules_to_check:
    try:
        __import__(module)
        print(f"  ✅ {module}")
    except ImportError:
        print(f"  ❌ {module} - run: pip install {module}")

# Check API keys
print("\nAPI Keys:")
print(f"  OpenAI: {'✅' if os.environ.get('OPENAI_API_KEY') else '❌ Not set'}")
print(f"  Google: {'✅' if os.environ.get('GOOGLE_API_KEY') else '❌ Not set'}")

# Check directory structure
print("\nDirectory Structure:")
required_dirs = ['config', 'src', 'utils']
for dir_name in required_dirs:
    exists = Path(dir_name).exists()
    print(f"  {dir_name}: {'✅' if exists else '❌ Missing'}")

In [None]:
# Common troubleshooting tips
troubleshooting_guide = """
=== COMMON ISSUES AND SOLUTIONS ===

1. **"Module not found" errors**:
   - Run: pip install -r requirements.txt
   - Make sure you're in the fcm_extractor directory

2. **API key errors**:
   - Set environment variables: export OPENAI_API_KEY="your-key"
   - Check that your API keys are valid and have sufficient credits

3. **"No documents processed" error**:
   - Check that interview files exist in the specified directory
   - Verify file formats (.docx, .doc, .txt are supported)
   - Check file permissions

4. **Memory errors during processing**:
   - Reduce HDBSCAN_MIN_CLUSTER_SIZE for smaller clusters
   - Use faster embedding models (e.g., all-MiniLM-L6-v2)
   - Process files individually instead of batch processing

5. **Poor clustering results**:
   - Adjust HDBSCAN_MIN_CLUSTER_SIZE (lower = more clusters)
   - Try different embedding models
   - Enable/disable post-clustering merge

6. **Few or no edges in FCM**:
   - Lower EDGE_CONFIDENCE_THRESHOLD (e.g., from 0.7 to 0.5)
   - Enable ENABLE_INTRA_CLUSTER_EDGES for more connections
   - Check that your interview text contains causal language

7. **Visualization not displaying**:
   - Open HTML files in a modern web browser
   - Check that the FCM JSON file has nodes and edges
   - Try the static PNG visualization instead

8. **Logging issues**:
   - Set ENABLE_FILE_LOGGING = True in constants.py
   - Check that the logs directory has write permissions
   - Look in the logs/ directory for detailed error messages
"""

print(troubleshooting_guide)

## Conclusion

This tutorial covered the essential aspects of working with the FCM Extractor codebase:

1. **Project structure** and key components
2. **Basic usage** for processing interview transcripts
3. **Advanced configuration** for different domains and use cases
4. **Pipeline components** (concept extraction, clustering, edge inference)
5. **Working with results** (FCM files, visualizations, scoring)
6. **Common tasks** and practical examples
7. **Troubleshooting** common issues

### Next Steps:

1. **Practice with your data**: Try processing your own interview files
2. **Experiment with configurations**: Test different clustering and inference settings
3. **Evaluate results**: Use the scoring system to validate against ground truth
4. **Customize for your domain**: Adapt the configuration for your specific research area
5. **Extend the pipeline**: Add custom processing steps as needed

### Resources:

- `README.md`: Comprehensive documentation
- `config/constants.py`: All configuration options
- `utils/`: Utility scripts for scoring, visualization, etc.
- `logs/`: Processing logs for debugging

Happy FCM extraction! 🗺️📊