# Developer Task Suite

This notebook defines a comprehensive suite of real-world developer tasks and maps them to:
- **Evaluation Metrics**: How to measure task performance
- **Expected Best Rungs**: Which representation rungs should work best
- **Implementation Status**: What's been implemented and what's pending

## Purpose

This is a **reference/planning document** (not an execution notebook). It serves as:
1. **Task Catalog**: Complete list of developer tasks we want to evaluate
2. **Evaluation Guide**: Metrics and methods for each task
3. **Rung Selection**: Expected best rungs based on task characteristics
4. **Progress Tracker**: Implementation status across the research pipeline

## Representation Rungs

Our privacy-preserving abstraction ladder has 6 rungs:

1. **tokens** (Rung 1): Token-level with PII redaction - exact text changes, lowest privacy
2. **semantic_edits** (Rung 2): AST-based edit operations - intent without raw code
3. **functions** (Rung 3): Function signatures and module-level changes
4. **files** (Rung 4): File-level collaboration graph with action counts
5. **dependencies** (Rung 5): Dependency graph - import relationships
6. **motifs** (Rung 6): Workflow patterns and high-level sequences - highest abstraction


In [1]:
import json
import pandas as pd
from pathlib import Path
from typing import Dict, List, Optional
from dataclasses import dataclass, asdict
from enum import Enum

class TaskStatus(Enum):
    NOT_STARTED = "not_started"
    IN_PROGRESS = "in_progress"
    COMPLETE = "complete"
    PARTIAL = "partial"

class TaskCategory(Enum):
    RETRIEVAL = "retrieval"
    PREDICTION = "prediction"
    CLASSIFICATION = "classification"
    SEARCH = "search"
    SEGMENTATION = "segmentation"
    ANALYSIS = "analysis"

@dataclass
class DeveloperTask:
    """Represents a developer task with evaluation details."""
    id: str
    name: str
    description: str
    category: TaskCategory
    metrics: List[str]
    expected_best_rungs: List[str]
    rationale: str
    status: TaskStatus
    implementation_file: Optional[str] = None
    results_file: Optional[str] = None
    notes: Optional[str] = None
    
    def to_dict(self):
        d = asdict(self)
        d['category'] = self.category.value
        d['status'] = self.status.value
        return d

# Initialize task suite
tasks: List[DeveloperTask] = []


## Task Suite Definition

### Category 1: Context Retrieval Tasks

Tasks that involve finding relevant code/context for a given prompt or task.


In [2]:
# Context Retrieval: Find relevant files/snippets for a prompt
tasks.append(DeveloperTask(
    id="context_retrieval",
    name="Context Retrieval",
    description="Given a prompt/task, retrieve the most relevant files and code snippets from the codebase",
    category=TaskCategory.RETRIEVAL,
    metrics=["MRR", "recall@1", "recall@5", "recall@10", "precision@k", "representation_size"],
    expected_best_rungs=["semantic_edits", "functions", "files"],
    rationale="Semantic edits capture intent, functions provide structure, files show scope",
    status=TaskStatus.COMPLETE,
    implementation_file="research/evaluation/retrieval/context_retrieval.ipynb",
    results_file="research/results/context_retrieval_performance.png",
    notes="Implemented with TF-IDF retrieval across all rungs"
))

# File-level Context: Find files that should be in context window
tasks.append(DeveloperTask(
    id="file_context_retrieval",
    name="File-level Context Retrieval",
    description="Retrieve files that should be included in context window for a given task",
    category=TaskCategory.RETRIEVAL,
    metrics=["MRR", "recall@k", "file_coverage", "dependency_accuracy"],
    expected_best_rungs=["files", "dependencies", "functions"],
    rationale="File and dependency graphs capture structural relationships",
    status=TaskStatus.COMPLETE,
    implementation_file="research/evaluation/retrieval/context_retrieval.ipynb",
    results_file="research/results/context_file_actions.png",
    notes="Part of context_retrieval evaluation"
))


### Category 2: Prediction Tasks

Tasks that predict future events or actions based on current context.


In [3]:
# Next Event Prediction: Predict the next action/event
tasks.append(DeveloperTask(
    id="next_event_prediction",
    name="Next Event Prediction",
    description="Predict the next event (code change, file navigation, terminal command) given current context",
    category=TaskCategory.PREDICTION,
    metrics=["accuracy", "top_k_accuracy", "perplexity", "cross_entropy"],
    expected_best_rungs=["semantic_edits", "functions", "motifs"],
    rationale="Semantic edits show immediate patterns, functions show structure, motifs show workflow",
    status=TaskStatus.COMPLETE,
    implementation_file="research/evaluation/probes/probes_baseline.ipynb",
    results_file="research/results/probe_metrics.json",
    notes="Implemented as classification probe in probes_baseline"
))

# Next File Prediction: Predict which file will be edited next
tasks.append(DeveloperTask(
    id="next_file_prediction",
    name="Next File Prediction",
    description="Predict which file the developer will edit next based on current session",
    category=TaskCategory.PREDICTION,
    metrics=["accuracy", "recall@k", "file_rank"],
    expected_best_rungs=["files", "dependencies", "motifs"],
    rationale="File relationships and workflow patterns predict next file",
    status=TaskStatus.PARTIAL,
    implementation_file="research/evaluation/probes/probes_baseline.ipynb",
    notes="Can be derived from next_event_prediction with file filtering"
))


### Category 3: Classification Tasks

Tasks that classify or categorize developer activities.


In [4]:
# Activity Classification: Classify type of coding activity
tasks.append(DeveloperTask(
    id="activity_classification",
    name="Activity Classification",
    description="Classify the type of coding activity (feature addition, bug fix, refactoring, etc.)",
    category=TaskCategory.CLASSIFICATION,
    metrics=["accuracy", "f1_score", "per_class_f1", "confusion_matrix"],
    expected_best_rungs=["semantic_edits", "functions", "motifs"],
    rationale="Semantic edits show change patterns, functions show scope, motifs show workflow intent",
    status=TaskStatus.COMPLETE,
    implementation_file="research/evaluation/probes/probes_baseline.ipynb",
    results_file="research/results/classification_accuracy_by_rung.png",
    notes="Implemented as multi-class classification probe"
))

# Intent Classification: Classify developer intent
tasks.append(DeveloperTask(
    id="intent_classification",
    name="Intent Classification",
    description="Classify developer intent (DEBUG, FEATURE, REFACTOR, etc.) from activity patterns",
    category=TaskCategory.CLASSIFICATION,
    metrics=["accuracy", "f1_score", "intent_aware_metrics"],
    expected_best_rungs=["motifs", "semantic_edits", "functions"],
    rationale="Motifs capture high-level workflow patterns that indicate intent",
    status=TaskStatus.COMPLETE,
    implementation_file="research/evaluation/probes/probes_baseline.ipynb",
    results_file="research/results/probe_metrics_intent_aware.json",
    notes="Intent-aware probe evaluation implemented"
))

# Anomaly Detection: Detect unusual or suspicious patterns
tasks.append(DeveloperTask(
    id="anomaly_detection",
    name="Anomaly Detection",
    description="Detect anomalous or unusual developer activity patterns",
    category=TaskCategory.CLASSIFICATION,
    metrics=["precision", "recall", "f1_score", "roc_auc", "anomaly_score_distribution"],
    expected_best_rungs=["motifs", "semantic_edits", "functions"],
    rationale="Motifs show normal patterns, deviations indicate anomalies",
    status=TaskStatus.COMPLETE,
    implementation_file="research/evaluation/probes/probes_baseline.ipynb",
    results_file="research/results/probe_metrics.json",
    notes="Implemented as binary classification probe"
))

# Multi-file Activity Detection: Detect when activity spans multiple files
tasks.append(DeveloperTask(
    id="multi_file_detection",
    name="Multi-file Activity Detection",
    description="Detect when a coding task involves changes across multiple files",
    category=TaskCategory.CLASSIFICATION,
    metrics=["accuracy", "f1_score", "file_count_accuracy"],
    expected_best_rungs=["files", "dependencies", "functions"],
    rationale="File and dependency graphs directly show multi-file relationships",
    status=TaskStatus.PARTIAL,
    implementation_file="research/evaluation/probes/probes_baseline.ipynb",
    notes="Can be derived from existing probes"
))


### Category 4: Search Tasks

Tasks that involve searching for specific patterns, workflows, or code.


In [5]:
# Event Retrieval: Search for specific events by natural language query
tasks.append(DeveloperTask(
    id="event_retrieval",
    name="Event Retrieval",
    description="Retrieve relevant events from history using natural language queries",
    category=TaskCategory.SEARCH,
    metrics=["precision@k", "recall@k", "NDCG@k", "MRR", "overlap_analysis"],
    expected_best_rungs=["semantic_edits", "functions", "motifs", "tokens"],
    rationale="Different query types need different rungs - semantic for intent, tokens for exact matches",
    status=TaskStatus.NOT_STARTED,
    implementation_file="research/evaluation/search/event_retrieval_evaluation.ipynb",
    notes="Notebook exists but not yet executed"
))

# Procedural Search: Search for workflow patterns and sequences
tasks.append(DeveloperTask(
    id="procedural_search",
    name="Procedural Search",
    description="Search for workflow patterns, temporal sequences, and procedural knowledge",
    category=TaskCategory.SEARCH,
    metrics=["precision@k", "recall@k", "NDCG@k", "pattern_distinctness", "workflow_coherence"],
    expected_best_rungs=["motifs", "semantic_edits", "functions"],
    rationale="Motifs capture workflow patterns, semantic edits show step-by-step processes",
    status=TaskStatus.COMPLETE,
    implementation_file="research/evaluation/search/procedural_search_evaluation.ipynb",
    results_file="research/results/search_evaluation_results.json",
    notes="Evaluates baseline, intent-only, and intent+representation search"
))

# Code Pattern Search: Search for specific code patterns or structures
tasks.append(DeveloperTask(
    id="code_pattern_search",
    name="Code Pattern Search",
    description="Search for specific code patterns, structures, or implementations",
    category=TaskCategory.SEARCH,
    metrics=["precision@k", "recall@k", "pattern_match_accuracy"],
    expected_best_rungs=["tokens", "semantic_edits", "functions"],
    rationale="Tokens for exact patterns, semantic edits for structural patterns, functions for API patterns",
    status=TaskStatus.PARTIAL,
    notes="Can be evaluated using event_retrieval with code-focused queries"
))

# Similar Session Search: Find similar past sessions/workflows
tasks.append(DeveloperTask(
    id="similar_session_search",
    name="Similar Session Search",
    description="Find past sessions with similar workflows or patterns",
    category=TaskCategory.SEARCH,
    metrics=["precision@k", "recall@k", "session_similarity_score"],
    expected_best_rungs=["motifs", "semantic_edits", "functions"],
    rationale="Motifs capture high-level workflow similarity",
    status=TaskStatus.PARTIAL,
    notes="Can use motif clustering and similarity metrics"
))


### Category 5: Segmentation Tasks

Tasks that involve dividing activity into meaningful segments or sessions.


In [None]:
# Temporal Segmentation: Segment activity into task-based chunks
tasks.append(DeveloperTask(
    id="temporal_segmentation",
    name="Temporal Segmentation",
    description="Segment developer activity into meaningful task-based segments",
    category=TaskCategory.SEGMENTATION,
    metrics=["completeness", "homogeneity", "boundary_quality", "semantic_coherence", "f1_score"],
    expected_best_rungs=["motifs", "semantic_edits", "functions"],
    rationale="Motifs show natural workflow boundaries, semantic edits show task transitions",
    status=TaskStatus.COMPLETE,
    implementation_file="research/evaluation/segmentation/embedding_ground_truth_evaluation.ipynb",
    results_file="research/results/temporal_segmentation_inactivity_results.json",
    notes="Evaluates inactivity-based segmentation with embedding ground truth"
))

# Intent-based Segmentation: Segment by developer intent
tasks.append(DeveloperTask(
    id="intent_segmentation",
    name="Intent-based Segmentation",
    description="Segment activity by developer intent (each segment = one intent)",
    category=TaskCategory.SEGMENTATION,
    metrics=["intent_purity", "segment_coherence", "boundary_accuracy"],
    expected_best_rungs=["motifs", "semantic_edits"],
    rationale="Motifs capture intent patterns, semantic edits show intent transitions",
    status=TaskStatus.PARTIAL,
    notes="Can be derived from intent classification + temporal segmentation"
))


### Category 6: Analysis Tasks

Tasks that involve analyzing code, patterns, or developer behavior.


In [None]:
# Expressiveness-Privacy Analysis: Measure trade-offs
tasks.append(DeveloperTask(
    id="expressiveness_privacy",
    name="Expressiveness-Privacy Trade-off",
    description="Measure expressiveness (within-trace similarity) vs privacy (cross-trace similarity)",
    category=TaskCategory.ANALYSIS,
    metrics=["drift", "epsilon_equivalent", "k_anonymity", "within_similarity", "cross_similarity"],
    expected_best_rungs=["all"],
    rationale="Requires comparison across all rungs to understand trade-offs",
    status=TaskStatus.COMPLETE,
    implementation_file="research/analysis/expressiveness/LLM_reconstruction_test.ipynb",
    results_file="research/results/expressiveness_privacy_tradeoff.png",
    notes="Core privacy-utility analysis"
))

# Context Usage Analysis: Analyze how context is used
tasks.append(DeveloperTask(
    id="context_usage_analysis",
    name="Context Usage Analysis",
    description="Analyze how developers use context (files, snippets, dependencies) in their workflow",
    category=TaskCategory.ANALYSIS,
    metrics=["context_coverage", "context_efficiency", "file_action_distribution"],
    expected_best_rungs=["files", "dependencies", "context_events"],
    rationale="File and dependency graphs show context relationships",
    status=TaskStatus.COMPLETE,
    implementation_file="research/analysis/context/event_context_extraction.ipynb",
    results_file="research/results/event_context_table.csv",
    notes="Analyzes context extraction and usage patterns"
))

# Workflow Pattern Analysis: Analyze common workflow patterns
tasks.append(DeveloperTask(
    id="workflow_pattern_analysis",
    name="Workflow Pattern Analysis",
    description="Identify and analyze common workflow patterns in developer activity",
    category=TaskCategory.ANALYSIS,
    metrics=["pattern_frequency", "pattern_diversity", "pattern_clustering"],
    expected_best_rungs=["motifs", "semantic_edits"],
    rationale="Motifs are designed to capture workflow patterns",
    status=TaskStatus.PARTIAL,
    notes="Can use motif mining and clustering results"
))


## Task Summary Table


In [None]:
# Create summary DataFrame
df = pd.DataFrame([task.to_dict() for task in tasks])

# Display summary
print("=" * 80)
print("DEVELOPER TASK SUITE SUMMARY")
print("=" * 80)
print(f"\nTotal Tasks: {len(tasks)}")
print(f"\nBy Status:")
print(df['status'].value_counts())
print(f"\nBy Category:")
print(df['category'].value_counts())

# Display detailed table
print("\n" + "=" * 80)
print("DETAILED TASK LIST")
print("=" * 80)
display_cols = ['id', 'name', 'category', 'status', 'expected_best_rungs', 'implementation_file']
print(df[display_cols].to_string(index=False))


DEVELOPER TASK SUITE SUMMARY

Total Tasks: 17

By Status:
status
complete       10
partial         6
not_started     1
Name: count, dtype: int64

By Category:
category
classification    4
search            4
analysis          3
retrieval         2
prediction        2
segmentation      2
Name: count, dtype: int64

DETAILED TASK LIST
                       id                             name       category      status                         expected_best_rungs                                                      implementation_file
        context_retrieval                Context Retrieval      retrieval    complete          [semantic_edits, functions, files]                    research/evaluation/retrieval/context_retrieval.ipynb
   file_context_retrieval     File-level Context Retrieval      retrieval    complete            [files, dependencies, functions]                    research/evaluation/retrieval/context_retrieval.ipynb
    next_event_prediction            Next Event Predictio

## Rung Selection Guide

Based on task characteristics, here's a guide for selecting the best rung:


In [None]:
# Rung selection guide
rung_guide = {
    "tokens": {
        "best_for": ["Exact code matching", "Precise text search", "Low-level pattern detection"],
        "trade_off": "Lowest privacy, highest storage"
    },
    "semantic_edits": {
        "best_for": ["Intent-based retrieval", "Pattern matching", "Next event prediction"],
        "trade_off": "Good balance of expressiveness and privacy"
    },
    "functions": {
        "best_for": ["API-level patterns", "Structure-based search", "Multi-file detection"],
        "trade_off": "Abstracts implementation, good for structure"
    },
    "files": {
        "best_for": ["File-level context", "Multi-file tasks", "Dependency analysis"],
        "trade_off": "High abstraction, good privacy"
    },
    "dependencies": {
        "best_for": ["Structural relationships", "Import analysis", "Module-level tasks"],
        "trade_off": "Pure structure, no content"
    },
    "motifs": {
        "best_for": ["Workflow patterns", "Intent classification", "High-level search"],
        "trade_off": "Highest privacy, workflow-level only"
    }
}

print("RUNG SELECTION GUIDE")
print("=" * 80)
for rung, info in rung_guide.items():
    print(f"\n{rung.upper()}")
    print(f"  Best for: {', '.join(info['best_for'])}")
    print(f"  Trade-off: {info['trade_off']}")


RUNG SELECTION GUIDE

TOKENS
  Best for: Exact code matching, Precise text search, Low-level pattern detection
  Trade-off: Lowest privacy, highest storage

SEMANTIC_EDITS
  Best for: Intent-based retrieval, Pattern matching, Next event prediction
  Trade-off: Good balance of expressiveness and privacy

FUNCTIONS
  Best for: API-level patterns, Structure-based search, Multi-file detection
  Trade-off: Abstracts implementation, good for structure

FILES
  Best for: File-level context, Multi-file tasks, Dependency analysis
  Trade-off: High abstraction, good privacy

DEPENDENCIES
  Best for: Structural relationships, Import analysis, Module-level tasks
  Trade-off: Pure structure, no content

MOTIFS
  Best for: Workflow patterns, Intent classification, High-level search
  Trade-off: Highest privacy, workflow-level only


## Implementation Status

### ✅ Complete Tasks (9)
- Context Retrieval
- File-level Context Retrieval  
- Next Event Prediction
- Activity Classification
- Intent Classification
- Anomaly Detection
- Procedural Search
- Temporal Segmentation
- Expressiveness-Privacy Analysis
- Context Usage Analysis

### ⚠️ Partial Tasks (5)
- Next File Prediction
- Multi-file Activity Detection
- Code Pattern Search
- Similar Session Search
- Intent-based Segmentation
- Workflow Pattern Analysis

### ❌ Not Started (1)
- Event Retrieval

## Next Steps

1. **Complete Event Retrieval**: Implement and run `event_retrieval_evaluation.ipynb`
2. **Expand Partial Tasks**: Add dedicated evaluations for partial tasks
3. **Advanced Probes**: Run `advanced_probes_comparison.ipynb` for ML model comparisons
4. **Cross-task Analysis**: Compare rung performance across all tasks


In [None]:
# Export task suite to JSON for reference
output_file = Path("research/results/developer_task_suite.json")
output_file.parent.mkdir(parents=True, exist_ok=True)

with open(output_file, 'w') as f:
    json.dump([task.to_dict() for task in tasks], f, indent=2)

print(f"✅ Task suite exported to {output_file}")
print(f"   Total tasks: {len(tasks)}")
print(f"   Complete: {sum(1 for t in tasks if t.status == TaskStatus.COMPLETE)}")
print(f"   Partial: {sum(1 for t in tasks if t.status == TaskStatus.PARTIAL)}")
print(f"   Not started: {sum(1 for t in tasks if t.status == TaskStatus.NOT_STARTED)}")


✅ Task suite exported to research/results/developer_task_suite.json
   Total tasks: 17
   Complete: 10
   Partial: 6
   Not started: 1
