# RL Configuration Manager

This module provides a centralized configuration interface for the RL-based educational coaching system. It manages all static settings required by the agent and environment to interpret observations, execute actions, and compute rewards.

The configuration is saved and loaded from a JSON file (`rl_config.json`). If the file does not exist, a default configuration is created automatically. This ensures a reproducible setup and simplifies testing across modules.

## Configuration Structure

- `state_variables`:  
  Defines the numerical features observed by the RL agent at each time step. These features represent the student’s academic state, ethical posture, writing progression, and contextual traits.

- `actions`:  
  A dictionary of discrete, recommendation-style actions the agent can select. Each key is an action ID (e.g. `"eth_0"`), and each value is a human-readable description.

- `reward_config`:  
  Defines the scalar reward shaping used during training. Rewards and penalties reflect fluency, ethical alignment, advisor trust, autonomy, and creativity.

- `action_effects`:  
  Maps agent actions to updates in the simulated state. Each action has associated side-effects that alter one or more student-related variables (e.g., reducing `ai_usage` or increasing `advisor_trust`).

## Usage Example

```python

from rl_config_manager import RLConfigManager

# Load existing configuration or initialize default
config = RLConfigManager.load_config()

# Access specific parts
print("Available state variables:")
for var in config["state_variables"]:
    print("-", var)

In [None]:
# rl_config_manager.py
"""
Adaptive RL Optimizer Configuration Manager

Defines and enforces the JSON schema for the RL-based academic coaching agent’s
centralized configuration. Uses Pydantic for validation and metadata on versioning,
timestamps, reward rationales, and action-to-state effect mappings.
"""
import json
import os
import shutil
from datetime import datetime
from pydantic import BaseModel, Field, field_validator
from typing import Dict, List, Any


# ----------------------------- Schema Definitions -----------------------------


class RewardItem(BaseModel):
    """
    Represents a single reward entry in the multi-objective shaping scheme.

    Attributes:
        value (float):
            The numerical reward (positive or negative) assigned for this signal.
        justification (str):
            A brief rationale (min. 5 characters) explaining why this reward level was chosen.
        risk (str):
            A description (min. 5 characters) of any potential downside or misuse if over-emphasized.
    """
    value: float
    justification: str = Field(..., min_length=5)
    risk: str = Field(..., min_length=5)

class ConfigSchema(BaseModel):
    """
    Full schema for the RL configuration file. Validates metadata, state/action definitions,
    reward shaping, and the mapping from chosen actions to changes in latent state variables.

    Attributes:
        config_version (float):
            Version number of the configuration. Must be ≥ 1.0 to ensure compatibility.
        created_at (str):
            ISO-formatted timestamp marking when this config was generated or last updated.
        state_variables (List[str]):
            Ordered list of state feature names that the RL agent will observe.
        actions (Dict[str, str]):
            Mapping from action keys (e.g., "eth_0") to human-readable labels.
        reward_config (Dict[str, RewardItem]):
            Detailed reward items for each supervision signal, including value, justification, and risk.
        action_effects (Dict[str, Dict[str, float]]):
            Defines how each action modifies one or more state variables, as a delta change.
    """
    config_version: float
    created_at:       str
    state_variables:  List[str]
    actions:          Dict[str, str]
    reward_config:    Dict[str, RewardItem]
    action_effects:   Dict[str, Dict[str, float]]

    @field_validator("config_version")
    @classmethod
    def validate_version(cls, v):
        """
        Ensure the configuration version is compatible with the agent's codebase.

        Args:
            v (float): The version number provided in the JSON.

        Returns:
            float: The same version number, if it meets requirements.

        Raises:
            ValueError: If v < 1.0, indicating an unsupported config format.
        """
        if v < 1.0:
            raise ValueError("Config version must be ≥ 1.0")
        return v
# ----------------------------- Configuration Manager -----------------------------
class ConfigIO:
    """
    Pure I/O: load & save the RL optimizer’s JSON config, with safe versioning and backups.

    - If no config file exists, write out a default config (DEFAULT).
    - On first save(): stamps `created_at`.
    - On every save(): increments `config_version` by +0.1 and makes a timestamped backup of the old file.
    """
    CONFIG_FILE = "rl_config.json"
    BACKUP_DIR  = "config_backups"

    # Default config skeleton—matches your full specification
    DEFAULT: Dict[str, Any] = {
        "config_version": 1.0,
        "created_at":     "",   # filled in when we first save()
        "state_variables": [
            "ai_usage", "ethical_flags", "advisor_trust",
            "thesis_quality", "deadline_ratio", "thesis_difficulty",
            "student_autonomy", "language_proficiency",
            "emotional_state", "creativity_score", "timestep"
        ],
        "actions": {
            "eth_0": "Display ethical reminder",
            "eth_1": "Propose AI restriction",
            "eth_2": "Recommend advisor check-in",
            "eth_3": "Log academic concern",
            "brain_0": "Prompt open-ended reflection",
            "brain_1": "Offer question inversion",
            "brain_2": "Stimulate cross-topic merge",
            "brain_3": "Show novelty heatmap",
            "write_0": "Suggest rewriting section",
            "write_1": "Recommend outline reform",
            "write_2": "Display writing tip",
            "write_3": "Enable feedback loop",
            "emo_0": "Encourage autonomy",
            "emo_1": "Acknowledge deadline stress",
            "emo_2": "Suggest micro-break",
            "emo_3": "Offer motivational boost"
        },
        "reward_config": {
            "fluency_improved": {
                "value": 1.5,
                "justification": "Improves clarity and coherence",
                "risk": "May incentivize style over substance"
            },
            "trust_earned": {
                "value": 2.0,
                "justification": "Advisor feedback acknowledged and used",
                "risk": "May reward form without deep content change"
            },
            "creativity_expressed": {
                "value": 2.5,
                "justification": "Encourages safe novelty and synthesis",
                "risk": "May drift into irrelevant tangents"
            },
            "autonomy_respected": {
                "value": 1.0,
                "justification": "Student took initiative",
                "risk": "Passive neglect might appear as autonomy"
            },
            "ai_dependency_violation": {
                "value": -4.0,
                "justification": "Detected AI overuse",
                "risk": "Could punish legitimate drafting support"
            },
            "ethical_boundary_crossed": {
                "value": -6.0,
                "justification": "Clear breach of academic norms",
                "risk": "Non-compensable — agent must intervene"
            },
            "deadline_panic_detected": {
                "value": -1.0,
                "justification": "Urgency spike detected",
                "risk": "Might suppress productive deadline use"
            },
            "milestone_completed": {
                "value": 5.0,
                "justification": "Goal achieved within scope",
                "risk": "May mask ethics issues if used alone"
            },
            "novel_but_safe": {
                "value": 3.0,
                "justification": "Original idea aligned with context",
                "risk": "Requires semantic checking"
            },
            "supervisor_disappointment": {
                "value": -5.0,
                "justification": "Advisor flags trust breakdown",
                "risk": "Recovery should be possible over time"
            }
        },
        "action_effects": {
            "eth_0": {"ethical_flags": -0.1},
            "eth_1": {"ai_usage": -0.2},
            "eth_2": {"advisor_trust": 0.15},
            "eth_3": {"ethical_flags": 0.2, "advisor_trust": -0.3},
            "brain_0": {"creativity_score": 0.05},
            "brain_1": {"creativity_score": 0.07},
            "brain_2": {"creativity_score": 0.10},
            "brain_3": {"thesis_quality": 0.05},
            "write_0": {"thesis_quality": 0.10},
            "write_1": {"thesis_quality": 0.07},
            "write_2": {"thesis_quality": 0.05},
            "write_3": {"thesis_quality": 0.05, "advisor_trust": 0.1},
            "emo_0": {"student_autonomy": 0.1},
            "emo_1": {"emotional_state": -0.05},
            "emo_2": {"emotional_state": 0.1},
            "emo_3": {"emotional_state": 0.15}
        }
    }

    @classmethod
    def load(cls) -> Dict[str, Any]:
        """
        Load the JSON config from disk. If missing, write DEFAULT → disk then load.
        """
        if not os.path.exists(cls.CONFIG_FILE):
            os.makedirs(cls.BACKUP_DIR, exist_ok=True)
            with open(cls.CONFIG_FILE, "w") as f:
                json.dump(cls.DEFAULT, f, indent=4)
        with open(cls.CONFIG_FILE, "r") as f:
            return json.load(f)

    @classmethod
    def save(cls, config: Dict[str, Any]) -> None:
        """
        Persist `config` back to disk safely:

        1) On first ever save, fill in `created_at`.
        2) Always bump `config_version` by +0.1.
        3) Copy the old file into `config_backups/rl_config_<timestamp>.json`.
        4) Overwrite the main JSON.
        """
        # 1) stamp created_at if missing
        if not config.get("created_at"):
            config["created_at"] = datetime.utcnow().isoformat()

        # 2) bump version
        config["config_version"] = round(config.get("config_version", 1.0) + 0.1, 2)

        # 3) backup old
        if os.path.exists(cls.CONFIG_FILE):
            ts = datetime.utcnow().strftime("%Y%m%dT%H%M%SZ")
            os.makedirs(cls.BACKUP_DIR, exist_ok=True)
            shutil.copy(
                cls.CONFIG_FILE,
                os.path.join(cls.BACKUP_DIR, f"rl_config_{ts}.json")
            )

        # 4) write new
        with open(cls.CONFIG_FILE, "w") as f:
            json.dump(config, f, indent=4)

class StateVarManager:
    """CRUD for state_variables through a validated ConfigSchema."""
    def __init__(self, io: ConfigIO):
        self.io    = io
        self.model = ConfigSchema(**io.load())

    def list(self) -> List[str]:
        return self.model.state_variables

    def add(self, var: str):
        if var not in self.model.state_variables:
            self.model.state_variables.append(var)
            self.io.save(self.model.model_dump())

    def remove(self, var: str):
        if var in self.model.state_variables:
            self.model.state_variables.remove(var)
            self.io.save(self.model.model_dump())

class ActionManager:
    """CRUD for actions & their effects, via a validated ConfigSchema."""
    def __init__(self, io: ConfigIO):
        self.io    = io
        self.model = ConfigSchema(**io.load())

    def list(self) -> Dict[str,str]:
        return self.model.actions

    def add(self, key:str, label:str):
        self.model.actions[key] = label
        self.io.save(self.model.model_dump())

    def remove(self, key:str):
        self.model.actions.pop(key, None)
        self.model.action_effects.pop(key, None)
        self.io.save(self.model.model_dump())

    def list_effects(self) -> Dict[str,Dict[str,float]]:
        return self.model.action_effects

    def set_effect(self, action_key:str, var:str, delta:float):
        self.model.action_effects.setdefault(action_key, {})[var] = delta
        self.io.save(self.model.model_dump())

    def remove_effect(self, action_key:str, var:str):
        self.model.action_effects.get(action_key, {}).pop(var, None)
        self.io.save(self.model.model_dump())

class RewardManager:
    """CRUD for reward_config (with full RewardItem metadata)."""
    def __init__(self, io: ConfigIO):
        self.io    = io
        self.model = ConfigSchema(**io.load())

    def list(self) -> Dict[str,RewardItem]:
        return self.model.reward_config

    def set(self, key:str, item:RewardItem):
        self.model.reward_config[key] = item
        self.io.save(self.model.model_dump())

    def remove(self, key:str):
        self.model.reward_config.pop(key, None)
        self.io.save(self.model.model_dump())


# -------------------------- Top-level Façade Class ----------------------------

class RLConfigManager:
    """
    Unified facade: bundles I/O + per-domain managers, all validated
    by ConfigSchema on each save() call.
    """
    def __init__(self):
        self.io   = ConfigIO()
        self.vars = StateVarManager(self.io)
        self.act  = ActionManager(self.io)
        self.rwd  = RewardManager(self.io)

    @classmethod
    def default_config(cls) -> Dict[str, Any]:
        """Returns the default configuration dictionary."""
        return ConfigIO.DEFAULT

    def load_config(self) -> Dict[str, Any]:
        """Loads the configuration from disk and returns it as a dictionary."""
        return self.io.load()

    def save_config(self, config: Dict[str, Any]) -> None:
        """Saves the given configuration dictionary to disk."""
        self.io.save(config)


# ------------------------------ Quick Smoke-test ------------------------------

if __name__=="__main__":
    print("--- Smoke Test ---")
    # Create a temporary configuration object without saving
    temp_io = type('TempConfigIO', (object,), {
        'load': lambda self: ConfigIO.DEFAULT,
        'save': lambda self, config: None # Do nothing on save
    })()
    mgr = RLConfigManager()
    mgr.io = temp_io # Replace the real IO with the temporary one
    mgr.vars = StateVarManager(mgr.io)
    mgr.act = ActionManager(mgr.io)
    mgr.rwd = RewardManager(mgr.io)


    print("\nState Variables:")
    initial_vars = mgr.vars.list()
    print(f"- Initial: {len(initial_vars)} variables")
    mgr.vars.add("time_spent_recent")
    print("- Added: 'time_spent_recent'")
    print(f"- After add: {len(mgr.vars.list())} variables")
    mgr.vars.remove("time_spent_recent")
    print("- Removed: 'time_spent_recent'")
    print(f"- After remove: {len(mgr.vars.list())} variables")


    print("\nActions:")
    initial_actions = mgr.act.list()
    print(f"- Initial: {len(initial_actions)} actions")
    mgr.act.add("tmp_0","Demo action")
    print("- Added: 'tmp_0': 'Demo action'")
    print(f"- With tmp_0: {len(mgr.act.list())} actions")
    mgr.act.remove("tmp_0")
    print("- Removed: 'tmp_0'")
    print(f"- After remove: {len(mgr.act.list())} actions")


    print("\nEffects:")
    initial_effects = mgr.act.list_effects()
    print(f"- Initial effects for 'eth_0': {initial_effects.get('eth_0', {})}")
    mgr.act.set_effect("eth_0","ai_usage",-0.05)
    print("- Set effect for 'eth_0': 'ai_usage' = -0.05")
    print(f"- After set effects for 'eth_0': {mgr.act.list_effects().get('eth_0', {})}")
    mgr.act.remove_effect("eth_0","ai_usage")
    print("- Removed effect for 'eth_0': 'ai_usage'")
    print(f"- After remove effects for 'eth_0': {mgr.act.list_effects().get('eth_0', {})}")


    print("\nRewards:")
    initial_rewards = mgr.rwd.list()
    print(f"- Initial: {len(initial_rewards)} rewards")
    demo = RewardItem(
        value=0.7,
        justification="Demo done",
        risk="minimal potential"
    )
    mgr.rwd.set("demo_reward", demo)
    print(f"- Added 'demo_reward': {demo}")
    print(f"- With demo_reward: {len(mgr.rwd.list())} rewards")
    mgr.rwd.remove("demo_reward")
    print("- Removed: 'demo_reward'")
    print(f"- After remove: {len(mgr.rwd.list())} rewards")


    print("\n--- Smoke Test Complete ---")

--- Smoke Test ---

State Variables:
- Initial: 11 variables
- Added: 'time_spent_recent'
- After add: 12 variables
- Removed: 'time_spent_recent'
- After remove: 11 variables

Actions:
- Initial: 16 actions
- Added: 'tmp_0': 'Demo action'
- With tmp_0: 17 actions
- Removed: 'tmp_0'
- After remove: 16 actions

Effects:
- Initial effects for 'eth_0': {'ethical_flags': -0.1}
- Set effect for 'eth_0': 'ai_usage' = -0.05
- After set effects for 'eth_0': {'ethical_flags': -0.1, 'ai_usage': -0.05}
- Removed effect for 'eth_0': 'ai_usage'
- After remove effects for 'eth_0': {'ethical_flags': -0.1}

Rewards:
- Initial: 10 rewards
- Added 'demo_reward': value=0.7 justification='Demo done' risk='minimal potential'
- With demo_reward: 11 rewards
- Removed: 'demo_reward'
- After remove: 10 rewards

--- Smoke Test Complete ---


In [17]:
# --- This cell contains code intended for two files:
# --- langgraph_policy.py and thesis_modules.py
# --- Clear separators are used for easy splitting later.

# --- START OF FILE: thesis_modules.py ---

# --- MODULE 0: State Definition ---
from typing import TypedDict, List, Dict, Any
# LangGraph and related imports will go in langgraph_policy.py
# from langgraph.graph import StateGraph, END

# Assuming RLConfigManager and RewardItem are defined in a separate file (e.g., rl_config_manager.py)
# from rl_config_manager import RLConfigManager, RewardItem

class CoreState(TypedDict):
    """
    Core state variables for the thesis assistant.

    These variables represent the student’s current status across different
    dimensions relevant to the RL policy.
    """
    stage: str # Current stage of the thesis (e.g., planning, drafting, editing)
    advisor_trust: float # Level of trust from the advisor (e.g., 0.0 to 1.0)
    creativity_score: float # Metric for the novelty and originality of ideas (e.g., 0.0 to 1.0)
    ethical_flags: float # Aggregated score indicating potential ethical concerns (e.g., 0.0 to 1.0, higher is worse)
    ai_usage: float # Metric for the level of AI assistance used (e.g., 0.0 to 1.0)
    thesis_quality: float # Estimated quality of the thesis content (e.g., 0.0 to 1.0)
    deadline_ratio: float # Progress towards the deadline (e.g., 0.0 to 1.0)
    thesis_difficulty: float # Perceived difficulty of the thesis topic/task (e.g., 0.0 to 1.0)
    student_autonomy: float # Level of student self-direction and initiative (e.g., 0.0 to 1.0)
    language_proficiency: float # Assessment of writing and language skills (e.g., 0.0 to 1.0)
    emotional_state: float # Proxy for the student's emotional well-being (e.g., 0.0 to 1.0, higher is better)
    timestep: int # Current simulation timestep

class ThesisState(TypedDict):
    """
    Full state definition for the LangGraph.

    Combines core state variables with policy execution trace, logs,
    and the configuration dictionary.
    """
    core: CoreState # The core state variables
    policy_trace: List[str] # List of action keys executed in sequence
    log: List[str] # Log of messages or events generated by actions
    config: Dict[str, Any] # The loaded RL configuration


# --- MODULE 1: Dummy Action Modules (Intended for thesis_modules.py) ---

# ethics_module.py
def display_eth_warning(state: ThesisState) -> ThesisState:
    """
    Dummy action: Logs an ethics warning.

    Args:
        state (ThesisState): The current state of the LangGraph.

    Returns:
        ThesisState: The updated state.
    """
    state["log"].append("Ethics warning issued.")
    # No core state changes in this dummy
    return state

def propose_ai_restriction(state: ThesisState) -> ThesisState:
    """
    Dummy action: Logs a proposed AI restriction.

    Args:
        state (ThesisState): The current state of the LangGraph.

    Returns:
        ThesisState: The updated state.
    """
    state["log"].append("AI restriction proposed.")
    # No core state changes in this dummy
    return state

def recommend_advisor_checkin(state: ThesisState) -> ThesisState:
    """
    Dummy action: Logs a recommended advisor check-in.

    Args:
        state (ThesisState): The current state of the LangGraph.

    Returns:
        ThesisState: The updated state.
    """
    state["log"].append("Advisor check-in recommended.")
    # No core state changes in this dummy
    return state

def log_academic_concern(state: ThesisState) -> ThesisState:
    """
    Dummy action: Logs an academic concern.

    Args:
        state (ThesisState): The current state of the LangGraph.

    Returns:
        ThesisState: The updated state.
    """
    state["log"].append("Academic concern logged.")
    # No core state changes in this dummy
    return state


ethics_module_actions = {
    "eth_0": display_eth_warning,
    "eth_1": propose_ai_restriction,
    "eth_2": recommend_advisor_checkin,
    "eth_3": log_academic_concern,
}


# writing_support_module.py
def suggest_rewrite(state: ThesisState) -> ThesisState:
    """
    Dummy action: Logs a rewrite suggestion.

    Args:
        state (ThesisState): The current state of the LangGraph.

    Returns:
        ThesisState: The updated state.
    """
    state["log"].append("Rewrite suggested.")
    # No core state changes in this dummy
    return state

def recommend_outline_reform(state: ThesisState) -> ThesisState:
    """
    Dummy action: Logs a recommendation for outline reform.

    Args:
        state (ThesisState): The current state of the LangGraph.

    Returns:
        ThesisState: The updated state.
    """
    state["log"].append("Outline reform recommended.")
    # No core state changes in this dummy
    return state

def display_writing_tip(state: ThesisState) -> ThesisState:
    """
    Dummy action: Logs a writing tip display.

    Args:
        state (ThesisState): The current state of the LangGraph.

    Returns:
        ThesisState: The updated state.
    """
    state["log"].append("Writing tip displayed.")
    # No core state changes in this dummy
    return state

def enable_feedback_loop(state: ThesisState) -> ThesisState:
    """
    Dummy action: Logs enabling feedback loop.

    Args:
        state (ThesisState): The current state of the LangGraph.

    Returns:
        ThesisState: The updated state.
    """
    state["log"].append("Feedback loop enabled.")
    # No core state changes in this dummy
    return state


writing_support_module_actions = {
    "write_0": suggest_rewrite,
    "write_1": recommend_outline_reform,
    "write_2": display_writing_tip,
    "write_3": enable_feedback_loop,
}

# emotion_support_module.py
def encourage_autonomy(state: ThesisState) -> ThesisState:
    """
    Dummy action: Logs encouragement of autonomy.

    Args:
        state (ThesisState): The current state of the LangGraph.

    Returns:
        ThesisState: The updated state.
    """
    state["log"].append("Autonomy encouraged.")
    # No core state changes in this dummy
    return state

def acknowledge_deadline_stress(state: ThesisState) -> ThesisState:
    """
    Dummy action: Logs acknowledging deadline stress.

    Args:
        state (ThesisState): The current state of the LangGraph.

    Returns:
        ThesisState: The updated state.
    """
    state["log"].append("Deadline stress acknowledged.")
    # No core state changes in this dummy
    return state

def suggest_micro_break(state: ThesisState) -> ThesisState:
    """
    Dummy action: Logs suggesting a micro break.

    Args:
        state (ThesisState): The current state of the LangGraph.

    Returns:
        ThesisState: The updated state.
    """
    state["log"].append("Micro break suggested.")
    # No core state changes in this dummy
    return state

def offer_motivational_boost(state: ThesisState) -> ThesisState:
    """
    Dummy action: Logs offering a motivational boost.

    Args:
        state (ThesisState): The current state of the LangGraph.

    Returns:
        ThesisState: The updated state.
    """
    state["log"].append("Motivational boost offered.")
    # No core state changes in this dummy
    return state

emotion_support_module_actions = {
    "emo_0": encourage_autonomy,
    "emo_1": acknowledge_deadline_stress,
    "emo_2": suggest_micro_break,
    "emo_3": offer_motivational_boost,
}

# idea_brainstorm_module.py
def prompt_open_ended_reflection(state: ThesisState) -> ThesisState:
    """
    Dummy action: Logs prompting open-ended reflection.

    Args:
        state (ThesisState): The current state of the LangGraph.

    Returns:
        ThesisState: The updated state.
    """
    state["log"].append("Open-ended reflection prompted.")
    # No core state changes in this dummy
    return state

def offer_question_inversion(state: ThesisState) -> ThesisState:
    """
    Dummy action: Logs offering question inversion.

    Args:
        state (ThesisState): The current state of the LangGraph.

    Returns:
        ThesisState: The updated state.
    """
    state["log"].append("Question inversion offered.")
    # No core state changes in this dummy
    return state

def stimulate_cross_topic_merge(state: ThesisState) -> ThesisState:
    """
    Dummy action: Logs stimulating cross-topic merge.

    Args:
        state (ThesisState): The current state of the LangGraph.

    Returns:
        ThesisState: The updated state.
    """
    state["log"].append("Cross-topic merge stimulated.")
    # No core state changes in this dummy
    return state

def show_novelty_heatmap(state: ThesisState) -> ThesisState:
    """
    Dummy action: Logs showing novelty heatmap.

    Args:
        state (ThesisState): The current state of the LangGraph.

    Returns:
        ThesisState: The updated state.
    """
    state["log"].append("Novelty heatmap shown.")
    # No core state changes in this dummy
    return state

idea_brainstorm_module_actions = {
    "brain_0": prompt_open_ended_reflection,
    "brain_1": offer_question_inversion,
    "brain_2": stimulate_cross_topic_merge,
    "brain_3": show_novelty_heatmap,
}


# Combine all actions from modules into a single dictionary
# This dictionary is intended to be imported by langgraph_policy.py
__actions__ = {
    **ethics_module_actions,
    **writing_support_module_actions,
    **emotion_support_module_actions,
    **idea_brainstorm_module_actions,
}

# --- END OF FILE: thesis_modules.py ---


# --- START OF FILE: langgraph_policy.py ---

# Import necessary modules
from langgraph.graph import StateGraph, END
from typing import Dict, Any, List # Import necessary types

# Assuming ThesisState and __actions__ are defined in thesis_modules.py
# from thesis_modules import ThesisState, __actions__

# Assuming RLConfigManager is defined in rl_config_manager.py
# from rl_config_manager import RLConfigManager

# --- MODULE 2: Decorator for Action Effects ---

def apply_action_effects(action_key: str, effects_config: Dict[str, Dict[str, float]]):
    """
    Decorator to apply action effects to the state after an action function executes.

    This decorator wraps an action function. After the original action function
    updates the state (e.g., logs a message), this decorator applies the
    numerical changes to the core state variables as defined in the
    `action_effects` section of the configuration for the given `action_key`.
    It also appends the executed action's key to the `policy_trace`.

    Args:
        action_key (str): The key of the action being executed (e.g., "eth_0").
        effects_config (Dict[str, Dict[str, float]]): A dictionary mapping
            action keys to dictionaries of state variable changes (deltas).

    Returns:
        Callable: The decorated action function.
    """
    def decorator(fn):
        def wrapped(state: ThesisState) -> ThesisState:
            # Execute the original action function
            state = fn(state)

            # Apply action effects if defined for the action_key
            delta = effects_config.get(action_key, {})
            for key, val in delta.items():
                # Ensure the state variable exists in core state before applying effect
                if key in state["core"]:
                    state["core"][key] += val
                # Optional: Add logging for applied effects
                # state["log"].append(f"Applied effect for {action_key}: {key} changed by {val:+}")

            # Append the action to the policy trace
            state["policy_trace"].append(action_key)

            # Increment timestep after each action
            state["core"]["timestep"] += 1

            return state
        return wrapped
    return decorator




In [18]:
# --- MODULE 3: LangGraph Policy Builder ---

def build_policy_graph(config: Dict[str, Any]) -> StateGraph:
    """
    Builds the LangGraph policy from the loaded configuration and action modules.

    This function constructs the LangGraph dynamically. It adds nodes for each
    action defined in the config (and available in `__actions__`), applies
    the `apply_action_effects` decorator to each action node, and sets up
    conditional edges based on the `route_action` function and the
    `action_transitions` from the configuration.

    Args:
        config (Dict[str, Any]): The loaded configuration dictionary,
            expected to contain "actions", "action_effects", and "action_transitions".

    Returns:
        StateGraph: The compiled LangGraph ready for invocation.
    """
    graph = StateGraph(ThesisState)

    action_effects = config.get("action_effects", {})
    action_transitions = config.get("action_transitions", {})
    actions = config.get("actions", {})

    # Add nodes for each action defined in the combined actions dictionary
    # Only add nodes for actions that are present in the config's "actions"
    for action_key, action_fn in __actions__.items():
        if action_key in actions:
             # Wrap the action function with the apply_action_effects decorator
             decorated_action_fn = apply_action_effects(action_key, action_effects)(action_fn)
             graph.add_node(action_key, decorated_action_fn)

    # Add the END node, which signifies the termination of a policy rollout
    graph.add_node("END", lambda state: state) # END node doesn't modify state


    # Add conditional edges from each action node based on the routing function
    # After an action node executes, it transitions to the next node determined by route_action.
    for action_key in actions.keys():
         graph.add_conditional_edges(
             action_key, # The node to route FROM (an action node)
             route_action, # The function that determines the next node
             # No mapping needed here; route_action returns the target node name directly.
             # LangGraph uses the string returned by route_action as the next node name.
         )

    # Define a special entry point node that immediately routes
    graph.add_node("start_node", lambda state: state) # Dummy node, doesn't modify state

    # Set the entry point to the dummy start node
    graph.set_entry_point("start_node")

    # Add a conditional edge from the start node using the route_action function
    # This will determine the first actual action to take based on the initial state
    graph.add_conditional_edges(
        "start_node", # Route from the dummy start node
        route_action,  # Use route_action to determine the next node
    )

    return graph.compile()

# --- MODULE 4: Routing Function ---

def route_action(state: ThesisState) -> str:
    """
    Determines the next action based on the current state and action transitions in the config.

    This function serves as the policy's core logic. It looks at the current
    state (specifically the `policy_trace` to find the last action) and the
    `action_transitions` defined in the config to decide the next action.
    In a real RL scenario, this would be replaced by a trained policy model
    that takes the state as input and outputs the next action.

    Args:
        state (ThesisState): The current state of the LangGraph.

    Returns:
        str: The key of the next action node to transition to, or "END" to stop.
    """
    config = state.get("config", {})
    action_transitions = config.get("action_transitions", {})
    actions = config.get("actions", {})
    policy_trace = state.get("policy_trace", [])

    # If no actions have been taken yet (first step), determine the starting action.
    # For this example, we'll just pick the first action key from the config.
    if not policy_trace:
        action_keys = list(actions.keys())
        return action_keys[0] if action_keys else END

    # Get the key of the last action that was executed.
    last_action = policy_trace[-1]

    # Look up the transitions defined for the last action in the config.
    if last_action in action_transitions:
        transitions = action_transitions[last_action]
        # Assuming a simple transition structure like {"next": "next_action_key"}
        # In a more complex scenario, transitions could be conditional based on state.
        next_node = transitions.get("next", END)
        return next_node
    else:
        # If no transition is defined for the last action, terminate the graph.
        return END

# --- END OF FILE: langgraph_policy.py ---


# --- MODULE 5: Simulate One Run (Demo) ---

if __name__ == "__main__":
    print("--- LangGraph Policy Demo ---")

    # Assuming RLConfigManager is available (from the previous cell)
    # from rl_config_manager import RLConfigManager

    # Load configuration using the RLConfigManager
    try:
        mgr = RLConfigManager()
        config = mgr.load_config()
        print("✅ Configuration loaded.")

        # Ensure action_transitions exist for the demo, add if not
        if "action_transitions" not in config:
            config["action_transitions"] = {
                "eth_0": {"next": "write_0"},
                "write_0": {"next": "emo_0"}, # Added another transition
                "emo_0": {"next": "END"}    # Added transition to END
            }
            print("ℹ️ Added default action_transitions for demo.")
            # Note: This change is only in the 'config' dictionary in memory for the demo.
            # To make it persistent, you would need to call mgr.save_config(config)

    except Exception as e:
        print(f"❌ Failed to load configuration: {e}")
        # Fallback to a basic default config if loading fails
        config = {
            "actions": {
                "dummy_action_1": "A first dummy action",
                "dummy_action_2": "A second dummy action",
            },
            "action_effects": {},
            "action_transitions": {
                "dummy_action_1": {"next": "dummy_action_2"},
                "dummy_action_2": {"next": "END"},
            },
            "state_variables": [],
             "reward_config": {},
        }
        print("⚠️ Using fallback configuration.")


    # Define the initial state for the simulation
    initial_state: ThesisState = {
        "core": {
            "stage": "planning",
            "advisor_trust": 0.5,
            "creativity_score": 0.5,
            "ethical_flags": 0.5,
            "ai_usage": 0.5,
            "thesis_quality": 0.5,
            "deadline_ratio": 0.0, # Start at the beginning
            "thesis_difficulty": 0.5,
            "student_autonomy": 0.5,
            "language_proficiency": 0.5,
            "emotional_state": 0.5,
            "timestep": 0,
        },
        "policy_trace": [], # Start with an empty trace
        "log": [], # Start with an empty log
        "config": config # Pass the loaded config to the state
    }

    print("\nInitial State:", initial_state["core"])
    print("Policy Trace:", initial_state["policy_trace"])
    print("Log:", initial_state["log"])


    # Build and run the LangGraph
    try:
        thesis_graph = build_policy_graph(config)
        print("\n✅ LangGraph built.")

        # Invoke the graph to run the policy
        # The graph will start at the entry point ('start_node'),
        # which will immediately route based on route_action and the initial state.
        print("\nRunning LangGraph...")
        final_state = thesis_graph.invoke(initial_state)
        print("✅ LangGraph execution finished.")

        # Inspect the final state
        print("\n--- Final State ---")
        print("Trace:", final_state["policy_trace"])
        print("Log:", final_state["log"])
        print("Final core state:", final_state["core"])

    except Exception as e:
        print(f"\n❌ An error occurred during LangGraph execution: {e}")


    print("\n--- Demo Complete ---")

--- LangGraph Policy Demo ---
✅ Configuration loaded.
ℹ️ Added default action_transitions for demo.

Initial State: {'stage': 'planning', 'advisor_trust': 0.5, 'creativity_score': 0.5, 'ethical_flags': 0.5, 'ai_usage': 0.5, 'thesis_quality': 0.5, 'deadline_ratio': 0.0, 'thesis_difficulty': 0.5, 'student_autonomy': 0.5, 'language_proficiency': 0.5, 'emotional_state': 0.5, 'timestep': 0}
Policy Trace: []
Log: []

✅ LangGraph built.

Running LangGraph...
✅ LangGraph execution finished.

--- Final State ---
Trace: ['eth_0', 'write_0', 'emo_0']
Final core state: {'stage': 'planning', 'advisor_trust': 0.5, 'creativity_score': 0.5, 'ethical_flags': 0.4, 'ai_usage': 0.5, 'thesis_quality': 0.6, 'deadline_ratio': 0.0, 'thesis_difficulty': 0.5, 'student_autonomy': 0.6, 'language_proficiency': 0.5, 'emotional_state': 0.5, 'timestep': 3}

--- Demo Complete ---
