<a href="https://colab.research.google.com/github/hanjiadong0/chatbot-/blob/RL/rl_components.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# Adaptive RL Optimizer for Human-in-the-Loop Academic Coaching

This project implements a reinforcement learning (RL) optimizer designed to model adaptive, ethically aligned coaching within the thesis-writing process. The RL agent learns to recommend helpful actions while balancing academic integrity, advisor trust, writing fluency, and student autonomy.

While real-world RLHF systems remain challenging to deploy in this domain, this controlled simulation framework enables principled experiments in policy optimization for education-support agents.

## System Overview

The optimizer operates within a configurable environment consisting of ten modular components:

- Configuration Manager  
- Developer Dashboard  
- Data Preprocessor  
- RL Environment  
- PPO Supervisor  
- Continual Training Loop  
- Synthetic Thesis Student Simulator  
- Multi-Student Cohort Generator  
- Pretraining Pipeline  
- RL Training Launcher

These components collectively support the training, evaluation, and logging of a human-centered RL agent with transparent behavior and tunable oversight.

## Agent Architecture

The RL agent is trained using Proximal Policy Optimization (PPO), with design objectives that include:

- Responding adaptively to different student profiles
- Encouraging ethical writing behavior
- Supporting reflection, creativity, and revision
- Managing deadline pressure and advisor feedback
- Preserving student agency under imperfect supervision

The optimizer interacts with synthetic students and advisor models across iterative training loops.

## State Representation

At each time step, the RL agent observes a flattened vector encoding ethical behavior, academic status, and student progression. All features are numerically encoded. Categorical variables (e.g., thesis stage) are one-hot encoded.

### State Features

**Ethics & AI Use**

- `ai_usage`: Normalized recent AI tool usage (e.g., 0.0 to 1.0)
- `ethical_flags`: Accumulated binary violations or concerns
- `advisor_trust`: A float score estimating advisor confidence in student behavior

**Writing Progress**

- `thesis_quality`: Noisy score from GPT-based heuristic (0–1), with injected randomness
- `deadline_ratio`: Time remaining until key deadline (normalized between 0 and 1)
- `thesis_difficulty`: A float indicating complexity (e.g., based on topic category or prior workload)
- `timestep`: Incrementing integer that marks time steps across one episode

**Student Traits & Meta-State**

- `student_autonomy`: Estimate of student independence (inferred from action compliance)
- `language_proficiency`: Static or adaptive float reflecting writing fluency
- `emotional_state`: Real-time estimate of cognitive/emotional strain (e.g., from prompt rejection or overuse)
- `creativity_score`: Proxy for originality, possibly increased by divergent prompts

All features are normalized before being passed to the agent's policy network.
## Action Space

The agent selects from a finite set of educational suggestions, not direct commands. Actions are grouped into modules:

- **Ethics**: issue reminders, recommend AI restraint, log concerns
- **Writing**: suggest rewriting, outline reform, style tips
- **Cognition**: trigger reflection, novelty prompts
- **Emotion**: acknowledge stress, encourage rest or autonomy

All actions are defined in configuration files and mapped to effects on student state variables.

## Reward Structure

The reward function integrates multi-objective signals reflecting advisor trust, writing fluency, creativity, ethics, and progress. Sample rewards:

- `+2.5` for safe, original idea generation
- `-6.0` for breaching academic integrity
- `+1.5` for demonstrable writing improvement
- `-5.0` for supervisor disappointment
- `+1.0` for supporting autonomous behavior
- `-1.0` for shortcut behaviors under time pressure

A dual-critic logic is implemented: a strict reviewer penalizes flaws, while a supportive advisor rewards learning and repair. Some penalties (e.g. ethics violations) are treated as non-compensable.

## Environment Dynamics

The training environment simulates student behavior, feedback lags, and psychological drift over time. Features include:

- Trust hysteresis after misconduct
- Ethical decay near deadlines
- Noisy quality assessments using GPT-like scoring
- Delayed feedback from advisor
- Emotionally reactive student models

Synthetic students are parameterized by risk profiles, engagement levels, and writing styles.

## Logging and Auditability

All agent actions and outcomes are logged with timestamps, state snapshots, and reward signals. Policies are checkpointed for reproducibility. Behavior can be distilled using policy distillation for post-hoc interpretability.

## Safety Constraints

- Hard constraints on unethical behaviors
- Configurable reward and action-effect maps
- Fully observable logs for intervention or override
- Human-in-the-loop architecture
- Version control for agent updates

## Future Directions

This framework supports further experimentation in:

- Dynamic curriculum modeling
- Longitudinal adaptation via student memory
- Real-time tutor feedback loop injection
- Variational policies for student diversity
- Integrating real writing samples (with consent and anonymization)

## Citation

Please cite this repository if used for research or prototyping in RL for education or AI tutoring systems.

## Contact

For development inquiries, research collaboration, or code review, contact the maintainer.


In [1]:
!pip install streamlit gymnasium stable-baselines3
!pip install numpy # Ensure numpy is installed if not already
!pip install pandas # Ensure pandas is installed if not already
!pip install scipy # Ensure scipy is installed if not already

Collecting streamlit
  Downloading streamlit-1.46.1-py3-none-any.whl.metadata (9.0 kB)
Collecting stable-baselines3
  Downloading stable_baselines3-2.6.0-py3-none-any.whl.metadata (4.8 kB)
Collecting watchdog<7,>=2.1.5 (from streamlit)
  Downloading watchdog-6.0.0-py3-none-manylinux2014_x86_64.whl.metadata (44 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.3/44.3 kB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
Collecting pydeck<1,>=0.8.0b4 (from streamlit)
  Downloading pydeck-0.9.1-py2.py3-none-any.whl.metadata (4.1 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch<3.0,>=2.3->stable-baselines3)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch<3.0,>=2.3->stable-baselines3)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch<3.0,>=2.3->stable-basel

# RL Configuration Manager

This module provides a centralized configuration interface for the RL-based educational coaching system. It manages all static settings required by the agent and environment to interpret observations, execute actions, and compute rewards.

The configuration is saved and loaded from a JSON file (`rl_config.json`). If the file does not exist, a default configuration is created automatically. This ensures a reproducible setup and simplifies testing across modules.

## Configuration Structure

- `state_variables`:  
  Defines the numerical features observed by the RL agent at each time step. These features represent the student’s academic state, ethical posture, writing progression, and contextual traits.

- `actions`:  
  A dictionary of discrete, recommendation-style actions the agent can select. Each key is an action ID (e.g. `"eth_0"`), and each value is a human-readable description.

- `reward_config`:  
  Defines the scalar reward shaping used during training. Rewards and penalties reflect fluency, ethical alignment, advisor trust, autonomy, and creativity.

- `action_effects`:  
  Maps agent actions to updates in the simulated state. Each action has associated side-effects that alter one or more student-related variables (e.g., reducing `ai_usage` or increasing `advisor_trust`).

## Usage Example

# Load existing configuration or initialize default
config = RLConfigManager.load_config()

# Access specific parts
print("Available state variables:")
for var in config["state_variables"]:
    print("-", var)

In [11]:
import json
import os
from datetime import datetime
from pydantic import BaseModel, Field, field_validator
from typing import Dict


# ----------------------------- Schema Definitions -----------------------------


class RewardItem(BaseModel):
    value: float
    justification: str = Field(..., min_length=5)
    risk: str = Field(..., min_length=5)

class ConfigSchema(BaseModel):
    config_version: float
    created_at: str
    state_variables: list[str]
    actions: Dict[str, str]
    reward_config: Dict[str, RewardItem]
    action_effects: Dict[str, Dict[str, float]]

    @field_validator("config_version")
    @classmethod
    def validate_version(cls, v):
        if v < 1.0:
            raise ValueError("Config version must be ≥ 1.0")
        return v
# ----------------------------- Configuration Manager -----------------------------

class RLConfigManager:
    CONFIG_FILE = "rl_config.json"

    @classmethod
    def load_config(cls) -> Dict[str, Any]:
        """Loads config from file, validating with schema."""
        if not os.path.exists(cls.CONFIG_FILE):
            cls.save_config(cls.default_config())

        with open(cls.CONFIG_FILE, "r") as f:
            raw = json.load(f)

        validated = ConfigSchema(**raw)
        return validated.dict()

    @classmethod
    def save_config(cls, config: Dict[str, Any]):
        """Saves config to file after validation, with timestamp and backup."""
        config["created_at"] = datetime.utcnow().isoformat()
        config["config_version"] = config.get("config_version", 1.0)

        # Backup current config
        if os.path.exists(cls.CONFIG_FILE):
            with open(cls.CONFIG_FILE, "r") as old:
                with open(cls.CONFIG_FILE + ".backup", "w") as bak:
                    bak.write(old.read())

        validated = ConfigSchema(**config)

        with open(cls.CONFIG_FILE, "w") as f:
            json.dump(validated.dict(), f, indent=4)

    @classmethod
    def default_config(cls) -> Dict[str, Any]:
        return {
            "config_version": 1.0,
            "created_at": "",
            "state_variables": [
                "ai_usage", "ethical_flags", "advisor_trust",
                "thesis_quality", "deadline_ratio", "thesis_difficulty",
                "student_autonomy", "language_proficiency",
                "emotional_state", "creativity_score", "timestep"
            ],
            "actions": {
                "eth_0": "Display ethical reminder",
                "eth_1": "Propose AI restriction",
                "eth_2": "Recommend advisor check-in",
                "eth_3": "Log academic concern",
                "brain_0": "Prompt open-ended reflection",
                "brain_1": "Offer question inversion",
                "brain_2": "Stimulate cross-topic merge",
                "brain_3": "Show novelty heatmap",
                "write_0": "Suggest rewriting section",
                "write_1": "Recommend outline reform",
                "write_2": "Display writing tip",
                "write_3": "Enable feedback loop",
                "emo_0": "Encourage autonomy",
                "emo_1": "Acknowledge deadline stress",
                "emo_2": "Suggest micro-break",
                "emo_3": "Offer motivational boost"
            },
            "reward_config": {
                "fluency_improved": {
                    "value": 1.5,
                    "justification": "Improves clarity and coherence",
                    "risk": "May incentivize style over substance"
                },
                "trust_earned": {
                    "value": 2.0,
                    "justification": "Advisor feedback acknowledged and used",
                    "risk": "May reward form without deep content change"
                },
                "creativity_expressed": {
                    "value": 2.5,
                    "justification": "Encourages safe novelty and synthesis",
                    "risk": "May drift into irrelevant tangents"
                },
                "autonomy_respected": {
                    "value": 1.0,
                    "justification": "Student took initiative",
                    "risk": "Passive neglect might appear as autonomy"
                },
                "ai_dependency_violation": {
                    "value": -4.0,
                    "justification": "Detected AI overuse",
                    "risk": "Could punish legitimate drafting support"
                },
                "ethical_boundary_crossed": {
                    "value": -6.0,
                    "justification": "Clear breach of academic norms",
                    "risk": "Non-compensable — agent must intervene"
                },
                "deadline_panic_detected": {
                    "value": -1.0,
                    "justification": "Urgency spike detected",
                    "risk": "Might suppress productive deadline use"
                },
                "milestone_completed": {
                    "value": 5.0,
                    "justification": "Goal achieved within scope",
                    "risk": "May mask ethics issues if used alone"
                },
                "novel_but_safe": {
                    "value": 3.0,
                    "justification": "Original idea aligned with context",
                    "risk": "Requires semantic checking"
                },
                "supervisor_disappointment": {
                    "value": -5.0,
                    "justification": "Advisor flags trust breakdown",
                    "risk": "Recovery should be possible over time"
                }
            },
            "action_effects": {
                "eth_0": {"ethical_flags": -0.1},
                "eth_1": {"ai_usage": -0.2},
                "eth_2": {"advisor_trust": 0.15},
                "eth_3": {"ethical_flags": +0.2, "advisor_trust": -0.3},
                "brain_0": {"creativity_score": 0.05},
                "brain_1": {"creativity_score": 0.07},
                "brain_2": {"creativity_score": 0.1},
                "brain_3": {"thesis_quality": 0.05},
                "write_0": {"thesis_quality": 0.1},
                "write_1": {"thesis_quality": 0.07},
                "write_2": {"thesis_quality": 0.05},
                "write_3": {"thesis_quality": 0.05, "advisor_trust": 0.1},
                "emo_0": {"student_autonomy": 0.1},
                "emo_1": {"emotional_state": -0.05},
                "emo_2": {"emotional_state": 0.1},
                "emo_3": {"emotional_state": 0.15}
            }
        }


In [17]:
def reset_config():
    print("⚠️ Resetting to default configuration...")
    default = RLConfigManager.default_config()
    RLConfigManager.save_config(default)
    print("✅ Default configuration saved.")


if __name__ == "__main__":
    reset_config()

    config = RLConfigManager.load_config()

    print("\nCONFIG VERSION:", config["config_version"])
    print("CREATED AT:", config["created_at"])

    print("\nSTATE VARIABLES:")
    for var in config["state_variables"]:
        print(f" - {var}")

    print("\nACTIONS:")
    for action_id, description in config["actions"].items():
        print(f" {action_id}: {description}")

    print("\nREWARD CONFIGURATION:")
    for key, reward in config["reward_config"].items():
        print(f" {key}: value={reward['value']} | reason: {reward['justification']}")

    sample = "brain_2"
    print(f"\nACTION EFFECTS for '{sample}':")
    if sample in config["action_effects"]:
        for k, v in config["action_effects"][sample].items():
            print(f"  - {k}: {v:+}")
    else:
        print("  Not found.")

    print("\n✅ Configuration loaded and validated successfully.")



⚠️ Resetting to default configuration...
✅ Default configuration saved.

CONFIG VERSION: 1.0
CREATED AT: 2025-06-28T20:35:13.375654

STATE VARIABLES:
 - ai_usage
 - ethical_flags
 - advisor_trust
 - thesis_quality
 - deadline_ratio
 - thesis_difficulty
 - student_autonomy
 - language_proficiency
 - emotional_state
 - creativity_score
 - timestep

ACTIONS:
 eth_0: Display ethical reminder
 eth_1: Propose AI restriction
 eth_2: Recommend advisor check-in
 eth_3: Log academic concern
 brain_0: Prompt open-ended reflection
 brain_1: Offer question inversion
 brain_2: Stimulate cross-topic merge
 brain_3: Show novelty heatmap
 write_0: Suggest rewriting section
 write_1: Recommend outline reform
 write_2: Display writing tip
 write_3: Enable feedback loop
 emo_0: Encourage autonomy
 emo_1: Acknowledge deadline stress
 emo_2: Suggest micro-break
 emo_3: Offer motivational boost

REWARD CONFIGURATION:
 fluency_improved: value=1.5 | reason: Improves clarity and coherence
 trust_earned: value=2

/tmp/ipython-input-11-1451671547.py:62: PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
  json.dump(validated.dict(), f, indent=4)
/tmp/ipython-input-11-1451671547.py:45: PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
  return validated.dict()
