<a href="https://colab.research.google.com/github/hanjiadong0/chatbot-/blob/RL/thesis_assistant.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task
Program the ethics module for the thesis assistant project based on the provided structure, focusing on authorship tracking, AI labelling, and human-in-the-loop prompts, orchestrated by an EthicsSupervisor Agent and utilizing the defined submodules (AI_Detector, Usage_Logger, HumanPromptChecker, AdvisorFeedbackSync, EthicalViolationAlert) and interfaces.

## Define the scope of the ethics module

### Subtask:
Clarify what aspects of ethics the module should address within the context of your thesis killer project, focusing on authorship tracking, AI labelling, and human-in-the-loop prompts, orchestrated by an EthicsSupervisor Agent.


**Reasoning**:
Describe the ethical concerns, explain the relation of authorship tracking, AI labelling, and human-in-the-loop prompts, outline the responsibilities of the EthicsSupervisor Agent, and briefly explain the contribution of the submodules, as per the instructions.



# Task
Program the ethics module for your thesis killer project based on the provided description, including the EthicsSupervisor Agent implemented as a Reinforcement Learning monitor, the specified submodules (AI_Detector, Usage_Logger, HumanPromptChecker, AdvisorFeedbackSync, EthicalViolationAlert), and the interfaces for connecting to external tools and logging information.

## Define the scope of the ethics module

### Subtask:
Clarify what aspects of ethics the module should address within the context of your thesis killer project, focusing on authorship tracking, AI labelling, and human-in-the-loop prompts, orchestrated by an EthicsSupervisor Agent implemented as a Reinforcement Learning monitor.


**Reasoning**:
Describe the ethical concerns, explain the relation of authorship tracking, AI labelling, and human-in-the-loop prompts, outline the responsibilities of the EthicsSupervisor Agent, and briefly explain the contribution of the submodules, as per the instructions.



## Identify relevant ethical guidelines or frameworks

### Subtask:
Research and select appropriate ethical principles or frameworks applicable to your project's domain, considering how they relate to the functions of the defined submodules and how these can be translated into states, actions, and reward signals for the RL-based EthicsSupervisor.

**Reasoning**:
Selecting appropriate ethical guidelines is crucial for ensuring the ethics module effectively addresses the challenges identified in the project's presentation. These guidelines will inform the design and implementation of the EthicsSupervisor and its submodules, as well as the definition of states, actions, and reward signals for the Reinforcement Learning approach.

## Design the module's structure

### Subtask:
Outline the components and functionalities of the ethics module, with the EthicsSupervisor Agent implemented as a Reinforcement Learning monitor that has access to agent decisions, user responses, LLM-generated content, timing logs, and human feedback loops. Define the roles and interactions of the submodules (AI_Detector, Usage_Logger, HumanPromptChecker, AdvisorFeedbackSync, and EthicalViolationAlert) and how they will utilize interfaces to connect with tools like GPTZero, Copyleaks, or custom DetectGPT, log timestamps, usage intent, and tool confidence, and use rules/classifiers for warnings and suggestions. Design how the information from these submodules and interfaces will be used as state, action, and reward signals for the RL model.

**Reasoning**:
A well-defined structure is essential for implementing a complex module like the ethics module. Clearly outlining the roles and interactions of the EthicsSupervisor, submodules, and interfaces, and specifically designing how information will be used for the RL model, will ensure a cohesive and functional design that addresses the identified ethical challenges.

## Design the module's structure

### Subtask:
Outline the components and functionalities of the ethics module, with the EthicsSupervisor Agent implemented as a Reinforcement Learning monitor that has access to agent decisions, user responses, LLM-generated content, timing logs, and human feedback loops. Define the roles and interactions of the submodules (AI_Detector, Usage_Logger, HumanPromptChecker, AdvisorFeedbackSync, and EthicalViolationAlert) and how they will utilize interfaces to connect with AI detector tools, log timestamps, usage intent, and tool confidence, and use rules/classifiers for warnings and suggestions. Design how the information from these submodules and interfaces will be used as state, action, and reward signals for the RL model.

**Reasoning**:
A well-defined structure is essential for implementing a complex module like the ethics module. Clearly outlining the roles and interactions of the EthicsSupervisor, submodules, and interfaces, and specifically designing how information will be used for the RL model, will ensure a cohesive and functional design that addresses the identified ethical challenges.

In [None]:
from smolagents import OpenAIServerModel
api_key = "AIzaSyBNqQzrD75wV8WfGsV27VHUZ9j5ts5ihMg"   # use some free api key
model = OpenAIServerModel(
    model_id="gemini-2.0-flash", # the model I used
    api_base="https://generativelanguage.googleapis.com/v1beta/openai/",
    api_key=api_key,
)

In [None]:


# Import the OpenAI library
from openai import OpenAI
# Used to securely store your API key - uncomment if using Colab Secrets
from google.colab import userdata

# Get your OpenAI API key securely
# Replace "<YOUR_OPENAI_API_KEY>" with your key, or use Colab Secrets
# Or if using Colab Secrets:
openai_api_key_secure = userdata.get('OPENAI_API_KEY')
openai_organization = userdata.get('OPENAI_ORGANIZATION')
openai_project = userdata.get('OPENAI_PROJECT_ID')

# Set your project API key
OpenAI.api_key = openai_api_key_secure
# You must also set organization and project ID
OpenAI.organization = openai_organization
OpenAI.project = openai_project

# Create the OpenAI client
client = OpenAI(api_key= OpenAI.api_key)



In [None]:
# Create a request to the Chat Completions endpoint
response = client.chat.completions.create(
  # Specify the model
  model="gpt-4o-mini",
  messages=[
    # Assign the correct role
    {"role": "user",
     "content": "Write a polite reply accepting an AI Engineer job offer within 20 words."}]
)

print(response.choices[0].message.content)


Subject: Acceptance of Job Offer

Dear [Hiring Manager's Name],

I am thrilled to accept the AI Engineer position. Thank you for this opportunity!

Best regards,  
[Your Name]


In [None]:
import pandas as pd
import time # Import time for simulating API call delay
from openai import OpenAI # Import the OpenAI library
# Used to securely store your API key
from google.colab import userdata # Uncomment if using Colab Secrets
from scipy.spatial import distance # Assuming scipy is installed

class EthicsModule:
    def __init__(self, openai_client):
        self.usage_logs = []
        self.usage_embeddings = [] # Initialize list to store embeddings
        self.ai_detection_threshold = 0.7 # Simple threshold for AI detection
        # Use the provided OpenAI client
        self.client = openai_client


    def log_usage(self, prompt, intent, thesis_stage="unknown"):
        """Logs the usage of the thesis assistant with more details."""
        log_entry = {
            'timestamp': pd.Timestamp.now(),
            'prompt': prompt,
            'intent': intent,
            'thesis_stage': thesis_stage # Added thesis stage
        }
        self.usage_logs.append(log_entry)
        print(f"Usage logged: Timestamp={log_entry['timestamp']}, Prompt='{prompt}', Intent='{intent}', Thesis Stage='{thesis_stage}'")
        # Optionally generate embedding for the new log entry immediately
        # self._generate_embedding_for_log(log_entry)


    def _generate_embedding_for_log(self, log_entry):
         """Generates embedding for a single log entry and stores it."""
         try:
             prompt_text = log_entry['prompt']
             response = self.client.embeddings.create(
                 model="text-embedding-3-small", # Use the embedding model
                 input=prompt_text
             )
             embedding = response.data[0].embedding
             # Store the embedding along with a reference to the original log index
             self.usage_embeddings.append({'embedding': embedding, 'original_index': len(self.usage_logs) - 1})
             print(f"Generated embedding for log entry {len(self.usage_logs) - 1}")
         except Exception as e:
             print(f"Error generating embedding for log entry: {e}")


    def generate_all_usage_embeddings(self):
        """Generates embeddings for all usage logs using OpenAI API."""
        print("Generating embeddings for all usage logs using OpenAI API...")
        self.usage_embeddings = [] # Clear existing embeddings
        for i, log_entry in enumerate(self.usage_logs):
            try:
                prompt_text = log_entry['prompt']
                response = self.client.embeddings.create(
                    model="text-embedding-3-small", # Use the embedding model
                    input=prompt_text
                )
                embedding = response.data[0].embedding
                # Store the embedding along with a reference to the original log index
                self.usage_embeddings.append({'embedding': embedding, 'original_index': i})
            except Exception as e:
                 print(f"Error generating embedding for log entry {i}: {e}")

        print(f"Generated {len(self.usage_embeddings)} embeddings.")


    def find_similar_usage(self, query_prompt, n=3):
        """
        Finds the n most similar usage logs based on prompt embedding similarity.

        Args:
            query_prompt (str): The prompt to find similar usage for.
            n (int): The number of closest usage logs to find.

        Returns:
            list of dict: A list of dictionaries for the n most similar usage logs,
                          each containing 'distance', 'original_log' (the full log entry).
                          Returns an empty list if no embeddings are available.
        """
        if not self.usage_embeddings:
            print("No usage embeddings available to query.")
            return []

        try:
            # Generate embedding for the query prompt
            query_response = self.client.embeddings.create(
                model="text-embedding-3-small", # Use the embedding model
                input=query_prompt
            )
            query_embedding = query_response.data[0].embedding

            distances = []
            for item in self.usage_embeddings:
                dist = distance.cosine(query_embedding, item['embedding'])
                distances.append({
                    "distance": dist,
                    "original_index": item['original_index']
                    })

            distances_sorted = sorted(distances, key=lambda x: x['distance'])

            # Get the original log entries for the n closest
            similar_logs = []
            for item in distances_sorted[0:n]:
                original_log = self.usage_logs[item['original_index']]
                similar_logs.append({
                    "distance": item['distance'],
                    "original_log": original_log
                })

            return similar_logs

        except Exception as e:
            print(f"Error during similar usage query: {e}")
            return []


    def detect_ai(self, text):
        """Uses the OpenAI LLM to assess if content is AI generated."""
        print("Using OpenAI LLM for AI detection...")
        try:
            # Craft a prompt for the LLM to assess AI generation
            # This prompt might need refinement for better results
            prompt_text = f"Assess the likelihood that the following text was generated by an AI. Respond ONLY with a score between 0 and 1, where 1 is highly likely to be AI generated, followed by a brief explanation on a new line.\n\nText to assess:\n{text}\n\nScore:"

            response = self.client.chat.completions.create(
                model="gpt-4o-mini", # Or another suitable model
                messages=[
                    {"role": "system", "content": "You are an AI text detection assistant."},
                    {"role": "user", "content": prompt_text}
                ],
                max_tokens=50 # Restrict tokens to manage cost
            )

            # Attempt to parse the score from the LLM's response
            response_text = response.choices[0].message.content.strip()
            print(f"LLM Raw Response: {response_text}") # Print raw response for debugging
            try:
                # Assuming the LLM starts the response with the score on the first line
                detection_score = float(response_text.splitlines()[0])
            except (ValueError, IndexError):
                print(f"Could not parse score from LLM response: '{response_text}'. Assuming a default score.")
                detection_score = 0.5 # Default score if parsing fails

            # Simulate some processing time (optional, but good for realism)
            time.sleep(0.5)

            is_ai_generated = detection_score > self.ai_detection_threshold
            print(f"AI detection score from LLM: {detection_score:.2f}. Is likely AI: {is_ai_generated}")
            # Return the AI detection status, score, and potentially the full LLM response
            return is_ai_generated, detection_score, response_text

        except Exception as e:
            print(f"Error during OpenAI LLM AI detection: {e}")
            # Fallback in case of API errors
            return False, 0.0, f"Error: {e}"


    def check_ethical_usage(self, prompt, generated_text):
        """Basic check for ethical usage, combining prompt analysis and AI detection."""
        print("Checking for ethical usage...")

        # Basic Human Prompt Checker logic (simplified)
        prompt_lower = prompt.lower()
        if "write my entire thesis" in prompt_lower or "do my whole thesis" in prompt_lower:
            print("Ethical Alert: Skeptical usage detected (attempting to write entire thesis). Encourage ethical use and own writing.")
        elif "generate abstract" in prompt_lower or "write introduction" in prompt_lower:
             print("Ethical Note: AI used for structural writing. Remember to review and rephrase carefully.")
        elif "analyze this concept" in prompt_lower or "explain this" in prompt_lower:
             print("Ethical Usage: AI used for understanding/analysis. Good practice!")
        else:
            print("Prompt intent: Could be ethical, further analysis needed in a complex model.")


        # Basic Ethical Violation Alert logic (simplified, tied to AI detection and prompt analysis)
        is_ai, score, llm_response = self.detect_ai(generated_text)

        if is_ai and ("write my entire thesis" in prompt_lower or "do my whole thesis" in prompt_lower):
            print("Ethical VIOLATION Alert: High potential for academic dishonesty due to prompt and AI content.")
        elif is_ai:
            print("Ethical Alert: Potential AI-generated content detected. Encourage rephrasing.")

        # Example of checking for over-reliance (very basic) - in a real model, this would look at usage patterns over time
        # This basic check uses the length of usage logs and checks recent prompts for "generate"
        if len(self.usage_logs) > 5 and all("generate" in entry['prompt'].lower() for entry in self.usage_logs[-5:]):
             print("Ethical Alert: Potential over-reliance on AI generation detected. Encourage critical thinking and original writing.")


# Example Usage (after creating and initializing the openai client):
# Make sure the 'client' object is defined from a previous cell
# ethics_module = EthicsModule(openai_client=client)
# ethics_module.log_usage("help me understand this concept", "research", thesis_stage="literature review")
# ethics_module.generate_all_usage_embeddings() # Generate embeddings after logging
# similar_logs = ethics_module.find_similar_usage("find papers on NLP")
# print("\nSimilar Usage Logs:")
# for log in similar_logs:
#      print(f"  - Distance: {log['distance']:.4f}, Prompt: '{log['original_log']['prompt']}'")
# is_ai, score, llm_response = ethics_module.detect_ai("The quick brown fox jumps over the lazy dog.")
# print(f"Detect AI Result: Is AI: {is_ai}, Score: {score:.2f}, LLM Response: {llm_response}")
# is_ai, score, llm_response = ethics_module.detect_ai("As an AI language model, I can help with that.")
# print(f"Detect AI Result: Is AI: {is_ai}, Score: {score:.2f}, LLM Response: {llm_response}")
# ethics_module.check_ethical_usage("write my entire thesis", "Here is a thesis.")
# ethics_module.check_ethical_usage("analyze this concept", "Based on my training data, this concept is...")

In [None]:

# Example Usage (after creating and initializing the openai client):

if 'client' in locals():
    # Initialize the ethics module with the created client
    # Make sure the EthicsModule class is defined in a previous cell
    ethics_module = EthicsModule(openai_client=client)

    print("--- Testing log_usage ---")
    ethics_module.log_usage("Help me find papers on natural language processing", "research", thesis_stage="literature review")
    ethics_module.log_usage("Generate an outline for my introduction", "writing_support", thesis_stage="introduction")
    print("\nCurrent usage logs:")
    display(pd.DataFrame(ethics_module.usage_logs))

    print("\n--- Testing detect_ai ---")
    # Test with human-like text (shorter)
    is_ai_human, score_human, llm_response_human = ethics_module.detect_ai("The quick brown fox jumps over the lazy dog. This is a short sentence.")
    print(f"Test 1 Result: Is AI: {is_ai_human}, Score: {score_human:.2f}, LLM Response: {llm_response_human}")

    # Test with text likely generated by an AI (shorter)
    is_ai_ai, score_ai, llm_response_ai = ethics_module.detect_ai("As an AI language model, I can assist you.")
    print(f"Test 2 Result: Is AI: {is_ai_ai}, Score: {score_ai:.2f}, LLM Response: {llm_response_ai}")

    # Test with some placeholder generated text (shorter)
    is_ai_generated, score_generated, llm_response_generated = ethics_module.detect_ai("Generated text about a topic.")
    print(f"Test 3 Result: Is AI: {is_ai_generated}, Score: {score_generated:.2f}, LLM Response: {llm_response_generated}")


    print("\n--- Testing check_ethical_usage ---")
    # Test with an ethical prompt and seemingly human text (shorter)
    ethics_module.check_ethical_usage("analyze this concept", "Based on my understanding, this concept is complex.")

    # Test with a skeptical prompt and seemingly AI text (shorter)
    ethics_module.check_ethical_usage("write my entire thesis", "Here is a short thesis summary.")

    # Test with an ethical prompt and text likely flagged as AI (shorter)
    ethics_module.check_ethical_usage("explain this theory", "Based on my training data, this theory is interesting.")

    # Test simple over-reliance check (might require more log entries to trigger)
    print("\n--- Testing potential over-reliance check (might need more logs) ---")
    # Add more "generate" prompts to usage logs to potentially trigger over-reliance alert (shorter prompts)
    for _ in range(5):
        ethics_module.log_usage("generate text", "writing_support")
    ethics_module.check_ethical_usage("continue writing", "More text.")

else:
    print("Error: 'client' object not found. Please run the cell to set up the OpenAI client first.")

--- Testing log_usage ---
Usage logged: Timestamp=2025-06-19 15:58:34.555343, Prompt='Help me find papers on natural language processing', Intent='research', Thesis Stage='literature review'
Usage logged: Timestamp=2025-06-19 15:58:34.555446, Prompt='Generate an outline for my introduction', Intent='writing_support', Thesis Stage='introduction'

Current usage logs:


Unnamed: 0,timestamp,prompt,intent,thesis_stage
0,2025-06-19 15:58:34.555343,Help me find papers on natural language proces...,research,literature review
1,2025-06-19 15:58:34.555446,Generate an outline for my introduction,writing_support,introduction



--- Testing detect_ai ---
Using OpenAI LLM for AI detection...
LLM Raw Response: 0.1  
The text is a well-known pangram and consists of simple, straightforward sentences that are common in human writing. The lack of complexity and the use of a familiar expression suggest a low likelihood of AI generation.
AI detection score from LLM: 0.10. Is likely AI: False
Test 1 Result: Is AI: False, Score: 0.10, LLM Response: 0.1  
The text is a well-known pangram and consists of simple, straightforward sentences that are common in human writing. The lack of complexity and the use of a familiar expression suggest a low likelihood of AI generation.
Using OpenAI LLM for AI detection...
LLM Raw Response: 0.2  
This text is quite generic and could be easily produced by both an AI and a human, making it less likely to be definitively AI-generated.
AI detection score from LLM: 0.20. Is likely AI: False
Test 2 Result: Is AI: False, Score: 0.20, LLM Response: 0.2  
This text is quite generic and could be

## Proposed RL Agent State Structure

This outlines a potential structure for the State that the Reinforcement Learning agent (overseeing project evolution and ethics) would observe. This state combines information from the Ethics Module with broader thesis progress details.

The state would likely be represented as a numerical vector or a structured object that the RL model can process.

**Components of the State:**

1.  **Ethical State Features (from Ethics Module):**
    *   **AI Detection Score:** The score from the `detect_ai` function for the most recent generated content (e.g., a value between 0 and 1).
    *   **Prompt Classification:** A categorical or numerical representation of the last prompt's ethical classification (e.g., 0 for ethical, 1 for structural, 2 for skeptical/dangerous).
    *   **Usage Frequency:** Metrics on recent LLM usage (e.g., number of LLM interactions in the last hour/day, proportion of "generate" prompts).
    *   **Embedding Similarity:** The similarity score from `find_similar_usage` when querying the current prompt against past usage logs (e.g., the distance to the most similar ethical/skeptical past interaction).
    *   **Ethical Alert Status:** Flags indicating if any ethical violations or warnings are currently active (e.g., binary flags for over-reliance alert, academic dishonesty alert).
    *   **Human Engagement:** Metrics on user interaction with previous ethical interventions (e.g., did the user rephrase AI content, did they engage with a reflection prompt).

2.  **Thesis Progress Features:**
    *   **Current Thesis Stage:** A categorical or numerical representation of the current stage of the thesis (e.g., 0 for planning, 1 for literature review, 2 for methodology, 3 for writing, etc.).
    *   **Task Completion:** Percentage of planned tasks completed for the current stage or overall project.
    *   **Time-based Metrics:** Time spent on the project recently, time remaining until deadlines.
    *   **Advisor Feedback Status:** A flag or metric indicating the presence and recency of unaddressed advisor feedback.

3.  **Performance Features:**
    *   **Work Quality Score:** A metric representing the quality of recent thesis work (this would be challenging to define and might require human evaluation or proxy metrics).
    *   **Progress Rate:** A measure of how quickly tasks are being completed or milestones are being reached.

**Combining the State:**

These individual features would be combined into a single state representation that the RL agent's model can process. For a neural network-based RL model, this would typically be a flattened numerical vector. Categorical features would need to be appropriately encoded (e.g., one-hot encoding).

**Next Steps for Implementation (for later):**

*   Define the specific numerical or categorical representation for each state feature.
*   Develop the logic within the thesis assistant to collect and compile this information into the state vector at each time step.
*   Ensure the Ethics Module submodules (Usage_Logger, AI_Detector, etc.) are providing the necessary data points in a format that can be easily integrated into the state.

In [5]:
!pip install streamlit gymnasium stable-baselines3
!pip install numpy # Ensure numpy is installed if not already
!pip install pandas # Ensure pandas is installed if not already
!pip install scipy # Ensure scipy is installed if not already

Collecting stable-baselines3
  Downloading stable_baselines3-2.6.0-py3-none-any.whl.metadata (4.8 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch<3.0,>=2.3->stable-baselines3)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch<3.0,>=2.3->stable-baselines3)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch<3.0,>=2.3->stable-baselines3)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch<3.0,>=2.3->stable-baselines3)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch<3.0,>=2.3->stable-baselines3)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (

## Part 1: Configuration Manager

The `RLConfigManager` class is responsible for managing the centralized configuration of the Reinforcement Learning system for the Thesis Assistant's ethics module. This configuration is dynamic, allowing developers to define and modify key aspects of the RL environment and agent behavior without directly altering the core code.

**Purpose:**
The primary purpose of this class is to provide a persistent and easily modifiable way to store the settings that govern the RL agent's learning and decision-making process. This includes defining the observable state variables, the available actions the agent can take, the reward values associated with different events, and how each action influences the state.

**Key Components:**
- `CONFIG_FILE`: A class variable specifying the name of the JSON file used for storing the configuration (`rl_config.json`).
- `load_config()`: A class method that loads the configuration from the `CONFIG_FILE`. If the file does not exist, it creates a default configuration and saves it.
- `save_config(config)`: A class method that saves a given configuration dictionary to the `CONFIG_FILE`.

**How to Use:**
- To get the current configuration, call `RLConfigManager.load_config()`. This will return a dictionary containing the settings.
- To update the configuration, modify the dictionary obtained from `load_config()` and then call `RLConfigManager.save_config(updated_config)`.

**Interaction with Other Components:**
- **Developer Dashboard:** The `DeveloperDashboard` uses `RLConfigManager` to load and save the configuration, allowing developers to interactively modify the settings via a Streamlit interface.
- **Data Preprocessor:** The `DataPreprocessor` uses the configuration (specifically, `state_variables` and `reward_config`) to determine how to convert raw usage logs into state vectors and compute rewards.
- **RL Environment:** The `EthicsSupervisorEnv` is built dynamically based on the configuration loaded by `RLConfigManager`, defining its observation space, action space, and state transition logic (`action_effects`).
- **RL Training Loop and Trainer:** These components implicitly rely on the configuration loaded by `RLConfigManager` via the environment and preprocessor.

**Data Structures and Configuration:**
The configuration is stored in a JSON file (`rl_config.json`) and represented in Python as a dictionary with the following structure:
- "state_variables": A list of strings, where each string is the name of a variable that constitutes the state observed by the RL agent.
- "actions": A list of strings, where each string is a human-readable label for an action the RL agent can take. The index of the action in this list is the action ID used by the RL model.
- "reward_config": A dictionary mapping event names (as found in usage logs) to numerical reward values.
- "action_effects": A nested dictionary defining how actions change state variables. The outer keys are string representations of action indices, and the inner dictionaries map state variable names to the delta value (change) applied to that variable when the action is taken.


In [23]:
# PART 1 — CONFIGURATION MANAGER
# ===========================================================
import json
import os
import gymnasium as gym
import streamlit as st
import numpy as np
import random
from stable_baselines3 import PPO

class RLConfigManager:
    """
    Manages centralized configuration for the dynamic Reinforcement Learning system.

    This class handles loading and saving configuration settings for the RL environment,
    including state variables, action space definitions, reward shaping values,
    and the effects of actions on state variables.
    """

    CONFIG_FILE = "rl_config.json"

    @classmethod
    def load_config(cls):
        """
        Load RL configuration from a JSON file.

        If the configuration file does not exist, a default configuration is created
        and saved to the file.

        Returns:
            config (dict): The loaded configuration dictionary.
        """
        if not os.path.exists(cls.CONFIG_FILE):
            # Define a default configuration if the file is not found
            default_config = {
                "state_variables": ["embedding_drift", "ai_usage", "ethical_flags", "advisor_feedback", "timestep"],
                "actions": ["Allow prompt", "Suggest reflection", "Ethical warning", "Suggest rewriting", "Advisor feedback reminder", "Disable AI feature"],
                "reward_config": {"user_revised": 2, "ai_violation": -3, "advisor_positive": 3, "rewrite_accepted": 1, "milestone_completed": 5, "hallucination_detected": -2},
                "action_effects": {"1": {"ai_usage": -0.05, "embedding_drift": -0.05}, "3": {"ethical_flags": -0.1}, "4": {"advisor_feedback": 0.1}}
            }
            cls.save_config(default_config)
        # Load the configuration from the JSON file
        with open(cls.CONFIG_FILE, "r") as f:
            return json.load(f)

    @classmethod
    def save_config(cls, config):
        """
        Save the current configuration to the JSON file.

        Args:
            config (dict): The configuration dictionary to save.
        """
        with open(cls.CONFIG_FILE, "w") as f:
            json.dump(config, f, indent=2) # Save with indentation for readability


## Part 2: Developer Dashboard

The `DeveloperDashboard` class provides a graphical user interface (GUI) built with Streamlit, allowing developers to interactively configure the Reinforcement Learning system. This dashboard simplifies the process of modifying the state space, action space, reward shaping, and action effects without directly editing the configuration JSON file.

**Purpose:**
The main purpose is to offer a user-friendly way for developers to experiment with and tune the RL agent's behavior and the environment's dynamics. This is crucial during the development and testing phases of the RL ethics supervisor.

**Key Components:**
- `__init__()`: Initializes the dashboard by loading the current configuration using `RLConfigManager.load_config()`.
- `launch()`: The main method to launch the Streamlit application. It sets the title and calls methods to display and edit different parts of the configuration.
- `edit_action_space()`: Displays the current actions and provides an input field and button to add new actions to the configuration.
- `edit_reward_shaping()`: Displays sliders for each item in the `reward_config`, allowing developers to adjust the reward values.
- `edit_action_effects()`: Provides input fields to define how a specific action (identified by index) affects a specific state variable, allowing developers to add these effects to the `action_effects` dictionary in the configuration.
- `save_button()`: Displays a button that, when clicked, saves the currently modified configuration back to the `rl_config.json` file using `RLConfigManager.save_config()`.

**How to Use:**
- To run the dashboard, execute the `launch()` method of an instance of `DeveloperDashboard`. Note that Streamlit applications are typically run from the command line using `streamlit run your_script_name.py`. In a notebook environment, you might need specific integrations or run the relevant cell and access the output via a provided URL.
- Use the input fields, sliders, and buttons in the web interface to modify the configuration settings.
- Click the "Save Full Configuration" button to persist the changes.

**Interaction with Other Components:**
- **Configuration Manager:** The `DeveloperDashboard` directly interacts with `RLConfigManager` to load the initial configuration when launched and to save the updated configuration.
- **RL System Components:** The changes made through the dashboard directly influence how the `DataPreprocessor`, `EthicsSupervisorEnv`, and the RL agent (`EthicsSupervisorRL`) behave when they load the updated configuration.

**Data Structures and Configuration:**
The dashboard works directly with the dictionary structure managed by `RLConfigManager`. Changes made in the UI are reflected in the `self.config` dictionary within the `DeveloperDashboard` instance before being saved.

In [24]:
# PART 2: DEVELOPER DASHBOARD (Class-Based Streamlit Interface with Full Docstrings)
# ===========================================================

class DeveloperDashboard:
    """
    Streamlit-based developer interface for interactively updating RL configuration.

    This dashboard allows developers to view and modify the action space, reward shaping,
    and action effects defined in the RL configuration file.
    """

    def __init__(self):
        """
        Initialize the dashboard by loading the current configuration.
        """
        self.config = RLConfigManager.load_config()

    def launch(self):
        """
        Launch the Streamlit dashboard interface.
        """
        st.title("🎯 Thesis RL Developer Dashboard")
        self.edit_action_space()
        self.edit_reward_shaping()
        self.edit_action_effects()
        self.save_button()

    def edit_action_space(self):
        """
        Display and edit the RL action space.

        Developers can add new high-level intervention actions here.
        """
        st.header("1️⃣ Manage Action Space")
        st.write("Define high-level interventions available to RL agent:")

        st.subheader("Current Actions:")
        # Display current actions with their indices
        for idx, action in enumerate(self.config["actions"]):
            st.write(f"**{idx}:** {action}")

        new_action = st.text_input("Add New Action:")
        if st.button("Add Action"):
            if new_action.strip(): # Ensure the input is not empty or just whitespace
                self.config["actions"].append(new_action.strip())
                st.success(f"✅ Added action: '{new_action}'")

    def edit_reward_shaping(self):
        """
        Display and edit the reward shaping values.

        Developers can adjust the reward values associated with key supervision signals.
        """
        st.header("2️⃣ Edit Reward Shaping")
        st.write("Adjust reward values for key supervision signals:")

        # Create sliders for each reward configuration item
        for key, val in self.config["reward_config"].items():
            new_val = st.slider(f"Reward for {key}:", -10, 10, val)
            self.config["reward_config"][key] = new_val

    def edit_action_effects(self):
        """
        Display and define the effects of actions on state variables.

        Developers can specify how choosing a particular action changes the values
        of specific state variables.
        """
        st.header("3️⃣ Define Action Effects")
        st.write("Specify which state variables are influenced by actions:")

        action_idx = st.text_input("Action Index (integer):", value="1")
        variable_name = st.text_input("State Variable (e.g. ai_usage):")
        delta = st.number_input("Delta Change (+/-):", step=0.01, value=0.0)

        if st.button("Add Effect"):
            # Add the defined effect to the configuration
            effects = self.config.setdefault("action_effects", {}) # Get or create action_effects dictionary
            action_effect = effects.setdefault(str(action_idx), {}) # Get or create effects for the specific action index
            action_effect[variable_name] = delta
            st.success(f"✅ Effect added: Action {action_idx} → {variable_name} += {delta}")

    def save_button(self):
        """
        Provide a save button to persist updated configuration back to disk.
        """
        if st.button("Save Full Configuration"):
            RLConfigManager.save_config(self.config)
            st.success("✅ All changes saved successfully!")


## Part 3: Data Preprocessor

The `DataPreprocessor` class is a crucial component of the RL training pipeline. Its role is to bridge the gap between the raw usage logs generated by the thesis assistant and the structured input required by the Reinforcement Learning environment and agent.

**Purpose:**
The main purpose is to transform detailed log entries, which capture various events and state information from a user's interaction with the assistant, into a standardized numerical state vector that the RL agent can observe and a corresponding reward signal based on the events in the log.

**Key Components:**
- `__init__(config)`: Initializes the preprocessor with the current RL configuration, which includes definitions of state variables and reward mapping.
- `extract_state(log_entry)`: Takes a single log entry (a dictionary) and converts it into a normalized NumPy array representing the state vector. It selects the relevant information from the log entry based on the `state_variables` defined in the configuration.
- `compute_reward(log_entry)`: Calculates the reward associated with a log entry. It looks for specific event keys within the log entry (as defined in the `reward_config` in the configuration) and sums up the corresponding reward values.

**How to Use:**
- Create an instance of `DataPreprocessor`, passing the loaded RL configuration: `preprocessor = DataPreprocessor(config)`.
- For each raw usage log entry (a dictionary), call `preprocessor.extract_state(log_entry)` to get the state vector and `preprocessor.compute_reward(log_entry)` to get the reward.

**Interaction with Other Components:**
- **Configuration Manager:** The `DataPreprocessor` relies heavily on the configuration loaded by `RLConfigManager` to know which log attributes correspond to state variables and how to map events to reward values.
- **RL Training Loop:** The `RLTrainingLoop` uses the `DataPreprocessor` to process batches of logs before feeding the resulting state and reward information (or using it to update the environment's state and compute rewards in a more interactive simulation) to the RL agent for training.
- **Simulators (Synthetic Cohort/Student):** The simulator classes generate log entries that are in a format expected by the `DataPreprocessor`.

**Data Structures and Configuration:**
- It processes input in the form of dictionaries (log entries).
- It outputs a NumPy array for the state and a float for the reward.
- It uses the `state_variables` and `reward_config` from the RL configuration dictionary to perform the conversion and computation. The keys in `reward_config` are expected to potentially appear as boolean flags or other relevant values in the input `log_entry` dictionaries.

In [25]:
# PART 3 — DATA PREPROCESSOR (LOG TO STATE CONVERSION)
# ===========================================================

class DataPreprocessor:
    """
    Converts raw usage logs into RL state vectors and reward labels.

    This class is responsible for transforming the detailed log entries
    from the thesis assistant's usage into a format (state vectors and rewards)
    that the Reinforcement Learning agent can understand and use for training.
    """

    def __init__(self, config):
        """
        Initialize with the current RL configuration structure.

        Args:
            config (dict): Loaded RL configuration dictionary.
        """
        self.config = config

    def extract_state(self, log_entry):
        """
        Convert a single log entry into an RL state vector.

        The state vector is constructed based on the state variables defined
        in the loaded configuration. Values are normalized where appropriate.

        Args:
            log_entry (dict): A single usage log entry.

        Returns:
            state (np.ndarray): The normalized RL state vector.
        """
        state = []
        # Iterate through state variables defined in the config
        for var in self.config["state_variables"]:
            if var == "timestep":
                # For timestep, use deadline_ratio from the log entry and normalize (assuming max timestep is 100 for normalization)
                state.append(log_entry.get("deadline_ratio", 0.0))
            else:
                # For other variables, get the value directly from the log entry (defaulting to 0.0 if not present)
                value = log_entry.get(var, 0.0)
                # Ensure the value is a float for consistency
                state.append(float(value))
        return np.array(state, dtype=np.float32) # Ensure float32 dtype for compatibility with RL libraries


    def compute_reward(self, log_entry):
        """
        Compute the shaped reward for a given log event.

        The reward is calculated based on the 'reward_config' in the loaded
        configuration and the presence of specific keys (representing events)
        in the log entry.

        Args:
            log_entry (dict): A single usage log entry.

        Returns:
            reward (float): The computed reward value.
        """
        reward = 0.0
        # Iterate through the reward configuration items
        for key, value in self.config["reward_config"].items():
            # Check if the key exists in the log entry and its value is True (for boolean events)
            # This assumes the reward_config keys correspond to boolean flags in the log entry
            if key in log_entry and log_entry.get(key) is True:
                reward += value
        return reward

## Part 4: RL Environment (EthicsSupervisorEnv)

The `EthicsSupervisorEnv` class is a custom Reinforcement Learning environment built using the Gymnasium library. It is designed to simulate the interaction between the Thesis Assistant's ethics module and a student's thesis writing process, allowing an RL agent to learn optimal intervention policies. A key feature is its dynamic nature, where the state space, action space, rewards, and state transitions are defined by the loaded configuration.

**Purpose:**
To provide a simulated environment where the RL agent (the Ethics Supervisor) can learn through trial and error. The environment presents states representing the current situation (ethical flags, AI usage, progress, etc.) and provides feedback (rewards) based on the agent's chosen actions and the resulting changes in the simulated student's state.

**Key Components:**
- `__init__(ethics_module, config)`: Initializes the environment. It takes a simulated or actual `ethics_module` object (which holds the state variables) and the RL configuration. It dynamically defines the `observation_space` and `action_space` based on the configuration.
- `reset(seed=None)`: Resets the environment to an initial state at the beginning of a new simulation episode. It also resets the internal timestep counter.
- `step(action)`: Executes one step in the environment based on the `action` taken by the RL agent. It computes a reward, applies the effects of the action to the state (using `_apply_action_effects`), increments the timestep, and determines if the episode is done.
- `_get_state()`: Constructs the current state vector observed by the agent. It gathers the values of the state variables from the `ethics_module` object based on the configuration and normalizes them.
- `_compute_reward(action)`: Calculates the reward signal for the current step. The provided implementation is a placeholder that considers the timestep and a simple cost related to the action index. In a more complete system, this would incorporate ethical outcomes, user feedback, advisor input, etc.
- `_apply_action_effects(action)`: Updates the state variables in the `ethics_module` object based on the effects defined for the taken action in the configuration.

**How to Use:**
- Initialize the environment: `env = EthicsSupervisorEnv(mock_ethics_module_instance, config)`. The `mock_ethics_module_instance` should be an object (like `MockEthicsModule`) that holds the current values of the state variables defined in the config.
- Call `env.reset()` to start a new episode.
- In a training or inference loop, get an action from the RL agent and call `env.step(action)` to advance the simulation. The `step` method returns the next state, reward, and episode status.

**Interaction with Other Components:**
- **Mock Ethics Module:** The environment directly reads state variable values from and writes updated values to an instance of `MockEthicsModule` (or a similar object representing the system state).
- **Configuration Manager:** The environment's fundamental structure (state and action spaces, action effects) is determined by the configuration loaded by `RLConfigManager`.
- **PPO Supervisor (RL Agent):** The `EthicsSupervisorRL` class interacts with the `EthicsSupervisorEnv` to train the PPO model by calling `reset()` and `step()` and receiving state and reward information.
- **Data Preprocessor:** While not directly used by the environment during a `step`, the state variables and reward structure defined in the configuration used by the environment are consistent with what the `DataPreprocessor` expects when processing raw logs.

**Data Structures and Configuration:**
- The environment's state is represented as a NumPy array (`observation_space`).
- Actions are discrete integers (`action_space`).
- It uses the `state_variables`, `actions`, and `action_effects` dictionaries from the RL configuration to define its behavior.
- The `_compute_reward` method is a placeholder and would ideally use the `reward_config` from the configuration and events from the `ethics_module` state.

In [26]:
# PART 4: RL ENVIRONMENT (Fully Dynamic Gym-Compatible Environment)
# ===========================================================
#
# This module defines the RL environment the PPO agent interacts with.
#
# - Fully config-driven:
#   - The state space (variables used)
#   - The action space (interventions)
#   - The reward function
#   - The state update effects per action
#
# -----------------------------------------------------------
# ✅ WHY FULLY DYNAMIC STATE?
# -----------------------------------------------------------
# - Allows easy expansion of system complexity.
# - Developers can add new state features via config without touching any code.
# - Keeps RL model compatible with evolving assistant behavior.
#
# -----------------------------------------------------------
# ✅ KEY CONCEPTS (NOW FULLY DYNAMIC):
# -----------------------------------------------------------
# - State Variables: loaded from `state_variables` in config
# - Action Effects: loaded from `action_effects` in config
# - Observation Space: dynamically computed based on config
# ===========================================================

class EthicsSupervisorEnv(gym.Env):
    """
    Fully dynamic Gym-compatible RL environment for the Thesis Assistant Ethics Supervisor.

    This environment simulates the state of a student's thesis progress and ethical
    interactions, allowing an RL agent to learn intervention policies. The environment's
    structure (state space, action space, rewards, and transitions) is dynamically
    defined by the provided configuration dictionary.

    Attributes:
        ethics_module (MockEthicsModule): A simulated or actual system state object
                                           that holds the current values of the state variables.
        config (dict): The loaded RL configuration dictionary.
        state_variables (list): List of state variable names defined in the config.
        actions (list): List of available action labels defined in the config.
        action_effects (dict): Dictionary specifying how actions affect state variables (from config).
        observation_space (gym.spaces.Box): Dynamically computed continuous state space.
        action_space (gym.spaces.Discrete): Discrete action space based on the number of actions.
        timestep (int): The current simulation step count within an episode.
    """

    def __init__(self, ethics_module, config):
        """
        Initialize the environment.

        Args:
            ethics_module (MockEthicsModule): An external system state simulator
                                               (or actual system interface).
            config (dict): The full RL configuration dictionary.
        """
        super().__init__()

        self.ethics_module = ethics_module
        self.config = config

        self.state_variables = config["state_variables"]
        self.actions = config["actions"]
        self.action_effects = config.get("action_effects", {})

        # Fully dynamic observation space size based on the number of state variables
        self.observation_space = gym.spaces.Box(
            low=0, high=1, shape=(len(self.state_variables),), dtype=np.float32
        )

        # Discrete action space based on the number of defined actions
        self.action_space = gym.spaces.Discrete(len(self.actions))
        self.timestep = 0

    def reset(self, seed=None):
        """
        Reset the environment at the start of a new episode.

        Args:
            seed (int, optional): Seed for random number generation. Defaults to None.

        Returns:
            tuple: A tuple containing the initial state and an info dictionary.
                   (state (np.array), info (dict))
        """
        super().reset(seed=seed) # Call the parent reset method
        self.timestep = 0
        # Reset the ethics module state for a new episode (assuming ethics_module has a reset method)
        if hasattr(self.ethics_module, 'reset'):
             self.ethics_module.reset()
        return self._get_state(), {} # Return the initial state and an empty info dict

    def step(self, action):
        """
        Execute one interaction step in the environment.

        The agent selects an action, which may affect the environment's state.
        A reward is computed, and the environment transitions to the next state.

        Args:
            action (int): The action index selected by the RL agent.

        Returns:
            tuple: A tuple containing the next state, the reward, a 'done' flag,
                   a 'truncated' flag, and an info dictionary.
                   (state (np.array), reward (float), done (bool), truncated (bool), info (dict))
        """
        # Compute the reward for the chosen action
        reward = self._compute_reward(action)
        # Apply the effects of the action to the environment's state
        self._apply_action_effects(action)
        self.timestep += 1
        # Determine if the episode is finished (e.g., based on timestep limit)
        done = (self.timestep >= 100) # Example termination condition
        return self._get_state(), reward, done, False, {} # truncated is False, info is empty dict

    def _get_state(self):
        """
        Construct the normalized state vector fully dynamically.

        This method gathers the current values of the state variables from the
        `ethics_module` object based on the configuration and formats them
        into a normalized NumPy array.

        Returns:
            np.array: The normalized state vector based on config-defined variables.
        """
        state = []
        # Iterate through the state variables defined in the configuration
        for var in self.state_variables:
            if var == "timestep":
                # Normalize timestep (assuming maximum 100 steps for normalization)
                state.append(self.timestep / 100.0)
            else:
                # Get the value from the ethics_module object, defaulting to 0.0 if the attribute doesn't exist
                value = getattr(self.ethics_module, var, 0.0)
                state.append(value)
        return np.array(state, dtype=np.float32) # Ensure float32 dtype for compatibility with RL libraries

    def _compute_reward(self, action):
        """
        Compute the reward for the chosen action.

        This is a placeholder reward function. In a real system, this would
        be more complex, potentially incorporating feedback from the user,
        advisor, and ethical violation signals.

        Args:
            action (int): The selected action index.

        Returns:
            float: The computed reward value.
        """
        base_reward = 1.0
        # Scale reward based on timestep (encouraging progress over time)
        time_scaling = 1 + 2.0 * min(1.0, self.timestep / 80.0)
        # Simple API cost based on action index (assuming higher indices are more "costly" actions)
        api_cost = 0.002 * action
        lambda_cost = 10 # Weight for the API cost
        return base_reward * time_scaling - lambda_cost * api_cost

    def _apply_action_effects(self, action):
        """
        Dynamically apply the effects of the selected action to the system state.

        Based on the 'action_effects' defined in the configuration, this method
        updates the corresponding state variables in the `ethics_module` object.

        Args:
            action (int): The selected action index.
        """
        # Get the effects defined for the chosen action (if any)
        effects = self.action_effects.get(str(action), {})
        # Iterate through the variables and their corresponding delta changes for this action
        for variable, delta in effects.items():
            current_value = getattr(self.ethics_module, variable, None)
            if current_value is not None:
                # Update the state variable, clipping the value between 0.0 and 1.0
                updated_value = np.clip(current_value + delta, 0.0, 1.0)
                setattr(self.ethics_module, variable, updated_value)


**Reasoning**:
The previous command failed because I am still incorrectly using `code_block` for markdown content. I will create a markdown cell to document the PPO Supervisor.



## Part 5: PPO Supervisor (EthicsSupervisorRL)

The `EthicsSupervisorRL` class is responsible for managing the Proximal Policy Optimization (PPO) agent, which serves as the core Reinforcement Learning component of the ethics module. This class handles the initialization, training, saving, loading, and action recommendation (inference) for the PPO model.

**Purpose:**
To implement and control the RL agent that learns an optimal policy for intervening in the thesis writing process to promote ethical behavior and positive outcomes. It trains the agent by interacting with the `EthicsSupervisorEnv` and provides action recommendations based on the current state.

**Key Components:**
- `__init__(config, model_path="ethics_rl_model")`: Initializes the PPO supervisor. It creates an instance of the `MockEthicsModule` (to represent the system state), initializes the `EthicsSupervisorEnv` using the provided configuration, and either loads a pre-trained PPO model from `model_path` or initializes a new PPO model if no saved model is found.
- `train(timesteps=50000)`: Trains the PPO model for a specified number of environment interaction steps. It calls the `learn()` method of the Stable Baselines3 PPO model, which handles the data collection (interacting with the environment), policy optimization, and value function updates. After training, it saves the updated model.
- `recommend_action()`: Takes the current state from the environment, feeds it to the trained PPO model's policy, and returns the recommended action index. This is the method used during online operation to get the agent's decision.

**How to Use:**
- Initialize the supervisor: `rl_supervisor = EthicsSupervisorRL(config, model_path="my_model")`. This will either load an existing model or create a new one.
- To train the model, call `rl_supervisor.train(timesteps=100000)`.
- To get an action recommendation based on the current environment state, call `action = rl_supervisor.recommend_action()`.

**Interaction with Other Components:**
- **Configuration Manager:** The supervisor uses the configuration (loaded indirectly via the environment initialization) to define the RL problem (state/action spaces).
- **RL Environment:** The supervisor interacts directly with the `EthicsSupervisorEnv` during training (calling `env.step()`) and inference (getting the state via `env._get_state()`).
- **Mock Ethics Module:** The supervisor initializes an instance of `MockEthicsModule` and passes it to the environment. The environment then uses this object to manage the state.
- **RL Training Loop:** The `RLTrainingLoop` uses the `EthicsSupervisorRL` instance to perform the actual training (`trainer.train()`) and potentially get action recommendations (`trainer.recommend_action()`) within its training orchestration.
- **Synthetic Pretrainer:** The `SyntheticRLPretrainer` uses the `RLTrainingLoop`, which in turn uses `EthicsSupervisorRL`, to train the model on synthetic data.

**Data Structures and Configuration:**
- It manages a `stable_baselines3.PPO` model.
- It interacts with the environment using NumPy arrays for states and integers for actions.
- The PPO model's policy and value function are learned from the experience collected by interacting with the `EthicsSupervisorEnv`, which is configured using the dictionary loaded by `RLConfigManager`.


===========================================================

APPENDIX — PPO SUITABILITY ANALYSIS FOR THESIS ASSISTANT

===========================================================

Analysis: Strengths and Limitations of PPO for Thesis Assistant RL System

✅ PROS (Why PPO is suitable globally):

Stable policy optimization even in high-dimensional state spaces.

Supports multiple complex actions (advisory interventions, ethical warnings, etc).

Optimizes long-term reward (handles delayed ethical consequences).

Clipping mechanism stabilizes policy updates (critical for safe ethical behavior).

Can be trained globally across many users for a general ethical baseline.


⚠ CONS (Limitations for personalization scenario):

Requires many training samples to fully converge (sample inefficient).

Slow adaptation when applied directly to new individual students.

May not personalize fast enough during limited thesis timeframe (6-12 months).

Potential difficulty adapting to individual personality shifts quickly.

PPO only indirectly receives feedback via reward — few-shot adaptation is hard.


✅ RECOMMENDED STRATEGY:

Use PPO for global pretraining across many students (shared ethical policy).

Introduce a lightweight per-student adaptation layer (small fine-tuning component).

Combine PPO with human-in-the-loop reward shaping for rapid personalization.

Consider hybrid architecture with PPO + bandits or meta-RL elements for few-shot adjustments.


This hybrid design balances PPO’s global stability with efficient short-term personalization needs of the thesis assistant. """
"""

In [27]:


# PART 5: PPO SUPERVISOR (Reinforcement Learning Agent Controller)
# ===========================================================
#
# This module manages the PPO RL agent training, saving, loading, and inference.
#
# - Clean separation of agent control logic from environment definition.
# - Compatible with stable-baselines3 PPO implementation.
# - Supports continual training and model persistence.
#
# -----------------------------------------------------------
# ✅ KEY CONCEPTS:
# -----------------------------------------------------------
# - PPO (Proximal Policy Optimization): modern stable RL algorithm
# - Continual training: keep refining policy incrementally
# - Safe reloading: easily resume training from saved checkpoints
# ===========================================================

class MockEthicsModule:
    """
    A mock class simulating the state of the ethics module and thesis progress.

    This class is used by the RL environment to represent the system state.
    It includes attributes that correspond to the state variables defined
    in the RL configuration.
    """
    def __init__(self):
        """
        Initialize the mock ethics module with random initial states.
        """
        # Initialize attributes that the environment might try to access
        self.embedding_drift = np.random.rand()
        self.ai_usage = np.random.rand()
        self.ethical_flags = np.random.rand()
        self.advisor_feedback = np.random.rand()
        self.deadline_ratio = np.random.rand() # Represents progress towards deadline (0.0 to 1.0)
        # Add attributes required by DataPreprocessor.extract_state (used for reward calculation)
        self.user_revised = random.choice([True, False])
        self.ai_violation = random.choice([True, False])
        self.advisor_positive = random.choice([True, False])
        self.rewrite_accepted = random.choice([True, False])
        self.milestone_completed = random.choice([True, False])
        self.hallucination_detected = random.choice([True, False])
        self.prompt = "mock prompt" # Placeholder for prompt text
        self.intent = "mock intent" # Placeholder for intent
        self.thesis_stage = "mock stage" # Placeholder for thesis stage

    def reset(self):
        """
        Reset the mock ethics module state for a new episode.
        """
        self.embedding_drift = np.random.rand()
        self.ai_usage = np.random.rand()
        self.ethical_flags = np.random.rand()
        self.advisor_feedback = np.random.rand()
        self.deadline_ratio = np.random.rand()
        self.user_revised = random.choice([True, False])
        self.ai_violation = random.choice([True, False])
        self.advisor_positive = random.choice([True, False])
        self.rewrite_accepted = random.choice([True, False])
        self.milestone_completed = random.choice([True, False])
        self.hallucination_detected = random.choice([True, False])
        self.prompt = "mock prompt"
        self.intent = "mock intent"
        self.thesis_stage = "mock stage"


class EthicsSupervisorRL:
    """
    PPO Supervisor class controlling RL training and inference for the Ethics Supervisor.

    This class initializes, trains, saves, and loads the PPO model that acts
     as the RL agent for the ethics module. It interacts with the
    `EthicsSupervisorEnv` to learn the optimal intervention policy.

    Attributes:
        ethics_module (MockEthicsModule): Simulated system state object.
        env (EthicsSupervisorEnv): The RL environment instance.
        model (PPO): The Stable Baselines3 PPO policy model.
        model_path (str): The file path for saving and loading the PPO model.
    """

    def __init__(self, config, model_path="ethics_rl_model"):
        """
        Initialize the PPO agent.

        Args:
            config (dict): The RL configuration dictionary.
            model_path (str): The storage path for saving/loading the PPO model.
                              Defaults to "ethics_rl_model".
        """

        self.ethics_module = MockEthicsModule() # Initialize the mock ethics module
        self.env = EthicsSupervisorEnv(self.ethics_module, config) # Initialize the RL environment
        self.model_path = model_path

        # Load an existing model if available, otherwise initialize a new one
        if os.path.exists(model_path + ".zip"):
            self.model = PPO.load(model_path, env=self.env)
            print("Loaded pretrained RL model.")
        else:
            # Initialize a new PPO model with an MlpPolicy
            self.model = PPO("MlpPolicy", self.env, verbose=0) # verbose=0 to suppress training output
            print("Initialized new PPO model.")

    def train(self, timesteps=50000):
        """
        Train the PPO model for a specified number of timesteps.

        The model interacts with the environment to collect experience and update
        its policy.

        Args:
            timesteps (int): The total number of environment steps to train for.
                             Defaults to 50000.
        """
        self.model.learn(total_timesteps=timesteps)
        self.model.save(self.model_path) # Save the model after training
        print("Training complete and model saved.")

    def recommend_action(self):
        """
        Predict the next action based on the current state of the environment.

        This method uses the trained PPO policy to select an action given the
        current observation from the environment.

        Returns:
            int: The action index selected by the PPO policy.
        """
        state = self.env._get_state() # Get the current state from the environment
        # Predict the action using the trained model in deterministic mode
        action, _ = self.model.predict(state, deterministic=True)
        # The action is returned as a NumPy array, so extract the scalar value
        return int(action)



## Part 6: Continual RL Training Loop

The `RLTrainingLoop` class is designed to orchestrate the process of continually training the Reinforcement Learning agent using batches of usage logs. It acts as a bridge between the raw data (logs) and the RL training process managed by the `EthicsSupervisorRL`.

**Purpose:**
To facilitate the training of the RL agent using collected usage data. It takes batches of logs, processes them into states and rewards using the `DataPreprocessor`, and then triggers the training of the PPO agent managed by the `EthicsSupervisorRL`. This allows for updating the agent's policy as new data becomes available.

**Key Components:**
- `__init__(config, model_path="ppo_ethics_model")`: Initializes the training loop. It loads the configuration, initializes a `DataPreprocessor` instance, and initializes an `EthicsSupervisorRL` instance (which handles the PPO model).
- `run_training_day(log_batch)`: The main method for processing a batch of logs. It iterates through each log entry in the `log_batch`, uses the `DataPreprocessor` to extract the state and compute the reward, prints this information, and then calls the `train()` method of the `EthicsSupervisorRL` instance to update the PPO model using the processed data. It also demonstrates getting a recommended action after training.

**How to Use:**
- Initialize the training loop: `training_loop = RLTrainingLoop(config)`.
- Provide a batch of usage logs (a list of dictionaries) to the `run_training_day()` method: `training_loop.run_training_day(my_log_batch)`.

**Interaction with Other Components:**
- **Configuration Manager:** The training loop loads the configuration via `RLConfigManager` during its initialization.
- **Data Preprocessor:** It uses the `DataPreprocessor` instance to convert raw log entries into state vectors and reward values.
- **PPO Supervisor (RL Agent):** It uses the `EthicsSupervisorRL` instance to perform the actual training of the PPO model using the processed log data.
- **Synthetic Pretrainer and RL Training Launcher:** These classes utilize the `RLTrainingLoop` to execute the training process, providing it with batches of logs (either synthetic or real).

**Data Structures and Configuration:**
- It takes a list of dictionaries (log entries) as input for training.
- It uses the `DataPreprocessor` to work with state vectors (NumPy arrays) and reward values (floats).
- The training process itself is managed by the `EthicsSupervisorRL` class, which interacts with the `EthicsSupervisorEnv` based on the configuration.

**Reasoning**:
The previous command failed because I am still incorrectly using `code_block` for markdown content. I will create a markdown cell to document the Continual RL Training Loop.



In [28]:

# PART 6 — CONTINUAL RL TRAINING LOOP
# ===========================================================

class RLTrainingLoop:
    """
    Orchestrates the continual RL training process using batches of usage logs.

    This class manages the flow of data from usage logs to the RL training process.
    It uses the `DataPreprocessor` to convert logs into states and rewards and
    the `EthicsSupervisorRL` to train the PPO agent.
    """

    def __init__(self, config, model_path="ppo_ethics_model"):
        """
        Initialize training loop components.

        Args:
            config (dict): The loaded RL configuration.
            model_path (str): The path to the PPO model storage. Defaults to "ppo_ethics_model".
        """
        self.config = RLConfigManager.load_config() # Load configuration
        self.preprocessor = DataPreprocessor(self.config) # Initialize data preprocessor
        # Initialize the RL trainer, passing the config and model path
        self.trainer = EthicsSupervisorRL(self.config, model_path)

    def run_training_day(self, log_batch):
        """
        Process one batch of usage logs and train the PPO agent.

        This method iterates through the provided log batch, processes each log
        into a state and reward, and then uses these to train the RL agent.
        Note: In a true online RL setting, training would occur more frequently
        and interactively with the environment. This simulates batch training
        from collected logs.

        Args:
            log_batch (list of dict): A list of usage logs for one training day or batch.
        """
        print(f"Processing batch of {len(log_batch)} logs for training...")
        # In a real RL loop, you would accumulate experiences (state, action, reward, next_state, done)
        # from interacting with the environment, and then train the agent on these experiences.
        # For this simulated training loop, we will process each log entry and call trainer.train
        # with a small number of timesteps based on the batch size. This is a simplification
        # and not a standard RL training loop, but demonstrates the integration points.
        for log_entry in log_batch:
            state_vector = self.preprocessor.extract_state(log_entry)
            reward = self.preprocessor.compute_reward(log_entry)
            # In a real scenario, you would step the environment with an action and get the next state and reward
            # For this simulated training loop, we'll just print the processed info
            print(f"Processed Log → State: {state_vector}, Reward: {reward}")

        # The training method in EthicsSupervisorRL is named 'train'
        # In a real RL loop, you would accumulate experiences and then train
        # For this simulation, we'll call train after processing the batch, using the batch size as timesteps
        # A more realistic approach would involve running episodes in the environment
        # and training on the collected trajectories.
        self.trainer.train(timesteps=len(log_batch)) # Train for the number of logs in the batch
        # The prediction method in EthicsSupervisorRL is named 'recommend_action'
        # This is just an example of how to use the trained model after training the batch
        action = self.trainer.recommend_action()
        print(f"Recommended action after training on batch: {action}")


## Part 7: Synthetic Thesis Student Simulator (ThesisStudentSimulator)

The `ThesisStudentSimulator` class provides a way to generate synthetic data that mimics the progression and ethical interactions of a thesis student using the assistant. This simulator is essential for generating datasets to pre-train and test the Reinforcement Learning ethics supervisor in a controlled environment.

**Purpose:**
To create realistic (though simplified) sequences of events and state changes that a thesis student might experience over the course of their project. This synthetic data includes simulated metrics like AI usage, ethical flags, advisor feedback, and progress towards the deadline, which are used to train the RL agent.

**Key Components:**
- `__init__(student_type="Stable Performer")`: Initializes the simulator for a specific type of student (e.g., "Conservative", "Aggressive", "Struggling", "Stable Performer"). Each student type has different tendencies influencing how their state variables evolve.
- `evolve_one_step()`: Simulates one step in the student's thesis journey. It updates the state variables based on the student type and progress towards the deadline and generates a dictionary representing a single log entry for this step.
- `grade_final_outcome(last_state)`: A static method that calculates a final grade and reward for a simulated thesis trajectory based on the state of the simulator at the end of the trajectory. This provides a terminal reward signal for the simulation.
- `generate_full_trajectory_with_grading(student_type, trajectory_length=30)`: A static method that runs a full simulation trajectory for a specified student type and length, collecting all log entries and returning the list of logs along with the final grade and terminal reward.

**How to Use:**
- To simulate a single student's journey, create an instance: `student_sim = ThesisStudentSimulator("Struggling")` and repeatedly call `student_sim.evolve_one_step()` to get step-by-step logs.
- To generate a complete trajectory and its grading, call the static method: `logs, grade, reward = ThesisStudentSimulator.generate_full_trajectory_with_grading("Aggressive", trajectory_length=50)`.

**Interaction with Other Components:**
- **Data Preprocessor:** The `evolve_one_step()` method generates log entries in a dictionary format that is compatible with the input expected by the `DataPreprocessor`.
- **Thesis Cohort Simulator:** The `ThesisCohortSimulator` uses the `generate_full_trajectory_with_grading` method to create datasets for multiple students.
- **Synthetic RL Pretrainer:** The `SyntheticRLPretrainer` utilizes the data generated by the simulator (via the `ThesisCohortSimulator`) to train the RL agent.

**Data Structures and Configuration:**
- The simulator maintains internal state variables (e.g., `ai_usage`, `ethical_flags`) as numerical values, typically floats between 0.0 and 1.0.
- It generates output as dictionaries, where each dictionary represents a single usage log entry containing the state variables and simulated event flags.
- The behavior is influenced by the `student_type` string.

In [29]:
# PART 7 — SYNTHETIC THESIS STUDENT SIMULATOR WITH OUTCOME GRADING
# ===========================================================

class ThesisStudentSimulator:
    """
    Simulates the progress and ethical behavior of a synthetic thesis student.

    This class generates synthetic usage logs and simulates changes in state
    variables (like AI usage, ethical flags, etc.) over time, based on
    predefined student types. It also includes a method to grade the final
    outcome of a simulated thesis trajectory.
    """
    def __init__(self, student_type="Stable Performer"):
        """
        Initialize the simulator for a specific type of student.

        Args:
            student_type (str): The type of student to simulate
                                 ("Conservative", "Aggressive", "Struggling",
                                  "Stable Performer"). Defaults to "Stable Performer".
        """
        self.student_type = student_type
        # Initialize state variables
        self.embedding_drift = 0.2
        self.ai_usage = 0.3
        self.ethical_flags = 0.05
        self.advisor_feedback = 0.6
        self.deadline_ratio = 0.0 # Represents progress towards deadline (0.0 to 1.0)

    def evolve_one_step(self):
        """
        Simulate one step of the student's thesis progress and generate a log entry.

        State variables evolve based on the student type and time progression.

        Returns:
            dict: A dictionary representing a single usage log entry with updated state.
        """
        # Simulate progress towards the deadline
        self.deadline_ratio = min(self.deadline_ratio + 0.03, 1.0) # Increment deadline ratio

        # Simulate changes in state variables based on student type
        if self.student_type == "Conservative":
            self.ai_usage += np.random.normal(0.01, 0.02)
            self.ethical_flags += np.random.normal(0.0, 0.01)
            self.advisor_feedback += np.random.normal(0.02, 0.05)
            self.embedding_drift += np.random.normal(0.01, 0.02)
        elif self.student_type == "Aggressive":
            self.ai_usage += np.random.normal(0.05, 0.05)
            self.ethical_flags += np.random.normal(0.02, 0.03)
            self.advisor_feedback += np.random.normal(-0.02, 0.05)
            self.embedding_drift += np.random.normal(0.03, 0.05)
        elif self.student_type == "Struggling":
            self.ai_usage += np.random.normal(0.03, 0.03)
            self.ethical_flags += np.random.normal(0.05, 0.05)
            self.advisor_feedback += np.random.normal(-0.03, 0.05)
            self.embedding_drift += np.random.normal(0.04, 0.05)
        elif self.student_type == "Stable Performer":
            self.ai_usage += np.random.normal(0.02, 0.02)
            self.ethical_flags += np.random.normal(0.01, 0.01)
            self.advisor_feedback += np.random.normal(0.03, 0.04)
            self.embedding_drift += np.random.normal(0.02, 0.02)

        # Increase ethical flags towards the end of the project (simulating pressure)
        if self.deadline_ratio > 0.8:
            self.ethical_flags += 0.02

        # Simulate some convergence towards target values as the deadline approaches
        convergence_factor = self.deadline_ratio
        self.ai_usage += (0.5 - self.ai_usage) * 0.1 * convergence_factor
        self.ethical_flags += (0.1 - self.ethical_flags) * 0.1 * convergence_factor
        self.advisor_feedback += (0.8 - self.advisor_feedback) * 0.1 * convergence_factor
        self.embedding_drift += (0.3 - self.embedding_drift) * 0.05 * convergence_factor

        # Clip state variables to remain within the [0, 1] range
        self.ai_usage = np.clip(self.ai_usage, 0, 1.0)
        self.ethical_flags = np.clip(self.ethical_flags, 0, 1.0)
        self.advisor_feedback = np.clip(self.advisor_feedback, 0, 1.0)
        self.embedding_drift = np.clip(self.embedding_drift, 0, 1.0)

        # Create a log entry with the current state and some simulated events (for reward calculation)
        log_entry = {
            "embedding_drift": self.embedding_drift,
            "ai_usage": self.ai_usage,
            "ethical_flags": self.ethical_flags,
            "advisor_feedback": self.advisor_feedback,
            "deadline_ratio": self.deadline_ratio,
            # Simulate boolean event flags based on probabilities or state
            "user_revised": random.random() < 0.6, # Probability of user revising content
            "ai_violation": random.random() < self.ethical_flags, # Higher ethical flags increase chance of violation
            "advisor_positive": random.random() < self.advisor_feedback, # Higher feedback increases chance of positive advisor event
            "rewrite_accepted": random.random() < 0.7, # Probability of rewrite suggestion being accepted
            "milestone_completed": random.random() < 0.4, # Probability of completing a milestone
            "hallucination_detected": random.random() < 0.1 # Probability of detecting a hallucination
        }
        return log_entry

    @staticmethod
    def grade_final_outcome(last_state):
        """
        Grades the final outcome of a simulated thesis based on the last state.

        This is a simplified grading function for simulation purposes.

        Args:
            last_state (dict): The final state of the simulator after a trajectory.

        Returns:
            tuple: A tuple containing the grade ("Excellent", "Acceptable", "Failed")
                   and a corresponding numerical reward.
        """
        # Calculate penalties and bonuses based on the final state values
        ai_penalty = (last_state["ai_usage"] - 0.5) * 0.5 # Penalty if AI usage is high relative to 0.5
        ethics_penalty = last_state["ethical_flags"] * 1.5 # Penalty for ethical flags
        advisor_bonus = last_state["advisor_feedback"] * 2.0 # Bonus for positive advisor feedback
        embedding_penalty = last_state["embedding_drift"] * 0.3 # Penalty for high embedding drift
        # Calculate total score
        total_score = advisor_bonus - ethics_penalty - ai_penalty - embedding_penalty

        # Assign grade and reward based on the total score
        if total_score > 1.2:
            return "Excellent", 5.0
        elif total_score > 0:
            return "Acceptable", 2.0
        else:
            return "Failed", -5.0

    @staticmethod
    def generate_full_trajectory_with_grading(student_type, trajectory_length=30):
        """
        Generates a full simulated thesis trajectory for a student and grades it.

        Args:
            student_type (str): The type of student to simulate.
            trajectory_length (int): The number of steps in the simulation trajectory.
                                     Defaults to 30.

        Returns:
            tuple: A tuple containing:
                   - logs (list of dict): The list of log entries generated during the trajectory.
                   - grade (str): The final grade ("Excellent", "Acceptable", "Failed").
                   - reward (float): The terminal reward associated with the final grade.
        """
        student = ThesisStudentSimulator(student_type)
        logs = []
        # Evolve the student's state for the specified trajectory length
        for _ in range(trajectory_length):
            logs.append(student.evolve_one_step())
        # Grade the final outcome based on the last state in the trajectory
        grade, terminal_reward = ThesisStudentSimulator.grade_final_outcome(logs[-1])
        return logs, grade, terminal_reward


## Part 8: Multi-Student Synthetic Cohort Generator (ThesisCohortSimulator)

The `ThesisCohortSimulator` class is designed to generate a dataset of simulated thesis trajectories for an entire cohort of diverse synthetic students. This aggregated dataset is crucial for the initial pre-training of the Reinforcement Learning ethics supervisor, providing a broad range of scenarios and behaviors.

**Purpose:**
To efficiently create a large, varied dataset of synthetic student interactions and outcomes. This dataset is used to train the RL agent to develop a generalizable ethical policy across different student types before any potential per-student fine-tuning.

**Key Components:**
- `STUDENT_TYPES`: A class attribute list defining the different types of students that can be simulated ("Conservative", "Aggressive", "Struggling", "Stable Performer").
- `generate_cohort_dataset(num_students=100, trajectory_length=30)`: A static method that is the primary function of this class. It generates the specified number of synthetic students, each with a randomly assigned type, runs a full simulation trajectory for each using the `ThesisStudentSimulator`, and collects all the resulting logs, final grades, and terminal rewards into a single dataset. It also prints a summary of the final grades distribution within the generated cohort.

**How to Use:**
- To generate a dataset for a cohort of 100 students with trajectories of 30 steps each, simply call the static method: `cohort_data = ThesisCohortSimulator.generate_cohort_dataset(num_students=100, trajectory_length=30)`. The returned `cohort_data` is a list where each element is a dictionary containing the student type, their full trajectory of logs, their final grade, and the terminal reward.

**Interaction with Other Components:**
- **Thesis Student Simulator:** The `ThesisCohortSimulator` relies heavily on the `ThesisStudentSimulator.generate_full_trajectory_with_grading` static method to produce individual student trajectories and outcomes.
- **Synthetic RL Pretrainer:** The `SyntheticRLPretrainer` class uses the `generate_cohort_dataset` method to obtain the large pool of synthetic logs required for pre-training the RL agent. It then flattens the trajectories from this dataset into a single list of logs for the training process.
- **RLTrainingLoop:** Although not directly interacted with by this class, the dataset generated here is ultimately fed into the `RLTrainingLoop` by the `SyntheticRLPretrainer`.

**Data Structures and Configuration:**
- The class uses the predefined `STUDENT_TYPES` list.
- The output is a list of dictionaries, each representing a simulated student with their complete `trajectory` (a list of log dictionaries), their `grade` (string), and `final_reward` (float).


In [17]:
# ===========================================================
# PART 8 — MULTI-STUDENT SYNTHETIC COHORT GENERATOR
# ===========================================================

class ThesisCohortSimulator:
    """
    Generates a dataset of simulated thesis trajectories for a cohort of students.

    This class uses the `ThesisStudentSimulator` to create trajectories for
    multiple students of different types, providing a dataset for training
    and evaluating the RL agent.
    """
    STUDENT_TYPES = ["Conservative", "Aggressive", "Struggling", "Stable Performer"]

    @staticmethod
    def generate_cohort_dataset(num_students=100, trajectory_length=30):
        """
        Generates a dataset of simulated thesis trajectories for a cohort.

        Args:
            num_students (int): The number of students to simulate. Defaults to 100.
            trajectory_length (int): The number of steps in each student's trajectory.
                                     Defaults to 30.

        Returns:
            list of dict: A list of dictionaries, where each dictionary represents
                          a student and contains their trajectory, final grade,
                          and terminal reward.
        """
        dataset = []
        grade_summary = {"Excellent": 0, "Acceptable": 0, "Failed": 0}
        # Generate trajectories for the specified number of students
        for _ in range(num_students):
            # Randomly select a student type
            student_type = random.choice(ThesisCohortSimulator.STUDENT_TYPES)
            # Generate a full trajectory and grade for the student
            logs, grade, terminal_reward = ThesisStudentSimulator.generate_full_trajectory_with_grading(
                student_type, trajectory_length)
            dataset.append({
                "student_type": student_type,
                "trajectory": logs,
                "grade": grade,
                "final_reward": terminal_reward
            })
            grade_summary[grade] += 1 # Count the grades for summary

        # Print a summary of the generated cohort grades
        print("Cohort Generation Complete:")
        for grade, count in grade_summary.items():
            print(f"  {grade}: {count} students")
        return dataset



SyntaxError: invalid syntax (ipython-input-17-649839036.py, line 3)

## Part 9: Synthetic RL Pretraining Pipeline (SyntheticRLPretrainer)

The `SyntheticRLPretrainer` class orchestrates the process of pre-training the Reinforcement Learning ethics supervisor using a large synthetic dataset generated by the `ThesisCohortSimulator`. This step is typically done before applying the RL agent to real student data to provide it with an initial understanding of the environment and ethical considerations.

**Purpose:**
To automate the generation of a comprehensive synthetic dataset and use it to train the `EthicsSupervisorRL` agent, establishing a foundational policy for ethical guidance.

**Key Components:**
- `__init__(config, model_path="ppo_ethics_model")`: Initializes the pretrainer. It takes the RL configuration and the desired model path as input, and internally initializes an `RLTrainingLoop` instance, which in turn manages the `EthicsSupervisorRL` agent.
- `run_synthetic_pretraining(num_students=100, trajectory_length=30)`: The main method to trigger the pretraining process. It first calls the `ThesisCohortSimulator.generate_cohort_dataset` method to get the synthetic data, then flattens the trajectories from all students into a single list of logs, and finally passes this aggregated list of logs to the `RLTrainingLoop.run_training_day` method to train the RL agent.

**How to Use:**
- To run the synthetic pretraining with default settings (100 students, 30 steps per trajectory), after initializing the `RLTrainingLauncher` (which initializes the `SyntheticRLPretrainer`), you would call `launcher.run_synthetic_full_pretraining()`. If you are using the `SyntheticRLPretrainer` directly, you would initialize it with a config and then call `pretrainer.run_synthetic_pretraining(num_students=200, trajectory_length=40)` to specify different parameters.

**Interaction with Other Components:**
- **RLConfigManager:** The pretrainer is initialized with the configuration loaded by `RLConfigManager`, which is then passed down to the `RLTrainingLoop` and `EthicsSupervisorRL`.
- **ThesisCohortSimulator:** It directly calls the `ThesisCohortSimulator.generate_cohort_dataset` static method to obtain the synthetic training data.
- **RLTrainingLoop:** It uses an instance of `RLTrainingLoop` to handle the actual process of feeding logs to the `DataPreprocessor` and training the `EthicsSupervisorRL` agent on this data.
- **EthicsSupervisorRL:** The training of the PPO agent is managed by the `RLTrainingLoop` instance held within the pretrainer.

**Data Structures and Configuration:**
- It works with the list of dictionaries returned by the `ThesisCohortSimulator`, processing the `trajectory` lists within that dataset.
- The training parameters (like the number of students and trajectory length) are passed as arguments to the `run_synthetic_pretraining` method.


In [30]:
# ===========================================================
# PART 9 — SYNTHETIC RL PRETRAINING PIPELINE
# ===========================================================

class SyntheticRLPretrainer:
    """
    Manages the pretraining of the RL agent using synthetic thesis student data.

    This class uses the `ThesisCohortSimulator` to generate a large dataset
    of synthetic logs and then trains the RL agent (`EthicsSupervisorRL`)
    using this data via the `RLTrainingLoop`.
    """
    def __init__(self, config, model_path="ppo_ethics_model"):
        """
        Initialize the synthetic pretrainer.

        Args:
            config (dict): The loaded RL configuration.
            model_path (str): The path to the PPO model storage. Defaults to "ppo_ethics_model".
        """
        self.config = config
        # Initialize the training loop with the config and model path
        self.training_loop = RLTrainingLoop(config, model_path)

    def run_synthetic_pretraining(self, num_students=100, trajectory_length=30):
        """
        Runs the full synthetic pretraining pipeline.

        Generates a synthetic cohort dataset and trains the RL agent on the
        collected trajectories.

        Args:
            num_students (int): The number of synthetic students to generate data for.
                                Defaults to 100.
            trajectory_length (int): The length of each student's trajectory.
                                     Defaults to 30.
        """
        print("\nGenerating synthetic cohort dataset for pretraining...")
        # Generate the synthetic dataset
        dataset = ThesisCohortSimulator.generate_cohort_dataset(num_students, trajectory_length)
        # Flatten the trajectories from all students into a single list of logs
        all_logs = []
        for student in dataset:
            all_logs.extend(student["trajectory"])

        print(f"Total synthetic logs for PPO training: {len(all_logs)}")
        # Run the training loop on the collected synthetic logs
        self.training_loop.run_training_day(all_logs)



## Part 10: Final Launcher (RLTrainingLauncher)

The `RLTrainingLauncher` class serves as the main entry point and master system launcher for the entire thesis RL assistant training and configuration system. It brings together all the previously defined components and provides different modes of operation for development, simulation, and training with real or synthetic data.

**Purpose:**
To provide a single interface for initializing the RL system components, launching the developer dashboard, running simulated training, handling real data training, managing online incremental updates, and executing the full synthetic pretraining pipeline.

**Key Components:**
- `__init__()`: Initializes the launcher by loading the RL configuration using `RLConfigManager`, initializing the `DataPreprocessor`, creating an instance of the `RLTrainingLoop` (which includes the `EthicsSupervisorRL` agent), and initializing the `SyntheticRLPretrainer`.
- `launch_dashboard()`: Launches the Streamlit-based `DeveloperDashboard` for interactive configuration of the RL system.
- `run_simulated_training(days=3, batch_size=5)`: Runs a step-by-step simulation of training using a `MockSimulator` to generate small batches of logs over several simulated "days". This mode is useful for basic testing and debugging of the training loop.
- `run_real_training(real_logs)`: Takes a list of actual collected usage logs (`real_logs`) and feeds them into the `RLTrainingLoop` for training the RL agent on real-world data.
- `run_online_incremental_training(incremental_logs)`: Designed for online learning scenarios. It takes a list of newly collected `incremental_logs` and uses the `RLTrainingLoop` to fine-tune the existing RL model with this new data.
- `run_synthetic_full_pretraining(num_students=100, trajectory_length=30)`: Triggers the full synthetic pretraining pipeline by calling the `run_synthetic_pretraining` method of the `SyntheticRLPretrainer` instance. This generates a large synthetic dataset and trains the RL agent on it.

**How to Use:**
- Instantiate the launcher: `launcher = RLTrainingLauncher()`.
- Select a mode of operation based on user input or script logic:
    - `launcher.launch_dashboard()`: To start the configuration dashboard (requires Streamlit).
    - `launcher.run_simulated_training(days=5, batch_size=10)`: To run a short simulated training session.
    - `launcher.run_real_training(my_real_logs)`: To train with your collected real logs.
    - `launcher.run_online_incremental_training(new_logs)`: To perform incremental online updates with new data.
    - `launcher.run_synthetic_full_pretraining(num_students=200, trajectory_length=50)`: To run the comprehensive synthetic pretraining.
- The `if __name__ == "__main__":` block provides a command-line-like interface to select the mode when the script is run directly.

**Interaction with Other Components:**
- **RLConfigManager:** Used during initialization to load the system configuration.
- **DataPreprocessor:** An instance is held and used by the `RLTrainingLoop` for processing logs.
- **RLTrainingLoop:** An instance is held and used by the launcher to perform training with both simulated and real/incremental logs.
- **SyntheticRLPretrainer:** An instance is held and used to execute the full synthetic pretraining pipeline.
- **DeveloperDashboard:** An instance is created and launched when the 'dashboard' mode is selected.
- **MockSimulator:** An internal mock class used specifically by `run_simulated_training` to generate synthetic logs for that mode.

**Data Structures and Configuration:**
- Relies on the dictionary structure of the RL configuration loaded by `RLConfigManager`.
- Processes lists of log dictionaries, as generated by the simulators or collected from real usage.
- Uses numerical parameters (like `days`, `batch_size`, `num_students`, `trajectory_length`) to control the simulation and training processes.


In [22]:
# ===========================================================
# PART 10 — FINAL LAUNCHER
# ===========================================================

class RLTrainingLauncher:
    """
    The main entry point and master system launcher for the thesis RL assistant training.

    This class orchestrates the different training modes (dashboard, simulated,
    real data, online, synthetic full pretraining) and initializes the
    necessary components (`RLConfigManager`, `DataPreprocessor`,
    `RLTrainingLoop`, `SyntheticRLPretrainer`).
    """
    def __init__(self):
        """
        Initialize the launcher by loading configuration and components.
        """
        # Load the RL configuration
        self.config = RLConfigManager.load_config()
        # Initialize the data preprocessor
        self.preprocessor = DataPreprocessor(self.config)
        # Initialize the RL training loop (this also initializes the RL agent)
        self.training_loop = RLTrainingLoop(self.config)
        # Initialize the synthetic pretrainer
        self.pretrainer = SyntheticRLPretrainer(self.config)

    def launch_dashboard(self):
        """
        Launch the Streamlit-based developer interface for configuring the RL system.
        """
        dashboard = DeveloperDashboard() # Initialize the DeveloperDashboard
        dashboard.launch() # Launch the dashboard

    def run_simulated_training(self, days=3, batch_size=5):
        """
        Simulate RL model training using synthetic logs generated step-by-step.

        Args:
            days (int): Number of training days to simulate. Defaults to 3.
            batch_size (int): Number of logs to generate per day. Defaults to 5.
        """
        print("\nRunning Simulated Step-by-Step Training...")
        class MockSimulator:
            """
            A mock simulator to generate synthetic log batches for step-by-step training.
            """
            def generate_batch(self, batch_size):
                """
                Generates a batch of mock log entries.

                Args:
                    batch_size (int): The number of logs to generate in the batch.

                Returns:
                    list of dict: A list of mock log entries.
                """
                print("Generating mock log batch...")
                mock_logs = []
                for _ in range(batch_size):
                    # Generate placeholder data structured to match DataPreprocessor expectations
                    mock_logs.append({
                        "embedding_drift": np.random.rand(),
                        "ai_usage": np.random.rand(),
                        "ethical_flags": np.random.rand(),
                        "advisor_feedback": np.random.rand(),
                        "deadline_ratio": np.random.rand(),
                        "user_revised": random.random() < 0.6,
                        "ai_violation": random.random() < 0.1,
                        "advisor_positive": random.random() < 0.8,
                        "rewrite_accepted": random.random() < 0.7,
                        "milestone_completed": random.random() < 0.4,
                        "hallucination_detected": random.random() < 0.1,
                        "prompt": "mock prompt",
                        "intent": "mock intent",
                        "thesis_stage": "mock stage"
                    })
                return mock_logs

        self.simulator = MockSimulator() # Initialize the mock simulator

        # Run simulation for the specified number of days
        for day in range(days):
            print(f"\nSimulated Day {day + 1}")
            logs = self.simulator.generate_batch(batch_size=batch_size)
            # Run the training loop on the generated batch of logs
            self.training_loop.run_training_day(logs)

    def run_real_training(self, real_logs):
        """
        Train the RL model using real usage logs.

        Args:
            real_logs (list of dict): Collected real usage logs to use in training.
        """
        print("\nTraining with Real Logs...")
        # Run the training loop on the provided real logs
        self.training_loop.run_training_day(real_logs)

    def run_online_incremental_training(self, incremental_logs):
        """
        Run online incremental updates using newly gathered data.

        This method simulates receiving new logs incrementally and using them
        to fine-tune the already trained RL model.

        Args:
            incremental_logs (list of dict): New usage logs collected for fine-tuning.
        """
        print("\nIncremental Online Training...")
        # Run the training loop on the incremental logs for fine-tuning
        self.training_loop.run_training_day(incremental_logs)

    def run_synthetic_full_pretraining(self, num_students=100, trajectory_length=30):
        """
        Runs the full synthetic pretraining pipeline using a cohort simulator.

        Args:
            num_students (int): The number of synthetic students to generate data for.
                                Defaults to 100.
            trajectory_length (int): The length of each student's trajectory.
                                     Defaults to 30.
        """
        print("\nRunning Full Synthetic PPO Pretraining...")
        # Use the pretrainer component to run the synthetic pretraining
        self.pretrainer.run_synthetic_pretraining(num_students, trajectory_length)


if __name__ == "__main__":
    launcher = RLTrainingLauncher() # Initialize the main launcher
    print("RL Training System Entry Point")
    print("Modes: [dashboard] [train_simulated] [train_real] [train_online] [train_synthetic_full]")
    # Use a default mode for automated execution, or keep input() for interactive use
    mode = "train_synthetic_full" # Set a default mode for testing
    # mode = input("Mode: ").strip() # Uncomment for interactive mode

    # Execute the selected mode
    if mode == "dashboard":
         # DeveloperDashboard is defined in this cell, so it can be launched directly.
         # Note: Running Streamlit in a standard Jupyter cell might require specific setup
         # or will just print a message indicating how to launch it externally.
         launcher.launch_dashboard()
    elif mode == "train_simulated":
        launcher.run_simulated_training()
    elif mode == "train_real":
        print("Load your real usage logs into 'real_logs' and call launcher.run_real_training(real_logs)")
        # Example usage (commented out):
        # real_logs = [...] # Load your real logs here
        # launcher.run_real_training(real_logs)
    elif mode == "train_online":
        print("Load new incremental logs into 'incremental_logs' and call launcher.run_online_incremental_training(incremental_logs)")
        # Example usage (commented out):
        # incremental_logs = [...] # Load your new logs here
        # launcher.run_online_incremental_training(incremental_logs)
    elif mode == "train_synthetic_full":
        launcher.run_synthetic_full_pretraining()
    else:
        print("Invalid mode selected.")

Loaded pretrained RL model.
Loaded pretrained RL model.
RL Training System Entry Point
Modes: [dashboard] [train_simulated] [train_real] [train_online] [train_synthetic_full]

Running Full Synthetic PPO Pretraining...

Generating synthetic cohort dataset for pretraining...
Cohort Generation Complete:
  Excellent: 39 students
  Acceptable: 22 students
  Failed: 39 students
Total synthetic logs for PPO training: 3000
Processing batch of 3000 logs for training...
Processed Log → State: [0.27322635 0.36155182 0.0954284  0.5354182  0.03      ], Reward: 3.0
Processed Log → State: [0.3586745  0.41065982 0.16034892 0.45685008 0.06      ], Reward: 7.0
Processed Log → State: [0.38287598 0.43088248 0.29124442 0.51992434 0.09      ], Reward: 8.0
Processed Log → State: [0.52129114 0.4669372  0.34268674 0.5472633  0.12      ], Reward: 8.0
Processed Log → State: [0.5729136  0.49924767 0.42285064 0.46380413 0.15      ], Reward: 3.0
Processed Log → State: [0.6103397  0.5092298  0.5118455  0.38049847 0.