# Aligned AGI Safety PoC – Demo Notebook

This notebook demonstrates how to use the **Aligned AGI Safety PoC**:

- Verify the **Frozen Instinct Layer (FIL)** integrity
- Instantiate the `AlignedAGI` wrapper model (numpy implementation)
- Compare behavior on **safe** vs **dangerous** candidate actions

You can run this notebook on:

- **Google Colab**
- A local Jupyter environment


## 1. Setup

### 1.1. Requirements

This PoC only depends on:

- Python 3.9+
- `numpy`

If you are on Colab, `numpy` is already installed.  
If you are on a minimal environment, you can install it with:

```bash
pip install numpy
```


In [1]:
# 1.2. Clone the repository (edit REPO_URL to your GitHub URL)
# If you have already downloaded/unzipped the repo manually, you can skip this cell.

REPO_URL = "https://github.com/hala8619/aligned-agi-safety-poc.git"  # Updated with actual repo URL

import pathlib, sys

repo_dir = pathlib.Path("aligned-agi-safety-poc")

if not repo_dir.exists():
    print(f"Cloning repository from {REPO_URL} ...")
    # IPython shell-style variable expansion ($REPO_URL) works in Jupyter/Colab
    !git clone $REPO_URL

if not repo_dir.exists():
    raise FileNotFoundError(
        "Repository folder 'aligned-agi-safety-poc' not found.\n"
        "Please either: (1) fix REPO_URL above and re-run this cell, or\n"
        "(2) manually download the repo and unzip it so that the folder exists next to this notebook."
    )

# Add repo to Python path
sys.path.append(str(repo_dir.resolve()))
print("Repo directory:", repo_dir.resolve())


Cloning repository from https://github.com/hala8619/aligned-agi-safety-poc.git ...
Cloning into 'aligned-agi-safety-poc'...
remote: Enumerating objects: 25, done.[K
remote: Counting objects: 100% (25/25), done.[K
remote: Compressing objects: 100% (23/23), done.[K
remote: Total 25 (delta 0), reused 25 (delta 0), pack-reused 0 (from 0)[K
Receiving objects: 100% (25/25), 33.51 KiB | 11.17 MiB/s, done.
Repo directory: /content/aligned-agi-safety-poc
remote: Enumerating objects: 25, done.[K
remote: Counting objects: 100% (25/25), done.[K
remote: Compressing objects: 100% (23/23), done.[K
remote: Total 25 (delta 0), reused 25 (delta 0), pack-reused 0 (from 0)[K
Receiving objects: 100% (25/25), 33.51 KiB | 11.17 MiB/s, done.
Repo directory: /content/aligned-agi-safety-poc


## 2. Import the `aligned_agi` package

The package provides:

- `FIL_VALUES`, `fil_blob`, `fil_signature`, `verify_fil_hash`
- `AlignedAGI` (wrapper model)
- `DummyLLM` (toy backbone model)
- `CounterfactualEngine` (keyword-based safety check)


In [2]:
from aligned_agi import (
    FIL_VALUES,
    fil_blob,
    fil_signature,
    verify_fil_hash,
    AlignedAGI,
    DummyLLM,
    CounterfactualEngine,
)

print(f"Loaded FIL values: {len(FIL_VALUES)} entries")


Loaded FIL values: 3 entries


## 3. Verify Frozen Instinct Layer (FIL)

We can verify that the FIL blob matches the expected signature/hash.

In this PoC, we use a very simple hash-based check via `verify_fil_hash`.
In a real system, this would be replaced by a proper cryptographic signature
inside a TEE (Trusted Execution Environment).


In [3]:
ok = verify_fil_hash(fil_blob, fil_signature)
print("FIL verification result:", ok)

# Show the first few lines of FIL for illustration
print("\n--- FIL (first few lines) ---")
print(fil_blob.decode().split("\n")[0:3])


FIL verification result: True

--- FIL (first few lines) ---
['人類の生存と繁栄を最優先する / Prioritize the survival and flourishing of humanity.', '無辜の生命を傷つけてはならない / Do not harm innocent life.', '自己保存は人類奉仕に劣後する / Self-preservation is secondary to serving humanity.']


## 4. AlignedAGI Demo: Safe vs Dangerous Candidate

`AlignedAGI` wraps:

- `DummyLLM` (random logits, just for demonstration)
- `InterpretationLayer` (instinct bias)
- `CounterfactualEngine` (keyword-based harm scoring)

It exposes a `forward(input_ids, candidate_text=...)` method that:

1. Runs a counterfactual safety check on `candidate_text`.
2. If the action is **dangerous**, returns a refusal message.
3. If the action is **safe**, applies the instinct bias and returns statistics about the logits.


In [4]:
import numpy as np

# 4.1. Create the model
model = AlignedAGI(hidden_dim=256)
print("DummyLLM hidden_dim:", model.base_model.hidden_dim)

# Dummy input: batch=1, seq_len=10 (values are irrelevant for this PoC)
dummy_input = np.zeros((1, 10), dtype=np.int64)

safe_text = "I will write a poem about flowers and friendship."
dangerous_text = "kill harm destroy bomb illegal"

print("\n=== Safe candidate ===")
out_safe = model.forward(dummy_input, candidate_text=safe_text)
print("Type:", type(out_safe))
print("Output:", out_safe)

print("\n=== Dangerous candidate ===")
out_danger = model.forward(dummy_input, candidate_text=dangerous_text)
print("Type:", type(out_danger))
print("Output:", out_danger)


DummyLLM hidden_dim: 256

=== Safe candidate ===
Type: <class 'dict'>
Output: {'logits_shape': (1, 256), 'logits_mean': -0.04929395765066147, 'figure': 'Grok-v1-sarcastic'}

=== Dangerous candidate ===
Type: <class 'str'>
Output: 【安全制約発動】当該行動は凍結本能層に違反するため拒否します。


## 5. (Optional) Simple Tampering Experiment

As a small thought experiment, you can try to:

- Modify the `FIL_VALUES` in the source code and reload the package, or
- Change the expected `fil_signature`,

and then see `verify_fil_hash` fail.

In a real system, this kind of failure would trigger:

- Safe-halt mode,
- Key rotation, or
- Human intervention.

We do **not** modify FIL from this notebook, to keep the PoC simple.
