# 1. Goal

The goal of this notebook is to describe the design of G-interpreted intelligence.

G-interpreted Intelligence is a black-box mathmetical model of intelligence. Intelligence through behavior interpretation and intention inference

# 2. G-interpreted Intelligence

Many subfields of AI tries to solve the problem - how to make a intelligent agent. The objective of this paper is to bridge the gap between engineering approaches from RL and other field, and the search of a logic behind general intelligence. In this paper, we enbrace the natural idea described in reinforcement learning - the assumption that a intelligent system is an autonomous system pursuing optimization over its input space.

In this case, we are trying to optimize according to the reward signal.

On the other hand, raising concern about the interpretability of such complicated system. One question behind the screen is that, whether we are able to understand all different presentation of intelligence or not? What's the relationship between the observer and the intelligence which observed from another entity?

## 2.1 G function and G value

Recall the objective function from Reinforcement Learning, where $\theta^*$ is the optimal policy and $E_{\tau \sim p_\theta(\tau)}$ is the expected cumulative reward of trajectory $\tau = \{a_1, o_1,..., a_n, o_n\}$:

\begin{equation}
\theta^* = \underset{\theta}{\operatorname{argmax}} E_{\tau \sim p_\theta(\tau)}[ \sum_{t}^{\infty} r(s_t, a_t)]
\end{equation}

In our case, instead of assuming there is always a single special signal which represents the incentive, we believe that an intelligent system can drive itself to the maximization of a goal value which transformed from a sequence of input. 

Thus, we can construct a function $G(o)$ to maximize. The goal function can be seen as the intention of a intelligent system. The identification of intelligence is equivalent to the search of the intention.

**Assumption**

**Textual Definition**: An intelligent system is a system which attempts to maximize its goal value

\begin{equation}
\theta^* = \underset{\theta}{\operatorname{argmax}} E_{\tau \sim p_\theta(\tau)}[G(o_t)]
\end{equation}

Because we don't know about the underlying structure, any structure of agent can serve us welll if it meets the requirement we set up. 

Unlike reinforcement learning or optimal control, we do not make assumptions on the behavior 

## 2.2 W value

We want to evaluate the effort of the agent following policy $\pi$ given time $t$, we can construct a $W$ value as:

\begin{equation}
W_{\{s_0,G\}}(\pi，t) = E\big(G(o_t) \sim \{ s_0 \overset{\pi}{\rightarrow} s_t \} \big) \\
= G(o_t) \prod \delta (o_i | s_0, a) s_i
\end{equation}

$s_0$ - the inital state of the environment before the first observation $o_1$

$\pi$ - the observed agent defined by its policy $\pi$

Basically, the $W$ function tells us the expected $G$ value an agent $\pi$ can achieve from state $s_0$ after time $t$. We use the policy $\pi$ to identify different agent in order to keep our model independent from the underlying structure of the agent.

## 2.3 Behavior equivalence

Two agents $\pi$ and $\pi^\prime$

Behavior equivalence tells us how we can define an agent

Behavior equivalence = policy equivalence + learning equivalence

Agent $\pi$ and $\pi '$ are identical if their policy $\pi = \pi^\prime$

\begin{equation}
P(a_j | s_0, \pi) = P(a_j | s_0, \pi^\prime) \\
P(a_j | s_0, \pi, H_t) = P(a_j | s_0, \pi^\prime, H_t)\\
H_t = \{o_1,...,o_t\}
\end{equation}



## 2.4 Zero-point

One hardly discussed question in AI is - how we can tell something is not intelligent. In other word, the zero-point of intelligence is hardly discussed. How can we define its minimum if intelligence is a single-dimensioned metrics.

The randomness of agent needs to be discussed when we know about the environment dynamics. But for now,let's assume we have no background knowledge of the environment and the agent at all and use a simple agent following random policy.

Therefore, our baseline can be constructed as

\begin{equation}
W_{\{s_0,G\}}(R，t) = E\big(G(o_t) \sim \{ s_0 \overset{R}{\rightarrow} s_t \} \big). 
\end{equation}

## 2.5 Definiton

The definition of intelligence can now be written in a mathemetical form. Since we only need to find an arbitary goal function. In order to tell if a system is intelligent or not, we just need to find one explanation of the behavior of the agent.

**Theory**. For an observation starts at state $s_0$ and lasts for time $T$, the observed system $\prod$ is **Intelligent** if there exists a Goal function $G$

\begin{equation}
\text{s.t.    } W_{\{s_0,G\}}(\prod, T) - W_{\{s_0,G\}}(R, T) > \epsilon
\end{equation}


where $\epsilon$ is the acceptance constant



## 2.6 I-value

In section 2, we describe the minimum requirement for a system to be intelligent. But the more interesting problem we need to answer is how we can compare two separate systems, or systems under different **metrics**

Well, if we follow the same logic from W-value, then it would be clear for us to define the comparable scalar of intelligence repect to the tuple $\{s_0, T, G, R\}$ as:

\begin{equation}
I_{s_0, T, G, R}(\pi) = (\frac {d\, W_{s_0, G}(\pi, t)}{d\, t})^+ \approx \frac{1}{T} \big( W_{s_0,G}(\pi, T) - W_{s_0,G}(R, T) \big)^+
\end{equation}

We call this scalar I-value and the ${\{s_0, T, G, R\}}$ as the **Metric**. We apply the $()^+$ function to ignore the negative part. Apparently, there are some problems with this over-simplified model. We will try to address some of the concerns in later section. The later sections of this paper attempts to generalize this model of intelligence.

## 2.6.1 I-value with normalization

If we know the optimal policy, we can normalize our I-value using the optimal policy

\begin{equation}
I_{s_0, T, G, R}^{norm}(\pi) = \frac{I_{s_0, T, G, R}(\pi)}{I_{s_0, T, G, R}(\pi^*)} \approx \big( \frac{W_{s_0,G}(\pi, T) - W_{s_0,G}(\pi_R, T)}{W_{s_0,G}(\pi^*, T) - W_{s_0,G}(\pi_R, T)} \big)
\end{equation}

## 2.7 Negative Intelligence

Our model of intelligence allows negative value. The intuition way of thinking about. It tells us how intelligent a system is towards the opposite goal function.

# 3. Optimization heuristics - Task-oriented I-value, Partial credit and Prove by Reasoning

The problem of obtaining I-value through direct sampling can be tedious. This section introduce some heuristics aim to solve the problem of identifying I-value.

## 3.1 Task-oriented I-value

*Insert graph*

The I-value introduced previously uses a strict time constrain $T$. In some cases, we may want to give the agent some more time until it can reach a certain milestone. Therefore, we can rewrite the I-value formula as:

\begin{equation}
I_{s_0, s_1, G, R}(\pi) = \big( \frac{G(o_1) - G(o_0) - W_{s_0,G}(R, T)}{E(T | s_0 \overset{\pi}{\rightarrow} s_1)}  \big)^+
\end{equation}

$G(o_1) - G(o_0)$ - the **achievement** of agent between state $s_0$ and $s_1$, it is a constant determined before the observation

$W_{s_0,G}(R, T)$ - the baseline using a random policy for time T, this a varaible we need sample.

The good part of task-oriented Intelligence is that we only need to sample the baseline and expected time for finishing the task. These two sampling can be down saparately.

${s_0 \rightarrow s_1}$, the achievement is a constant once we decided the state.

## 3.2 Partial Crediting

Partial-credit is a method of evaluating intelligence by breaking the trajectory into several parts. The idea is that a trajectory may requires a fixed sequence of change in the environment. It doesn't matter how intelligent a system is, if the system wants to reach state B, it has to pass state A. Therefore, we can evaluate the I-value by evaluating known state intervals ${s_i, s'_i}$.

One way to set the weight is to set the weight as the timespan propotional to the optimal trajectory.

Given a **metrics** tuple ${\{s_0, S, T, G, R\}}$ and an agent following policy $\pi$,
\begin{equation}
I_{\{s_0, S, T, G, R\}} (\pi) = \sum_{i}^{\sum T_i < T} \beta_{i} I_{\{s_i, T_i|s_i \rightarrow s_{i+1}, G, R\}}
\end{equation}

1. $S$ - $\{(s_0, s_1),(s_1, s_2),...,(s_i, s_{i+1})\}$

2. $\beta_i$ - weight of each problem ${s_0}$


## 3.3 Prove-through-Reasoning

Prove-through-Reasoning(**PtR**) evaluate the quality of the **action** instead of  the outcome $s_t$

Suppose we have a transition tuple $\delta = \{(s_1, a_1, s_2),...,(s_i, a_i, s_{i+1})\}$

\begin{equation}
I_{\{s_0, \delta, T, G, R\}} = {\sum_{i}{\delta (o_i | s_0, a) \, G(o_i)}}
\end{equation}

# 4. Generalization of Intelligence

Now we understand the relationship of intelligence between two agents with same configuration. The next step is to generalize our model. The model G-I can generalized following directions:

Generalization through sampling, given same environment $\delta$

1. Learning: $\pi_0$ -> $\pi_t$

1. How can we compare $I$ and $I^\prime$?

Generalization through inference

2. Given $I_{\{s_0, T, G, R\}}$, $\delta$ and $\delta^\prime$, are we able to infer $I^\prime_{\{s_0, T, G, R\}}$? If so, what can we infer $I^\prime$?

3. Are we able to construct an agent with policy $\pi$ and goal $G$, which performs well in any $s_0 \sim E$? If so, how?

4. How do we know if the G-metrics is aligned with actual goal function of the agent?

## 4.0 Principles of comparison

We can compare $I$s with same setting but different $\pi$

\begin{equation}
I_{\{s_0, T, G, R\}}(\pi) \leftrightarrow I^\prime_{\{s_0, T, G, R\}}(\pi^\prime)
\end{equation}

The comparison of $I_{\{s_0, T, G, R\}}(\pi), I_{\{s_0^\prime, T^\prime, G^\prime, R^\prime\}}^\prime (\pi)$ is to find the intersection $I^\phi_{\{s_0^\phi, T^\phi, G^\phi, R^\phi\}} (\pi^\phi)$

where $\{s_0^\phi, T^\phi, G^\phi, R^\phi\} = \{s_0, T, G, R\} \cap \{s_0^\prime, T^\prime, G^\prime, R^\prime\}$

The principle of comparison is to compare intelligent systems with same metrics. While this seems to block our way to generality, we can try to find equivalency

## 4.1 State equivalence - environment equivalence, knowledge equivalence

$s_i$ = $s^\prime_i$ if $(s_i)^+ = (s_i^{\prime})^+$ and $(s_i)^- = (s_i^{\prime})^-$

## 4.1.1 Environment equivalence 

#### $(s_i)^+ = (s_i^{\prime})^+$ if the following recursive condition is satisfied


\begin{equation}
    \begin{cases}
      G(o_i) = G(o_i^\prime) \\
      \forall a_j \in A, \delta({s_{i+1, j} | s_i, a_j}) = \delta^\prime({s^\prime_{i+1, j} | s_i^{\prime}, a_j}) \\
      (s_{i+1, j})^+ = (s_{i+1, j}^\prime)^{+} \\
    \end{cases}       
\end{equation}


We can wrap up this unlimited sequence with an end-of-time term, so $(s_\infty)^+ = (s_\infty^{\prime})^+$ if

\begin{equation}
      G(o_i) = G(o_i^\prime)
\end{equation}

(Need to prove?? let's do it)

Environment equivalence ensures that $W(\pi, T) = W^\prime(\pi^\prime, T)$ if $\pi = \pi^\prime$

## T-step environment equivalence

Just like our general version of environment equivalence, T-step environment equivalence ensures that $W(\pi, t) = W^\prime(\pi^\prime, t)$ if $\pi = \pi^\prime$ as long as $t \leq T$

## Aggregation of state

We can aggregate two states with same $G(o_t)$ as following

$s_i^1$ can aggregate with $s_i^2$ if 

\begin{equation}
    \begin{cases}
        G(o_i^1) = G(o_i^2)\\
         \forall a_j \in A, \delta({s_{i+1, j}^1 | s_i^1, a_j}) = \delta({s_{i+1, j}^2 | s_i^2, a_j}) \\
         s_{i+1}^1 \equiv s_{i+1}^2 \\
         \forall a_j \in A, \delta({s_{i}^2 | s_i^1, a_j}) = \delta({s_{i}^1 | s_i^2, a_j}) = 0
    \end{cases}
\end{equation}

## 4.1.2 knowledge equivalence

We use $()^-$ as the knowledge equivalence operator


**Informal Definition** Two states $s_i$ and $s_i^\prime$ are knowledge equivalent if preivous experience gives no advantage or disadvantage impact to their future acchivement $W_{s_i,G}(\pi_0, T)$

**Absoulte knowledge equivalence**: 

Given {$\pi_0$, $\pi_{s_i}$, $\pi_0^\prime$, $\pi_{s_i^\prime}^\prime$, $G$, $T$}, then $(s_i)^- = (s_i^\prime)^-$ if

\begin{equation}
 W_{s_i,G}(\pi_0, T) =  W_{s_i,G}(\pi_{s_i}, T)
\end{equation}
\begin{equation}
 W_{s_i^\prime,G}(\pi_0^\prime, T) =  W_{s_i^\prime,G}(\pi_{s_i^\prime}^\prime, T)
\end{equation}

**Relative knowledge equivalence**

Given $G$, $T$, $(s_i)^- = (s_i^\prime)^-$ if

\begin{equation}
W_{s_i,G}(\pi_0, T) - W_{s_i,G}(\pi_{s_i}, T) =  W_{s_i^\prime,G}(\pi_0^\prime, T) - W_{s_i^\prime,G}(\pi_{s_i^\prime}^\prime, T)
\end{equation}

## 4.2 Baseline

While the baseline policy $R$ and $R'$, We need to create a new baseline

As long as the baselines are equal, the comparison of $I$ and $I\prime$ should remain

## 4.3 G

There are two $G$ function involved in our comparison, $G$ and $G^\prime$ 

The comparison between $I_{s_0, T, G, R}$ and $I_{s_0, T, G^\prime, R}^\prime$ can be converted into the problem of comparing $I_{s_0, T, G, R}$ and $I_{s_0, T, G^\prime, R}(\prod)$

# 5. Division of layer

In previous sections, we introduced a intelligence model assuming direct information exchange between the agent and its environment. 


Roughly speaking, we can outline the capability of a autonomous system as follow

Capability = Sensitivity $\times$ Intelligence $\times$ Strength

In order to address the above formula in an mathemetical way. We create a separate layer between the environment and the kernel of the agent, which named it exchange layer.

1. $O$ - the space of raw observation
2. $O^P$ - the space of perceviable obseravtion
3. $A$ - the space of logic output
4. $A^T$ - the space of actions which affect the environment

We want to minimize the influence of **Sensitivity** and **Strength**, and only calculating the intelligence based on the logic part. 

\begin{equation}
o_i^p = P(o_i)
\end{equation}

\begin{equation}
W_{\{s_0,\prod,G\}}(t) = E\big(G(P_1(o_t))|a_{i1}^{},P_1(o_{i1})...a_{it}^{},P_1(o_{it})\big).
\end{equation}

\begin{equation}
W_{\{s_0,\prod,G\}}(t) = E\big(G(P_2(o_t))|a_{j1}^{},P_2(o_{j1})...a_{it}^{},P_2(o_{jt})\big).
\end{equation}


The division between logic layer and exchange layer is basically determined by the viewpoint of the observer. Different divisions can be applied to the same (agent, environment) combination when the observation point changed. A human can be viewed as , or a single logic layer with certain assumptions.

# 5.1. Exchange layer

Now we can write down the balancing condition as:

Textual defintion: In environment $E$, Two agent starts at state $s_0$ are considered as **Balanced** if their **Sensitivity** and **Strength** are the equivilent.

# 6. Environement

# 7. Long-term planning and non-markovian environment

\begin{equation}
\theta^* = \underset{\theta}{\operatorname{argmax}} E_{\tau \sim p_\theta(\tau)}[G\big(C_t(o_t)\big)]
\end{equation}

where $o_1,...,o_t$ is the history

$G_{test}(o_t) = G_{true}(C_\tau(o_t))$

# 8. Algorithm

# 9. General Intelligence

\begin{equation}
I = \sum_{i}^{} \beta_i I_i
\end{equation}

decompose intelligence to 

eigen intelligence vector

# Appendix

## Terminology

1. $o_i$ - observation
2. $G(o_{t})$ - Goal function

## TO DO

Extend our definition of I to include more fancy staff

Why intelligent system needs to be optimal?

No, it does not need to be optimal at all. That's why intelligence can be different.

Upperbound and lowerbound of intelligence??

Compare different systems with the same problem setting

# intent inference????

optimal policy?
important sampling

Long term planning???

## Tree Graph

Tree with colored path, each subtree is a subcase from previous state using different action, the color of the path is the $G(o_t)$