# FIT5226 Project Stage 1

## Group 8
- **Members:**
  - Jeeeun Kim
  - Jinxu Tao
  - Xiaolong Shen
  - Zhihan Ye

---

### 1. Introduction
This project aims to develop a table-based Q-learning agent to solve a simple transport task in a grid world environment. The agent's goal is to pick up an item located at a random position (A) and deliver it to a fixed goal location (B) with the minimum number of steps possible.

---

### 2. Environment Setup
The environment is implemented as an `n x n` grid world. The agent and the item are placed at random locations within this grid, while the goal location is fixed at the bottom-right corner of the grid. The main components of the environment are as follows:

- **Grid Size (n):** Initially set to 5x5 but can be configured to other sizes.
- **Agent's Position:** Randomly initialized starting position within the grid.
- **Item Location (A):** Randomly placed at a different location from both the agent's starting position and the goal location.
- **Goal Location (B):** Fixed at the bottom-right corner of the grid (coordinate (n, n)).
- **State Space:** Defined by the agent's position (`n x n`), the item's position (`n x n`), and whether the agent is carrying the item (2).
- **Action Space:** 'n', 's', 'e', 'w' (north, south, east, west).

---

### 3. Reward Structure
The agent receives rewards and penalties based on its proximity to the item and the goal. The rewards are structured as follows:

- **Basic Penalty:** A small negative reward (-1) for each move to minimize the number of steps.
- **Item Pickup Reward:** +5 for getting closer to the item when not carrying it, -5 for moving away from the item.
- **Goal Proximity Reward:** +10 for getting closer to the goal while carrying the item, -10 for moving away from the goal.
- **Item Delivery Reward:** +50 for successfully delivering the item to the goal, -50 for reaching the goal without the item.

---

### 4. Q-Learning Algorithm

- **Bellman Equation for Q-Learning:** 
$$
Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha [R_{t+1} + \gamma \max_{a'} Q(S_{t+1}, a') - Q(S_t, A_t)]
$$

- **Initialization:** Initialize the Q-table to 0 for all state-action pairs.
- **Policy Selection:** Use an epsilon-greedy policy to balance exploration and exploitation.
- **Q-Value Update:** Update Q-values using the Bellman equation based on rewards and estimates of future rewards.
- **Alpha Decay (Learning Rate Decay):** Gradually decrease the learning rate (α) to enable faster learning initially and stabilize later. The learning rate decreases over episodes.
- **Epsilon Decay (Exploration Rate Decay):** Gradually decrease the exploration rate (ϵ) to explore various actions initially and exploit the learned optimal policy later. The exploration rate decreases over episodes.

---

### 5. Training Procedure
The agent is trained over a specified number of episodes. In each episode, the agent starts from a random position and attempts to complete the task. Q-values are updated after each step, and the agent's policy is refined to minimize the number of steps needed to complete the task.

Metrics such as total reward, steps per episode, Q-value convergence, and policy stability are tracked to evaluate the agent's learning progress.

---

### 6. Evaluation
The trained agent is evaluated over multiple episodes using the following metrics:

1. **Total reward**
2. **Steps per episode**
3. **Exploration vs. Exploitation Ratio:** Frequency of exploring new actions versus utilizing known strategies.
4. **Learning Curve Stability:** Variance in rewards over recent episodes to assess learning consistency.
5. **Q-Value Convergence:** Tracking average Q-values to determine if the agent converges to an optimal policy.
6. **Policy Stability:** Frequency of policy changes as learning progresses.

---

### 7. Visualization
Learning and evaluation results are visualized using Matplotlib. Key visualizations include:

1. **Dynamic Animations**
2. **Graphs of Steps and Total Rewards:** Displaying steps taken and total rewards obtained per episode.
3. **Exploration vs. Exploitation:** Showing the balance of exploration and exploitation over time.
4. **Learning Curve Stability:** Visualizing reward variance over episodes.
5. **Q-Value Convergence vs. Policy Stability:** A combined graph showing the relationship between Q-value convergence and policy stability.
6. **Final Q-Table**

---

### 8. Conclusion
Learning and evaluation results are visualized using Matplotlib. Key visualizations include:

1. **Steps per Episode and Total Rewards:** As episodes progress, the number of steps required to complete the task noticeably decreases, indicating that the agent is learning to navigate the environment more efficiently. The total rewards fluctuate initially but stabilize over time, suggesting that the agent is improving its policy to maximize rewards and minimize penalties.

2. **Q-Value Convergence and Policy Stability:** The agent's learning rate (alpha) is initially set high (alpha=0.9) to facilitate rapid learning. As training progresses, the learning rate is gradually reduced using alpha_decay=0.995, allowing the Q-values to converge more smoothly without drastic changes. A minimum learning rate (alpha_min=0.01) ensures that even in later stages of training, the agent can still make small adjustments, helping it trust its existing knowledge and react less sensitively to new experiences. This strategy helps the Q-value convergence curve to flatten gradually as learning progresses, showing that the agent is effectively learning the optimal value function and developing a stable policy. Policy stability is initially variable, but it becomes more stable as training continues, indicating that the agent is refining its policy and making fewer changes as it approaches an optimal strategy.

3. **Learning Curve Stability:** The variance in rewards per episode is an indicator of learning curve stability. High variance in the early stages indicates inconsistent performance as the agent explores different actions. Over time, the variance decreases significantly, demonstrating that the agent is learning consistently and stabilizing its performance.

4. **Exploration vs. Exploitation Ratio:** The agent's initial exploration rate (epsilon) is set high at 0.99, encouraging it to actively explore various state-action pairs during the early stages of learning. This high initial value allows the agent to gather sufficient information about the environment. As episodes progress, epsilon gradually decreases according to the decay rate (epsilon_decay=0.995), until it reaches a minimum value of epsilon_min=0.01. This ensures that a certain level of exploration is maintained while allowing the agent to leverage its learned knowledge most of the time. This reduction helps the agent focus on understanding the environment and selecting optimal actions, moving towards optimizing performance by repeating effective actions rather than exploring new ones as training progresses.


---

In [7]:
import numpy as np
import random
import matplotlib
matplotlib.use('TkAgg')

import matplotlib.pyplot as plt
from matplotlib.widgets import Slider, Button
#from matphttp://localhost:8888/notebooks/Desktop/%5BFIT5226%5DMulti%20Agent/FIT5226_2024_S2/Stage1/group8.ipynb#Environmentlotlib.widgets import Slider, Button
from matplotlib.patches import Patch

import threading

## Environment


In [8]:
''' 
Grid class:
    1. define the grid, set the gird size as 5 for demostration
    2. set the position of agent and item; the flag of carrying
'''

class Grid:

    '''
    the initial attributes for Grid obj
    n: grid size
    target position: it's fixed and set at (n,n)
    reset(): set the position of agent and item, and the state of carrying

    ''' 
    def __init__(self, n=5):
        self.n = n
        self.target_position = (self.n-1, self.n-1)
        self.reset()


    '''
    generate the random positon in the gird
    '''
    def reset_position(self):
        y, x = random.randint(0, self.n-1), random.randint(0, self.n-1)
        return y, x
        

    '''
    set the position of agent and item; 
    set the flag of has_item and at_target

    agent's position can't be same as item's position & 
    item's position can't be same as target's position

    '''
    def reset(self):
        self.agent_position = self.reset_position()
        self.item_position = self.reset_position()
        self.has_item = False
        self.at_target = False

        # Ensure the item is not placed at the agent's initial position or the target position
        while self.agent_position == self.item_position or self.item_position == self.target_position:
            self.item_position = self.reset_position()


        # previous distance between agent and item
        self.perv_dist_item = self._dist(self.agent_position, self.item_position)
        
        # previous distance between agent and target
        self.perv_dist_target = self._dist(self.agent_position, self.target_position)


        return self.get_state()
        

    '''
    the document says that 
        the agent konws its position, 
        the item's position and 
        item's state

    get the current state of agent:
        agent position
        item position
        has_item
    '''
    def get_state(self):
        return (self.agent_position, self.item_position, self.has_item)
        


    '''
    calculate the distance between position_1 and position_2 (agent and target/item)
    '''
    def _dist(self, loc1, loc2):
        return abs(loc1[0] - loc2[0]) + abs(loc1[1] - loc2[1])


    '''
    check whether the agent has arrieved the target with the item
    '''
    def is_at_target(self):
        return self.agent_position == self.target_position and self.has_item


    '''
    self.has_item = False (initial)

    when the agent has no item, calculate the distance between the agent and the item
    when the agent has the item, calculate teh distance between the agent and the target
    '''
    def update_distances(self):
        if not self.has_item:
            self.perv_dist_item = self._dist(self.agent_position, self.item_position)
        else:
            self.perv_dist_target = self._dist(self.agent_position, self.target_position)



## Reward Structure

T-D method 
1 step ahead


In [9]:
'''
agent class:
    agent's movement
    reward principle 
    Q-learning process
    
'''
class Agent:
    '''
    parameters of q-learning
    gird
    agent action
    q-table
    '''
    def __init__(self, grid, actions, alpha=0.9, gamma=0.95, epsilon=0.99,
                 epsilon_decay=0.995, epsilon_min=0.01,
                 alpha_min=0.01, alpha_decay=0.995):
        self.grid = grid
        self.gamma = gamma


        # update_parameters()
        # epsilon - large number means large step for exploring 
        self.epsilon = epsilon
        # epsilon decay - the decay rate of epsilon 
        self.epsilon_decay = epsilon_decay
        # epsilon min - the minimum number of epsilon
        self.epsilon_min = epsilon_min

        # alpha - is used when updating the q value - learning rate
        self.alpha = alpha
        self.alpha_decay = alpha_decay
        self.alpha_min = alpha_min


        
        self.actions = actions  # ['n', 's', 'e', 'w']
        self.action_dir = {'n': (-1, 0), 's': (1, 0), 'e': (0, 1), 'w': (0, -1)}  # Move directions
        self.q_table = {}

    '''
    obtain the q-value
    '''
    def get_q_value(self, state, action):
        return self.q_table.get((state, action), 0.0)

    '''
    update the q-value of state S with action A
    '''
    def update_q_value(self, state, action, value):
        self.q_table[(state, action)] = value

    '''
    when epsilon is large than uniform number, the agent is exploring and choose the action randomly

    else: 
        calculate the q-values of each action at state S
        find the best q-value
        find the best action at state S
    '''
    def choose_action(self, state):
        # exploring 
        if np.random.uniform(0, 1) < self.epsilon:
            return random.choice(self.actions)
        else:
            # q-values for all actions at state S
            q_values = [self.get_q_value(state, action) for action in self.actions]

            # the max q-value within these actions
            max_q_value = np.max(q_values)

            # based on the best q-value, find the best action
            best_actions = [action for action, q_value in zip(self.actions, q_values) if q_value == max_q_value]

            return random.choice(best_actions)


    '''
    We use temporal-difference (TD) learning algorithm

    Q(S,A) = Q(S,A) + alpha * (Reward + gamma* Q(S',A') - Q(S,A))
    
    
    Features: 
        TD has low variance, some bias
        TD converges to V(s) (with value tables)
        TD converges faster than MC(Monte-Carlo)
        TD is more sensitive to initial value
    '''
    def update_q_table(self, state, action, reward, next_state):
        q_value = self.get_q_value(state, action)
        next_q_values = [self.get_q_value(next_state, next_action) for next_action in self.actions]
        max_next_q_value = max(next_q_values)

        # TD target = Reward of next state + discount factor * Max value of next q value        
        td_target = reward + self.gamma * max_next_q_value
        # TD error = TD target - V(St)
        td_error = td_target - q_value

        # calculate the new value of q at state S with action A
        new_q_value = q_value + self.alpha * td_error
        self.update_q_value(state, action, new_q_value)
    
    

    '''
    how the agent moves with the input action
    '''
    def move(self, action):
        # how the position of agent changes with the input action
        direction = self.action_dir[action]
        # how far the agent moves on x and y
        dy, dx = direction
        # the position cannot exceed the boundary 
        new_position = (min(max(self.grid.agent_position[0] + dy, 0), self.grid.n-1), 
                        min(max(self.grid.agent_position[1] + dx, 0), self.grid.n-1)) 

        # Update agent position and handle item pickup
        self.grid.agent_position = new_position
        if self.grid.agent_position == self.grid.item_position and not self.grid.has_item:
            self.grid.has_item = True



    '''
    define the reward rules 
        1. general: reward = -1 : the agent moves one step, the reward is -1
        2. agent arrives the target: 
            agent arrives with item
            agent arrives without item
        3. item is picked / isn't picked
        
    '''
    def reward_function(self):
        reward = -1

        # agent arrives at the target with item
        if self.grid.is_at_target():
            return 50
        
        # agent arrives at the target
        elif self.grid.agent_position == self.grid.target_position:
            return -50

        # the agent has not picked the item
        if not self.grid.has_item:
            new_dist_item = self.grid._dist(self.grid.agent_position, self.grid.item_position)
            if new_dist_item < self.grid.perv_dist_item:
                reward += 5
            else:
                reward -= 5
        
        # the agent picks the item
        else:
            new_dist_target = self.grid._dist(self.grid.agent_position, self.grid.target_position)
            if new_dist_target < self.grid.perv_dist_target:
                reward += 10
            else:
                reward -= 10

        self.grid.update_distances()
        return reward
    

    '''
    update the epsilon and alpha with decay rate
    '''
    def update_parameters(self):
        self.epsilon = max(self.epsilon_min, self.epsilon * self.epsilon_decay)
        self.alpha = max(self.alpha_min, self.alpha * self.alpha_decay)


    '''
    used for testing and visualizing the q table
    '''
    def print_q_table(self):
        for key in sorted(self.q_table.keys(), key=lambda x: str(x[0])):
            state, action = key
            print(f"State: {state}, Action: {action}, Q-value: {self.q_table[key]}")



## Training and Testing

In [10]:
class AgentTrainer:
    def __init__(self, grid, agent, episodes=1000, reward_smoothing_window=10, step_smoothing_window=10):
        self.grid = grid
        self.agent = agent
        self.episodes = episodes
        self.reward_smoothing_window = reward_smoothing_window
        self.step_smoothing_window = step_smoothing_window

        # Metrics storage
        self.step_counts = []
        self.total_rewards = []
        self.q_value_convergence = []
        self.policy_stability = []
        self.exploration_ratios = []
        self.learning_curve_stability = []

        # Recent metrics for smoothing
        self.recent_rewards = []
        self.recent_steps = []

    def train(self):
        for episode in range(self.episodes):
            state = self.grid.reset()
            done = False
            total_reward = 0
            steps = 0
            old_q_table = self.agent.q_table.copy()
            exploration_count = 0

            while not done:
                # Track exploration vs exploitation
                if np.random.uniform(0, 1) < self.agent.epsilon:
                    exploration_count += 1

                action = self.agent.choose_action(state)
                self.agent.move(action)
                reward = self.agent.reward_function()

                next_state = self.grid.get_state()
                done = self.grid.is_at_target()
                self.agent.update_q_table(state, action, reward, next_state)
                state = next_state

                total_reward += reward
                steps += 1

            # Store metrics after the episode
            self.exploration_ratios.append(self.calculate_exploration_ratio(exploration_count, steps))
            self.total_rewards.append(self.calculate_smoothed_reward(total_reward))
            self.step_counts.append(self.calculate_smoothed_steps(steps))
            self.learning_curve_stability.append(self.calculate_learning_curve_stability())
            self.q_value_convergence.append(self.calculate_q_value_convergence())
            self.policy_stability.append(self.calculate_policy_stability(old_q_table))

            # Update parameters for smoother convergence
            self.agent.update_parameters()

        return self.step_counts, self.total_rewards, self.q_value_convergence, self.policy_stability, self.exploration_ratios, self.learning_curve_stability

    # Calculate exploration vs exploitation ratio
    def calculate_exploration_ratio(self, exploration_count, steps):
        return exploration_count / steps if steps > 0 else 0

    # Calculate smoothed reward
    def calculate_smoothed_reward(self, total_reward):
        self.recent_rewards.append(total_reward)
        if len(self.recent_rewards) > self.reward_smoothing_window:
            self.recent_rewards.pop(0)  # Maintain a fixed-size window
        return np.mean(self.recent_rewards)

    # Calculate smoothed steps
    def calculate_smoothed_steps(self, steps):
        self.recent_steps.append(steps)
        if len(self.recent_steps) > self.step_smoothing_window:
            self.recent_steps.pop(0)  # Maintain a fixed-size window
        return np.mean(self.recent_steps)

    # Calculate learning curve stability (variance of rewards)
    def calculate_learning_curve_stability(self):
        return np.var(self.recent_rewards)

    # Calculate Q-value convergence
    def calculate_q_value_convergence(self):
        q_values = list(self.agent.q_table.values())
        return np.mean(q_values)

    # Calculate policy stability
    def calculate_policy_stability(self, old_q_table):
        policy_changes = sum(
            1 for key in self.agent.q_table if key in old_q_table and self.agent.q_table[key] != old_q_table[key]
        )
        return policy_changes / (len(self.agent.q_table) if self.agent.q_table else 1)

    # Plot steps and rewards together
    def plot_reward_and_steps(self):
        fig, ax1 = plt.subplots(figsize=(10, 6))

        # Plot steps on the left Y-axis
        ax1.set_xlabel('Episode')
        ax1.set_ylabel('Steps', color='red')
        ax1.plot(range(len(self.step_counts)), self.step_counts, color='red', label='Steps')
        ax1.tick_params(axis='y', labelcolor='red')

        # Plot rewards on the right Y-axis
        ax2 = ax1.twinx()
        ax2.set_ylabel('Total Reward', color='blue')
        ax2.plot(range(len(self.total_rewards)), self.total_rewards, color='blue', label='Total Reward')
        ax2.tick_params(axis='y', labelcolor='blue')

        # Title and layout
        plt.title('Steps and Total Rewards per Episode')
        fig.tight_layout()

        # Show plot
        plt.show()

    # Plot exploration vs exploitation ratio
    def plot_exploration_vs_exploitation(self):
        fig, ax = plt.subplots(figsize=(10, 6))
        ax.plot(range(len(self.exploration_ratios)), self.exploration_ratios, color='purple')
        ax.set_xlabel('Episode')
        ax.set_ylabel('Exploration vs Exploitation Ratio')
        ax.set_title('Exploration vs Exploitation Over Episodes')
        plt.show()

    # Plot learning curve stability
    def plot_learning_curve_stability(self):
        fig, ax = plt.subplots(figsize=(10, 6))
        ax.plot(range(len(self.learning_curve_stability)), self.learning_curve_stability, color='blue')
        ax.set_xlabel('Episode')
        ax.set_ylabel('Learning Curve Stability (Variance of Rewards)')
        ax.set_title('Learning Curve Stability Over Episodes')
        plt.show()

    # Plot Q-value convergence vs policy stability
    def plot_q_convergence_vs_policy_stability(self):
        fig, ax1 = plt.subplots(figsize=(10, 6))

        # Plot Q-value convergence on the left Y-axis
        ax1.set_xlabel('Episode')
        ax1.set_ylabel('Q-value Convergence', color='green')
        ax1.plot(range(len(self.q_value_convergence)), self.q_value_convergence, color='green', label='Q-value Convergence')
        ax1.tick_params(axis='y', labelcolor='green')

        # Plot policy stability on the right Y-axis
        ax2 = ax1.twinx()
        ax2.set_ylabel('Policy Stability', color='red')
        ax2.plot(range(len(self.policy_stability)), self.policy_stability, color='red', label='Policy Stability')
        ax2.tick_params(axis='y', labelcolor='red')

        # Title and layout
        plt.title('Q-value Convergence and Policy Stability per Episode')
        fig.tight_layout()

        # Show plot
        plt.show()

    # Display the final Q-table
    def display_q_table(self):
        # Create a Q-table for all possible states and actions
        q_table_array = np.zeros((self.grid.n, self.grid.n, len(self.agent.actions)))
        
        for (state, action), q_value in self.agent.q_table.items():
            agent_pos, item_pos, has_item = state
            action_index = self.agent.actions.index(action)
            q_table_array[agent_pos[0], agent_pos[1], action_index] = q_value
        
        # Print Q-values for each state
        for i in range(self.grid.n):
            for j in range(self.grid.n):
                print(f"State ({i}, {j}): ", q_table_array[i, j])
        print("\nActions: ", self.agent.actions)    


## Visualization

In [11]:
def simulation(grid, agent, episodes=1000):
    global stop, fig, ax, mat, bnext, bstart, bstop, binit, sspeed, speed, time
    stop = True
    speed = 1.0
    time = 0

    def stopAnim(event):
        global stop
        stop = True

    def startAnim(event):
        global stop
        stop = False
        animate()

    def advance(event):
        global time, stop
        time += 1
        state = grid.get_state()
        action = agent.choose_action(state)
        agent.move(action)
        reward = agent.reward_function()
        next_state = grid.get_state()
        agent.update_q_table(state, action, reward, next_state)
        mat.set_data(render_grid(grid))
        plt.title(f't = {time}')
        plt.draw()

        # Check if agent has reached the target with the item
        if grid.has_item and grid.agent_position == grid.target_position:
            print("Agent reached the target with the item!")
            stop = True  # Stop the animation

    def initAnim(event):
        global time
        time = 0
        grid.reset()
        mat.set_data(render_grid(grid))
        plt.title(f't = {time}')
        plt.draw()

    def updateSpeed(val):
        global speed
        speed = 1 / sspeed.val

    def render_grid(grid):
        grid_display = np.zeros((grid.n, grid.n))
        ay, ax = grid.agent_position
        iy, ix = grid.item_position
        ty, tx = grid.target_position
        grid_display[ay, ax] = 1  # Agent
        grid_display[iy, ix] = 2  # Item
        grid_display[ty, tx] = 3  # Target
        if grid.has_item:
            grid_display[ay, ax] = 4  # Agent with item
        return grid_display

    def animate():
        global stop
        advance(None)
        if not stop:
            threading.Timer(speed, animate).start()

    fig, ax = plt.subplots()
    ax.axis('off')
    plt.title("GridWorld Q-learning Simulation")

    # Add legend
    legend_elements = [
        Patch(facecolor='blue', label='Agent'),
        Patch(facecolor='green', label='Item'),
        Patch(facecolor='yellow', label='Target'),
    ]
    ax.legend(handles=legend_elements, loc='upper right')

    axspeed = plt.axes([0.175, 0.05, 0.65, 0.03])
    sspeed = Slider(axspeed, 'Speed', 0.1, 10.0, valinit=1.0)
    sspeed.on_changed(updateSpeed)

    axnext = plt.axes([0.85, 0.15, 0.1, 0.075])
    axstart = plt.axes([0.85, 0.25, 0.1, 0.075])
    axstop = plt.axes([0.85, 0.35, 0.1, 0.075])
    axinit = plt.axes([0.85, 0.45, 0.1, 0.075])
    bnext = Button(axnext, 'Next')
    bnext.on_clicked(advance)
    bstart = Button(axstart, 'Start')
    bstart.on_clicked(startAnim)
    bstop = Button(axstop, 'Stop')
    bstop.on_clicked(stopAnim)
    binit = Button(axinit, 'Init')
    binit.on_clicked(initAnim)

    mat = ax.matshow(render_grid(grid), cmap=plt.cm.viridis)
    initAnim(None)
    plt.show()

## Example Execution

In [12]:
if __name__ == "__main__":
    # Initialize the grid and agent
    grid = Grid(n=5)
    agent = Agent(actions=['n', 's', 'e', 'w'], alpha=0.1, gamma=0.95, grid=grid, epsilon=1, epsilon_decay=0.9)
    
    # Initialize the trainer
    trainer = AgentTrainer(grid, agent, episodes=500)
    
    # Train the agent
    trainer.train()

    # Run the dynamic simulation
    simulation(grid, agent)
    
    # Plotting reward and steps on the same graph
    trainer.plot_reward_and_steps()

    # Plotting other metrics
    trainer.plot_exploration_vs_exploitation()
    trainer.plot_learning_curve_stability()
    trainer.plot_q_convergence_vs_policy_stability()

    # Display the final Q-table
    trainer.display_q_table()


invalid command name "4353569536process_stream_events"
    while executing
"4353569536process_stream_events"
    ("after" script)


State (0, 0):  [-0.12433649  0.04654218  0.16406426 -0.07456806]
State (0, 1):  [-0.06        0.20088081  0.47570478 -0.06      ]
State (0, 2):  [-0.20525348  0.23249948 -0.20080653  0.72918935]
State (0, 3):  [-0.09677265 -0.19513267 -0.20080653  0.05233552]
State (0, 4):  [-0.11707808  0.07805205 -0.11707808  0.26899031]
State (1, 0):  [-0.07612603  0.15748557  0.04       -0.13471914]
State (1, 1):  [-0.06        0.04794888  0.07932063 -0.35178659]
State (1, 2):  [-0.10957918  0.28553312  0.85398554  0.16989507]
State (1, 3):  [-0.4070374   0.30622577 -0.10120106 -0.41846627]
State (1, 4):  [ 0.11401653  0.16904924 -0.23033486  0.08335179]
State (2, 0):  [-0.2285379  -0.23483494  0.04319334 -0.24096388]
State (2, 1):  [ 0.03116514 -0.28126094 -0.30015671 -0.28586433]
State (2, 2):  [0.26966233 0.12356236 1.00941097 0.43823223]
State (2, 3):  [-0.11043487 -0.11021867 -0.34080068  0.08923928]
State (2, 4):  [ 0.0884723  -0.09872923  0.04776408  0.74339309]
State (3, 0):  [ 0.07711181 -