# Analysis of the perform of a Q-learning agent in the Iterated Prisoner's Dilemma #

## 1. Introduction ##
In this notebook I will explore the Q-learning agent behavior inside an IPD environment. The analysis will encompass:
- How the agent learns to maximize it's rewards
- How interacts with different opponent strategies
- The impact of key parameters in the agent performance 

## 2. Configuration ##

We have to set up de environment and the agent before we start with the analysis. Next, we define key parameters and create the instances needed.



In [1]:
import sys
import os

# Añadir la carpeta 'src' al path desde la raíz
sys.path.append(os.path.abspath('./src'))

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from agent import QL_Agent
from ipd_env import IteratedPrisonersDilemmaEnv
from strats import tft, stft, gtft, imptft, always_defect, always_cooperate

#Initial Parameters
num_rounds = 200  #Rounds per episode
episodes = 500  #Number of episodes
alpha = 0.1  #Learning rate
gamma = 0.9  #Discount factor
epsilon = 0.1  #Exploration rate

#Initialize the environment and the agent
env = IteratedPrisonersDilemmaEnv(num_rounds=num_rounds, opponent_strategies=[tft])
agent = QL_Agent(env.action_space, env.observation_space, alpha=alpha, gamma=gamma, epsilon=epsilon)

#Save some metrics
total_rewards = []
cooperation_rates = []
cooperation_counts = []
defection_counts = []
cooperation_points = []
defection_points = []


## 3. Simulation ##

In this section we will train the agent in the IPD environment implemented in Gymnasium 
- The agent will decide following an `epsilon-greedy` policy.
- Towards the end of every episode, we will update its Q-table based on the rewards.

**Key Parameters:**
- `epsilon`: controls exploration (random actions) vs. explotation (using learned policy).
- `alpha`: learning rate.
- `gamma`: discount factor to value future rewards.

We will take insights as:
- **Accumulated rewards:** measures agent overall performance.
- **Cooperation rate:** indicates how often agent decides to cooperate.


In [2]:
for episode in range(episodes):
    state = env.reset()
    total_reward = 0
    cooperations = 0
    defections = 0

    for _ in range(num_rounds):
        action = agent.choose_action(state)
        next_state, reward, done, _ = env.step(action)
        agent.learn(state, action, reward, next_state, done)
        
        total_reward += reward
        if action == 0:  # Action "Cooperate"
            cooperations += 1
            cooperation_points.append((episode, len(cooperation_points) + 1))
        else:
            defection_points.append((episode, len(defection_points) + 1))
            #defections += 1
        if done:
            break
        state = next_state

    #Taking metrics
    total_rewards.append(total_reward)
    cooperation_rates.append(cooperations / num_rounds)
    cooperation_counts.append(cooperations)
    defection_counts.append(defections)

## 4. Performance analysis ##

### 4.1 Cumulative Reward Evolution ###
Next plot show how cumulative reward of the agent evolves through episodes.
This helps us to visualize if the agent its learning to optimize its decisions.

In [None]:
plt.figure(figsize=(10, 6))
plt.plot(total_rewards, label='Cumulative Reward', color='blue')
plt.xlabel('Episodes')
plt.ylabel('Total Reward')
plt.title('Evolution of Cumulative Reward')
plt.legend()
plt.grid()
plt.show()

### 4.2 Cooperation rate through time ###
We analyze the proportion of times that the agent decides to cooperate in each episode.
This is useful to understand if the agent is adopting a "generous" behaviour or a "greedy" one.

In [None]:
plt.figure(figsize=(10, 6))
plt.plot(cooperation_rates, label='Cooperation rate', color='green')
plt.xlabel('Episodes')
plt.ylabel('Cooperation frequency')
plt.title('Evolution of cooperation rate')
plt.legend()
plt.grid()
plt.show()

### 4.3 Cooperation vs Defection action ###

In [None]:
cooperation_x, cooperation_y = zip(*cooperation_points)
defection_x, defection_y = zip(*defection_points)

plt.figure(figsize=(12, 6))

plt.scatter(cooperation_x, cooperation_y, color='blue', label='Cooperations', alpha=0.6)

plt.scatter(defection_x, defection_y, color='red', label='Defections', alpha=0.6)

plt.title('Accumulation of Cooperations and Defections Over Episodes')
plt.xlabel('Episode')
plt.ylabel('Cumulative Count')
plt.legend()
plt.grid(alpha=0.5)
plt.show()

## 5. Visualization ##

### 5.1 Q-table visualization ###
The agent stores its knowledge in a **Q-table**, which contains asociated values for every (state, action) pair.
Each values indicates the expected reward if the agent takes a specific action from a given state.

In the following heatmap, we can observ how Q-values evolve after the training.
This gives us an idea of what action the agent prefer in different situations.

In [None]:
plt.figure(figsize=(8, 6))
sns.heatmap(agent.q_table, annot=True, cmap='coolwarm', cbar=True, fmt='.2f')
plt.title('Final Q-table')
plt.xlabel('Action (0=Coop, 1=Defect)')
plt.ylabel('State')
plt.show()

### 5.2 Opponent Analysis ###

Here we compare how the agent performs against different opponent strategies: 

- **Tit-for-Tat:** Imitates last agent's decision (start cooperating).
- **Suspicious Tit-for-Tat:** Defects on the first round and imitates agent's previous move thereafter
- **Generous Tit-for-Tat:** Cooprates on the first round and after agent cooperates. Following a defection, it cooperates with a certain probability
- **Imperfect Tit-for-Tat:** Imitates agent's last move with high (but less than one) probability
- **Always Cooperate**
- **Always Defect**

This let us see if the agent adjust it's behaviour based on the opponent strategy.


In [None]:
opponent_strategies = {
    "Tit-for-Tat": tft,
    "Suspicious Tit-for-Tat": stft,
    "Generous Tit-for-Tat": gtft,
    "Imperfect Tit-for-Tat": imptft,
    "Always Defect": always_defect,
    "Always Cooperate": always_cooperate
}

results = {}
for strategy_name, strategy in opponent_strategies.items():
    env.opponent_strategies = [strategy]
    rewards = []
    for _ in range(episodes):
        state = env.reset()
        total_reward = 0
        for _ in range(num_rounds):
            action = agent.choose_action(state)
            next_state, reward, done, _ = env.step(action)
            total_reward += reward
            if done:
                break
            state = next_state
        rewards.append(total_reward)
    results[strategy_name] = np.mean(rewards)

#Visualize results
plt.figure(figsize=(12, 8))
plt.bar(results.keys(), results.values(), color=['blue', 'red', 'green', 'yellow', 'purple', 'brown'])
plt.title('Agent performance vs. diferent strategies')
plt.xlabel('Opponent Strategies')
plt.ylabel('Average reward')
plt.xticks(rotation=45, ha='right')
plt.grid()
plt.show()

In [None]:
episode_rewards = {}

for strategy_name, strategy in opponent_strategies.items():
    env.opponent_strategies = [strategy]
    rewards = []
    for episode in range(episodes):
        state = env.reset()
        total_reward = 0
        for _ in range(num_rounds):
            action = agent.choose_action(state)
            next_state, reward, done, _ = env.step(action)
            total_reward += reward
            if done:
                break
            state = next_state
        rewards.append(total_reward)
    episode_rewards[strategy_name] = rewards  #Save rewards per episode

#Visualize results in a scatter plot
plt.figure(figsize=(12, 8))

#Add each strategy data points
for strategy_name, rewards in episode_rewards.items():
    plt.scatter(range(1, episodes + 1), rewards, label=strategy_name, alpha=0.6, s=50)

plt.title('Reward evolution per episode vs different strategies')
plt.xlabel('Episode')
plt.ylabel('Cumulative reward')
plt.legend(title="Strategies")
plt.grid(alpha=0.5)
plt.show()


## 6.Overview ##

