In [1]:
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('seaborn-v0_8-whitegrid')

Hi everyone! In our last lecture, we built ***YouBandit***, an agent that learned the single best thumbnail to show on YouTube. 

However, this approach has a glaring weakness: it's a **one-size-fits-all** model. It assumes that the best thumbnail is the same for your little brother that is a gamer as it is for our grandmother. We are essentially assuming that every person likes the same thing. There is no personalisation.

What if the best thumbnail to choose actually depends on **who is watching**? With *YouBandit*, we have no way of knowing, or storing such information, so we will never know anything personal about the user, and we can never use such information to pick the best thumbnail specifically tailored for the person watching. 

Today, we therefore introduce the concept of  <span style="color:blue">state</span>. The best action now depends on the situation.

So, we've introduced a new word, but what does it actually mean? It is actually one of the most important 
ideas in Reinforcement Learning, so it can be worth it to spend a couple of minutes to build some intuition.

> <span style="color:blue">state</span> is the information used to determine what happens next

You can think of state as like a *summary of all the history* we have seen so far.

Our new problem: We're personalising a YouTube channel's homepage.

- **State**: Is the user a new viewer or a subscriber?
- **Actions**: Should we show them a clickbaity viral thing, an AI tutorial, or a boring politics debate?

Our <span style="color:gold">goal</span> now is to show the right video to the right user (known as a "Contextual Bandit"). Solving it will force us to evolve our agent's brain. Let's go!

##### Environment: the YouTube home page

In [2]:
class HomepageEnv:
    def __init__(self):
        self.contexts = ["New viewer", "Subscriber"]
        self.actions = ["🔥 Viral thing", "🤖 AI tutorial", "💤 Boring politics debate"]
        self.num_contexts = len(self.contexts)
        self.num_actions = len(self.actions)

    def get_reward(self, context_idx, action_idx):
        # New viewers (context 0) are attracted by viral videos
        if context_idx == 0: # the index for a new viewer
            true_probs = [0.6, 0.1, 0.05] # gives the size of a spinning wheel, here "viral thing" has 60% of the space 
            return 1 if np.random.rand() < true_probs[action_idx] else 0 # like spinning a wheel that can land anywhere from 0.00 to 0.99
        # Subscribers (context 1) want deeper content
        elif context_idx == 1:
            true_probs = [0.1, 0.5, 0.7] # The subscribers are thoughtfull and use YouTube a bit differently, they want boring debates
            return 1 if np.random.rand() < true_probs[action_idx] else 0

# Initialize environment
env = HomepageEnv()

In [3]:
env.actions

['🔥 Viral thing', '🤖 AI tutorial', '💤 Boring politics debate']

In [4]:
env.contexts

['New viewer', 'Subscriber']

So now we have our new world created, which have the concept of a state now. The *hidden secrets* are also different depending on which user who is watching. 

In [5]:
# We will now store our beliefs in a 2D table: (number of states x number of actions). here: 2x3 table
probs = np.zeros((env.num_contexts, env.num_actions))
probs.shape

(2, 3)


#### YouBandit: the old way without context to establish a baseline

Let's first prove that the old way is broken. We'll take our *YouBandit* from the last lecture, and see how it performs. This agent is *context-blind*. It doesn't know if a user is new or a subscriber; it just sees an action and a reward.

In [22]:
belief_table = np.zeros(env.num_actions)
action_counts = np.zeros(env.num_actions)

epsilon = 0.1
num_steps = 2000
history_blind_bandit = []

for step in range(num_steps):
    # A user arrives, they have a context...
    context_idx = np.random.randint(0, env.num_contexts)
    
    # ...but our agent ignores it! It just uses its single belief table.
    if np.random.rand() < epsilon:
        action_idx = np.random.randint(0, env.num_actions)
    else:
        # If multiple actions have the same max belief, this breaks ties randomly
        best_action = np.argmax(belief_table)
        action_idx = np.random.choice(np.where(belief_table == belief_table[best_action])[0])
        
    reward = env.get_reward(context_idx, action_idx)
    history_blind_bandit.append(reward)
    
    action_counts[action_idx] += 1 # increment the count
    alpha = 1 / action_counts[action_idx] # calculate the 'dynamic learning rate' with the sample average

    # Update our table using the incremental average approach we've used in the last lecture
    belief_table[action_idx] += alpha * (reward - belief_table[action_idx])     

print("--- YouBandit ---")
print("Final CTR estimates (averaged over all users):")
for i, belief in enumerate(belief_table):
    print(f" - '{env.actions[i]}': {belief:.2%}")

avg_reward = np.mean(history_blind_bandit)
print(f"\nAverage CTR achieved: {avg_reward:.2%}")

# Optimal policy would be (0.6 for new viewers -> show them the viral thing + 0.7 for the subscribers) / 2 = 65%
print(f"Theoretical maximum average CTR (if we were smart): 65.00%")

--- YouBandit ---
Final CTR estimates (averaged over all users):
 - 'Viral thing': 34.51%
 - 'AI tutorial': 23.08%
 - 'Boring politics debate': 30.56%

Average CTR achieved: 33.85%
Theoretical maximum average CTR (if we were smart): 65.00%


Our simple, state-ignoring agent is failing. Its belief table is kind of ugly, and no where near optimal. It learned that the viral thing is best on average, so it shows this regardless of the user. Therefore, we are annoying our loyal subscribers and probably <span style="color:red">not</span> showing it enough to our new viewers. 

So how do we approach this?



In [23]:
# use 2D table to represent the beliefs
belief_table_2d = np.zeros((env.num_contexts,env.num_actions))

In [24]:
belief_table_2d # so each row now, correspond to the context, so the first row could be 'new user'

array([[0., 0., 0.],
       [0., 0., 0.]])

In [25]:
# need the same for the counts
action_counts_2d = np.zeros((env.num_contexts,env.num_actions))
action_counts # we can look at the counts of the YouBandit. It thought the viral thing was the best to show, so it showed it to everyone

array([1814.,   78.,  108.])


##### PersonalBandit


In [26]:
belief_table_2d = np.zeros((env.num_contexts,env.num_actions))
action_counts_2d = np.zeros((env.num_contexts,env.num_actions))

history_aware_bandit = []

for step in range(num_steps):
    # A user arrives with a context
    context_idx = np.random.randint(0, env.num_contexts) # pick out one context at random

    # ... and then we do our 'sometimes pick a random action'-strategy based on looking up which user has arrived
    if np.random.rand() < epsilon:
        action_idx = np.random.randint(0, env.num_actions) # with probability ε sometimes pick a random thumbnail to show
    else:
        action_idx = np.argmax(belief_table_2d[context_idx]) # else, choose the best thumnail according to our belief

    reward = env.get_reward(context_idx, action_idx) # pass the thumbnail we want to show to the environment and get the reward back
    history_aware_bandit.append((context_idx, action_idx, reward)) # remember which user came (context_idx), what we showed (actions_idx)

    # ------------- Update our beliefs ---------------------
    action_counts_2d[context_idx, action_idx] += 1 # increment the count for how many times which action we took for which user
    belief_table_2d[context_idx, action_idx] += (1/action_counts_2d[context_idx, action_idx]) * (reward - belief_table_2d[context_idx, action_idx])


print("--- PersonalBandit ---")
print("\nFinal beliefs for new viewers (row 0):")
for i, belief in enumerate(belief_table_2d[0]):
    print(f" - '{env.actions[i]}': {belief:.2%}")

print("\nFinal beliefs for subscribers (row 1):")
for i, belief in enumerate(belief_table_2d[1]):
    print(f" - '{env.actions[i]}': {belief:.2%}")
    
avg_reward_aware = np.mean([example[2] for example in history_aware_bandit])
print(f"\nAverage CTR achieved: {avg_reward_aware:.2%}")    

--- PersonalBandit ---

Final beliefs for new viewers (row 0):
 - 'Viral thing': 60.63%
 - 'AI tutorial': 3.45%
 - 'Boring politics debate': 12.50%

Final beliefs for subscribers (row 1):
 - 'Viral thing': 8.33%
 - 'AI tutorial': 48.28%
 - 'Boring politics debate': 71.37%

Average CTR achieved: 62.85%


WOHOOO! This is *way better* than the 35% we got with *YouBandit* that didn't know anything about states, users or context of the problem.

By simply giving our agent a separate belief table for each state, it learned an almost perfect personalised strategy, and actually showed content that the user clicked on.

In [27]:
# Let's see the history
action_counts_2d  # see that we show a lot more 'viral things' to the new users and mostly 'boring politics debates' to subscribers

array([[927.,  29.,  24.],
       [ 48.,  29., 943.]])

In [28]:
# Let's take a loot at what users came, what we showed and if the clicked or not
for context_idx, action_idx, reward in history_aware_bandit[5:15]:
    print(f"User: {'Subscriber' if context_idx == 1 else 'New viewer'} showed up, we showed: {env.actions[action_idx]} and got reward: {reward}")

User: New viewer showed up, we showed: Viral thing and got reward: 0
User: Subscriber showed up, we showed: Viral thing and got reward: 0
User: New viewer showed up, we showed: Viral thing and got reward: 0
User: New viewer showed up, we showed: Viral thing and got reward: 1
User: New viewer showed up, we showed: Viral thing and got reward: 0
User: New viewer showed up, we showed: Viral thing and got reward: 0
User: Subscriber showed up, we showed: Viral thing and got reward: 0
User: New viewer showed up, we showed: Viral thing and got reward: 1
User: Subscriber showed up, we showed: Viral thing and got reward: 0
User: Subscriber showed up, we showed: Viral thing and got reward: 0



#### Bonus: feeling the pain of an exploding table

We are exited, and want to have even more context. Therefore, we want to personalise with *everything* we can think of in our imagination. We'll personalise with *device type*, *country*, *time of day*. Notice that what we are doing now is <span style="color:purple">growing</span> our <span style="color:blue">state</span> space.

Exercise: How many rows do we need now?

In [29]:
num_user_types = 2
num_devices = 3     # assume mobile, desktop, tablet
num_countries = 20  # just assume 20 countries
num_time_of_day = 3 # assume that we use morning, afternoon, evening
num_actions = env.num_actions

In [30]:
# total unique contexts is the product of all context variables
total_unique_contexts = num_user_types * num_devices * num_countries * num_time_of_day
print(f"The total unique contexts are now: {total_unique_contexts}")

The total unique contexts are now: 360


In [33]:
table_shape = (total_unique_contexts, num_actions)
print(f"Our belief table would need the shape: {table_shape}")

Our belief table would need the shape: (360, 3)


This simple extension reveals two lurking <span style="color:red">problems</span>:
1. Data sparisty: We need thousands of views to get a good estimate on just **one row** (e.g., for subscripers on a tablet in France in the evening, for example). It could take years to show enough thumbnails to reliably fill out all 360 rows. Most of our rows could remain empty or have really unreliable estimates :(
2. No generalisation: Let's say our agent learns that 'Mobile users in the US' love viral videos. The knowledge is stuck on that one specific row of the table. It does not kind of *leak* out so that it helps when a 'Tablet users in the US' shows up for the first time. The agent can't see this pattern. It has to learn everything from scratch for every single context.

---
This is the wall. Our simple, intuitive solution of creating **a separate belief table for every situation collapses** under *real-world complexity*.

The <span style="color:red">core problem</span> is that our brain works by <span style="color:Violet">memorisation</span>. What we need is something that can generalise. We don't want to store separate values for *here* and *one millimeter to the right*. The value we get from these two positions, *should be pretty similar*. And therefore, we need something that understand that. In other words, we need something that understands the relationship between context and results. A brain that could learn that the feature 'Device=Mobile' generally leads to a preference for shorter content, regardless of the country or the time of day, would be more like what we want.

Next lecture we will see how to replace our giant table, with machine learning (ML).  


#### Bonus 2: Gym API 


So far, I have deliberately held this very high level, hands-on and with very simple examples. Now I want to show you the first step to becoming an RL researcher, by showing you a paper released in 2016 which introduced a toolkit for reinforcement learning. 

[OpenaI Gym paper](https://arxiv.org/pdf/1606.01540)


[OpenAI Gym](https://gymnasium.farama.org/index.html) is a toolkit for reinforcement learning. It includes a collection of benchmark problems that expose a common interface, and a website where people can share their results and compare the performance of algorithms. 


##### Mapping our simple homepage environment to the Gymnasium API
Let's create a clear side-by-side comparison:

| Gymnasium API (env.step) | Our HomepageEnv     | Explanation                                                                 |
|---------------------------|----------------------------|-----------------------------------------------------------------------------|
| observation (next state)  | Doesn't Exist              | Because our episode ends after one step, there is no "next state" to transition to. |
| reward                    | The 1 or 0 from get_reward() | This is a direct one-to-one mapping. This is the core feedback signal.       |
| terminated                | Always True                | Our "episode" always terminates after a single action.                       |
| truncated                 | Always False               | Our episode can't be cut short; it's already as short as possible.           |
| info                      | Doesn't Exist              | We don't need extra debugging information.                                   |

The `env.reset()` in Gymnasium gives you the first observation of an episode.  
In our case, this is equivalent to *“a new user arrives.”*


##### Rewrite our environment to the Gym API


In [35]:
# WARNING: This version contains a subtle bug, which will complicate the learning process. 
class GymHomepageEnv:
    def __init__(self):
        self.contexts = ["New viewer", "Subscriber"]
        self.actions = ["🔥 Viral thing", "🤖 AI tutorial", "💤 Boring politics debate"]
        self.num_contexts = len(self.contexts)
        self.num_actions = len(self.actions)

    def reset(self):
        """Called at the start of a new episode (a new user arrives)."""
        # Returns the first observation (the user's context)
        context_idx = np.random.randint(0, self.num_contexts)
        info = {} # empty info dict
        return context_idx, info

    def step(self, action_idx):
        """Called when an agent takes an action."""
        # For this to work, we need to know which context we are in.
        # A real Gym env would store this internally, but we'll assume it's known.
        # Let's say the last context from reset() was stored
        # (This is a bit of a hack to fit our problem into the API)

        # In a real scenario, the context would be part of the env's internal state.
        # Here, we just cenerate it on the fly for the example
        current_context = np.random.randint(0, self.num_contexts)  # <-- !!!! BUG !!!!

        # Get the reward using the logic from our simple env
        if current_context == 0: # a new viewer is watching
            true_probs = [0.6, 0.1, 0.05]
        else:                    # subscriber who is thoughful and enjoys politics debate
            true_probs = [0.1, 0.5, 0.7]
        reward = 1 if np.random.rand() < true_probs[action_idx] else 0

        # Since the episode is always over, there is no "next_observation"
        # and terminated is always True
        next_observation = None
        terminated = True
        truncated = False
        info = {}

        return next_observation, reward, terminated, truncated, info

In [36]:
env = GymHomepageEnv()


##### Agent would now interact with the work using env.step(), and env.reset()


In [66]:
belief_table_2d = np.zeros((env.num_contexts,env.num_actions))
action_counts_2d = np.zeros((env.num_contexts,env.num_actions))

for step in range(num_steps):
    # 1. A new episode starts (a user arrives)
    # env.reset() gives us the context for THIS episode
    context_idx, info = env.reset() # replaces our old 'context_idx = np.random.randint(...)'

    # 2. Agent makes a descisino based on the context
    if np.random.rand() < epsilon:
        action_idx = np.random.randint(0, num_actions)
    else:
        action_idx = np.argmax(belief_table_2d[context_idx])

    # 3. Agent takes a step in the environment
    # note: in our specific case, the step function does not use the info of which user is in context
    # which is a flaw in this analogy, but highlights that our problem is not truly sequential
    # In a real Gym env, the state from reset() would be used by step()
    observation, reward, terminated, truncated, info = env.step(action_idx)

    # 4. Agent learns from the experience
    action_counts_2d[context_idx, action_idx] += 1
    belief_table_2d[context_idx, action_idx] += 1 / (action_counts_2d[context_idx, action_idx]) * (reward - belief_table_2d[context_idx, action_idx])

    # 5. Check if the episode is over (it always is for us, because we are working with one-step MDPs)
    if terminated or truncated:
        pass

In [67]:
belief_table_2d
action_counts_2d

array([[110., 177., 684.],
       [308.,  49., 672.]])


##### Our subtle bug: debugging


1. The agent starts a new episode. env.reset() tells it, "The user is a Subscriber (context_idx = 1)."
2. The agent looks at its brain for subscribers and correctly decides to show the "Community Q&A" video (action_idx = 2).
3. The agent calls env.step(action_idx=2).
4. Inside env.step, our faulty code ignores the fact that the user is a subscriber. It randomly generates a new context: current_context = np.random.randint(0, 2). Let's say it randomly picks 0 ("New Viewer").
5. The environment then calculates the reward based on a New Viewer seeing the Community Q&A. The CTR is only 5%, so the reward is almost certainly 0.
6. The env.step function returns this reward=0 back to the agent.
7. The agent, still thinking it's dealing with a Subscriber, takes this reward=0 and updates its belief for belief_table_2d[1, 2]. It wrongly concludes that "Community Q&A for Subscribers" is a bad idea.


The reward signal is being completely mismatched from the context. The agent is being told about one situation but is being graded on a totally different one. This injects so much noise into the learning process that the agent can't figure out the true values, leading to the confused action_counts you observed

So, basically the <span style="color:blue">state</span> is decoupled from the reward. 

---

This is a classic RL debugging scenario. Our agent's performance is terrible, and the action counts show it isn't learning the right policy. So where is the bug? Is it in the environment or in the brain we've built?

Let's trace the flow of information for a single step:
1. **Inside `step`**: Here's the bug! The `step` function completely **ignores** the fact that the user was a subscriber (context_idx =1). It re-rolls the dice with `current_context = np.random.randint(...)`. It might decide the reward should be based on a "new viewer" (context_idx=0)
5.  **Corrupted reward signal**: The agent chose an action based on its *beliefs about subscribers*, but it *received a reward that might have come from a new viewer's experience*. The feedback signal is now corrupted. The agent is being punished or rewarded for the wrong reasons.

In [None]:
# Corrected version: Gym-like version of our environment
class GymHomepageEnv:
    def __init__(self):
        self.contexts = ["New viewer", "Subscriber"]
        self.actions = ["🔥 Viral thing", "🤖 AI tutorial", "💤 Boring politics debate"]
        self.num_contexts = len(self.contexts)
        self.num_actions = len(self.actions)
        
        # The environment now has an internal memory of the current state
        self.current_context_idx = None 
    
    def reset(self):
        """A new user arrives. The environment sets its internal state and returns it as an observation."""
        self.current_context_idx = np.random.randint(0, self.num_contexts)
        info = {}
        return self.current_context_idx, info

    def step(self, action_idx):
        """The agent takes an action. The environment uses its INTERNAL state to calculate the outcome."""
        if self.current_context_idx is None:
            raise Exception("Cannot call step before reset. Start a new episode.")

        # THE BUG IS FIXED: We use the stored internal state, not a new random one.
        if self.current_context_idx == 0:  # new viewer
            true_probs = [0.6, 0.1, 0.05]
        else:                              # thoughtful subscriber
            true_probs = [0.1, 0.5, 0.7]
        
        reward = 1 if np.random.rand() < true_probs[action_idx] else 0
        
        # After the step, the episode is over, so we can clear the internal state
        # until the next reset.
        self.current_context_idx = None
        
        next_observation = None 
        terminated = True
        truncated = False
        info = {}
        
        return next_observation, reward, terminated, truncated, info

We've just peeked behind the curtain and seen that the tools we use are built on a deep understanding of the challenges of RL research.

- We need freedom to design agents ('Environments, not Agents')
- We need our results to be comparable ('Strict versioning')
- We need to learn effectively ('Sample complexity)

We now are in the perfect position to see the limitations of our work so far. Our tabular methods had terrible *sample complexity* (i.e., they needed too much data) and they *couldn't generalise* (everything was hidden in the 'boxes' of our  tabels, with no leak between the different states)