In [29]:
from google.colab import output
output.enable_custom_widget_manager()

# Expert Notebook: Generalist Agents in Open-Ended Worlds

### A Deep Dive Inspired by Jim Fan's CS25 Lecture

**Objective:** This notebook provides a PhD-level deconstruction of the concepts presented in the lecture on building generalist AI agents. We will move beyond a surface-level summary to explore the core architectural paradigms of **MineCLIP**, **Voyager**, and **Eureka**. Through from-scratch code implementations, recreations of key lecture examples, mathematical deep dives, and interactive visualizations, our goal is to build an expert-level intuition for the challenges and breakthroughs in the quest for open-ended, embodied AI.

## Section 1: The Foundational Analogy - The Kitten Carousel

The lecture begins with the powerful 1963 Held and Hein experiment, which serves as a central metaphor for the entire field of embodied AI. Understanding this analogy is crucial for grasping the limitations of current models and the motivation for building generalist agents.

- **The Passive Kitten (Today's LLMs, e.g., ChatGPT):** This kitten was moved through the world but could not control its own motion. It received all the same visual stimuli as the other kitten but failed to develop a healthy visual-motor loop. This is analogous to today's large language models, which passively observe a massive, static dataset (the internet). They have no agency, cannot perform actions to test hypotheses, and receive no direct feedback from a physical world. Their knowledge, while vast, is abstract, ungrounded, and can lead to "hallucinations" that violate physical common sense.

- **The Active Kitten (The Generalist Agent):** This kitten actively controlled its own movement. By initiating actions and observing the resulting sensory changes, it formed a robust understanding of the world. This is the goal for generalist agents: to learn within a closed **perception-action loop**, continuously interacting with an environment. This active embodiment is what grounds abstract knowledge in physical reality.

**The Goal:** The future of AI is not about choosing one kitten over the other. It's about building an **Active Kitten** on top of the knowledge foundation of a **Passive Kitten**. The agent's ability to act and experiment in the world is a crucial layer built upon the vast knowledge pre-trained from internet-scale data.

In [7]:
import plotly.graph_objects as go

fig = go.Figure()

# Nodes
nodes_x = [0.2, 0.8, 0.5, 0.5]
nodes_y = [0.8, 0.8, 0.2, 0.6]
node_text = ['<b>Passive Kitten</b><br>(e.g., GPT-4)', '<b>Static World</b><br>(Internet Text/Images)', '<b>Active Kitten</b><br>(Generalist Agent)', '<b>Interactive World</b><br>(e.g., Minecraft, Robotics Sim)']
node_colors = ['#FF6F61', '#6B5B95', '#88B04B', '#F7CAC9']

fig.add_trace(go.Scatter(x=nodes_x, y=nodes_y, mode='markers+text', text=node_text,
                           marker=dict(size=[60, 60, 60, 60], color=node_colors, line=dict(width=2, color='black')),
                           textposition='middle center', textfont=dict(size=12, color='black')))

# Edges - Passive Loop (One-way Observation)
fig.add_annotation(x=0.75, y=0.8, ax=0.25, ay=0.8, xref='x', yref='y', axref='x', ayref='y',
                   showarrow=True, arrowhead=2, arrowsize=1.5, arrowwidth=2, arrowcolor='#FF6F61', text="<i>Passively Observes</i>")

# Edges - Active Loop (Closed Perception-Action Loop)
fig.add_annotation(x=0.5, y=0.25, ax=0.5, ay=0.55, xref='x', yref='y', axref='x', ayref='y',
                   showarrow=True, arrowhead=2, arrowsize=1.5, arrowwidth=2, arrowcolor='#88B04B', text="<b>Observes (Perception)</b>")
fig.add_annotation(x=0.5, y=0.55, ax=0.5, ay=0.25, xref='x', yref='y', axref='x', ayref='y',
                   showarrow=True, arrowhead=2, arrowsize=1.5, arrowwidth=2, arrowcolor='#88B04B', text="<b>Acts On (Action)</b>")

fig.update_layout(title='The Kitten Carousel Analogy: Passive vs. Active Learning',
                  showlegend=False, xaxis=dict(visible=False), yaxis=dict(visible=False), height=600)
fig.show()

### 1.1. The Limitations of Specialized Agents

The lecture highlights that even superhuman systems like AlphaGo, OpenAI Five (Dota 2), and AlphaStar (StarCraft) are fundamentally limited. They are **specialists**, not **generalists**.

- **Single Objective:** Their goal is simple and singular: win the game.
- **Closed World:** The rules and environment are fixed and known.
- **No Generalization:** The AlphaGo model cannot play StarCraft, let alone a creative game like Minecraft. Its intelligence is brittle and task-specific.

A **Generalist Agent** must overcome these limitations. It should:
1.  **Pursue Open-Ended Objectives:** Understand complex, language-based instructions.
2.  **Possess Broad World Knowledge:** Go beyond the specifics of a single task.
3.  **Be Massively Multitask:** Perform a wide variety of skills, as diverse as human language itself.

## Section 2: Case Study 1 - MineCLIP: Learning Rewards from Observation

MineCLIP is a foundational step that addresses the data problem. It shows how to extract a useful signal (a reward function) from passive, observational data (YouTube videos).

### 2.1. The Data Trove: Tapping into Human Creativity

The project, **MineDojo**, first creates the necessary ingredients:
- **The Environment:** A highly controllable Minecraft simulator with over 3,000 programmatic and creative tasks.
- **The Knowledge Base:** A massive dataset scraped from the internet, reflecting the collective intelligence of the Minecraft community.
    - **700,000+ YouTube Videos:** With 2 billion words of transcribed narration. This provides (video, text) pairs.
    - **7,000 Wiki Pages:** Multimodal documents with images, tables, and text explaining game mechanics.
    - **300,000+ Reddit Posts:** A Q&A forum showing problem-solving and community interaction.

### 2.2. Technical Deep Dive: Contrastive Video-Language Learning

MineCLIP's core mechanism is a **contrastive learning** model, similar to OpenAI's CLIP, but adapted for video. It learns to associate short video clips with their corresponding text descriptions from the YouTube transcripts.

**Mathematical Foundation: The InfoNCE Loss**

The goal is to learn embedding functions for video, $f_v$, and text, $f_t$, such that for a matched pair $(v_i, t_i)$, their embeddings are close in a shared vector space, while embeddings for mismatched pairs $(v_i, t_j)$ where $i \neq j$ are far apart.

1.  **Embedding:** Given a batch of $N$ (video, text) pairs, we get two sets of embeddings: $V = \{f_v(v_1), ..., f_v(v_N)\}$ and $T = \{f_t(t_1), ..., f_t(t_N)\}$.
2.  **Similarity Matrix:** We compute the cosine similarity between every video embedding and every text embedding, creating an $N \times N$ matrix $S$, where $S_{i,j} = \frac{f_v(v_i) \cdot f_t(t_j)}{\|f_v(v_i)\| \|f_t(t_j)\|}$.
3.  **The Loss Function:** The probability that text $t_j$ is the correct description for video $v_i$ is modeled using a softmax over the similarity scores, scaled by a temperature parameter $\tau$:
    $$ p(j|i) = \frac{\exp(S_{i,j} / \tau)}{\sum_{k=1}^{N} \exp(S_{i,k} / \tau)} $$
    The InfoNCE (Noise Contrastive Estimation) loss is the cross-entropy loss of predicting the correct pairing. For a single video $v_i$, the loss is:
    $$ \mathcal{L}_{v_i} = -\log p(i|i) = -\log \frac{\exp(S_{i,i} / \tau)}{\sum_{k=1}^{N} \exp(S_{i,k} / \tau)} $$
    The total loss is calculated symmetrically for both video-to-text and text-to-video and averaged over the batch.

**Expert Insight:** What does MineCLIP become after this training? It becomes a **universal, language-prompted reward function**. You can give it *any* task description in English (e.g., "shear a sheep") and a video of an agent's behavior. MineCLIP will output a high score if the behavior in the video matches the description, and a low score otherwise. This is effectively **Reinforcement Learning from Human Feedback (RLHF)**, where the "feedback" is the alignment score from a pre-trained model instead of a live human click.

In [32]:
# Code Implementation: Interactive Contrastive Loss Visualization (Corrected)

import torch
import torch.nn.functional as F
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import ipywidgets as widgets
from IPython.display import display
import plotly.io as pio

# This helps ensure Plotly displays correctly in most notebook environments.
pio.renderers.default = "notebook"

def contrastive_loss(v_embeds, t_embeds, temperature):
    v_embeds = F.normalize(v_embeds, p=2, dim=-1)
    t_embeds = F.normalize(t_embeds, p=2, dim=-1)
    similarity_matrix = v_embeds @ t_embeds.T
    logits = similarity_matrix / temperature
    labels = torch.arange(len(v_embeds), device=v_embeds.device)
    loss = F.cross_entropy(logits.detach(), labels)
    return loss, logits

# --- Interactive Visualization Setup ---
# 1. Create the widgets that will control the visualization
num_pairs_slider = widgets.IntSlider(min=4, max=16, step=2, value=8, description='Num Pairs')
temp_slider = widgets.FloatSlider(min=0.01, max=1.0, step=0.01, value=0.1, description='Temp (τ)')
noise_slider = widgets.FloatSlider(min=0.1, max=2.0, step=0.1, value=0.5, description='Noise')

# 2. Create the FigureWidget. This is essential for interactive updates.
fig_widget = go.FigureWidget(make_subplots(rows=1, cols=2,
                                           subplot_titles=("Embedding Space", "Softmax Attention (for V0)")))

def visualize_contrastive_space(num_pairs, temperature, noise):
    """
    This function generates the data and UPDATES the FigureWidget.
    It is called whenever a slider value changes.
    """
    # Generate mock embeddings as PyTorch Tensors
    v = torch.randn(num_pairs, 2)
    t = v + torch.randn(num_pairs, 2) * noise
    v = F.normalize(v, p=2, dim=-1)
    t = F.normalize(t, p=2, dim=-1)
    loss, logits = contrastive_loss(v, t, temperature)

    # Convert all PyTorch tensors to NumPy arrays or Python scalars before plotting
    v_np = v.detach().numpy()
    t_np = t.detach().numpy()
    probs_v0_np = F.softmax(logits[0], dim=0).detach().numpy()
    loss_val = loss.item()

    # Use a batch_update context manager for efficient plotting updates
    with fig_widget.batch_update():
        # Clear all existing data and shapes from the figure to prevent drawing on top of old plots
        fig_widget.data = []
        fig_widget.layout.shapes = []

        # Re-add all traces and shapes with the new NumPy data
        fig_widget.add_trace(go.Scatter(x=v_np[:, 0], y=v_np[:, 1], mode='markers', name='Video', marker=dict(symbol='circle', size=12, color='#88B04B')), row=1, col=1)
        fig_widget.add_trace(go.Scatter(x=t_np[:, 0], y=t_np[:, 1], mode='markers', name='Text', marker=dict(symbol='square', size=12, color='#FF6F61')), row=1, col=1)

        for i in range(num_pairs):
             fig_widget.add_shape(type='line', x0=v_np[i, 0], y0=v_np[i, 1], x1=t_np[i, 0], y1=t_np[i, 1],
                                 line=dict(color='grey', width=1, dash='dot'), row=1, col=1)

        fig_widget.add_trace(go.Bar(y=[f'T{i}' for i in range(num_pairs)], x=probs_v0_np, orientation='h', name='Prob(T_i | V_0)'), row=1, col=2)

        # Update layout properties
        fig_widget.layout.title = f'Contrastive Learning (Loss: {loss_val:.3f})'
        fig_widget.layout.xaxis1.title = 'Dimension 1'
        fig_widget.layout.yaxis1.title = 'Dimension 2'
        fig_widget.layout.xaxis2.title = 'Probability'
        fig_widget.layout.yaxis2.title = 'Text Candidates'
        fig_widget.layout.yaxis2.autorange = 'reversed'
        fig_widget.layout.yaxis1.scaleanchor = "x1"

def on_slider_change(change):
    # 3. Define a single callback that reads all current slider values and calls the update function
    visualize_contrastive_space(num_pairs_slider.value, temp_slider.value, noise_slider.value)

# 4. Manually link each slider's 'value' change event to our single callback function
for slider in [num_pairs_slider, temp_slider, noise_slider]:
    slider.observe(on_slider_change, names='value')

# 5. Perform the initial render of the plot with the default slider values
visualize_contrastive_space(num_pairs_slider.value, temp_slider.value, noise_slider.value)

# 6. Display the final layout using a VBox
controls = widgets.VBox([num_pairs_slider, temp_slider, noise_slider])
print("Instructions: Training pushes matching Video/Text pairs (connected by dotted lines) closer. A perfect model would have a huge bar for 'T0' and tiny bars for all others. Adjust 'Temp' and 'Noise' to see their effects.")
display(widgets.VBox([controls, fig_widget]))

Instructions: Training pushes matching Video/Text pairs (connected by dotted lines) closer. A perfect model would have a huge bar for 'T0' and tiny bars for all others. Adjust 'Temp' and 'Noise' to see their effects.


VBox(children=(VBox(children=(IntSlider(value=8, description='Num Pairs', max=16, min=4, step=2), FloatSlider(…

## Section 3: Learning from Interaction - The Voyager Paradigm

Voyager represents a completely different approach. It's a **zero-gradient** or **no-training** agent that leverages a powerful, pre-trained LLM (GPT-4) as its executive brain. It learns not by updating weights, but by generating code, observing the outcome, and building a library of successful programs.

### 3.1. The Architecture of Voyager
Voyager's intelligence emerges from the interaction of three core modules, all orchestrated by LLM prompts:
1.  **Automatic Curriculum:** An LLM instance that acts as a curiosity engine. Given the agent's current state (inventory, explored areas) and existing skills, it proposes a new, challenging but achievable task. Its high-level goal is `"discover as many unique items as possible"`, which naturally leads to exploration and tech tree progression.
2.  **Iterative Prompting Module:** The core execution engine. This LLM instance writes JavaScript code to accomplish the given task using the **Mineflayer API** (a crucial piece that provides high-level control functions). If the code fails (due to syntax errors, execution errors, or simply not achieving the goal), the error message or a self-critique is fed back into the prompt, and the LLM attempts to debug and refine its own code.
3.  **Skill Library:** The agent's long-term memory. This is a **vector database** that stores successful code snippets.
    - **Storage:** When a skill (e.g., `craft_wooden_pickaxe`) is successfully coded, the LLM first generates a natural language description of it ("crafts a wooden pickaxe from wood planks"). This description is converted into an embedding and stored as the *key*, with the raw code as the *value*.
    - **Retrieval:** When faced with a new task, the agent queries the library with a natural language description of what it needs to do. The most semantically similar skills are retrieved and included in the prompt as examples, allowing the agent to reuse and compose existing skills.

### 3.2. Lecture Example: The Pig and the Sword

The lecture gives a brilliant example of the full loop in action. Let's trace the logic:
1.  **State:** Agent is hungry. `Hunger: 1/20`.
2.  **Perception (from Mineflayer):** `Nearby entities: [cat, villager, pig, wheat_seeds]`.
3.  **Automatic Curriculum:** Proposes task: `"Find food"`.
4.  **Iterative Prompting (Reasoning):** The LLM receives the task and state. It reasons:
    - *"Killing a cat or villager is bad."
    - *"Growing wheat from seeds takes too long."
    - *"The pig is the best source of food."
    - **Refined Task:** `"Hunt the pig for porkchops."`
5.  **Skill Retrieval:** The agent queries the skill library with `"hunt a pig"`. Let's say it finds no direct skill. But it might find `"fight a zombie"` which uses a sword.
6.  **Code Generation (Sub-goal):** It checks its inventory. It has an `iron_ingot`. It queries the skill library with `"use iron to make a weapon"` and retrieves the `craft_iron_sword` skill. It executes this code.
7.  **Code Generation (Main Goal):** Now equipped with a sword, it generates a new program, `hunt_pig`, which involves approaching the pig and attacking it.
8.  **Skill Storage:** Once the pig is killed and porkchops are collected, the new `hunt_pig` code is summarized and stored in the skill library for future use.

In [14]:
# Code Implementation: Interactive Voyager Skill Library (Corrected)
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import ipywidgets as widgets
import pandas as pd
from IPython.display import display

class VoyagerSkillDB:
    def __init__(self):
        self.vectorizer = TfidfVectorizer()
        self.skills = [] # {'description': str, 'code': str}
        self.skill_embeddings = None

    def _update_embeddings(self):
        if not self.skills:
            return
        descriptions = [s['description'] for s in self.skills]
        self.skill_embeddings = self.vectorizer.fit_transform(descriptions)

    def add_skill(self, description, code):
        self.skills.append({'description': description, 'code': code})
        self._update_embeddings()

    def retrieve_skills(self, query, top_k=3):
        if self.skill_embeddings is None or not self.skills: return []
        query_embedding = self.vectorizer.transform([query])
        sims = cosine_similarity(query_embedding, self.skill_embeddings)[0]
        # Ensure we don't request more items than exist
        k = min(top_k, len(self.skills))
        if k == 0: return []

        top_indices = np.argsort(sims)[::-1][:k]
        results = [{'description': self.skills[i]['description'], 'code': self.skills[i]['code'], 'similarity': sims[i]} for i in top_indices]
        return results

# --- Interactive Demo ---
db = VoyagerSkillDB()
db.add_skill("chop down an oak tree to get oak logs", "bot.equip('axe'); bot.chop('oak_tree');")
db.add_skill("craft a wooden pickaxe from wood planks", "bot.craft({item: 'wooden_pickaxe', ingredients: ['plank', 'plank', 'stick']});")
db.add_skill("mine for cobblestone using a pickaxe", "bot.equip('pickaxe'); bot.mine('stone_block');")
db.add_skill("craft a furnace from 8 cobblestone blocks", "bot.craft({item: 'furnace', ingredients: ['cobblestone']*8});")
db.add_skill("smelt iron ore in a furnace to get an iron ingot", "bot.smelt('iron_ore', 'furnace');")
db.add_skill("kill a zombie with a sword to get rotten flesh", "bot.equip('sword'); bot.attack('zombie');")

query_text = widgets.Text(value='make a stone tool', description='Task:', layout={'width': '400px'})
results_html = widgets.HTML()

def on_query_change(change):
    # **THE FIX:** Check if `change` is a dict from manual call or an object from the widget
    if isinstance(change, dict):
        query = change['new']
    else: # It's the widget's change object
        query = change.new

    results = db.retrieve_skills(query)
    if results:
        df = pd.DataFrame(results)
        results_html.value = df.to_html(index=False, escape=False)
    else:
        results_html.value = "No skills found or database is empty."

query_text.observe(on_query_change, names='value')
# Manually trigger the first update
on_query_change({'new': query_text.value})

print("### Voyager's Skill Memory (Vector Database) ###")
print("Type a task in the box. The agent retrieves the most semantically similar skills to include in its prompt.")
display(widgets.VBox([query_text, results_html]))

### Voyager's Skill Memory (Vector Database) ###
Type a task in the box. The agent retrieves the most semantically similar skills to include in its prompt.


VBox(children=(Text(value='make a stone tool', description='Task:', layout=Layout(width='400px')), HTML(value=…

## Section 4: Eureka: The Hybrid Gradient Architecture

Eureka applies the core ideas of LLM-driven discovery to a more challenging domain: low-level, dexterous robotic manipulation. It introduces a powerful **hybrid gradient architecture**.

**The Problem with Direct Control:** Generating low-level motor torques directly from an LLM is extremely difficult. The action space is continuous, high-dimensional, and requires high-frequency control (e.g., 30-60 Hz).

**Eureka's Solution:** Decouple the problem. Use the LLM for what it's good at (high-level reasoning, code) and a traditional algorithm for what *it's* good at (low-level optimization).

- **The Outer Loop (Gradient-Free, LLM-driven):** This is an **evolutionary search over reward functions**.
  1.  **Generation:** The LLM (GPT-4) is given the simulator's Python code as context and a task description (e.g., "spin the pen"). It generates a population of candidate reward functions as Python code.
  2.  **Evaluation:** Each candidate reward function is evaluated by running a full, expensive reinforcement learning training process (the inner loop). The final performance of the trained policy becomes the "fitness score" of the reward function.
  3.  **Mutation:** The top-performing reward functions and their scores are fed back into the LLM's prompt. The LLM is then asked to act as a "superhuman reward engineer" and propose mutations or entirely new functions that might improve performance.

- **The Inner Loop (Gradient-Based, RL-driven):** For each reward function generated by the outer loop, a standard RL algorithm (like PPO) trains a neural network policy from scratch. This is done in a highly parallelized simulator (like NVIDIA Isaac Sim) to make the evaluation step feasible.

### 4.1. Lecture Example: Coaching the Moonwalk

The power of this approach lies in its interactivity. The lecture shows an agent attempting a moonwalk and failing. A human can then provide corrective feedback *in natural language*.

1.  **Initial Prompt:** `"Make the robot do a moonwalk while standing up."`
2.  **LLM Generates Reward Code v1:** The agent moves backward but doesn't have the characteristic leg swing.
3.  **Human Feedback (Coaching):** `"You're being very patient. Moonwalk means a robot should walk backwards while the feet swing as if they are moving forward. Correct your answer and also make it work at a speed of 0.5 m/s."`
4.  **LLM Generates Reward Code v2:** The LLM incorporates this new, more detailed semantic information and constrains into its next reward function proposal, leading to a successful behavior.

This demonstrates that the LLM is not just a code generator, but a **reasoning engine** that can understand and integrate corrective feedback to refine its programmatic output.

In [11]:
# Code Implementation: Interactive Eureka-style Reward Evolution (Visualization Added)
import numpy as np
import ipywidgets as widgets
from IPython.display import display, HTML
import plotly.graph_objects as go

# --- The "Inner Loop": A mock fitness evaluation ---
def evaluate_fitness(reward_code):
    """Mocks a full RL run. A 'good' reward should mention moving backward ('-state['x_pos']') and swinging ('y_vel')."""
    fitness = 0
    if "-state['x_pos']" in reward_code: fitness += 50
    if "abs(state['y_vel'])" in reward_code: fitness += 30
    if "state['x_vel']" in reward_code: fitness += 20 # Speed term
    fitness += np.random.randint(-5, 5) # Add some noise
    return fitness

# --- The "Outer Loop": The Mock LLM ---
class MockEurekaLLM:
    def __init__(self):
        self.population = [("reward = -state['x_pos']**2", 0)]
        self._update_fitness()

    def _update_fitness(self):
        self.population = [(code, evaluate_fitness(code)) for code, _ in self.population]
        self.population.sort(key=lambda x: x[1], reverse=True)

    def get_prompt_for_llm(self, feedback=""):
        prompt = "# Environment has state: state['x_pos'], state['y_pos'], state['x_vel'], state['y_vel']\\n"
        prompt += "# Goal: Make the agent moonwalk.\\n"
        if feedback: prompt += f"# User Feedback: {feedback}\\n"
        prompt += "# Here are the best performing reward functions so far:\\n"
        for code, fitness in self.population[:2]:
            prompt += f"# Fitness: {fitness:.1f}\\n# def get_reward(state):\\n#    {code}\\n"
        prompt += "# Propose a new, better reward function code:\\ndef get_reward(state):\\n   reward = "
        return prompt

    def generate_new_candidate(self, feedback=""):
        best_code, _ = self.population[0]
        new_code = best_code.replace("reward = ", "") # Start from the best code
        # Mock LLM logic based on feedback
        if "swing" in feedback and "y_vel" not in new_code:
            new_code += " + 0.5 * abs(state['y_vel'])"
        if "speed" in feedback and "x_vel" not in new_code:
            new_code += " - 20 * abs(state['x_vel'] + 0.5)" # Penalize deviation from target speed
        # Add a random mutation
        new_code += f" + np.random.uniform(-0.1, 0.1) * state['y_pos']**2"

        self.population.append(("reward = " + new_code, 0))
        self._update_fitness()

# --- Interactive Demo ---
eureka_llm = MockEurekaLLM()

feedback_input = widgets.Textarea(description="Feedback:", layout={'width':'95%', 'height':'80px'})
generate_button = widgets.Button(description="Generate & Evaluate Next Generation")
prompt_output = widgets.HTML()
population_output = widgets.HTML()

# **THE FIX:** Create a FigureWidget for the plot
fitness_plot = go.FigureWidget()
fitness_plot.update_layout(title="Reward Function Fitness", yaxis_title="Fitness Score")

def update_display():
    prompt_output.value = f"<b>Current Prompt to LLM:</b><pre>{eureka_llm.get_prompt_for_llm(feedback_input.value)}</pre>"
    # Update HTML display of population
    pop_html = "<b>Current Reward Population (Code -> Fitness):</b><br/>"
    for code, fitness in eureka_llm.population:
        pop_html += f"<code>{code}</code> -> {fitness}<br/>"
    population_output.value = pop_html

    # Update the plot
    with fitness_plot.batch_update():
        fitness_plot.data = [] # Clear old data
        codes = [p[0] for p in eureka_llm.population]
        scores = [p[1] for p in eureka_llm.population]
        fitness_plot.add_trace(go.Bar(x=[f"Cand. {i}" for i in range(len(codes))], y=scores, text=codes, hoverinfo='text'))
        fitness_plot.layout.yaxis.autorange = True


def on_generate_click(b):
    eureka_llm.generate_new_candidate(feedback_input.value)
    update_display()

generate_button.on_click(on_generate_click)
update_display()

print("### Eureka's Reward Evolution Loop ###")
print("Provide feedback to guide the LLM's reward design. Try typing: 'moonwalk means the feet should swing' and generating.")
display(widgets.VBox([feedback_input, generate_button, prompt_output, widgets.HBox([population_output, fitness_plot])]))

### Eureka's Reward Evolution Loop ###
Provide feedback to guide the LLM's reward design. Try typing: 'moonwalk means the feet should swing' and generating.


VBox(children=(Textarea(value='', description='Feedback:', layout=Layout(height='80px', width='95%')), Button(…

## Section 5: Future Directions and Final Takeaways

The lecture concludes by looking forward, emphasizing two key areas:

1.  **Learning from Internet-Scale Video:** Video is the richest source of data for learning about dynamics, physics, and human behavior. However, it's challenging due to:
    - **Distribution Shift:** YouTube videos look different from a robot's camera view.
    - **No Action Labels:** You see *what* happens, but not the motor commands that *caused* it.
    - **Embodiment Mismatch (The "Boxing Cat" Problem):** An observer (the cat) may fundamentally misinterpret the purpose of an action (the boxer shaking his head) because it has a different body and context. A key research direction is training an **Inverse Dynamics Model (IDM)** to pseudo-label actions in unlabeled videos.

2.  **True Multimodal Foundation Models:** Moving beyond text and static images to models that can seamlessly process a mix of text, images, videos, and audio as input to produce actions as output. The **VIMA** and **RT-2** projects are early steps in this direction, using "multimodal prompting" to define tasks.

### Final Takeaway

The journey towards generalist agents is a paradigm shift from building specialized problem-solvers to creating **autonomous, lifelong learners**. The key is not just about scaling up existing models, but about creating architectures that can actively interact with their environment, learn from diverse data sources, and continuously improve through a closed perception-action loop—finally allowing our AI to become the active kitten.