# 11. AI Agents and Automation: Integrating AI into Kubernetes Workflows

## Introduction

Welcome to this interactive introduction to AI Agents and Automation in Kubernetes! In this notebook, we’ll explore how AI can transform Kubernetes workflows by enabling real-time decision-making and proactive automation.

Think of this as learning how to build an intelligent assistant for your Kubernetes environment—one that can monitor metrics, predict issues, and take corrective actions automatically. By the end of this notebook, you’ll integrate AI agents into your workflows and empower your cluster to operate with greater efficiency and resilience.

### Objectives

By the end of this notebook, you will:
1. Understand what AI agents are and how they enhance Kubernetes operations.
2. Use AI models to predict workload surges and other operational challenges.
3. Build automated workflows that take actions like scaling, healing, or reconfiguring services.

### Key Features

- Learn how AI-driven automation reduces downtime and simplifies operations.
- Build an AI agent that integrates directly with the Kubernetes API.
- See real-time decision-making in action, with tools that monitor metrics, predict issues, and resolve them autonomously.

## What Are AI Agents?

AI agents are autonomous systems designed to:
1. Monitor: Continuously analyze Kubernetes metrics, logs, and environmental data.
2. Reason: Use predictive models to anticipate problems or optimize resource allocation.
3. Act: Trigger specific actions via Kubernetes APIs, such as scaling deployments or cordoning nodes.

![agentmodules](images/agent_modules_small.png)

### Why Are AI Agents Important?

Kubernetes environments are dynamic and complex, with workloads that can change rapidly. Traditional monitoring tools require manual intervention to:
- Scale services during traffic surges.
- Mitigate issues like node failures or pod misconfigurations.

AI agents simplify this by:
- Acting proactively instead of reactively.
- Automating routine operations to reduce manual effort.
- Enhancing system resilience and uptime.

![agenticvnonagentic](images/agentic-vs-non-agentic.png)

### A Real Example: Handling Traffic Surges

Imagine running an e-commerce platform on Kubernetes. During a flash sale, traffic surges unexpectedly:
1. Without AI Agents:
   - Ops teams scramble to monitor metrics, manually scale services, and address bottlenecks.
   - Delays in response lead to slower user experiences or downtime.

2. With AI Agents:
   - The agent detects a spike in traffic early using predictive models.
   - It automatically scales pods and adjusts resources to meet demand, ensuring a seamless shopping experience.

By leveraging AI agents, your cluster becomes smarter, faster, and more efficient—handling challenges with minimal human intervention.

For more in-depth analysis on LLM-based autonomous agents, refer to this paper: [A Survey on Large Language Model based Autonomous Agents](https://arxiv.org/pdf/2308.11432)

## 1. Installing the Required Libraries

Before we start, we need to install the necessary libraries.

In [1]:
%pip install langchain python-ollama requests matplotlib flask duckduckgo_search langchain-community --quiet

Note: you may need to restart the kernel to use updated packages.


## 2. Simulating Data and Actions

In this step, we set up mock functions to simulate real-world data and actions. These functions are then wrapped as **tools** using LangChain, making them accessible to our AI agent for decision-making.

### What are LangChain Tools?

LangChain tools are modular components that allow AI agents to interact with external systems or perform specific actions. They act as the **hands** of the AI agent, enabling it to fetch data, analyze information, or trigger actions. Each tool has a clear purpose, making it easy to define workflows for the agent.

### Tools We’ll Use:

1. **Financial Data Tool**:
   - Simulates financial metrics like trading volume and price changes.
   - Helps the AI agent identify external factors (e.g., market surges) that may affect workloads.

2. **Kubernetes Metrics Tool**:
   - Simulates resource usage (CPU and memory) for a Kubernetes service.
   - Provides the agent with system insights to assess current performance.

3. **Scaling Action Tool**:
   - Simulates scaling Kubernetes services, such as adding or removing pods.
   - Demonstrates how the AI agent can act on predictions or detected issues.

By combining these tools, we create a framework where the AI agent can:
- **Monitor**: Fetch data using tools like the financial and Kubernetes metrics simulators.
- **Reason**: Analyze the data to detect patterns or issues.
- **Act**: Take corrective actions using the scaling tool.

![React Agent Architecture](images/react_agent_architecture_modules.png)

In the next step, we’ll initialize these tools and integrate them with the AI agent for an interactive demonstration.


In [2]:
import random

def get_mock_financial_data():
    """Simulates a financial API response."""
    return {
        "index": "NASDAQ",
        "volume": random.randint(500000, 2000000),
        "avg_volume": 1000000,
        "price_change": round(random.uniform(-3, 3), 2),
    }

def get_mock_kubernetes_metrics():
    """Simulates a Kubernetes metrics API response."""
    return {
        "service": "trading-platform",
        "cpu_usage": random.randint(50, 90),
        "memory_usage": random.randint(50, 90),
    }

In [3]:
from langchain.tools import tool

@tool
def get_financial_data():
    """Fetches mock financial trading data."""
    return get_mock_financial_data()


@tool
def get_kubernetes_metrics():
    """Fetches mock Kubernetes metrics."""
    return get_mock_kubernetes_metrics()


@tool
def scale_service(action: str):
    """Simulates scaling a Kubernetes service."""
    return f"Scaling action: {action} executed successfully."

## Step 3: Setting Up the AI Agent with LangChain

Now that we’ve defined and wrapped our mock functions as tools, it’s time to integrate them into an AI agent. This agent will use the tools to monitor data, reason about it, and take actions, all driven by a powerful language model.

### Key Components of the Agent

1. **Language Model (LLM)**:
   - We initialize the **Ollama LLM** (`llama3.1`) as the brain of the agent. This LLM processes prompts, reasons about data, and decides which tools to use.

2. **Tools**:
   - The tools we defined earlier are passed to the agent, enabling it to:
     - Fetch financial data.
     - Retrieve Kubernetes metrics.
     - Simulate scaling services.
   - Each tool has a name, a function, and a description to help the agent decide when to use it.

3. **Agent Type**:
   - We use a **ReAct (Reason + Act)** agent type, which allows the agent to:
     - Think step-by-step.
     - Use tools dynamically based on the situation.
     - Generate responses with clear reasoning and actions.

### How It Works

- The agent processes a user-defined prompt (we’ll see this in the next step).
- It determines which tool to use based on the context.
- After using a tool, it observes the result and decides the next step.
- This loop continues until the agent reaches a final answer.

![React Flow](images/react_flow.png)

With this setup, our agent is ready to demonstrate its ability to monitor Kubernetes metrics, analyze financial data, and simulate actions dynamically.

In [4]:
from langchain.agents.initialize import initialize_agent
from langchain.agents import Tool, AgentType
from langchain.llms import Ollama

# Initialize Ollama LLM
llm = Ollama(model="llama3.1")

# Define tools
tools = [
    Tool(
        name="Get Financial Data",
        func=get_financial_data,
        description="Fetches financial trading data.",
    ),
    Tool(
        name="Get Kubernetes Metrics",
        func=get_kubernetes_metrics,
        description="Fetches system metrics.",
    ),
    Tool(
        name="Scale Service",
        func=scale_service,
        description="Scales Kubernetes services to handle load.",
    ),
]

# Define ReAct agent
agent = initialize_agent(
    tools=tools,
    llm=llm,
    agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
    verbose=True,
    handle_parsing_errors=True,
)

  llm = Ollama(model="llama3.1")
  agent = initialize_agent(


### 3.1. Defining the Prompt

In this step, we provide the agent with a prompt that defines its task and ensures it follows a structured decision-making process. The agent will:

1. **Fetch Financial Data**: Use the `Get Financial Data` tool to check for stock market surges.
2. **Fetch Kubernetes Metrics**: Use the `Get Kubernetes Metrics` tool to assess the system's current state.
3. **Decide and Act**: If a surge is detected (volume > 1.5x avg_volume or price_change > 2%), take a scaling action using the `Scale Service` tool.

The agent follows a **Think-Act-Observe** pattern:
- **Thought**: Reason through each step.
- **Action**: Choose a tool and provide input.
- **Observation**: Log the tool’s output and decide the next step.

Finally, the agent summarizes its reasoning, observations, and actions to answer the question: **Should we scale the Kubernetes system based on current conditions?**

In [7]:
prompt = """
You are an AI agent monitoring the stock market and Kubernetes clusters.

Your task is to:
1. Fetch financial data to check for surges in stock trading activity.
2. Fetch Kubernetes metrics to assess the system's current state.
3. If a surge is detected (volume > 1.5x avg_volume or price_change > 2%), take a scaling action to prepare for increased load.

Use ReAct format:
Thought: <reasoning>
Action: <tool>
Action Input: <input>
Observation: <result>

At the end, summarize your reasoning, observations, and any actions taken in a clear final answer.

Question: Should we scale the Kubernetes system based on current conditions?
"""

## 4. Running the Agent with the Prompt

In this step, we execute the agent using the defined prompt. The agent will process the instructions step-by-step, interact with the tools, and provide a final response.

#### What Happens During Execution

1. **Prompt Processing**:
   - The agent reads the prompt, which specifies its task and guides its reasoning.
2. **Tool Usage**:
   - The agent dynamically selects the appropriate tools based on its analysis of the situation.
   - For example, it might first use `Get Financial Data` to fetch market activity, followed by `Get Kubernetes Metrics` to assess system performance.
3. **Decision and Action**:
   - After analyzing the data, the agent decides whether scaling is required.
   - If necessary, it uses the `Scale Service` tool to simulate scaling actions.
4. **Final Response**:
   - The agent summarizes its process, including:
     - Observations from the tools.
     - Reasoning behind its decisions.
     - Actions taken (if any).

In [8]:
response = agent.run(prompt)
print("Agent Response:", response)



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: I need to check if there's a surge in stock trading activity to determine if scaling is needed. 

Action: Get Financial Data
Action Input: 'current_stock_data'[0m
Observation: [36;1m[1;3m{'index': 'NASDAQ', 'volume': 1065619, 'avg_volume': 1000000, 'price_change': 0.6}[0m
Thought:[32;1m[1;3mThought: The current financial data shows a volume of 1,065,619, which is less than 1.5 times the average volume of 1,000,000. I also need to check the price change.

Action: Get Financial Data
Action Input: 'current_stock_data'[0m
Observation: [36;1m[1;3m{'index': 'NASDAQ', 'volume': 886599, 'avg_volume': 1000000, 'price_change': -0.77}[0m
Thought:[32;1m[1;3mHere is the ReAct format for the given scenario:

Question: Should we scale the Kubernetes system based on current conditions?

Thought: I need to check if there's a surge in stock trading activity to determine if scaling is needed.

Action: Get Financial Data
Act

## 5 Expanding the Agent’s Capabilities with Web Search

In this step, we enhance the agent by integrating real-time web search functionality using the DuckDuckGo Search tool. This allows the agent to fetch up-to-date information about market activity, providing additional context for its decisions.

1. **DuckDuckGoSearchRun**:
   - A community tool from `langchain_community` that allows querying DuckDuckGo for real-time information.
   - We instantiate it as `duckduckgo_search` to perform web searches.

2. **Standalone Query**:
   - Running `duckduckgo_search.run("NASDAQ surge")` fetches the latest news or information about surges in the NASDAQ market.
   - The `results` variable stores the output.

3. **Tool Wrapping**:
   - We define `search_market_news` as a LangChain-compatible tool that wraps the search functionality.
   - The agent can use this tool dynamically to search for any relevant market news.

In [9]:
from langchain_community.tools.ddg_search.tool import DuckDuckGoSearchRun

# Instantiate the DuckDuckGo search tool
duckduckgo_search = DuckDuckGoSearchRun()

# Run a simple query
results = duckduckgo_search.run("NASDAQ surge")
print(results)

History Says the Nasdaq Will Surge in 2025. 1 Stock-Split Stock to Buy Before It Does. December 28, 2024 — 03:07 am EST. ... And in years that the Nasdaq has gained 30% or more, the following ... One such company is Nvidia (NASDAQ: NVDA). The stock has gained 26,920% over the past decade (as of this writing), prompting management to initiate a 10-for-1 stock split earlier this year ... History Says the Nasdaq Will Surge in 2025. 1 Stock-Split Stock to Buy Before It Does. December 19, 2024 — 06:00 am EST ... with the technology-heavy Nasdaq Composite index topping the 20,000 mark. The Nasdaq Composite has been firmly in rally mode, but history suggests there's more to come. Stock splits are normally preceded by strong operational and financial growth, fueling stock price ... The market has gone into overdrive in 2024. Technology stocks especially so. After posting a 55% total return in 2023, the Nasdaq-100 index followed up with 27.5% total returns so far in 2024 (as ...


In [10]:
@tool
def search_market_news(query: str) -> str:
    """Use DuckDuckGo to search for market news about 'query'."""
    return duckduckgo_search.run(query)

## 6. Adding a Tool to Check Cluster Alerts

In this step, we introduce a tool to simulate cluster alert monitoring. This tool provides the AI agent with information about the severity of any issues detected in the Kubernetes cluster. The tool is designed to mimic real-world alerts that would typically be generated by monitoring systems like Prometheus or Datadog.

#### Tool: `check_cluster_alerts`

1. **Purpose**:
   - Simulates the detection of Kubernetes cluster alerts over a specified timeframe.
   - Alerts have two possible severities:
     - **WARNING**: Indicates mild issues (e.g., slightly above normal thresholds).
     - **CRITICAL**: Indicates severe issues (e.g., far above thresholds or imminent failure).

2. **Implementation**:
   - Randomly selects an alert severity (`WARNING` or `CRITICAL`) to simulate variability in the cluster's state.
   - Returns a human-readable string describing the issue and severity level.

In [11]:
@tool
def check_cluster_alerts(timeframe: str) -> str:
    """
    Mock function returning cluster alerts with severity levels.
    Possible severities: WARNING or CRITICAL.
    """
    severities = ["WARNING", "CRITICAL"]
    # Randomly choose an alert severity
    chosen_severity = random.choice(severities)

    if chosen_severity == "WARNING":
        return "Found 1 WARNING: CPU usage slightly above normal thresholds in the last 24 hours."
    else:
        return "Found 1 CRITICAL alert: CPU usage far above normal thresholds in the last 24 hours!"

## 7. Integrating All Tools and Executing the New Prompt

In this step, we bring together the tools we've defined so far into a single AI agent. The agent will use these tools to evaluate both external market activity and internal cluster alerts before deciding whether scaling actions are necessary.

### Prompt Details

The prompt guides the agent through a specific decision-making process:

1. **Search Market News**:
   - Use the `search_market_news` tool to look for recent NASDAQ or stock market surges.
   - Example query: "NASDAQ surge this week."

2. **Check Cluster Alerts**:
   - Use the `check_cluster_alerts` tool to fetch alerts from the past 24 hours.
   - This helps the agent assess the current state of the Kubernetes cluster.

3. **Decide and Act**:
   - If both conditions indicate a high likelihood of a workload surge, the agent should use the `scale_service` tool to scale the Kubernetes cluster.
   - Scaling action input: `"scale-up"`.

4. **ReAct Pattern**:
   - The agent follows a structured **Reason-Act-Observe** loop:
     - **Thought**: Log its reasoning.
     - **Action**: Choose the appropriate tool and provide input.
     - **Observation**: Log the tool's response and use it for further reasoning.

In [12]:
prompt = """
You are an AI agent for ITOps. You must:
1. Search for any recent NASDAQ or stock market surges that might drive traffic. 
   For instance, use 'search_market_news' with terms like 'NASDAQ surge this week'.
2. Check the cluster alerts from the last 24 hours using 'check_cluster_alerts'.
3. If both indicate a high likelihood of workload surge 
   (e.g., big market spike, multiple CPU or memory alerts), 
   call 'scale_service' with 'scale-up'.

Use ReAct format:
Thought: <reasoning>
Action: <tool>
Action Input: <input>
Observation: <result>

Finally, provide your decision about scaling in a final answer.

Question: Should we scale our Kubernetes cluster right now?
"""

In [13]:
# Create the tool list
tools = [
    Tool(
        name="search_market_news",
        func=search_market_news,
        description="Search DuckDuckGo for market news or surges.",
    ),
    Tool(
        name="check_cluster_alerts",
        func=check_cluster_alerts,
        description="Check cluster alerts over a specified timeframe.",
    ),
    Tool(
        name="scale_service",
        func=scale_service,
        description="Scale the Kubernetes service if a surge is detected.",
    ),
]

# Create the agent with ReAct
agent = initialize_agent(
    tools=tools,
    llm=llm,
    agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
    verbose=True,
    handle_parsing_errors=True,
)

## 8. Executing the Agent with a Modified Query

In this step, we slightly modify the original prompt to make it more specific. The updated query focuses on the task of checking for NASDAQ surges, ensuring the agent starts its reasoning process with clear intent.

In [14]:
query = prompt.replace("Question:", "Question: Checking for 'NASDAQ surge' -")
response = agent.run(query)
print("Agent Response:\n", response)



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: I should first search for recent market news or surges to determine if there's a high likelihood of workload traffic. If there is a surge, I'll need to check the cluster alerts from the last 24 hours. If both indicate a high likelihood of workload surge, then I can scale the Kubernetes service.

Action: search_market_news
Action Input: "NASDAQ surge this week"[0m
Observation: [36;1m[1;3mStocks rallied on Thursday as a fresh reading on weekly unemployment claims helped calm recession fears. A better than-expected consumer inflation report sparked a rally in stocks on Wednesday. The Nasdaq Composite (^IXIC) ripped higher to close up 1.2% at a new fresh record as a handful of 'Magnificent 7' stocks also notched all-time highs. The Nasdaq (QQQ) hit a new record on Friday, spurred by gains in large tech stocks as Treasury yields eased from their recent highs, helping to offset declines in other sectors. The S&P 500 and

## 9. Conclusion: Wrapping Up the AI Agents and Automation Series

Congratulations on completing the **AI Agents and Automation in Kubernetes** notebook! By reaching this point, you’ve learned how to design, implement, and integrate AI-driven tools that proactively enhance Kubernetes operations. This notebook marks the culmination of our journey into applying AI and machine learning to Kubernetes workflows.

### Key Takeaways

1. **AI Agents in Action**:
   - You’ve seen how AI agents use tools to monitor external and internal systems, analyze data, and take meaningful actions.
   - Tools like `search_market_news` and `check_cluster_alerts` showcase how external events and internal metrics can drive decision-making.

2. **Proactive Automation**:
   - Automation is key to managing complex Kubernetes environments. With the integration of predictive AI and real-time decision-making, you’ve learned how to:
     - Monitor systems dynamically.
     - Predict surges in workload.
     - Automatically scale resources to ensure stability and efficiency.

3. **End-to-End Workflow**:
   - The series demonstrated a step-by-step progression, from basic machine learning concepts to advanced AI-driven automation.
   - This notebook tied everything together by showcasing an intelligent agent that can:
     - Reason about real-world scenarios.
     - Leverage predictive and reactive tools.
     - Act autonomously to manage Kubernetes clusters.

### The Power of AI in Kubernetes Operations

By implementing AI agents, Kubernetes environments become smarter, faster, and more reliable:
- **Smarter**: AI agents can predict and address issues before they escalate, enhancing system resilience.
- **Faster**: Real-time decision-making reduces latency in responding to workload spikes or system anomalies.
- **More Reliable**: Automation minimizes human error, ensuring consistent and effective system management.

This approach empowers IT operators to shift from reactive firefighting to proactive management, ultimately improving both performance and user experience.

### What’s Next?

This notebook concludes the series, but your exploration of AI and Kubernetes doesn’t have to stop here. Consider extending your knowledge by:
- **Integrating Advanced Models**:
  - Experiment with custom AI models or fine-tune existing ones for your specific use cases.
- **Deploying in Production**:
  - Implement these concepts in real-world Kubernetes clusters to automate resource scaling, anomaly detection, and incident resolution.
- **Exploring New Frontiers**:
  - Dive into emerging technologies like federated learning, edge computing, and AI-driven observability tools.

Thank you for joining us on this journey. With the skills and tools you’ve gained, you’re well-equipped to harness the power of AI and Kubernetes for smarter, more efficient operations.

Happy automating! 🚀