# 9. LLMs for Kubernetes Operations: Unlocking Insights from Logs and Metrics

## Introduction

Welcome to the 9th notebook in our series on **AI for Kubernetes operations**! In this notebook, we dive into the transformative capabilities of **Large Language Models (LLMs)** and explore how they can enhance the way Kubernetes operators analyze and interpret complex operational data. 

By leveraging LLMs, operators can automate time-consuming tasks such as log analysis, incident summarization, and actionable recommendations, empowering them to focus on strategic decisions rather than repetitive, manual efforts.

### Objectives

By the end of this notebook, you will:

1. Understand the foundational principles behind **LLMs** and their transformer-based architecture.
2. Explore how LLMs handle language processing tasks using **self-attention** and parallel processing.
3. Interact with an LLM to perform tasks such as text generation, summarization, and classification.
4. Use **RAG (Retrieval-Augmented Generation)** techniques.
5. Combine **LLMs and RAG** workflows to extract actionable insights from complex data, showcasing how these tools can automate and simplify problem-solving.

### What Are LLMs?

**Large Language Models (LLMs)** are advanced AI systems trained on massive datasets of text to understand and generate human-like language. They excel at tasks such as answering questions, summarizing text, generating content, and even reasoning. Famous LLMs include:
- **GPT**: A versatile model known for its fluency and wide range of capabilities.
- **Claude**: A model optimized for safety and conversational clarity.
- **DeepSeek**: Renowned for its precision in information retrieval and search-related tasks.
- **LLaMA**: Lightweight and efficient, designed for fine-tuning on specific tasks.
- **Gemini**: A cutting-edge model that combines multimodal understanding with language generation.

LLMs are at the heart of modern AI applications because they can generalize across a wide range of domains and tasks with minimal additional training.

![LLM Evolutionary Tree](https://github.com/Mooler0410/LLMsPracticalGuide/blob/main/imgs/tree.jpg?raw=true)

<p><em>Source: Mooler0410, LLM Practical Guide</em></p>

### Why Were LLMs Created?

LLMs emerged to overcome limitations in earlier NLP models:
1. **Contextual Understanding**:
   - Models like RNNs and LSTMs struggled to grasp long-range dependencies in text. For example, they found it difficult to connect ideas across multiple sentences.
2. **Training Inefficiency**:
   - Sequential processing of input data made earlier models slow to train and scale.
3. **Static Representations**:
   - Traditional word embeddings (like Word2Vec) represented words without understanding their context, leading to ambiguity. For instance, the word "bank" could mean a financial institution or a riverbank.

## Open Source and LLMs

The term **open source** is widely used in the field of generative AI, but it often means different things depending on the model and context. In the world of **Large Language Models (LLMs)**, openness extends beyond simply releasing code or weights. It involves multiple aspects, such as transparency, accessibility, and documentation.

### Dimensions of Openness in LLMs

Openness in LLMs can be viewed as a **gradient** rather than a binary concept. Key dimensions of openness include:
- **Model Weights**: Availability of the trained model weights for fine-tuning or deployment.
- **Training Data Transparency**: Disclosure of the datasets used to train the model, ensuring reproducibility and fairness.
- **Documentation**: The extent to which technical information, such as architecture details, preprints, and datasheets, is made available.
- **Licensing and Access**: Whether the model is freely usable under open licenses and how accessible it is (e.g., via APIs or downloadable packages).

### Levels of Openness

Not all models claiming to be open source are truly open across all dimensions:
- **Fully Open**: Models that release their weights, training data, and comprehensive documentation.
- **Partially Open ("Open Weight")**: Models that release their weights but withhold details about the training data or fine-tuning processes.
- **Closed**: Proprietary models that only provide access via APIs or under restrictive licenses.

![Generative AI Openness Table](https://media.licdn.com/dms/image/v2/D4D22AQGMtO3uYxBJ0Q/feedshare-shrink_2048_1536/feedshare-shrink_2048_1536/0/1690891092800?e=1740614400&v=beta&t=RKa5tJSEuu46Yh7fwumbmslui8q-iwdy6EZMUxNJk2c)

<p><em>Source: <a href="https://pure.mpg.de/rest/items/item_3588217_2/component/file_3588218/content" target="_blank">"Rethinking Open Source Generative AI"</a></em></p>

## First Interaction with an LLM

Now that we’ve explored the foundational concepts behind Large Language Models (LLMs), let’s see them in action. In this section, we’ll interact with an LLM via an endpoint using a simple prompt.

### Example: Asking the LLM to Explain Kubernetes

We’ll send a straightforward request to the LLM to demonstrate its ability to generate clear and concise responses.

## 1. Installing the Required Libraries

Before we start, we need to install the necessary libraries. These include **transformers**, **torch**, and **scikit-learn**, which are required to build and fine-tune the BERT model. Run the following cell to install these libraries:

In [None]:
%pip install requests langchain faiss-cpu sentence-transformers langchain_community langchain-ollama neo4j tqdm langchainhub --quiet


## 2. Preparing the Request
First, we define the system and user prompts to guide the model’s behavior and input.

### What is a Prompt?

A prompt is the input we give to an LLM to guide its response. It consists of two main components:

- **System Prompt:** This defines the role, tone, and behavior of the model. It acts as a set of instructions or rules for how the LLM should respond. The system prompt sets the stage for the interaction by shaping the model's personality or context. For example, you can instruct the model to act as a teacher, assistant, or subject matter expert.

  - *Example:* "You are a helpful assistant that provides concise and factual answers to technical questions." 

- **User Prompt:** This is the actual input or question provided by the user. It is typically the main request or query for which the user seeks an answer or action. The quality of the user prompt is key, as clear and specific questions yield more accurate and relevant responses from the model.

   - *Example:* "What are the key features of Large Language Models?"

The way you craft your prompts significantly influences the quality and relevance of the LLM’s response.

### 2.1. Crafting the Prompts

In [None]:
def create_prompt(system_message: str, user_message: str) -> dict:
    sys_prompt = f"<|begin_of_text|><|start_header_id|>system<|end_header_id|>{system_message}<|eot_id|>"
    user_prompt = (
        f"<|start_header_id|>user<|end_header_id|>\n\n{user_message}<|eot_id|>"
    )

    return {"system": sys_prompt, "user": user_prompt}

prompts = create_prompt(
    system_message="You are an IT operations assistant. Provide concise and actionable recommendations based on log data and system metrics.",
    user_message="Summarize the last 10 critical errors in the Kubernetes logs and suggest potential fixes.",
)

print("System Prompt:", prompts["system"])
print("User Prompt:", prompts["user"])

### 2.2. Prepare the payload

In [None]:
def prepare_payload(user_prompt: str, sys_prompt: str, model="llama3.1:latest", temperature=0.0, stop=None):
    return {
        "model": model,
        "prompt": user_prompt,
        "system": sys_prompt,
        "temperature": temperature,
        "stop": stop,
        "stream": False,
    }


# Prepare the payload
payload = prepare_payload(user_prompt=prompts["user"], sys_prompt=prompts["system"])
print("Prepared payload:", payload)

### Key Parameters in LLM Interaction

1. **Model**:
   - Specifies the version of the LLM to be used.
   - Example: `"llama3.1:latest"`, `"gpt-4"`.

2. **Temperature**:
   - Controls the randomness or creativity of the response.
     - **Low values (e.g., 0.2)**: Generate more predictable and deterministic answers.
     - **High values (e.g., 0.8)**: Produce more creative and diverse outputs.

3. **Stop Sequences**:
   - Specifies patterns that indicate where the model should stop generating.
   - Example: `[“\n”]` ensures the model stops at the end of a line.


## 3. Sending the Request

Now that we have prepared the payload, let’s send it to the LLM endpoint and retrieve the response.

In [None]:
import requests
import json

# Endpoint for the LLM
endpoint = "http://localhost:11434/api/generate"

# Send the request
response = requests.post(
    endpoint, headers={"Content-Type": "application/json"}, data=json.dumps(payload)
)

# Parse the response
llm_response = response.json()
print("LLM Response:", llm_response.get("response", "No response received"))

### LLM Interaction with Internal Processes

This diagram represents the flow of interaction with a Large Language Model (LLM), including both the external and internal processes involved when sending a request and receiving a response.

![image](images/llm_flow.png)

1. **Define Request**: The user defines their request to the LLM.
2. **Payload Preparation**
	* **System Prompt**: Sets the role or behavior of the LLM.
	* **User Prompt**: Specifies the question or task to be solved.
	* **Parameters**: Additional settings for fine-tuning the response (e.g., temperature, stop sequences).
3. **Send Request to LLM**: The prepared payload is sent to the LLM for processing.
4. **LLM Internal Processing**
	* **Tokenization**: Breaks down input into smaller parts (tokens).
	* **Inference/Computation**: Computes a response based on input tokens.
	* **Detokenization**: Converts output tokens back to human-readable text.
	* **Post-Processing**: Makes final adjustments based on parameters.
5. **Receive & Process Response**: The LLM generates and sends the response.
6. **Output Result**: The processed response is displayed to the user.
7. **End**: The process concludes once the result is delivered.

## 4. Tuning the Parameters

To refine the model's responses, you can adjust the following parameters in the payload:

- **Temperature**: Controls the variability of the response.
  - Low values (e.g., `0.2`) produce deterministic answers.
  - High values (e.g., `0.8`) encourage creative and varied outputs.
- **Max Tokens**: Limits the length of the response to prevent overly long outputs.
- **Stop Sequences**: Defines when the LLM should stop generating text, useful for structured outputs.

Let’s experiment with different parameters:

In [None]:
# Experimenting with parameters
payload = prepare_payload(
    user_prompt=prompts["user"], sys_prompt=prompts["system"], temperature=0.8
)

# Send the request
response = requests.post(
    endpoint, headers={"Content-Type": "application/json"}, data=json.dumps(payload)
)

# Parse and display the response
llm_response = response.json()
print("Modified LLM Response:", llm_response.get("response", "No response received"))

## 5. What Is Retrieval-Augmented Generation (RAG)?

**Retrieval-Augmented Generation (RAG)** is a technique that combines the capabilities of Large Language Models (LLMs) with external knowledge sources to enhance responses. By retrieving relevant information from structured or unstructured data, RAG enables LLMs to:
- Provide more accurate, context-aware answers.
- Overcome limitations of static knowledge (e.g., missing recent events or domain-specific details).
- Handle large datasets without the need for retraining.

### How RAG Works
1. **Retrieval**:
   - Extract relevant information from a knowledge source, such as:
     - Graph databases (e.g., service dependencies, team assignments).
     - Structured files (e.g., CSVs for logs or metrics).
     - Unstructured documents (e.g., markdown files for incident reports).
2. **Augmentation**:
   - Combine the retrieved information with the user prompt to provide additional context.
3. **Generation**:
   - Use the LLM to generate a response that incorporates the augmented context.

### Why Use RAG for Kubernetes Operations?

In Kubernetes environments, operators deal with vast amounts of data from diverse sources. RAG can help:
- **Log Analysis**: Retrieve logs matching specific error codes or timestamps and summarize issues.
- **Service Dependencies**: Query graphs to identify which services might be impacted by a failing node.
- **Configuration Documentation**: Retrieve markdown snippets describing configuration policies to answer questions like, “What is the resource limit for Pod X?”

RAG transforms the LLM into a dynamic assistant that can access real-time, domain-specific knowledge, making it far more effective for IT operations.

```plaintext
          User Query
              ↓
       Data Retrieval
    (e.g., Graph, CSV, Markdown)
              ↓
       Data Augmentation
       (Combine Query + Data)
              ↓
     LLM Processes Input
              ↓
       Contextual Response
```

## 6. Embeddings and Querying Markdown Files

### What Are Embeddings?
Before we dive into querying markdown files, it's essential to understand how embeddings work. 

Embeddings are dense vector representations of text data. They map text into a numerical space where similar pieces of text are located close together. Think of embeddings as the "GPS coordinates" of text in a high-dimensional space.
  
- **Why Use Embeddings?**
  In Kubernetes operations, embeddings help in:
  - **Log Retrieval**: Finding similar log entries for faster troubleshooting.
  - **Configuration Matching**: Retrieving relevant sections of operational policies or resource limits.
  - **Incident Analysis**: Locating similar incidents from past reports to guide current remediation.

### 6.1. Creating Embeddings from Markdown Files

1. **Loading Markdown Files**:
  Use the `langchain` library to load and preprocess markdown files, splitting them into manageable chunks. Each chunk represents a meaningful section of the document, ensuring the context is preserved.

In [None]:
import os
from langchain.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_core.documents import Document

# Load Markdown files
markdown_dir = "./markdown_files"  # Directory containing markdown files
documents = []
for filename in os.listdir(markdown_dir):
    if filename.endswith(".md"):  # Ensure only Markdown files are loaded
        loader = TextLoader(
            os.path.join(markdown_dir, filename)
        )  # Use TextLoader for reading
        documents.extend(loader.load())  # Load and append each document

print(f"Loaded {len(documents)} documents.")

# Step 2: Split the content of documents into smaller chunks
splitter = RecursiveCharacterTextSplitter(
    chunk_size=300,  # Maximum size of each chunk
    chunk_overlap=100,  # Overlap between chunks for better context
    length_function=len,
    add_start_index=True,
)

split_documents = []
for doc in documents:
    chunks = splitter.split_text(doc.page_content)  # Split text into chunks
    split_documents.extend(
        [Document(page_content=chunk, metadata=doc.metadata) for chunk in chunks]
    )

print(f"Split into {len(split_documents)} chunks.")

# Verify the split documents
for split_doc in split_documents[:5]:  # Display the first 5 chunks for verification
    print(f"Chunk metadata: {split_doc.metadata}")
    print(
        f"Chunk content:\n{split_doc.page_content[:500]}"
    )  # Display the first 500 characters
    print("-" * 50)

### 6.2. Generating Embeddings

Use embeddings to convert text chunks into dense vectors that represent their semantic meaning.

1. **Choosing an Embedding Model**:
   Here, we use the `OllamaEmbeddings` module for generating embeddings.

In [None]:
from langchain_ollama import OllamaEmbeddings
from langchain.vectorstores import FAISS

# Define embedding model
embedding_model = OllamaEmbeddings(model="llama3.1")

# Generate embeddings
vector_store = FAISS.from_documents(split_documents, embedding_model)
print(f"Vector store created with {vector_store.index.ntotal} embeddings.")

### Step 3: Querying Markdown Files with RAG

1. **Defining the Retrieval Workflow**:
   A retriever helps locate relevant chunks of text based on a user's query.

2. **Creating the Prompt**:
   Combine retrieved context with user queries to ensure the LLM generates accurate and contextual responses.

3. **Integrating the LLM**:
   Use a local model (e.g., `llama3.1`) to generate responses based on the augmented context.

4. **Building the Full Chain**:
   Combine the retriever, prompt, and LLM into a processing pipeline.

In [None]:
from langchain.prompts import ChatPromptTemplate
from langchain_community.chat_models import ChatOllama
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

retriever = vector_store.as_retriever()

template = """Answer the question based only on the following context:
{context}

Question: {question}
"""
prompt = ChatPromptTemplate.from_template(template)

model_local = ChatOllama(model="llama3.1", temperature=0.0)

chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | model_local
    | StrOutputParser()
)

#### 6.4. Example Query

Query the vector store to retrieve relevant sections from markdown files, then use the LLM to generate a contextual response.

1. **Query Example**:
   Retrieve insights about a Kubernetes service:

In [None]:
response = chain.invoke(
    "What mechanism does StockTraderX use for low-latency order processing?"
)

print("Response:", response)