<a href="https://colab.research.google.com/github/hamidb201214-svg/Lectures/blob/main/M3_3_NLG_8.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# How Smart Are They? Understanding the Scale of GPT-3 and GPT-4


| Assumption                                  | Description                                                                                       |
|---------------------------------------------|---------------------------------------------------------------------------------------------------|
| **Average Tokens per Book**                 | Estimated at 135,000 tokens per book, based on an average book length of 80,000 to 100,000 words.  |
| **Average Reading Lifetime of an Individual** | Estimated at 510 books per lifetime, assuming a moderate reading habit of 5-12 books per year over 60 years. |
| **Tokens per Word**                         | Estimated at 1.5 tokens per word, accounting for spaces and punctuation.                          |



| Detail                             | GPT-3                                   | GPT-4                                   |
|------------------------------------|-----------------------------------------|-----------------------------------------|
| **Developed By**                   | OpenAI                                  | OpenAI                                  |
| **Approximate Training Data Size** | 45 terabytes of text data               | Larger than GPT-3 (exact size unknown)  |
| **Estimated Token Count**          | 300-400 billion tokens                  | Likely over 500 billion tokens          |
| **Equivalent Number of Books**     | 2,222,222 - 2,962,963 books             | >3,703,704 books                        |
| **Equivalent Knowledge of People** | 4,356 - 5,810 people                    | >7,263 people                           |


![](https://raw.githubusercontent.com/aaubs/ds-master/main/data/Images/transformermodel_architecture.png)

# Why adapt the language model?

- LMs are trained in a task-agnostic way.
- Downstream tasks can be very different from language modeling on the Pile.
For example, consider the natural language inference (NLI) task (is the hypothesis entailed by the premise?):

      Premise: I have never seen an apple that is not red.
      Hypothesis: I have never seen an apple.
      Correct output: Not entailment (the reverse direction would be entailment)

- The format of such a task may not be very natural for the model.

# Ways downstream tasks can be different

- **Formatting**: for example, NLI takes in two sentences and compares them to produce a single binary output. This is different from generating the next token or filling in MASKs. Another example is the presence of MASK tokens in BERT training vs. no MASKs in downstream tasks.
- **Topic shift**: the downstream task is focused on a new or very specific topic (e.g., medical records)
- **Temporal shift**: the downstream task requires new knowledge that is unavailable during pre-training because 1) the knowledge is new (e.g., GPT3 was trained before Biden became President), 2) the knowledge for the downstream task is not publicly available.


# Optimizing Large Language Models

There are several options to optimize Large Language Models:

    Prompt engineering by providing samples (In-Context Learning)
    Prompt Tuning
    Fine-Tuning
       - Supervised fine-tuning (SFT): Classic fine-tuning by changing all weights
       - Transfer Learning - PEFT fine-tuning by changing only a few weights
       - Reinforcement Learning Human Feedback (RLHF)

An important question is which of these options is the most effective one and which one can overwrite previous optimizations.

### Understanding Prompt Engineering, Prompt Tuning, and PEFT
These techniques are essential for efficiently adapting large, pre-trained models like GPT or BERT to specialized tasks or domains, optimizing resource usage and reducing training time.


1. **Prompt Engineering (In-Context Learning)**:
   - **Definition**: Crafting input prompts to guide a Large Language Model (LLM) for desired outputs.
   - **Application**: Uses natural language prompts to "program" the LLM, leveraging its contextual understanding.
   - **Model Change**: No alteration to the model's parameters; relies on the model's existing knowledge and interpretive abilities.

2. **Prompt Tuning**:
   - **Difference from Prompt Engineering**: Involves appending a trainable tensor (prompt tokens) to the LLM's input embeddings.
   - **Process**: Fine-tunes this tensor for a specific task and dataset, keeping other model parameters unchanged.
   - **Example**: Adapting a general LLM for specific tasks like sentiment classification by adjusting prompt tokens.

3. **Parameter-Efficient Fine-Tuning (PEFT)**:
   - **Overview**: A set of techniques to enhance model performance on specific tasks or datasets by tuning a small subset of parameters.
   - **Objective**: Targeted improvements without the need for full model retraining.
   - **Relation to Prompt Tuning**: Prompt tuning is a subset of PEFT, focusing on fine-tuning specific parts of the model for task/domain adaptation.



![](https://raw.githubusercontent.com/aaubs/ds-master/main/data/Images/PEFT_LLMs.png)

### Challenges

Fine-tuning models can certainly help to get models to do what you want them to do. However, there are some potential issues:

> - **Catastrophic forgetting**: This phenomenon describes a behavior when fine-tuning or prompts can overwrite the pre-trained model characteristics.
> - **Overfitting**: If only a certain AI task has been fine-tuned, other tasks can suffer in terms of performance.

In general, fine-tuning should be used wisely and best practices should be applied, for example, the quality of the data is more important than the quantity and multiple AI tasks should be fine-tuned at the same time vs after each other.

# Applications

There are many platforms that can be used for LLMs' applications:


| Tool                                                                                                    | Category                             | Best For                                                                         | Type        |
| :------------------------------------------------------------------------------------------------------ | :----------------------------------- | :------------------------------------------------------------------------------- | :---------- |
| **[LangChain](https://docs.langchain.com)**                                                             | Orchestration                        | Agents, tools, RAG, observability                                                | Open-source |
| **[Flowise](https://docs.flowiseai.com)**                                                               | App Builder / Orchestration (Visual) | Low-code drag-and-drop LLM apps (chatbots, RAG flows), rapid prototyping         | Open-source |
| **[CrewAI](https://docs.crewai.com)**                                                                   | Agent Orchestration (Multi-agent)    | Role-based multi-agent workflows, task delegation, coordinated tool-using agents | Open-source |
| **[Hugging Face](https://huggingface.co/docs)**                                                         | Model Hub                            | Open models, fine-tuning, hosting                                                | Platform    |
| **[vLLM](https://docs.vllm.ai)** / **[SGLang](https://github.com/sgl-project/sglang)**                  | Serving                              | High-throughput / Structured generation                                          | Open-source |
| **[Ollama](https://github.com/ollama/ollama)** / **[llama.cpp](https://github.com/ggml-org/llama.cpp)** | Local Run                            | Local inference & model management                                               | Open-source |
| **[bitsandbytes](https://huggingface.co/docs/transformers/en/quantization/bitsandbytes)**               | Quantization (4/8-bit)               | Fit models into less VRAM; decent speed/quality tradeoffs                        | Open-source |
| **[Pydantic](https://docs.pydantic.dev/)**                                                              | Validation / Schemas                 | Type-safe data validation; enforce structured outputs and tool I/O               | Open-source |
| **[LlamaIndex](https://docs.llamaindex.ai)**                                                            | Data / RAG                           | Ingestion, indexing, retrieval                                                   | Open-source |
| **[Haystack](https://haystack.deepset.ai)**                                                             | RAG Pipelines                        | Production pipelines, Doc QA                                                     | Open-source |
| **[Semantic Kernel](https://github.com/microsoft/semantic-kernel)**                                     | Orchestration                        | Enterprise workflows (C#/Python)                                                 | Open-source |


In [None]:
!pip install --upgrade transformers

In [None]:
import torch
import gc

# Delete the model and any other large tensors
del model
del tokenizer

# Force garbage collection
gc.collect()

# Clear the PyTorch CUDA cache
torch.cuda.empty_cache()


In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Qwen/Qwen3-4B-Thinking-2507"

# load the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)

# prepare the model input
prompt = "Give me a short introduction to large language model."
messages = [
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

# conduct text completion
generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=512
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()

# parsing thinking content
try:
    # rindex finding 151668 (</think>)
    index = len(output_ids) - output_ids[::-1].index(151668)
except ValueError:
    index = 0

thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip("\n")
content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("\n")

print("thinking content:", thinking_content) # no opening <think> tag
print("content:", content)


In [None]:
!nvidia-smi

In [None]:
!pip install -U bitsandbytes>=0.46.1

In [None]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

model_name = "Qwen/Qwen3-4B-Thinking-2507"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.float16,   # <-- match fp16
)

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    torch_dtype=torch.float16,              # <-- match fp16
    quantization_config=bnb_config,
)

In [None]:


# prepare the model input
prompt = "Give me a short introduction to large language model."
messages = [
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

# conduct text completion
generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=512
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()

# parsing thinking content
try:
    # rindex finding 151668 (</think>)
    index = len(output_ids) - output_ids[::-1].index(151668)
except ValueError:
    index = 0

thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip("\n")
content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("\n")

print("thinking content:", thinking_content) # no opening <think> tag
print("content:", content)


In [None]:
!nvidia-smi

| Tool                                                                                                                     | Category                     | Best For                                                                         | Type        |
| :----------------------------------------------------------------------------------------------------------------------- | :--------------------------- | :------------------------------------------------------------------------------- | :---------- |
| **[Transformers](https://huggingface.co/docs/transformers)**                                                             | Python Inference (HF-native) | `pipeline()` / `generate()`, chat templates, quick prototyping                   | Open-source |
| **[Accelerate](https://huggingface.co/docs/accelerate)**                                                                 | Loading / Sharding           | `device_map="auto"`, CPU/GPU split, offload big models that don’t fit            | Open-source |
| **[huggingface_hub](https://huggingface.co/docs/huggingface_hub)**                                                       | Download / Caching           | `snapshot_download()` / caching / pinned revisions for reproducible loads        | Open-source |
| **[PEFT](https://huggingface.co/docs/peft)**                                                                             | Adapters                     | Load LoRA/adapters on top of a base Mistral (cheap “fine-tune” deployments)      | Open-source |
| **[bitsandbytes](https://huggingface.co/docs/transformers/en/quantization/bitsandbytes)**                                | Quantization (4/8-bit)       | Fit models into less VRAM; decent speed/quality tradeoffs                        | Open-source |
| **[Transformers Quantization (AWQ/GPTQ)](https://huggingface.co/docs/transformers/en/main_classes/quantization)**        | Quantization (algos)         | Using AWQ/GPTQ paths supported by Transformers for inference workflows           | Open-source |
| **[AutoAWQ](https://github.com/casper-hansen/AutoAWQ)**                                                                  | Quantization + Kernels       | INT4 AWQ quantization and fast inference for AWQ checkpoints                     | Open-source |
| **[vLLM](https://docs.vllm.ai)**                                                                                         | Serving                      | High-throughput serving + OpenAI-compatible API server                           | Open-source |
| **[SGLang](https://github.com/sgl-project/sglang)**                                                                      | Serving / Structured Gen     | Low-latency serving + structured generation runtime patterns                     | Open-source |
| **[Text Generation Inference (TGI)](https://huggingface.co/docs/text-generation-inference/en/index)**                    | Serving                      | Classic HF inference server (see note on maintenance mode)                       | Open-source |
| **[Ollama](https://github.com/ollama/ollama)**                                                                           | Local Run / Model Mgmt       | “Just run it locally” experience + simple local API                              | Open-source |
| **[llama.cpp](https://github.com/ggml-org/llama.cpp)**                                                                   | Local Run (GGUF)             | CPU-friendly / wide-hardware inference via GGUF models                           | Open-source |
| **[HF Inference Endpoints](https://huggingface.co/docs/inference-endpoints/en/index)**                                   | Managed Serving              | Deploy HF Hub models on managed infra (autoscaling, logs/metrics)                | Managed     |
| **[HF Inference Client / Providers](https://huggingface.co/docs/huggingface_hub/en/package_reference/inference_client)** | Hosted Inference API         | Call hosted endpoints/providers (and also talk to local servers) with one client | Platform    |


In [None]:
from typing import List, Literal, Optional
from pydantic import BaseModel, Field, ValidationError

from transformers import AutoModelForCausalLM, AutoTokenizer

# -----------------------------
# 1) Pydantic models (schemas)
# -----------------------------
class ChatMessage(BaseModel):
    role: Literal["system", "user", "assistant"]
    content: str = Field(min_length=1)

class GenerationRequest(BaseModel):
    prompt: str = Field(min_length=1)
    max_new_tokens: int = Field(default=512, ge=1, le=4096)

class GenerationResult(BaseModel):
    thinking: str = ""
    answer: str = Field(min_length=1)

# -----------------------------
# 2) Your code, with validation
# -----------------------------
model_name = "Qwen/Qwen3-4B-Thinking-2507"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)

req = GenerationRequest(prompt="Give me a short introduction to large language model.", max_new_tokens=512)

# validate messages
messages: List[ChatMessage] = [ChatMessage(role="user", content=req.prompt)]
messages_dicts = [m.model_dump() for m in messages]  # convert to plain dicts for HF

text = tokenizer.apply_chat_template(
    messages_dicts,
    tokenize=False,
    add_generation_prompt=True,
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(**model_inputs, max_new_tokens=req.max_new_tokens)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()

# parsing thinking content (your logic)
try:
    index = len(output_ids) - output_ids[::-1].index(151668)  # </think> token id
except ValueError:
    index = 0

thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip("\n")
content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("\n")

# validate / package output
try:
    result = GenerationResult(thinking=thinking_content, answer=content)
except ValidationError as e:
    # e.g. answer empty -> you get a clean error instead of silent bad data
    raise

print("thinking content:", result.thinking)
print("content:", result.answer)


# LangChain
## Deep Agents overview



### Step 1: Install dependencies

In [None]:
!pip install deepagents tavily-python

### Step 2: Set up your API keys

In [None]:
import os
from getpass import getpass

os.environ["GEMINI_API_KEY"] = getpass("Enter GEMINI_API_KEY: ").strip()
os.environ["TAVILY_API_KEY"] = getpass("Enter TAVILY_API_KEY: ").strip()

print('export GEMINI_API_KEY="***"')
print('export TAVILY_API_KEY="***"')

### Step 3: Create a search tool

In [None]:
import os
from typing import Literal
from tavily import TavilyClient
from deepagents import create_deep_agent

tavily_client = TavilyClient(api_key=os.environ["TAVILY_API_KEY"])

def internet_search(
    query: str,
    max_results: int = 5,
    topic: Literal["general", "news", "finance"] = "general",
    include_raw_content: bool = False,
):
    """Run a web search"""
    return tavily_client.search(
        query,
        max_results=max_results,
        include_raw_content=include_raw_content,
        topic=topic,
    )

### Step 4: Create a deep agent

In [None]:
from langchain.chat_models import init_chat_model

# System prompt to steer the agent to be an expert researcher
research_instructions = """You are an expert researcher. Your job is to conduct thorough research and then write a polished report.

You have access to an internet search tool as your primary means of gathering information.

## `internet_search`

Use this to run an internet search for a given query. You can specify the max number of results to return, the topic, and whether raw content should be included.
"""

# Initialize the Gemini model using the GEMINI_API_KEY set earlier
# The model name 'gemini-1.5-flash' is a common and capable choice for general tasks.
model = init_chat_model(model="google_genai:gemini-2.5-flash-lite")

agent = create_deep_agent(
    model=model, # Explicitly pass the initialized model
    tools=[internet_search],
    system_prompt=research_instructions
)

### Step 5: Run the agent

In [None]:
result = agent.invoke({"messages": [{"role": "user", "content": "What is langgraph?"}]})

# Print the agent's response
print(result["messages"][-1].content)

In [None]:
result

In [None]:
# %pip -q install -U google-genai

import os
from google import genai

api_key = os.environ.get("GOOGLE_API_KEY") or os.environ.get("GEMINI_API_KEY")
client = genai.Client(api_key=api_key)

available = []
for m in client.models.list():
    # Docs example uses m.supported_actions and checks for "generateContent"
    if "generateContent" in getattr(m, "supported_actions", []):
        available.append(m.name.replace("models/", ""))

print("Models that support generateContent:")
for name in available[:50]:
    print(" -", name)


In [None]:
!pip install langchain_huggingface

In [None]:
%pip -q install -U "protobuf>=5.26.1,<6" "grpcio-status>=1.71.2,<2" jedi


In [None]:
!pkill -f vllm || true
!nvidia-smi


In [None]:
!pip install -U langgraph deepagents "langchain[openai]" "langchain[google-genai]"

## Human-in-the-loop

Learn how to configure human approval for sensitive tool operations

Some tool operations may be sensitive and require human approval before execution. Deep agents support human-in-the-loop workflows through LangGraph’s interrupt capabilities. You can configure which tools require approval using the interrupt_on parameter.

In [None]:
import os
import json
import uuid
import getpass
import argparse
from pathlib import Path
from typing import Any, Dict, List, Optional

from langchain.tools import tool
from langchain.chat_models import init_chat_model
from deepagents import create_deep_agent
from langgraph.checkpoint.memory import MemorySaver
from langgraph.types import Command

# -----------------------------
# 0) Provider selection + API key prompting
# -----------------------------
OPENAI_DEFAULT_MODEL = "openai:gpt-4o-mini"
GEMINI_DEFAULT_MODEL = "google_genai:gemini-2.5-flash-lite"

def choose_provider(cli_value: Optional[str]) -> str:
    if cli_value in {"openai", "gemini"}:
        return cli_value

    # Interactive prompt if not provided
    while True:
        choice = input("Choose provider [openai/gemini] (default: openai): ").strip().lower()
        if choice == "":
            return "openai"
        if choice in {"openai", "gemini"}:
            return choice
        print("Please type 'openai' or 'gemini'.")

def ensure_api_key(provider: str) -> None:
    """
    Prompt for the provider's API key if missing, and store in env.
    - OpenAI: OPENAI_API_KEY
    - Gemini: GOOGLE_API_KEY (LangChain checks this first; GEMINI_API_KEY is also supported as fallback)
    """
    if provider == "openai":
        if not os.getenv("OPENAI_API_KEY"):
            key = getpass.getpass("Enter OPENAI_API_KEY (input hidden): ").strip()
            if not key:
                raise RuntimeError("OPENAI_API_KEY was not provided.")
            os.environ["OPENAI_API_KEY"] = key

    elif provider == "gemini":
        # Prefer GOOGLE_API_KEY because that's what LangChain docs show; GEMINI_API_KEY is also accepted.
        if not (os.getenv("GOOGLE_API_KEY") or os.getenv("GEMINI_API_KEY")):
            key = getpass.getpass("Enter GOOGLE_API_KEY (Gemini) (input hidden): ").strip()
            if not key:
                raise RuntimeError("GOOGLE_API_KEY was not provided.")
            os.environ["GOOGLE_API_KEY"] = key

    else:
        raise ValueError("Unknown provider. Use 'openai' or 'gemini'.")

def pick_model_id(provider: str, override: Optional[str]) -> str:
    if override:
        return override
    return OPENAI_DEFAULT_MODEL if provider == "openai" else GEMINI_DEFAULT_MODEL

# -----------------------------
# Storage layout: ./submissions/<student_id>/*
# -----------------------------
ROOT = Path("./submissions").resolve()
ROOT.mkdir(exist_ok=True)

def _student_dir(student_id: str) -> Path:
    p = (ROOT / student_id).resolve()
    if ROOT not in p.parents:
        raise ValueError("Invalid student_id (path traversal blocked).")
    p.mkdir(exist_ok=True)
    return p

def _list_files(student_id: str) -> List[str]:
    d = _student_dir(student_id)
    return sorted([p.name for p in d.iterdir() if p.is_file()])

def _read_file(student_id: str, filename: str) -> str:
    d = _student_dir(student_id)
    p = (d / filename).resolve()
    if d not in p.parents:
        raise ValueError("Invalid filename (path traversal blocked).")
    if not p.exists():
        return ""
    return p.read_text(encoding="utf-8")

def _append_outbox(text: str) -> None:
    outbox = ROOT / "OUTBOX.txt"
    existing = outbox.read_text(encoding="utf-8") if outbox.exists() else ""
    outbox.write_text(existing + text, encoding="utf-8")

# -----------------------------
# Tools (LangChain)
# -----------------------------
@tool
def list_submission_files(student_id: str) -> List[str]:
    """List files in a student's submission folder."""
    return _list_files(student_id)

@tool
def read_submission_file(student_id: str, filename: str) -> str:
    """Read a file from a student's submission folder."""
    text = _read_file(student_id, filename)
    if text == "":
        return f"(empty or missing) {filename}"
    return text

@tool
def auto_validate(student_id: str) -> Dict[str, Any]:
    """
    Run simple validity checks and return a report + recommended verdict.
    Verdict: 'valid' or 'resubmit'
    """
    files = _list_files(student_id)
    required = {"report.md", "solution.py"}
    missing_files = sorted(list(required - set(files)))

    report = _read_file(student_id, "report.md")
    solution = _read_file(student_id, "solution.py")

    required_headings = ["# Problem", "# Method", "# Results"]
    missing_headings = [h for h in required_headings if h not in report]

    has_required_function = "def solve(" in solution

    issues = []
    if missing_files:
        issues.append(f"Missing required files: {missing_files}")
    if "report.md" in files and missing_headings:
        issues.append(f"Missing required headings in report.md: {missing_headings}")
    if "solution.py" in files and not has_required_function:
        issues.append("solution.py missing required function signature: def solve(...)")

    recommended_verdict = "valid" if not issues else "resubmit"

    recommended_message = (
        "✅ Your submission looks valid. Nice work!"
        if recommended_verdict == "valid"
        else "⚠️ Please fix the issues listed and resubmit."
    )

    return {
        "student_id": student_id,
        "files": files,
        "issues": issues,
        "recommended_verdict": recommended_verdict,
        "recommended_message": recommended_message,
    }

@tool
def record_verdict(student_id: str, verdict: str, notes: str) -> str:
    """
    Record the official verdict (sensitive).
    Writes to ./submissions/verdicts.json
    """
    out = ROOT / "verdicts.json"
    data = json.loads(out.read_text(encoding="utf-8")) if out.exists() else {}
    data[student_id] = {"verdict": verdict, "notes": notes}
    out.write_text(json.dumps(data, indent=2), encoding="utf-8")
    return f"Recorded verdict for {student_id}: {verdict}"

@tool
def message_student(student_id: str, message: str) -> str:
    """
    Mock messaging (sensitive).
    Appends to ./submissions/OUTBOX.txt instead of actually sending email.
    """
    _append_outbox(f"\n=== TO {student_id} ===\n{message}\n")
    return f"Queued message to {student_id} (see submissions/OUTBOX.txt)"

# -----------------------------
# Console HITL "review UI"
# -----------------------------
def _prompt_decision(tool_name: str, args: Dict[str, Any], allowed: List[str]) -> Dict[str, Any]:
    print("\n--- HUMAN REVIEW REQUIRED ---")
    print(f"Tool: {tool_name}")
    print("Proposed args:")
    print(json.dumps(args, indent=2))
    print(f"Allowed decisions: {allowed}")

    while True:
        choice = input("Type approve / reject / edit: ").strip().lower()
        if choice == "approve" and "approve" in allowed:
            return {"type": "approve"}
        if choice == "reject" and "reject" in allowed:
            return {"type": "reject"}
        if choice == "edit" and "edit" in allowed:
            print(
                "Paste edited args as JSON "
                "(e.g. {\"student_id\": \"student_001\", \"verdict\": \"valid\", \"notes\": \"...\"})"
            )
            edited_args = json.loads(input("> ").strip())
            return {"type": "edit", "edited_action": {"name": tool_name, "args": edited_args}}
        print("Invalid choice for this tool. Try again.")

# -----------------------------
# Main runner
# -----------------------------
def run(student_id: str, provider: str, model_id: str) -> None:
    # Ensure correct key exists before model init
    ensure_api_key(provider)

    checkpointer = MemorySaver()

    # init_chat_model accepts provider:model identifiers like openai:... and google_genai:...
    model = init_chat_model(model_id)

    agent = create_deep_agent(
        model=model,
        tools=[
            list_submission_files,
            read_submission_file,
            auto_validate,
            record_verdict,
            message_student,
        ],
        system_prompt=(
            "You are a TA agent.\n"
            "Workflow:\n"
            "1) Call auto_validate(student_id).\n"
            "2) Summarize the issues (if any).\n"
            "3) Propose record_verdict(student_id, verdict, notes).\n"
            "4) If helpful, propose message_student(student_id, message).\n"
            "Keep notes short and factual."
        ),
        interrupt_on={
            # Sensitive: human must approve/edit/reject official verdict
            "record_verdict": True,  # default allows approve/edit/reject
            # Sensitive: outbound message needs approval (no edit allowed here)
            "message_student": {"allowed_decisions": ["approve", "reject"]},
            # Safe: no interrupts
            "auto_validate": False,
            "read_submission_file": False,
            "list_submission_files": False,
        },
        checkpointer=checkpointer,
    )

    config = {"configurable": {"thread_id": str(uuid.uuid4())}}
    user_prompt = (
        f"Validate {student_id}. "
        "Run auto checks, then record an official verdict, and message the student with next steps."
    )

    result = agent.invoke({"messages": [{"role": "user", "content": user_prompt}]}, config=config)

    while result.get("__interrupt__"):
        payload = result["__interrupt__"][0].value
        action_requests = payload["action_requests"]
        review_configs = {cfg["action_name"]: cfg for cfg in payload["review_configs"]}

        decisions = []
        for action in action_requests:
            name = action["name"]
            args = action["args"]
            allowed = review_configs[name]["allowed_decisions"]
            decisions.append(_prompt_decision(name, args, allowed))

        result = agent.invoke(Command(resume={"decisions": decisions}), config=config)

    print("\n=== FINAL ASSISTANT MESSAGE ===")
    print(result["messages"][-1].content)

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--provider", choices=["openai", "gemini"], help="Model provider")
    parser.add_argument("--model", help="Override model id (e.g., openai:gpt-4o-mini or google_genai:gemini-2.5-flash-lite)")
    parser.add_argument("--student", default="student_001", help="Student submission folder name")
    parser.add_argument("--seed", action="store_true", help="Create a demo submission if missing")
    args, unknown = parser.parse_known_args() # Modified line

    provider = choose_provider(args.provider)
    model_id = pick_model_id(provider, args.model)

    # Optional demo seed
    if args.seed:
        sid = args.student
        sdir = _student_dir(sid)
        if not (sdir / "report.md").exists():
            (sdir / "report.md").write_text("# Problem\n...\n# Method\n...\n# Results\n...\n", encoding="utf-8")
        if not (sdir / "solution.py").exists():
            (sdir / "solution.py").write_text("def solve(x):\n    return x\n", encoding="utf-8")
        print(f"Seeded demo submission in: {sdir}")

    run(args.student, provider, model_id)


In [None]:
!pip install deepagents langchain-openai

# LangChain

    Build simple application with LangChain
    Trace your application with LangSmith
    Serve your application with LangServe

The simplest and most common chain contains three things:

- **Model/Chat (LLM) Wrappers**: The language model is the core reasoning engine here. In order to work with LangChain, you need to understand the different types of language models and how to work with them.

- **Prompt Template**: This provides instructions to the language model. This controls what the language model outputs, so understanding how to construct prompts and different prompting strategies is crucial.

- **Memory**: Provides a construct for storing and retrieving messages during a conversation which can be either short term or long term.

- **Indexes**: Help LLMs interact with documents by providing a way to structure them. LangChain provides Document Loaders to load documents, Text Splitters to split documents into smaller chunks, Vector Stores to store documents as embeddings, and Retrievers to fetch relevant documents.

- **Chain**: Probably the most important component of LangChain is the Chain class. It's a wrapper around the LLM that allows you to create a chain of actions.

- **Agents**:: Agents are the most powerful feature of LangChain. They allow you to combine LLMs with external data and tools.

- **Callbacks**: Callbacks mechanism allows you to go back to different stages of your LLM application using ‘callbacks’ argument of the API. It is used for logging, monitoring, streaming etc.



In this guide we'll cover those three components individually, and then go over how to combine them. Understanding these concepts will set you up well for being able to use and customize LangChain applications. Most LangChain applications allow you to configure the model and/or the prompt, so knowing how to take advantage of this will be a big enabler

## Setup

Installing LangChain is easy. You can install it with pip:

In [None]:
%time
!pip install langchain langchain_community -qqq

Note that we're also installing a few other libraries that we'll be using in this tutorial.

## Model (LLM) Wrappers

Using Llama 2 is as easy as using any other HuggingFace model. We'll be using the HuggingFacePipeline wrapper (from LangChain) to make it even easier to use. To load the 13B version of the model, we'll use a GPTQ (Generative Pre-trained Transformer Quantization) version of the model:

In [None]:
from langchain.chains import LLMChain
from langchain import PromptTemplate
from langchain.llms import HuggingFacePipeline
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
import torch

# Load the model and tokenizer from Hugging Face
model_name = "Qwen/Qwen2.5-1.5B-Instruct"
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Create a Hugging Face text-generation pipeline with desired parameters
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=512,
    temperature=0.01,        # More deterministic output
    top_p=0.95,              # Focus on the top 95% of the probability distribution
    do_sample=True,          # Enable sampling for randomness
    repetition_penalty=1.15  # Discourage repetitive outputs
)

# Wrap the pipeline in LangChain's HuggingFacePipeline LLM wrapper
llm = HuggingFacePipeline(pipeline=pipe)

GPTQ has been shown to be able to quantize GPTs down to 4-bit weights with minimal loss of accuracy. This means that GPTQs can be run on much smaller and cheaper hardware, such as smartphones and laptops.

GPTQ is a promising new technology that could make LLMs more accessible to a wider range of users.

Here are some of the benefits of using GPTQ:

> - Smaller model size: GPTQ can reduce the model size by up to 90%, without sacrificing too much accuracy. This makes it possible to deploy GPTs on smaller and cheaper hardware.
- Faster inference: GPTQ can also speed up inference by up to 4x. This makes it possible to use GPTs in more real-time applications.
- Lower power consumption: GPTQ can also reduce power consumption by up to 80%. This makes it possible to use GPTs on battery-powered devices.

Good thing is that the transformers library supports loading models in GPTQ format using the AutoGPTQ library. Let's try out our LLM:

In [None]:
result = llm(
    "Explain the difference between ChatGPT and open source LLMs in a couple of lines."
)
print(result)

##Exercise 1:

Check the results of different settings for a prompt. You can change temperature, top_p, do_sample, and repetition_penalty in the model configuration and compare the results.

## Prompts and Prompt Templates

One of the most useful features of LangChain is the ability to create prompt templates. A prompt template is a string that contains a placeholder for input variable(s). Let's see how we can use them:

In [None]:
from langchain.prompts import PromptTemplate

# Define the template for generating prompts
template = """
<s>[INST] <<SYS>>
Behave as a teacher and provide an explanation for the following query:
<</SYS>>

{text} [/INST]
"""

# Initialize the PromptTemplate with the specified variables and template
prompt = PromptTemplate(
    input_variables=["text"],  # Specify the variables to be included in the prompt
    template=template,  # Define the template structure
)

In [None]:
text = "How does attention mechanism work? Let's think step by step"

In [None]:
print(prompt.format(text=text))

In [None]:
result = llm(prompt.format(text=text))
print(result)

## Exercise 2: Basic Prompt Formatting for Sum Calculation

Define a PromptTemplate acting as a calculator that takes two input values and formats a prompt to calculate their sum.

## Exercise 3:

Modify the prompt or question to explore how we can improve the model's performance.

In [None]:
from langchain import PromptTemplate

template = """
<s>[INST] <<SYS>>
Act as a Machine Learning engineer who is teaching high school students.
<</SYS>>

{text} [/INST]
"""

prompt = PromptTemplate(
    input_variables=["text"],
    template=template,
)

The variable must be surrounded by {}. The input_variables argument is a list of variable names that will be used to format the template. Let's see how we can use it:

In [None]:
text = "Explain what are Deep Neural Networks in 2-3 sentences"
print(prompt.format(text=text))

You just have to use the format method of the PromptTemplate instance. The format method returns a string that can be used as input to the LLM. Let's see how we can use it:

In [None]:
result = llm(prompt.format(text=text))
print(result)

## Chain

Probably the most important component of LangChain is the Chain class. It's a wrapper around the LLM that allows you to create a chain of actions. Here's how you can use the simplest chain:

In [None]:
from langchain.chains import LLMChain

chain = LLMChain(llm=llm, prompt=prompt)
result = chain.run(text)
print(result)

The arguments to the LLMChain class are the LLM instance and the prompt template.

## Exercise 4: Use the LLMChain for Direct Response Generation
Task: Create a new PromptTemplate for a fitness coach explaining the benefits of regular exercise and use LLMChain to generate a response.

#### Chaining Chains

The LLMChain is not that different from using the LLM directly. Let's see how we can chain multiple chains together. We'll create a chain that will first explain what are Deep Neural Networks and then give a few examples of practical applications. Let's start by creating the second chain:

In [None]:
template = "<s>[INST] Use the summary {summary} and give 3 examples of practical applications with 1 sentence explaining each [/INST]"

examples_prompt = PromptTemplate(
    input_variables=["summary"],
    template=template,
)
examples_chain = LLMChain(llm=llm, prompt=examples_prompt)

Now we can reuse our first chain along with the examples_chain and combine them into a single chain using the SimpleSequentialChain class:

In [None]:
from langchain.chains import SimpleSequentialChain

# Create an instance of 'SimpleSequentialChain'. This chain will execute two other chains
# sequentially. The 'chains' parameter is a list of these chains - 'chain' and 'examples_chain'.
multi_chain = SimpleSequentialChain(chains=[chain, examples_chain], verbose=True)

# The 'run' method executes the chains in the order they are listed, passing the output
# of one chain as the input to the next. The final output is then stored in the variable 'result'.
result = multi_chain.run(text)

print(result.strip())

## Exercise 5: Chaining Multiple Chains Together
Task: Explain a scientific concept and then provide real-world applications.

## Chatbot

LangChain makes it easy to create chatbots. Let's see how we can create a simple chatbot that will answer questions about Deep Neural Networks. We'll use the ChatPromptTemplate class to create a template for the chatbot:

In [None]:
from langchain.prompts.chat import (
    ChatPromptTemplate,
    HumanMessagePromptTemplate,
    SystemMessagePromptTemplate,
)
from langchain.schema import AIMessage, HumanMessage

template = "Act as an experienced high school teacher that teaches {subject}. Always give examples and analogies"
human_template = "{text}"

chat_prompt = ChatPromptTemplate.from_messages(
    [
        SystemMessagePromptTemplate.from_template(template),
        HumanMessage(content="Hello teacher!"),
        AIMessage(content="Welcome everyone!"),
        HumanMessagePromptTemplate.from_template(human_template),
    ]
)

messages = chat_prompt.format_messages(
    subject="Artificial Intelligence", text="What is the most powerful AI model?"
)
messages

We start by creating a system message that will be used to initialize the chatbot. Then we create a human message that will be used to start the conversation. Next, we create an AI message that will be used to respond to the human message. Finally, we create a human message that will be used to ask the question. We can use the format_messages method to format the messages.

To use our LLM with the messages, we'll pass them to the predict_messages method:

In [None]:
result = llm.predict_messages(messages)
print(result.content)

In [None]:
# Assuming necessary imports and initializations have been done...

# Define the initial template for the AI acting as a high school teacher.
teacher_template = "Act as an experienced high school teacher specializing in {subject}. Respond to the student's questions with informative answers, examples, and analogies."

# Set the subject that the teacher specializes in.
subject = "Artificial Intelligence"

# The loop for the interactive conversation.
while True:
    # Get user input.
    user_input = input("You: ")

    # Check for a quit condition.
    if user_input.lower() in ["exit", "quit"]:
        break

    # Construct the complete prompt for the AI model.
    # This includes the role description (teacher_template) and the user's question.
    complete_prompt = teacher_template.format(subject=subject) + "\nStudent asks: " + user_input + "\nTeacher:"

    # Use the language model to generate a response.
    # Ensure that 'llm.predict' is the correct method for your setup.
    # This method should take the prompt as input and return the AI's response.
    ai_response = llm.predict(complete_prompt)

    # Print the AI's response.
    # Make sure that 'ai_response' is being correctly extracted from the model's output.
    print("Teacher:", ai_response)

# End the conversation loop.
print("Conversation ended.")

## Agents

Agents are the most powerful feature of LangChain. They allow you to combine LLMs with external data and tools. Let's see how we can create a simple agent that will use the Python REPL to calculate the square root of a number and divide it by 2:

In [None]:
from langchain.agents.agent_toolkits import create_python_agent
from langchain.tools.python.tool import PythonREPLTool

agent = create_python_agent(llm=llm, tool=PythonREPLTool(), verbose=True)

result = agent.run("Calculate the square root of a number and divide it by 2")

Python REPL stands for "Read-Eval-Print Loop." It's an interactive environment where you can write Python code and execute it immediately.

Here's the final answer from our agent:

In [None]:
result

Let's run the code from the agent in a Python REPL:

In [None]:
from math import sqrt

x = 16
y = sqrt(x)
z = y / 2
z

So, our agent works but made a mistake in the calculations. This is important, you might hear great things about AI, but it's still not perfect. Maybe another, more powerful LLM, will get this right. Try it out and let me know.

Here's the response to the same prompt but using ChatGPT:

     Enter a number: 16
     The square root of 16.0 divided by 2 is: 2.0

In [None]:
!pip install wikipedia

In [None]:
import wikipedia

class WikipediaAgent:
    def search(self, query):
        # Search Wikipedia and return the summary of the first result.
        try:
            # Get the page summary for the query
            summary = wikipedia.summary(query)
            return summary
        except wikipedia.exceptions.DisambiguationError as e:
            # If there's a disambiguation issue, return the options.
            return "Disambiguation Error: " + '; '.join(e.options)
        except wikipedia.exceptions.PageError:
            # If the page is not found, inform the user.
            return "Page not found for the query."

# Create an instance of the WikipediaAgent
wiki_agent = WikipediaAgent()

# Example use of the agent to search for a term
result = wiki_agent.search("Artificial Intelligence")
print(result)