The two `ChatPromptTemplate` objects are quite similar, but they differ in how they are constructed and the structure of the prompt.

### 1. **Using `from_messages`**:

```python
prompt = ChatPromptTemplate.from_messages(
    [
        ("system", "You are a helpful AI assistant. Please respond to the question asked."),
        ("user", "Question: {question}")
    ]
)
```

#### How it works:
- **`from_messages`** creates a prompt template using a list of message tuples.
- Each tuple contains two elements:
  1. **Role**: This defines the role of the speaker, such as `"system"` and `"user"`. These roles help the language model distinguish between who is speaking (the system or the user).
  2. **Message**: This is the text of the message that will be used. It can contain placeholders (like `{question}`) which will be filled at runtime with actual data.

#### Use Case:
- This approach is useful when you want to have a more structured conversation flow with clearly defined roles (e.g., system, user). It's often used when different messages need to come from different roles.
  
  For example:
  - The **system** message sets up the role of the assistant.
  - The **user** message provides the context or question dynamically using the `{question}` placeholder.

### 2. **Using `from_template`**:

```python
prompt = ChatPromptTemplate.from_template(
    """
Answer the following question based only on provided context:
<context>
{context}
<context>
"""
)
```

#### How it works:
- **`from_template`** creates a prompt template from a string template.
- The template allows for more flexibility in how you structure the prompt, and you can include placeholders like `{context}` that will be replaced with actual data at runtime.

#### Use Case:
- This approach is useful when you want to define the structure of the prompt in a more flexible, free-form way.
- It’s especially useful if you need to design complex prompt templates with multiple placeholders or specific formatting, such as showing the context to the model or asking a more detailed question based on provided information.

In this example, the prompt asks the model to answer the question based on the provided context, and the `{context}` placeholder will be replaced with actual context information when the prompt is invoked.

### Key Differences:

1. **Message Structure**:
   - `from_messages`: You are explicitly defining the roles (e.g., system, user) and the specific message content.
   - `from_template`: You are defining a full prompt template where placeholders are embedded into the structure of the prompt. It’s more flexible for different styles.

2. **Use Case**:
   - `from_messages` is more useful when you need a structured prompt with different roles, such as in a dialogue-based system.
   - `from_template` is more suitable for free-form or specialized prompt designs, particularly when you have more complex templates with multiple placeholders.

3. **Flexibility**:
   - `from_messages`: More rigid structure (you define the roles and the messages).
   - `from_template`: More flexible, as the whole template is treated as a string where you can include placeholders dynamically.

In this example:

```python
prompt = ChatPromptTemplate.from_template(
    """
Answer the following question based only on provided context:
<context>
{context}
<context>
"""
)
```

The `{context}` placeholder will be dynamically replaced with the value of `context` when you invoke the prompt. It doesn't inherently "remember" the context, but when you provide it as an argument during the prompt invocation, it will be inserted into the prompt at runtime.

#### Key Points:
- **Dynamic Insertion**: The `{context}` placeholder is replaced with actual content (usually provided by you in the invocation of the prompt). For example, the context might be a document or text chunk containing information that the LLM will use to answer the question.
- **One-Time Context**: The context is provided on a per-request basis. It doesn't get stored or remembered between different invocations. If you want to remember context across different queries, you will need to manually manage and feed that context back into the prompt.
  
  In a practical scenario, you can store and append context after each interaction, or if you are working with a document-based context, you might append relevant parts of it to the prompt every time the model is called.

#### Example Usage:

```python
context = "LangChain is a framework for building applications powered by LLMs."
question = "What is LangChain?"

prompt = ChatPromptTemplate.from_template(
    """
    Answer the following question based only on provided context:
    <context>
    {context}
    </context>
    """
)

# Here, the context is dynamically inserted into the template when you use it.
final_prompt = prompt.format(context=context)
print(final_prompt)
```

This would produce a prompt like:

```
Answer the following question based only on provided context:
<context>
LangChain is a framework for building applications powered by LLMs.
</context>
What is LangChain?
```

Thus, **context is only available for that specific call** to the prompt, and you would need to handle the logic to keep track of context over multiple queries if needed (e.g., by appending previous context to new ones).

### 🚀 What is LCEL?

**LCEL (LangChain Expression Language)** is a *concise, chainable syntax* for composing LangChain components like prompts, models, retrievers, and output parsers in a pipeline. It simplifies building **modular**, **readable**, and **reusable** workflows for language model applications.

---

### 🧱 Example:

```python
chain = prompt | llm | output_parser
```

- `prompt`: prepares the prompt
- `llm`: runs the model (like OpenAI, Llama, Ollama, etc.)
- `output_parser`: formats the model output (e.g., just text)

This is LCEL in action: components connected using the pipe (`|`) operator like Unix pipelines.

---

### ✅ Benefits of LCEL:

1. **Composable**: Easily plug different components together.
2. **Readable**: Clear, linear flow of data.
3. **Testable**: Each component can be tested independently.
4. **Reusable**: Swap out components (e.g., change LLMs or prompt templates) without rewriting logic.

---

### 🔄 Without LCEL vs With LCEL

**Without LCEL:**
```python
prompt_output = prompt.format(question="What is LCEL?")
model_output = llm.invoke(prompt_output)
response = output_parser.parse(model_output)
```

**With LCEL:**
```python
response = (prompt | llm | output_parser).invoke({"question": "What is LCEL?"})
```

Same result, but cleaner and more modular using LCEL.

---

### 🌟 Summary:

**LCEL** makes LangChain development:
- more elegant ✨
- more composable 🔧
- and easier to maintain 📦

It's one of LangChain's most powerful features for chaining operations cleanly.

### 🧠 What is **Groq**?

**Groq** is a company that has built its own ultra-fast, low-latency AI inference engine — **not a model**, but the *hardware and software infrastructure* that runs LLMs and AI workloads **faster** than traditional GPUs and CPUs.

It’s similar to how NVIDIA provides GPUs, but Groq created a new architecture designed specifically for AI inference.

---

### ⚡ What is the **Groq API**?

The **Groq API** gives you access to **pre-loaded large language models** (LLMs) that run on Groq’s custom hardware — specifically their **LPU™ (Language Processing Unit)** chips — through a simple, fast API.

Right now, Groq offers models like:

- **Mixtral 8x7B** (MoE — Mixture of Experts)
- **Gemma**
- **LLaMA models**

They do **not train models** — they optimize how *existing open-source models* are **served at blazing speeds**.

---

### 🚀 What makes Groq special?

1. **Insane speed**: Latency as low as **1 ms/token**. That’s **faster than GPU or TPU inference**.
2. **Deterministic latency**: You get consistent response times.
3. **Built for inference only**: Optimized for serving, not training.

---

### 🧩 What is the **LPU AI interface engine**?

The **LPU (Language Processing Unit)** is Groq’s **custom chip architecture** — it’s like their own version of a GPU but optimized **only for AI inference** — especially for **language models**.

#### LPU Interface Engine highlights:

- Designed to process **billions of tokens per second**
- Efficient for **batch and real-time inference**
- Runs on their **GroqNode servers**
- Interfaces with API or LangChain, etc.

You can think of it like this:
> **LPU = AI Supercharger for LLMs**

---

### 🔗 Use cases for Groq:

- **Chatbots** that require super low-latency
- **RAG applications** with large context windows
- **Streaming responses** for UX like ChatGPT
- Any **high-throughput inference** scenario

---

### 💡 Summary

| Feature               | Groq |
|-----------------------|------|
| **Type**              | Hardware + AI Inference Engine |
| **API**               | Serves ultra-fast open-source LLMs |
| **Special Chip**      | LPU (Language Processing Unit) |
| **Use Case**          | Blazing-fast inference (chatbots, assistants, RAG) |
| **Model Ownership**   | Runs open-source models (not owned/trained by Groq) |

---

If you're building something with **LangChain**, **RAG**, or want to serve a chatbot **without GPU cost/latency**, Groq is a strong option.

# What is Langserve
**LangServe** is a deployment tool developed by the LangChain team to simplify serving LangChain applications as RESTful APIs. It allows developers to expose their LangChain chains, agents, or runnables over HTTP endpoints without writing extensive boilerplate code. Built on top of FastAPI and utilizing Pydantic for data validation, LangServe streamlines the process of transitioning from development to production.

### 🔧 Key Features of LangServe

- **Easy Deployment**:Quickly deploy LangChain components as REST APIs with minimal setup

- **Automatic API Documentation**:Generates interactive API docs using Swagger and JSONSchema, facilitating easier testing and integration

- **Efficient Endpoints**:Provides `/invoke`, `/batch`, and `/stream` endpoints to handle various request types efficiently

- **Streaming Logs**:Offers a `/stream_log` endpoint to stream intermediate steps from your chain or agent, aiding in debugging and monitoring

- **Client Integration**:Includes a JavaScript client (LangChain.js) to interact with deployed LangServe routes, enabling seamless frontend integration

### 🚀 Getting Started with LangServe

1. **Installation**:

   Install LangChain and LangServe using pip:

   ```bash
   pip install langchain langserve
   ```

2. **Define Your Chain**:

   Create your LangChain chain, agent, or runnable as you normally would.

3. **Deploy with LangServe**:

   Use LangServe to expose your chain as a REST API:

   ```python
   from langserve import add_routes
   from fastapi import FastAPI

   app = FastAPI()
   add_routes(app, your_chain, path="/your-endpoint")
   ```

4. **Run the Server**:

   Start your FastAPI server to serve the API:

   ```bash
   uvicorn your_app:app --reload
   ```

#### What is **Message History** in an AI Chatbot?

In the context of AI chatbots, **message history** refers to the **persistent or temporary memory of the past exchanges** between the user and the chatbot — including both the user's inputs and the model's outputs.

This **conversation history** helps make the chatbot **stateful**, meaning it can **"remember" previous messages** and **respond in a more coherent and context-aware way**.

---

#### 🔁 Why is Message History Important?

Most LLMs (like GPT, Claude, etc.) are **stateless by default**, which means:
- Every prompt is treated **independently**, unless you **manually provide context**.
- Without message history, the bot would "forget" what was said earlier in the conversation.

So to **maintain context**, we pass prior messages along with the new user message in the prompt to the model.

---

#### 💡 Example Use Case

#### Without Message History:
```plaintext
User: What's the weather in Delhi?
Bot: It's 34°C and sunny in Delhi.

User: What about tomorrow?
Bot: (Without context, it doesn't know you're still talking about Delhi)
```

#### With Message History:
```plaintext
User: What's the weather in Delhi?
Bot: It's 34°C and sunny in Delhi.

User: What about tomorrow?
Bot: Tomorrow in Delhi, it's expected to be 36°C with some clouds.
```

---

#### 🧠 Optional: Store Message History in External Datastores

You can persist the message history in:
- **In-memory (short sessions)**
- **Vector Stores (semantic search)**
- **Databases (structured storage)**
- **Redis (fast key-value store)**

## 🔹 `ChatMessageHistory` (from `langchain_community.chat_message_histories`)
This is a **concrete implementation** of a message history class that stores messages **in memory** for a given session.

#### 🔍 Purpose:
- To **store and retrieve chat messages** (e.g., Human and AI messages) for a session.
- Acts like a log of conversations.

#### 💡 Used When:
You want to **track conversation history** **in RAM** (good for quick testing or non-persistent memory).

#### 📌 Example:
```python
history = ChatMessageHistory()
history.add_user_message("Hi")
history.add_ai_message("Hello, how can I help?")
```

---

### 🔹 `BaseChatMessageHistory` (from `langchain_core.chat_history`)
This is an **abstract base class** (interface) that all chat message history implementations must inherit from.

#### 🔍 Purpose:
- Defines the **expected methods** like `add_user_message()`, `add_ai_message()`, `messages` (getter).
- Used so LangChain can accept any history backend (memory, Redis, DB, etc.) as long as it conforms to this interface.

#### ✅ You don’t instantiate this. You **inherit** from it to create your own history class.

---

### 🔹 `RunnableWithMessageHistory` (from `langchain_core.runnables.history`)
This is a **wrapper around a chain or model** to add **memory support** (i.e., chat history state tracking).

#### 🔍 Purpose:
- Wraps any `Runnable` (like an LLM, chain, or tool).
- Injects **past messages** from the message history into the chain's input before invocation.
- Automatically **appends new inputs/outputs** into the history after invocation.

#### ⚙️ Example:
```python
wrapped = RunnableWithMessageHistory(my_llm_chain, get_session_history)
response = wrapped.invoke([HumanMessage(content="Tell me a joke")], config={"configurable": {"session_id": "abc"}})
```

🧠 Internally:
- Before the call, it **loads messages** using the session ID.
- It **appends your new user input** to the message list.
- Then, **calls the chain/model** with the full message history.
- Finally, **stores the model's output** back to the history store.

---

### 📌 Summary

| Class | Role | Scope |
|-------|------|-------|
| `ChatMessageHistory` | Stores messages in RAM for a session | Concrete implementation |
| `BaseChatMessageHistory` | Abstract interface for history | Foundation for all message memory |
| `RunnableWithMessageHistory` | Adds stateful memory to any chain/model | Wraps logic and history together |

## **Difference between `HumanMessage` and `ChatPromptTemplate` invocation**, how they work, and when to use each.

---

## 🔹 1. **Using `HumanMessage` for invocation**

### ✅ What it looks like:

```python
from langchain_core.messages import HumanMessage

response = model.invoke([
    HumanMessage(content="What's my name?")
])
```

### 🔍 How it works:

- You’re directly sending structured chat messages (e.g., `HumanMessage`, `AIMessage`, `SystemMessage`) to the model.
- This is **low-level** control — great when you're using something like `RunnableWithMessageHistory` that **automatically manages history**.
- No templating or formatting required.

### ✅ When to use:

- When you're using memory (`RunnableWithMessageHistory`) and don’t need to format prompts yourself.
- In simple stateful bots where message roles are more important than prompt templates.

---

## 🔹 2. **Using `ChatPromptTemplate` for invocation**

### ✅ What it looks like:

```python
from langchain_core.prompts import ChatPromptTemplate

prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful AI assistant."),
    ("human", "{question}")
])

chain = prompt | model
response = chain.invoke({"question": "What's my name?"})
```

### 🔍 How it works:

- You define a **template for a conversation**, using placeholders (e.g., `{question}`).
- It gives you **full control over how prompts are structured** — useful for:
  - RAG (including context)
  - Custom chat flow
  - Advanced agent orchestration

### ✅ When to use:

- When you need **fine-tuned prompts**.
- When you integrate external context (like vector DB results).
- When building **agentic pipelines**.
- When system message plays a key role.

---

## 🥇 Which is the **best technique**?

| Use Case | Best Choice |
|----------|-------------|
| Simple stateful chatbot | `HumanMessage` with memory (`RunnableWithMessageHistory`) |
| Custom instructions, prompt logic, RAG | `ChatPromptTemplate` |
| Building agents or multi-step chains | `ChatPromptTemplate` |
| Quick prototyping | `HumanMessage` |

### 🔔 Rule of Thumb:

- **Need memory & simplicity?** → `HumanMessage` + memory wrapper.
- **Need precision & flexibility?** → `ChatPromptTemplate`.


Here is the updated notebook with all markdown cells cleaned up, and icons like ✅ and others removed from the headers:



## **What is "Context" in an LLM?**

**Context** in a Large Language Model (LLM) refers to:

> ✅ **The total amount of text (tokens) that the model can "see" or "remember" in a single interaction.**

This includes:
- The **current prompt** (user input + any system instructions)
- **Prior conversation history** (if any)
- The model’s **own responses**, if they're reused as context
- Any other **background knowledge** passed in the prompt (like retrieved documents in RAG)

---

### 📏 Example: ChatGPT Context Window

| Model | Context Window |
|-------|----------------|
| GPT-3.5 | 4K tokens |
| GPT-4 (8K variant) | 8K tokens |
| GPT-4 Turbo | 128K tokens! |

---

### 🧮 Tokens ≠ Words  
LLMs don’t measure input by characters or words — they use **tokens**.

🧠 Rule of Thumb:
> 🔹 1 token ≈ 4 characters of English text  
> 🔹 1 token ≈ ¾ of a word (for English)

So:  
**"ChatGPT is awesome!"** ≈ 5 tokens

---

### 🚫 What Happens if You Exceed the Context Window?

If your input (history + prompt + system + retrieved docs) is **too long**, the model:
- Will **fail to process the prompt**, or
- Will **drop earlier parts of the conversation**, which may lead to hallucination or irrelevant answers

---

### 💡 Why Context Matters

- LLMs **don’t have memory** by default (like a brain).  
- They can only **"think" based on what you give them right now**.
- Managing context = managing what the model can "remember".

That’s why tools like:
- **Message history**
- **Trimmers**
- **Summarization chains**
- **RAG (retrieval augmented generation)**  
...are used to **efficiently feed context** to the model.

## `trim_messages` in LangChain

`trim_messages` is a utility that helps **manage the size of the message history** being sent to the LLM. It ensures that the **total token count** of the messages remains **within a specified limit**, like `max_tokens=200`. You can configure:
- `strategy="last"` – Keep the most recent messages.
- `include_system=True` – Retain system prompts.
- `allow_partial=False` – Only complete messages are kept (no cutting in the middle).
- `start_on="human"` – Decide from which type of message trimming should start.

---

### ✅ Benefits of `trim_messages`

| 🔹 **Benefit** | 🔍 **Explanation** |
|---------------|--------------------|
| 🚀 **Avoids context overflow** | Keeps message history within the model's context limit. Prevents errors or degraded model performance. |
| 📏 **Token control** | Manages token usage per prompt, helping stay within API or memory limits (e.g., OpenAI GPT-4’s 8K/32K). |
| 🧠 **Keeps relevant history** | Uses strategies like "keep last" to retain only **recent and relevant** context for better coherence. |
| 🧼 **Improves model output** | Helps model focus on the most recent context, avoiding dilution with outdated or irrelevant history. |
| 💰 **Reduces cost** | Fewer tokens passed = Lower API cost on usage-based billing models like OpenAI, Anthropic, etc. |

---

### ✅ Should You Use `trim_messages`?

**Yes, you absolutely should.** Especially if:
- You’re building a **multi-turn chatbot** or assistant.
- You're using **memory via RunnableWithMessageHistory**.
- Your application may involve **long user sessions**.
- You want **predictable token usage** and **performance stability**.

---

### 🔁 Without `trim_messages`:
Your chatbot can grow a giant context, eventually hitting model limits and causing:
- Errors (`context_length_exceeded`)
- Incomplete responses
- Model hallucinations
- Cost inefficiency

---

So yes — `trim_messages` is **essential** for production-grade chatbot memory management. It's like garbage collection for your message history.

## 🔧 What is a `Runnable` in LangChain?

> A `Runnable` is **any object that can be invoked like a function** (using `.invoke()` or `|`) in a LangChain pipeline.

Think of it as a standardized interface that says:

> “Hey, I know how to accept some input, process it, and return an output.”

It could be:
- An LLM
- A prompt template
- A retriever
- A memory wrapper
- A custom Python function

---

### 🧱 Basic Capabilities of a `Runnable`

All `Runnable`s implement methods like:

| Method             | What it does                                          |
|--------------------|-------------------------------------------------------|
| `.invoke(input)`   | Synchronously runs the logic and returns the output   |
| `.ainvoke(input)`  | Asynchronously runs the logic                         |
| `.batch(inputs)`   | Runs on a list of inputs                              |
| `.stream(input)`   | Streams tokens/output (e.g., from LLM)                |

---

### 🧪 Example:

```python
from langchain_core.runnables import RunnableLambda

# Custom Runnable that doubles a number
double = RunnableLambda(lambda x: x * 2)

print(double.invoke(5))  # Output: 10
```

This is now a component you can plug into a chain:

```python
pipeline = double | some_other_step | llm
```

---

### 🧬 Real Power: **Composable Chains**

Because everything is a Runnable:
```python
chain = prompt | llm | output_parser
```

All of those components (prompt, llm, parser) are `Runnable`s — chained together seamlessly using the pipe (`|`) operator.

---

### 🎯 Why It Matters

The `Runnable` concept is what **unifies** LangChain:
- You can treat LLMs, prompts, retrievers, memory, functions all the same.
- You get a **flexible and readable pipeline** (like UNIX pipes or functional chaining).
- Easier debugging, testing, and modularity.

---

### ✅ Summary

A **Runnable** in LangChain is:
- Any component that knows how to accept input and return output.
- A core building block in pipelines.
- Supports both sync and async workflows.
- Makes LangChain chains powerful, composable, and readable.