* https://docs.google.com/document/d/1b7DlLtvrVxihcO65dsBaIjUU-a6aLjdQfNHSOM2PhUM/edit?tab=t.0#heading=h.v0phdvfalpr2

## setup and tryout

In [1]:
%load_ext dotenv
%dotenv

```sh
pip install langchain-google-genai langchain langchain_openai beautifulsoup4 langchain-community lxml --upgrade
```

In [23]:
import os

# from openai import OpenAI
import jupyter_black
import tqdm, tqdm.notebook

jupyter_black.load()

assert {"OPENAI_API_KEY", "GOOGLE_API_KEY"} <= set(os.environ)

GEMINI_MODEL_NAME_CLEVER = "gemini-2.0-flash"
GEMINI_MODEL_NAME_FAST = "gemini-2.0-flash-lite"

In [3]:
%%time
from langchain_google_genai import ChatGoogleGenerativeAI

# Make sure to set your GOOGLE_API_KEY environment variable
llm = ChatGoogleGenerativeAI(model="gemini-2.0-flash-lite")

response = llm.invoke("What are the best practices for developing with LangChain?")

print(response.content)

Developing with LangChain effectively involves understanding its core components, best practices, and common pitfalls to avoid. Here's a breakdown of best practices, categorized for clarity:

**I. Foundational Principles & Design:**

*   **Understand LangChain's Core Components:**
    *   **Models:** LLMs (e.g., OpenAI, Cohere), Chat Models (e.g., ChatOpenAI, ChatAnthropic), Embeddings models (e.g., OpenAIEmbeddings).
    *   **Prompts:** Prompt templates, examples, variables, and how to structure them for optimal performance.
    *   **Chains:** Sequences of calls to LLMs, other utilities, or even other chains.  Understand different chain types (e.g., Sequential, Router, MapReduce).
    *   **Agents:**  Use LLMs as reasoning engines to select tools and decide actions.
    *   **Memory:**  Manage conversations and context over time. Types include:  `ConversationBufferMemory`, `ConversationBufferWindowMemory`, `ConversationSummaryMemory`.
    *   **Indexes (Document Loaders, Retrievers)

original:

```python
from langchain_openai.chat_models import ChatOpenAI
# chat = ChatOpenAI(openai_api_key="...")
# If you have an envionrment variable set for OPENAI_API_KEY, you can just do:
chat = ChatOpenAI()
chat.invoke("Hello, how are you?") 
```

In [4]:
from langchain_google_genai import ChatGoogleGenerativeAI

# Make sure to set your GOOGLE_API_KEY environment variable.
# You can get one from Google AI Studio: https://aistudio.google.com/app/apikey

# Initialize the Gemini chat model
# You can also specify other models like "gemini-1.5-pro"
chat = ChatGoogleGenerativeAI(model="gemini-2.0-flash-lite")
# Invoke the model with a prompt
response = chat.invoke("Hello, how are you?")
print(response.content)

I am doing well, thank you for asking! As a large language model, I don't experience emotions like humans do, but I am functioning properly and ready to assist you. How can I help you today?


## 85. Chat Models -- Coding

In [None]:
from IPython.display import Markdown

response = chat.invoke("What is the capital of France?")
Markdown(response.content)

In [None]:
response.response_metadata

original:

```python
from langchain_core.messages import HumanMessage, SystemMessage

text = "What would be a good company name for a company that makes colorful socks?"
messages = [HumanMessage(content=text)]
result = chat.invoke(messages)
```

In [None]:
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain_core.messages import HumanMessage, SystemMessage, AIMessage

# Ensure your GOOGLE_API_KEY environment variable is set
# 1. Initialize the Gemini chat model
chat = ChatGoogleGenerativeAI(model="gemini-2.0-flash-lite")

# 2. Define the message(s) for the model
text = "What would be a good company name for a company that makes colorful socks?"
messages = [
    SystemMessage(content="You are a helpful assistant that generates company names."),
    HumanMessage(content=text),
]

# 3. Invoke the model with the messages
result = chat.invoke(messages)

# 4. Print the AI's response content
print(result.content)

## 86. Chat Prompt Templates
* https://drive.google.com/file/d/1JoyxZlYfngmXnvrRyo7qqvUoB7qz6il0/view?usp=drive_link

In [None]:
from langchain_core.prompts.chat import ChatPromptTemplate

chat_prompt_template = ChatPromptTemplate.from_messages(
    [
        ("system", "You are a helpful assistant that generates company names"),
        ("human", "{text}"),
    ]
)

result = chat_prompt_template.invoke(
    {
        "text": "What would be a good company name for a company that makes colorful socks?"
    }
)

# model = Cha(model='gpt-4o-mini')

ai_llm_result = chat.invoke(result)
print(ai_llm_result.content)

## 87. Streaming
* https://drive.google.com/file/d/18sGlOZ8AKwON1CXUMnqf9ONfj7bwjSiO/view?usp=drive_link

In [None]:
import sys
import tqdm, tqdm.notebook

chat = ChatGoogleGenerativeAI(
    model="gemini-2.0-flash-lite",
    # streaming=True,
)
for chunk in tqdm.notebook.tqdm(chat.stream("What is the capital of the moon?")):
    print(chunk.content, end="", flush=True)
    sys.stdout.flush()

## 88. Output Parsers
* https://drive.google.com/file/d/1QWwi3AOCHEoMR83zR21sB7zzKdUxVdfO/view?usp=drive_link

In [None]:
from typing import List
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate, SystemMessagePromptTemplate
from langchain_core.output_parsers import PydanticOutputParser
from pydantic import BaseModel, Field

In [None]:
class Joke(BaseModel):
    setup: str = Field(description="The setup to the joke")
    punchline: str = Field(description="The punchline to the joke")


class Jokes(BaseModel):
    jokes: List[Joke] = Field(description="A list of jokes")


parser = PydanticOutputParser(pydantic_object=Joke)

In [None]:
print(parser.get_format_instructions())

In [None]:
template = "Answer the user query.\n{format_instructions}\n{query}"
system_message_prompt = SystemMessagePromptTemplate.from_template(template)
chat_prompt = ChatPromptTemplate.from_messages([system_message_prompt])


messages = chat_prompt.invoke(
    {
        "query": "What is a really funny joke about Python programming?",
        "format_instructions": parser.get_format_instructions(),
    }
)

In [None]:
chat = ChatOpenAI()
## does not work with Gemini
result = chat.invoke(messages)

In [None]:
try:
    joke_object = parser.parse(result.content)
    print(joke_object.setup)
    print(joke_object.punchline)
except Exception as e:
    print(e)

In [None]:
chat = ChatOpenAI(model="gpt-4.1-mini")
structured_llm = chat.with_structured_output(Joke)
result = structured_llm.invoke("What is a really funny joke about Python programming?")

In [None]:
result

In [None]:
class Joke(BaseModel):
    setup: str = Field(description="The setup to the joke")
    punchline: str = Field(description="The punchline to the joke")
    explanation: str = Field(
        description="A detailed explanation of why this joke is funny."
    )


class Jokes(BaseModel):
    jokes: List[Joke] = Field(description="A list of jokes")

In [None]:
chat = ChatGoogleGenerativeAI(model="gemini-2.0-flash")
structured_llm = chat.with_structured_output(Joke)
result = structured_llm.invoke("What is a really funny joke about Python programming?")
result

## 89. Summarizing large amounts of text
* https://colab.research.google.com/drive/11t0e03SThhKRPq9T1M7xg6BcooBFaTkA

### crisp

In [None]:
# from langchain_openai.chat_models import ChatOpenAI
from langchain.chains import LLMChain
from langchain_community.document_loaders import WebBaseLoader
from langchain.chains.summarize import load_summarize_chain
from langchain_core.prompts import PromptTemplate

loader = WebBaseLoader("https://lilianweng.github.io/posts/2023-06-23-agent/")
docs = loader.load()

llm = ChatGoogleGenerativeAI(
    temperature=0,
    model=GEMINI_MODEL_NAME_CLEVER,
)
chain = load_summarize_chain(llm, chain_type="stuff")

res = chain.invoke(docs)

In [None]:
res["input_documents"]
Markdown(res["output_text"])

### map reduce

* problem if pages refer to each other (since summaries are done independently)

In [None]:
from langchain.chains.mapreduce import MapReduceChain
from langchain.text_splitter import CharacterTextSplitter
from langchain.chains import (
    ReduceDocumentsChain,
    MapReduceDocumentsChain,
    StuffDocumentsChain,
)

In [None]:
llm = ChatGoogleGenerativeAI(temperature=0, model=GEMINI_MODEL_NAME_CLEVER)

# Map
map_template = """The following is a set of documents
{docs}
Based on this list of docs, please identify the main themes
Helpful Answer:"""
map_prompt = PromptTemplate.from_template(map_template)

# map_chain:
map_chain = LLMChain(llm=llm, prompt=map_prompt)

# Reduce
reduce_template = """The following is set of summaries:
{doc_summaries}
Take these and distill it into a final, consolidated summary of the main themes.
Helpful Answer:"""
reduce_prompt = PromptTemplate.from_template(reduce_template)

In [None]:
# Run chain
reduce_chain = LLMChain(llm=llm, prompt=reduce_prompt)

# Takes a list of documents, combines them into a single string, and passes this to an LLMChain
combine_documents_chain = StuffDocumentsChain(
    llm_chain=reduce_chain, document_variable_name="doc_summaries"
)

# Combines and iteravely reduces the mapped documents
reduce_documents_chain = ReduceDocumentsChain(
    # This is final chain that is called.
    combine_documents_chain=combine_documents_chain,
    # If documents exceed context for `StuffDocumentsChain`
    collapse_documents_chain=combine_documents_chain,
    # The maximum number of tokens to group documents into.
    token_max=4000,
)

In [None]:
# Combining documents by mapping a chain over them, then combining results
map_reduce_chain = MapReduceDocumentsChain(
    # Map chain
    llm_chain=map_chain,
    # Reduce chain
    reduce_documents_chain=reduce_documents_chain,
    # The variable name in the llm_chain to put the documents in
    document_variable_name="docs",
    # Return the results of the map steps in the output
    return_intermediate_steps=False,
)

text_splitter = CharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=1000, chunk_overlap=0
)
split_docs = text_splitter.split_documents(docs)

In [None]:
%%time
# print()
res = map_reduce_chain.invoke(split_docs)

In [None]:
Markdown(res["output_text"])

### template

In [6]:
prompt_template = """Write a concise summary of the following:
{text}
CONCISE SUMMARY:"""
prompt = PromptTemplate.from_template(prompt_template)

refine_template = (
    "Your job is to produce a final summary\n"
    "We have provided an existing summary up to a certain point: {existing_answer}\n"
    "We have the opportunity to refine the existing summary"
    "(only if needed) with some more context below.\n"
    "------------\n"
    "{text}\n"
    "------------\n"
    "Given the new context, refine the original summary"
    "If the context isn't useful, return the original summary."
)
refine_prompt = PromptTemplate.from_template(refine_template)
chain = load_summarize_chain(
    llm=llm,
    chain_type="refine",
    question_prompt=prompt,
    refine_prompt=refine_prompt,
    return_intermediate_steps=True,
    input_key="input_documents",
    output_key="output_text",
)
result = chain({"input_documents": split_docs}, return_only_outputs=True)

# Page 1 --> Page 2 (Refine) --> Page 3 (Refine)

NameError: name 'PromptTemplate' is not defined

In [None]:
for i, v in enumerate(result["intermediate_steps"]):
    display(Markdown(f"## step {i+1}\n{v}"))

In [None]:
result["output_text"]

## 91. Document Loaders, Text Splitting, Creating LangChain Documents

https://colab.research.google.com/drive/1YdtBCggWStErmFeP5GBSmEaw04kXeKqD

In [7]:
from bs4 import BeautifulSoup
from langchain_community.document_loaders import TextLoader
import requests

# Get this file and save it locally:
url = "https://github.com/hammer-mt/thumb/blob/master/README.md"

# Save it locally:
r = requests.get(url)

# Extract the text from the HTML:
soup = BeautifulSoup(r.text, "html.parser")
text = soup.get_text()

with open("README.md", "w") as f:
    f.write(text)

loader = TextLoader("README.md")
docs = loader.load()

In [9]:
len(docs)

1

In [10]:
from langchain_core.documents import Document

[Document(page_content="test", metadata={"test": "test"})]

[Document(metadata={'test': 'test'}, page_content='test')]

In [12]:
# Split the text into sentences:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    # Set a really small chunk size, just to show.
    chunk_size=300,
    chunk_overlap=50,
    length_function=len,
    is_separator_regex=False,
)

final_docs = text_splitter.split_documents(loader.load())
len(final_docs)

22

In [15]:
from langchain_openai.chat_models import ChatOpenAI
from langchain.chains.summarize import load_summarize_chain

In [18]:
%%time
chain = load_summarize_chain(llm=chat, chain_type="map_reduce")
res = chain.invoke(
    {
        "input_documents": final_docs,
    }
)

CPU times: user 69.4 ms, sys: 50.5 ms, total: 120 ms
Wall time: 15.3 s


In [20]:
res.keys()
res["output_text"]

'This describes a comprehensive GitHub interface, outlining features for code management, collaboration, security, AI-powered tools (Copilot), automation (Actions), and learning. It includes project-specific views, enterprise-grade security and support options, and links to resources, documentation, and community features. The interface encompasses navigation menus, error messages, and user settings, including a feedback form, saved search management, and appearance customization.'

## 92. Tagging Documents
https://colab.research.google.com/drive/1Gn1IxMqz0RcOaDKVdY7cnzgik0JoQlqZ

In [5]:
# fixes a bug with asyncio and jupyter
import nest_asyncio

nest_asyncio.apply()

In [6]:
from langchain.document_loaders.sitemap import SitemapLoader
from langchain_openai.chat_models import ChatOpenAI
from langchain.chains import create_tagging_chain, create_tagging_chain_pydantic
import pandas as pd



In [7]:
sitemap_loader = SitemapLoader(web_path="https://understandingdata.com/sitemap.xml")
sitemap_loader.requests_per_second = 5
docs = sitemap_loader.load()

Fetching pages: 100%|############################################################################################################################################################| 103/103 [00:03<00:00, 33.36it/s]


In [18]:
# Schema
schema = {
    "properties": {
        "sentiment": {"type": "string"},
        "aggressiveness": {"type": "integer"},
        "primary_topic": {
            "type": "string",
            "description": "The main topic of the document.",
        },
    },
    "required": ["primary_topic", "sentiment", "aggressiveness"],
}

# LLM
llm = ChatOpenAI(temperature=0, model="gpt-3.5-turbo")
## does not work with Gemini
# llm = ChatGoogleGenerativeAI(temperature=0, model=GEMINI_MODEL_NAME_CLEVER)
chain = create_tagging_chain(schema, llm, output_key="output")

In [19]:
results = []

# Remove the 0:10 to run on all documents:
for index, doc in enumerate(docs[0:10]):
    print(f"Processing doc {index +1}")
    chain_result = chain.invoke({"input": doc.page_content})
    results.append(chain_result["output"])

Processing doc 1
Processing doc 2
Processing doc 3
Processing doc 4
Processing doc 5
Processing doc 6
Processing doc 7
Processing doc 8
Processing doc 9
Processing doc 10


In [21]:
pd.DataFrame(results)

Unnamed: 0,sentiment,aggressiveness,primary_topic
0,positive,0,AI products
1,positive,3,technology
2,positive,0,Contact
3,positive,2,Software & Data Engineering Services
4,positive,0,Software & Data Engineering
5,neutral,0,Data Engineering
6,positive,0,Data Engineering Services
7,positive,3,React Software Development
8,positive,0,Python software development
9,positive,0,Python software development


### with Pyadntic

In [29]:
# fixes a bug with asyncio and jupyter
import nest_asyncio

nest_asyncio.apply()

from langchain.document_loaders.sitemap import SitemapLoader
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain.chains import create_tagging_chain_pydantic
from pydantic import BaseModel, Field
import pandas as pd


# 1. Pydantic Schema Definition
class DocumentTags(BaseModel):
    """Pydantic model for the tags to be extracted from the document."""

    sentiment: str = Field(
        description="The overall sentiment of the document (e.g., positive, negative, neutral)."
    )
    aggressiveness: int = Field(
        description="A rating from 1 to 10 of how aggressive the text is."
    )
    primary_topic: str = Field(description="The main topic of the document.")


# 2. Load Documents
# Note: This can take a moment to run
sitemap_loader = SitemapLoader(web_path="https://understandingdata.com/sitemap.xml")
sitemap_loader.requests_per_second = 5
docs = sitemap_loader.load()

Fetching pages: 100%|############################################################################################################################################################| 103/103 [00:02<00:00, 48.45it/s]


In [35]:
docs[0].metadata

{'source': 'https://understandingdata.com/',
 'loc': 'https://understandingdata.com/',
 'lastmod': '2025-08-17T12:52:13.527Z',
 'changefreq': 'monthly',
 'priority': '1.0'}

In [None]:
# 3. Initialize Gemini LLM
# Make sure your GOOGLE_API_KEY environment variable is set
# llm = ChatGoogleGenerativeAI(temperature=0, model="gemini-2.0-flash-lite")
llm = ChatOpenAI(temperature=0, model="gpt-3.5-turbo")

# 4. Create the Pydantic Tagging Chain
# This chain is specifically designed to work with Pydantic models
chain = create_tagging_chain_pydantic(DocumentTags, llm)

results = []

# 5. Process Documents
# Using a smaller slice [0:3] for a quick demonstration
for index, doc in tqdm.notebook.tqdm(list(enumerate(docs[:10]))):
    print(f"--- Processing doc {index + 1} ---")

    # The input to invoke is the document content
    chain_result = chain.invoke({"input": doc.page_content})

    # The result is a Pydantic object, which we convert to a dict
    # Access the result via the "function" key
    # tag_data = chain_result["function"].dict()
    tag_data = chain_result["text"]
    results.append(tag_data)

    print(tag_data)

In [34]:
# Optional: Display results in a DataFrame
df = pd.DataFrame(map(dict, results))
print("\n--- Final Results ---")
print(df)


--- Final Results ---
  sentiment  aggressiveness                         primary_topic
0  positive               3           Software & Data Engineering
1  positive               3                            technology
2   neutral               2                   Contact Information
3  positive               3  Software & Data Engineering Services
4  positive               3                  Software Engineering
5   neutral               5                      Data Engineering
6  positive               3             Data Engineering Services
7  positive               3            React Software Development
8  positive               3           Python Software Development
9  positive               3           Python Software Development


## 93. Tracing with LangSmith
https://colab.research.google.com/drive/1Sf-_1QP92iuJmFkykCufOYRkOB7tkliU