# <font color="#003660">Applied Machine Learning for Text Analysis (M.184.5331)</font>


# <font color="#003660">Session 9: LLM-based Apps with LangChain</font>

# <font color="#003660">Structured Outputs and Chains</font>

<center><br><img width=256 src="https://raw.githubusercontent.com/olivermueller/aml4ta-2021/main/resources/dag.png"/><br></center>

<p>

<div>
    <font color="#085986"><b>By the end of this lesson, you ...</b><br><br>
        ... will know how to implement structured outputs in LLMs. <br>
        ... will know how apply this to solve a real-world task in LangChain.
    </font>
</div>
</p>

The following content is heavily inspired by the following excellent sources:

* [LangChain Academy](https://academy.langchain.com/)
* [LangChain Docs (Python)](https://python.langchain.com/)

In [None]:
!pip install -U pymupdf4llm datasets transformers accelerate bitsandbytes langchain langchain-community langchain-huggingface

# Computer Scientists and JSON: A Love Story Written in Brackets

[JSON (JavaScript Object Notation)](https://www.json.org/json-en.html) is usually loved and hated by computer scientists. But this format is especially important in the online applications and databases such as [MongoDB](https://www.mongodb.com/de-de). Therefore, LLMs are often applied to extract JSON notation from unstructured text [Liu et al. (2024)](https://doi.org/10.1145/3613905.3650756).

Let's try this by prompting the model as we learned it in Session 06.

In [None]:
# packages
import os
import re
from tqdm.notebook import tqdm
from typing import Optional

import torch
from pydantic import BaseModel, Field
from typing import List, Optional
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, pipeline, set_seed
from langchain_huggingface import HuggingFacePipeline
from langchain_core.output_parsers import PydanticOutputParser
from langchain_core.prompts import ChatPromptTemplate

DEVICE = "cuda:0" if torch.cuda.is_available() else "mps:0" if torch.mps.is_available() else "cpu"

To use the a model directly in HuggingFace we can simply use the HuggingFace [Pipeline](https://huggingface.co/docs/transformers/main_classes/pipelines).

In [None]:
# generate a LangChain pipeline
LLM_NAME = "Qwen/Qwen2.5-1.5B-Instruct"

tokenizer = AutoTokenizer.from_pretrained(
    LLM_NAME
)

model = AutoModelForCausalLM.from_pretrained(
    LLM_NAME
)
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=512,
    return_full_text=False,
)

As the `pipe` cannot work easily with the LangChain `invoke` command, we need to use the LangChain `HuggingFacePipeline` wrapper.

In [None]:
hf = HuggingFacePipeline(pipeline=pipe)

Now let us add a simple structure prompt for the JSON-format.

In [None]:
INSTRUCTION = """
Use the following json format for the answer:
{
    "setup":"<The setup of your joke>",
    "punchline":"<The punchline to your joke>",
    "rating":"<Optional rating of how funny your joke is, from 1 to 10>"
}"""

Now let us generate a Joke about cats (yes, I could`nt get a better example), because we want to store it in a MongoDB.

In [None]:
set_seed(1)
prompt = "Tell me a joke about cats"
hopefully_json_response = hf.invoke(prompt + INSTRUCTION)
print(hopefully_json_response)

Looks good (not really funny) but will need some postprocessing to get only the JSON. Possibly we could do fine-tuning to improve that [Escarda-Fernández et al. (2024)](https://ceur-ws.org/Vol-3729/d3_rev.pdf), but usually we want to run this out of the box.

A simple way to do this are [OutputParsers](https://python.langchain.com/docs/how_to/#output-parsers). In our case we will use the [`PyDanticOutputParser`](https://python.langchain.com/docs/how_to/output_parser_structured/), as it can also output dictionary formats for python.

In [None]:
# the PyDantic Model
class Joke(BaseModel):
    """Joke to tell user."""

    setup: str = Field(description="The setup of the joke")
    punchline: str = Field(description="The punchline to the joke")
    rating: Optional[int] = Field(
        default=None, description="How funny the joke is, from 1 to 10"
    )

In [None]:
# Set up a parser
parser = PydanticOutputParser(pydantic_object=Joke)

In [None]:
# Prompt
prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "Answer the user query. Wrap the output in `json` tags\n{format_instructions}",
        ),
        ("human", "{query}"),
    ]
).partial(format_instructions=parser.get_format_instructions())

In [None]:
# chain
chain = prompt | hf | parser

In [None]:
set_seed(1)
prompt = "Tell me a joke about cats"
pydantic_response = chain.invoke({"query": prompt})
print(pydantic_response)

In [None]:
print(pydantic_response.model_dump(), type(pydantic_response.model_dump()))
print(pydantic_response.model_dump_json(), type(pydantic_response.model_dump_json()))

In [None]:
os.kill(os.getpid(), 9)

# Your Task

**Exercise Description:**

You are tasked with creating a system that extracts structured information from semi-structured text data describing swim training drills. Your goal is to transform the input into a structured JSON dictionary format that adheres to a predefined schema. This exercise requires you to design a system prompt for an AI model and implement Python classes using Pydantic to validate the extracted data.

### Requirements:

1. **System Prompt Design**:
   - Design a system prompt for an AI model that clearly explains the task of extracting swim training data.
   - Ensure the prompt outlines how to identify key components in the input data and map them to a structured JSON format.

2. **Key Input Entities to Extract**:
   - **Drill-Level Information**:
     - `set_repetitions` (e.g., 4x, 8x, default to 1 if absent)
     - `set_distance` (e.g., 100, 200, ...)
     - `set_instructions` (a combination of specific swim styles and techniques, default to `""` if absent)
     - `form` (e.g., A, B, G, T)
     - `intensity` (e.g., 1-4)
     - `total_distance` (total meters)
     - `total_duration` (total minutes)
     - An optional `rest_period` in seconds (default to 0 if absent)
   - **Set-Level Information**:
     - A collection of **Segments** with:
       - `distance` in meters
       - `instructions` Ges, Arme, Beine, Tü, K, R, S, Br, Lg, S Beine, K Beine, K Arme, K Beine, RK, Lgf, Lg25, SK, BrK, Torpedo, butterfly, freestyle, CU, Reißv, LongDog, Hundepd, Entenpd, Kombi, Kontrast, DPS, EBV, AT, HB, Fb, FS, BH, Ff, Pb, SN, Brett, PT, Kanal, PK

     
3. **Output Structure**:
   - Use a JSON dictionary format for the output, ensuring it aligns with the predefined schema.

4. **Implementation with Pydantic**:
   - Implement two classes:
     - **Segment**: Represents a single segment of the drill, including the distance and instructions.
     - **Drill**: Represents the overall drill, containing metadata and a list of segments.

5. **Task Deliverables**:
   - Develop a clear and concise system prompt that can instruct an AI assistant to extract the required entities from semi-structured text input.
   - Implement the **Segment** and **Drill** classes using Pydantic to validate the extracted data.

`"4x100 FB: 25 butterfly, 50 torpedo, 25 freestyle; A2; 400 m; 8 min"`

should be processed to

```
{
    "total_distance": 400,
    "total_duration": 8,
    "form": "A",
    "intensity": 2,
    "set_repetitions": 4,
    "set_distance": 100,
    "set_instructions": "FB",
    "set": [
        {
            "distance": 25,
            "instructions": "butterfly"
        },
        {
            "distance": 50,
            "instructions": "torpedo"
        },
        {
            "distance": 25,
            "instructions": "freestyle"
        }
    ],
    "rest_period": 0
}
```

In [None]:
# packages
import os
import re
from tqdm.notebook import tqdm
from typing import Optional

import torch
from pydantic import BaseModel, Field
from typing import List, Optional
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, pipeline, set_seed
from langchain_huggingface import HuggingFacePipeline
from langchain_core.output_parsers import PydanticOutputParser
from langchain_core.prompts import ChatPromptTemplate

DEVICE = "cuda:0" if torch.cuda.is_available() else "mps:0" if torch.mps.is_available() else "cpu"

In [None]:
# generate a LangChain pipeline
LLM_NAME = "Qwen/Qwen2.5-7B-Instruct" # you will need a 7B model here.

tokenizer = AutoTokenizer.from_pretrained(
    LLM_NAME
)

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

model = AutoModelForCausalLM.from_pretrained(
    LLM_NAME,
    quantization_config=bnb_config,
)

pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=512,
    return_full_text=False,
)

hf = HuggingFacePipeline(pipeline=pipe)

In [None]:
# Implement the Segment and Drill class
class Segment(BaseModel):
    # ToDo: Implement
    pass

class Drill(BaseModel):
    # ToDo: Implement
    pass

In [None]:
# Set up a parser
parser = PydanticOutputParser(pydantic_object=Drill)

In [None]:
# Set up a system prompt or prompt with prompt engineering
SYSTEM_PROMPT_TEMPLATE = """
YOUR SYSTEM PROMPT HERE (Optional based on the Prompting Method, if you want the standard system prompt use: You are an assistant that extracts JSON from a semi-structured language input about swim training.)
Do not delete the part below.
Wrap the output in `json` tags\n{format_instructions}
"""

PROMPT_TEMPLATE = """
YOUR PROMPT HERE (Optional based on the Prompting Method.)
{query}"""


# Prompt
prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            SYSTEM_PROMPT_TEMPLATE,
        ),
        ("human", PROMPT_TEMPLATE),
    ]
).partial(format_instructions=parser.get_format_instructions())

In [None]:
# complete the chain
chain = prompt | hf | parser

In [None]:
# test the results
exercise_strings = [
    "4x100: 25 butterfly, 50 torpedo, 25 freestyle; A2; 400 m; 8 min",
    "4x100: 25 Torpedo 50 Tü 25 DPS P15; B3; 400 m; 8 min",
    "4x100: 25 Senso 50 Kontrast 25 DPS; T1; 400 m; 8 min",
    "500: 25 Hundepd 25 KA BrB 25 Kontrast 25 K Faust; T2; 500 m; 9 min",
    "4x150: 50 Torpedo 50 RA SB 50 K DPS P15\"; T2; 600 m; 12 min",
    "4x300 Fb: 100 K CU 100 R Ges 100 K Ges P20\"; 2; 1200 m; 20 min"
]

for i in range(len(exercise_strings)):
    prompt = exercise_strings[i]
    set_seed(1)
    pydantic_response = chain.invoke({"query": prompt})
    print(pydantic_response.model_dump_json(indent=4))