## Models in LangChain
LangChain provide 2 interfaces to interact with any LLM API Provider.
1. LLMs (base: BaseLLM): take a string in and output a string.
2. Chat models (base: BaseChatModel): newer forms of language models that take messages in and output a message.

### LLMs

In [1]:
from langchain_huggingface import HuggingFacePipeline
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline, BitsAndBytesConfig,Qwen2TokenizerFast, LlamaTokenizer
from transformers import Qwen2ForCausalLM, set_seed, TextGenerationPipeline
import torch

set_seed(42)

model_path = 'Qwen/Qwen2.5-1.5B-Instruct'

quantization_cfg = BitsAndBytesConfig(load_in_8bit=True)

tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path,
                                             # torch_dtype=torch.bfloat16,
                                             quantization_config=quantization_cfg,
                                             low_cpu_mem_usage=True
                                             )
assert isinstance(model, Qwen2ForCausalLM)
hf_pipe = pipeline(task='text-generation', model=model, tokenizer=tokenizer, device_map='auto',
                   max_new_tokens=500)
assert isinstance(hf_pipe, TextGenerationPipeline)
# HuggingFacePipeline is BaseLLM
pipeline = HuggingFacePipeline(pipeline=hf_pipe)


Device set to use cuda:0


In [113]:
type(tokenizer)

transformers.models.qwen2.tokenization_qwen2_fast.Qwen2TokenizerFast

In [2]:
pipeline.invoke("The sky is ",
                pipeline_kwargs=dict(max_new_tokens=20))

'The sky is 120 feet above the base of a building. From a point on the ground, the angle'

In [3]:
# Chat model with BaseLLM
# The same result as the below
user_input = "what is 1+1 equal to ? explain why in shortly."
messages = [
    # {"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."}, # automatically added
    {"role": "users", "content": user_input},
]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
# Langchan_HF does not automatically apply_chat_template, it just str2str. So we have to do it.
# Langchan_HF let pipeline call apply_chat_template, but pipeline only call it when input not a string or list[str]. Also it does
# not call tools as argument. So in general, it does not call apply_chat_template

# Because we pass a string, a string MUST BE generated through apply_chat_template first. Otherwise, we can just mass the messages (list[dict])
print(pipeline.invoke(
    prompt,
    pipeline_kwargs=dict(top_k=1,
                         return_full_text=False)
))
print("===========================================================================")
# HuggingFacePipeline does not bind_tools to template. So we have to call it by templated input with tools.
# But HuggingFacePipeline.invoke result a string. Not reserve a chat structure.
# We we pass a structure, it will convert to a string then pass to Pipeline. So Pipeline don't apply chat template.

The sum of one plus one equals two.

In mathematics, addition is the operation of combining two or more numbers into a single number. When you add one to another one, you get a total of two. This can be represented as:

1 + 1 = 2

This simple equation demonstrates the basic principle of counting and quantity. It's a fundamental concept that forms the basis for many other mathematical operations and calculations.


In [4]:
# Both have the same result
# HF pipeline will automatically apply_chat_template with add_generation_prompt = True
print(hf_pipe(messages, top_k=1)[0]['generated_text'][-1]['content'])
print("===========================================================================")

The sum of one plus one equals two.

In mathematics, addition is the operation of combining two or more numbers into a single number. When you add one to another one, you get a total of two. This can be represented as:

1 + 1 = 2

This simple equation demonstrates the basic principle of counting and quantity. It's a fundamental concept that forms the basis for many other mathematical operations and calculations.


In [5]:
tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
# Try this for more details about use tool
# tokenizer.chat_template()
# tokenizer.get_chat_template(None,tools)

'<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\n<|im_start|>user\nwhat is 1+1 equal to ? explain why in shortly.<|im_end|>\n<|im_start|>assistant\n'

### Chat models
Chat model (BaseChatModel) interface enables back and forth conversations between the user and model.

The reason why it’s a separate interface is that popular LLM providers like OpenAI differentiate messages sent to and from the model into
user, assistant, and system roles. Chat models also offer addition capabilities:
1. Tool calling (role tool)
2. Structured Output
3. Multimodality

#### Message
Messages are the unit of communication in chat models. Contain:
1. Role
2. Content
3. Others: ID, name, metadata, tool calls,...

Messages are base on BaseMessage, including:
1. SystemMessage
2. HumanMessage
3. AIMessage, AIMessageChunk
4. ToolMessage
5. Multimodality


In [4]:
from langchain_huggingface import ChatHuggingFace, HuggingFaceEndpoint
from langchain_core.messages import HumanMessage

# ChatHuggingFace is BaseChatModel
chat_model = ChatHuggingFace(llm=pipeline)

prompt = [HumanMessage(user_input), ]

# chat_model.invoke DOES call apply_chat_template with add_generation_prompt = True,
# so the result is the same as above and we don't need to call apply_chat_template like LLMs.
print(chat_model.invoke(prompt,
                        pipeline_kwargs=dict(top_k=1)).content
      )

<|im_start|>system
You are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>
<|im_start|>user
what is 1+1 equal to ? explain why in shortly.<|im_end|>
<|im_start|>assistant
The sum of one plus one equals two.

In mathematics, addition is the operation of combining two or more numbers into a single number. When you add one to another one, you get a total of two. This can be represented as:

1 + 1 = 2

This simple equation demonstrates the basic principle of counting and quantity. It's a fundamental concept that forms the basis for many other mathematical operations and calculations.


In [None]:
# Process internal
cvt_input = chat_model._convert_input("tell me a joke").to_messages()
print(cvt_input,end='\n\n')
llm_input = chat_model._to_chat_prompt(cvt_input)
print(llm_input,end='\n\n')
llm_result = pipeline._generate(prompts=[llm_input])
print(llm_result,end='\n\n\n')
# Furthur processing


## Prompt template

In [5]:
from langchain_core.prompts import ChatPromptTemplate

# The system role in the first message will be set to default content (through apply chat template) if not provide.
# If provided, it will override the default
template = ChatPromptTemplate.from_messages([
    ('system', 'Answer and explain the question based on the context.'),
    ('human', 'Context: {context}'),
    ('human', 'Question: {question}'),
])
context = 'A 5 years old child is learning math and ask the question. Need to explain for the kid to understand.'
question = 'what is 3+5 ?'
ques_prompt = template.invoke(dict(context=context, question=question))
print(ques_prompt)

messages=[SystemMessage(content='Answer and explain the question based on the context.', additional_kwargs={}, response_metadata={}), HumanMessage(content='Context: A 5 years old child is learning math and ask the question. Need to explain for the kid to understand.', additional_kwargs={}, response_metadata={}), HumanMessage(content='Question: what is 3+5 ?', additional_kwargs={}, response_metadata={})]


In [13]:
print(chat_model.invoke(ques_prompt,
                        pipeline_kwargs=dict(return_full_text=False,
                                             top_k=1)
                        ).content)

The answer is 8.
Explanation: When you add two numbers together, you combine their values. In this case, you're adding 3 and 5. You can think of it as counting up from 3 by 5 steps or starting at 5 and counting up by 3 steps. Either way, you end up with a total of 8. So, 3 + 5 equals 8.


In [23]:
# The output is AIMessage but cover both prompt and answer.
chat_model.invoke(ques_prompt)

AIMessage(content='<|im_start|>system\nAnswer and explain the question based on the context.<|im_end|>\n<|im_start|>user\nContext: A 5 years old child is learning math and ask the question. Need to explain for the kid to understand.<|im_end|>\n<|im_start|>user\nQuestion: what is 3+5 ?<|im_end|>\n<|im_start|>assistant\nThe answer is 8. Three plus five makes eight because you count up by fives from three, which gives you a total of eight.', additional_kwargs={}, response_metadata={}, id='run-c6bb1e4d-5ab1-4b30-8085-fd4be7fde2b4-0')

In [20]:
# using raw string with chat model, chat model will automatically convert to chat template
print(chat_model.invoke("The sea is ").content)

<|im_start|>system
You are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>
<|im_start|>user
The sea is <|im_end|>
<|im_start|>assistant
the vast expanse of water that covers most of Earth's surface. It plays a crucial role in regulating the planet's climate and supporting marine life. The sea contains diverse ecosystems ranging from coral reefs to deep-sea trenches and has been instrumental in shaping human history through trade routes, transportation, and natural resources like fish and oil.


## Output parser
* There two ways to parse output in structured format like json or pydantic:
1. Built in with_structured_output()
2. Using OutputParser like PydanticOutputParser, Custom Parsing, JsonOutputParser. Parser will instruct model generate format via prompt, then
parse that output into json.

* with_structured_output() is implemented for models that provide native APIs for structuring outputs.
* Not all chat model support structure output, so below will return None when invoke.
* Find supported structure output model in : # https://python.langchain.com/docs/integrations/chat/
* Steps for create parser:
1. Define schema.
2. Inject instruction from parser to prompt.
3. Append parser as last of chain.

Note:
Keep in mind that large language models are leaky abstractions! You'll have to use an LLM with sufficient capacity to generate well-formed JSON.

### Json




In [110]:
from pydantic import BaseModel, Field
from transformers import TextStreamer
from langchain_core.tools import tool
from langchain_huggingface import HuggingFaceEndpoint, HuggingFacePipeline
@tool
def multiply(a: int, b: int) -> int:
    """Multiply a and b."""
    return a * b

# you can pass streamer when create pipeline
streamer = TextStreamer(tokenizer=tokenizer, skip_prompt=True, skip_special_tokens=True)
joke_prompt = [HumanMessage("tell me a joke, return json format with 'setup' and 'punchline' keys ."), ]


class JokeOutput(BaseModel):
    """A joke format"""
    setup: str = Field(description='question to setup a joke')
    punchline: str = Field(description='answer to resolve the joke and make us laugh.')

# JokeOutput = {
#     'setup' :'question to setup a joke' ,
#     'punchline':  'answer to resolve the joke and make us laugh.'
# }
#
structured_chat = chat_model.bind_tools([JokeOutput])
tool_chat = chat_model.bind_tools([multiply], tool_choice = 'auto')


# Because Bullshiet langchain set toolchoice = 'any' but bind_tools in ChatHuggingFace = 'auto' or 'none'
# structured_chat.first.kwargs['tool_choice'] =  structured_chat.first.kwargs['tools'][0]
joke = structured_chat.invoke(joke_prompt,
                             pipeline_kwargs=dict(
                                 return_full_text=False,
                                 top_k=1,
                                 streamer=streamer
    ))
type(joke)

{
  "setup": "Why don't scientists trust atoms?",
  "punchline": "Because they make up everything!"
}


langchain_core.messages.ai.AIMessage

In [111]:
joke

AIMessage(content='{\n  "setup": "Why don\'t scientists trust atoms?",\n  "punchline": "Because they make up everything!"\n}', additional_kwargs={}, response_metadata={}, id='run-7730852f-120c-4bd3-8226-4f15030285ae-0')

In [112]:
tool_chat.invoke("what is 3*12")

AIMessage(content='<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\n<|im_start|>user\nwhat is 3*12<|im_end|>\n<|im_start|>assistant\nThe product of 3 and 12 is 36.', additional_kwargs={}, response_metadata={}, id='run-8f8cb2ff-769d-4a05-bc6b-f2b7ebd6c43e-0')

In [17]:
from langchain_core.output_parsers import JsonOutputParser

parser = JsonOutputParser(pydantic_object=JokeOutput)
print(parser.get_format_instructions())

The output should be formatted as a JSON instance that conforms to the JSON schema below.

As an example, for the schema {"properties": {"foo": {"title": "Foo", "description": "a list of strings", "type": "array", "items": {"type": "string"}}}, "required": ["foo"]}
the object {"foo": ["bar", "baz"]} is a well-formatted instance of the schema. The object {"properties": {"foo": ["bar", "baz"]}} is not well-formatted.

Here is the output schema:
```
{"properties": {"setup": {"description": "question to setup a joke", "title": "Setup", "type": "string"}, "punchline": {"description": "answer to resolve the joke and make us laugh.", "title": "Punchline", "type": "string"}}, "required": ["setup", "punchline"]}
```


In [18]:
template2 = ChatPromptTemplate.from_messages([
    # The first system message is not set, so the default is used when apply_chat_template.
    # ('system', 'Answer and explain the ques.'), #
    ('human', 'Instruction: {context}'),
    ('human', 'Command: {question}'),
])
ques_prompt2 = template2.invoke(dict(context=parser.get_format_instructions(), question="Tell me a joke"))
print(ques_prompt2)


messages=[HumanMessage(content='Instruction: The output should be formatted as a JSON instance that conforms to the JSON schema below.\n\nAs an example, for the schema {"properties": {"foo": {"title": "Foo", "description": "a list of strings", "type": "array", "items": {"type": "string"}}}, "required": ["foo"]}\nthe object {"foo": ["bar", "baz"]} is a well-formatted instance of the schema. The object {"properties": {"foo": ["bar", "baz"]}} is not well-formatted.\n\nHere is the output schema:\n```\n{"properties": {"setup": {"description": "question to setup a joke", "title": "Setup", "type": "string"}, "punchline": {"description": "answer to resolve the joke and make us laugh.", "title": "Punchline", "type": "string"}}, "required": ["setup", "punchline"]}\n```', additional_kwargs={}, response_metadata={}), HumanMessage(content='Command: Tell me a joke', additional_kwargs={}, response_metadata={})]


In [19]:
# chain = chat_model|parser
# chain.invoke(ques_prompt2)
chat_out = chat_model.invoke(ques_prompt2,
                             pipeline_kwargs=dict(top_k=1, return_full_text=False))

You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset


In [21]:
print(chat_out)

content='{\n  "setup": "Why don\'t scientists trust atoms?",\n  "punchline": "Because they make up everything!"\n}' additional_kwargs={} response_metadata={} id='run-3e49a4ec-790c-4487-9d17-f9e1609bc3f8-0'


In [23]:
json_out = parser.invoke(chat_out)
json_out

{'setup': "Why don't scientists trust atoms?",
 'punchline': 'Because they make up everything!'}

## RunnableInterface
1. invoke
2. batch
3. stream

### Streaming

### Text Streamer

In [38]:
# Streamer from HF
from transformers import TextStreamer

# you can pass streamer when create pipeline
streamer = TextStreamer(tokenizer=tokenizer, skip_prompt=True, skip_special_tokens=True)
a = chat_model.invoke(
    ques_prompt,
    pipeline_kwargs=dict(
        return_full_text=False,
        top_k=1,
        streamer=streamer
    ))

The answer is 8.
Explanation: When you add two numbers together, you combine their values. In this case, you're adding 3 and 5. You can think of it as counting up from 3 by 5 steps or starting at 5 and counting up by 3 steps. Either way, you end up with a total of 8. So, 3 + 5 equals 8.


### TextIteratorStreamer

In [50]:
from threading import Thread
from transformers import TextIteratorStreamer
from sys import stdout

istreamer = TextIteratorStreamer(tokenizer=tokenizer, skip_prompt=True, skip_special_tokens=True)

thread = Thread(
    target=chat_model.invoke,
    args=(ques_prompt,),
    kwargs=dict(
        pipeline_kwargs=dict(
            return_full_text=False,
            top_k=1,
            streamer=istreamer
        )
    )
)

thread.start()
print("Thread started")
gen_text = ""
for text in istreamer:
    gen_text += text
    # print(text,end="")
    stdout.write(text)
    stdout.flush()

Thread started
The answer is 8.
Explanation: When you add two numbers together, you combine their values. In this case, you're adding 3 and 5. You can think of it as counting up from 3 by 5 steps or starting at 5 and counting up by 3 steps. Either way, you end up with a total of 8. So, 3 + 5 equals 8.

### AsyncTextIteratorStreamer

In [54]:
from transformers import AsyncTextIteratorStreamer
import asyncio as aio


async def delay(t=0.5):
    await aio.sleep(t)


async def main(s, t):

    aio.create_task(delay(t))
    async for text in s:
        # print(text,end="")
        stdout.write(text)
        stdout.flush()

if __name__ == '__main__':
    aistreamer = AsyncTextIteratorStreamer(tokenizer=tokenizer, skip_prompt=True, skip_special_tokens=True)
    thread = Thread(
        target=chat_model.invoke,
        args=(ques_prompt,),
        kwargs=dict(
            pipeline_kwargs=dict(
                return_full_text=False,
                top_k=1,
                streamer=aistreamer
            )
        )
    )
    thread.start()
    print("Thread started")

    aio.run(main(aistreamer,1))

Thread started


RuntimeError: asyncio.run() cannot be called from a running event loop

## Tools

In [None]:
from langchain_huggingface import ChatHuggingFace, HuggingFaceEndpoint
# ChatHuggingFace does not call apply_chat_template with tools as parameters. So we have to apply manually and can not just bind_tools().
#

In [2]:
from transformers import TextGenerationPipeline, Text2TextGenerationPipeline

ImportError: 
T5Tokenizer requires the SentencePiece library but it was not found in your environment. Checkout the instructions on the
installation page of its repo: https://github.com/google/sentencepiece#installation and follow the ones
that match your environment. Please note that you may need to restart your runtime after installation.


In [None]:
from langchain_huggingface import HuggingFaceEndpoint