# LLMs
source: https://python.langchain.com/docs/how_to/#llms

What LangChain calls LLMs are older forms of language models that take a string in and output a string.

Features:
- cache model responses
- create a custom LLM class
- stream a response back
- track token usage
- work with local models


## cache LLM responses

Caching layer for LLMs is useful for two reasons:
- saves money by reducing the number of API calls to the LLM provider
- can speed up application by reducing the number of API calls you make to the LLM provider.

In [1]:
from langchain_core.globals import set_llm_cache
from langchain_openai import OpenAI
from langchain_ollama import ChatOllama
from langchain_core.caches import InMemoryCache

set_llm_cache(InMemoryCache())

In [3]:
%%time
# To make the caching really obvious, let's use a slower and older model.
# llm = OpenAI(model="gpt-3.5-turbo-instruct", n=2, best_of=2)
llm = ChatOllama(base_url="http://localhost:11434", model="llama2:7b", temperature=0.)


set_llm_cache(InMemoryCache())

# The first time, it is not yet in cache, so it should take longer
llm.invoke("Tell me a joke")

CPU times: user 72.7 ms, sys: 8.5 ms, total: 81.2 ms
Wall time: 3.96 s


AIMessage(content="Sure, here's one:\n\nWhy don't scientists trust atoms?\nBecause they make up everything!\n\nI hope you found that amusing! Do you want to hear another one?", additional_kwargs={}, response_metadata={'model': 'llama2:7b', 'created_at': '2025-09-08T10:21:02.160104396Z', 'done': True, 'done_reason': 'stop', 'total_duration': 3892996750, 'load_duration': 3442655488, 'prompt_eval_count': 26, 'prompt_eval_duration': 160595979, 'eval_count': 46, 'eval_duration': 289122696, 'model_name': 'llama2:7b'}, id='run--a557b981-a7a3-4f82-bedd-d608099a451e-0', usage_metadata={'input_tokens': 26, 'output_tokens': 46, 'total_tokens': 72})

In [4]:
%%time
# The second time it is, so it goes faster
llm.invoke("Tell me a joke")

CPU times: user 371 μs, sys: 95 μs, total: 466 μs
Wall time: 443 μs


AIMessage(content="Sure, here's one:\n\nWhy don't scientists trust atoms?\nBecause they make up everything!\n\nI hope you found that amusing! Do you want to hear another one?", additional_kwargs={}, response_metadata={'model': 'llama2:7b', 'created_at': '2025-09-08T10:21:02.160104396Z', 'done': True, 'done_reason': 'stop', 'total_duration': 3892996750, 'load_duration': 3442655488, 'prompt_eval_count': 26, 'prompt_eval_duration': 160595979, 'eval_count': 46, 'eval_duration': 289122696, 'model_name': 'llama2:7b'}, id='run--a557b981-a7a3-4f82-bedd-d608099a451e-0', usage_metadata={'input_tokens': 26, 'output_tokens': 46, 'total_tokens': 72, 'total_cost': 0})

In [5]:
# We can do the same thing with a SQLite cache
# !rm .langchain.db
from langchain_community.cache import SQLiteCache

set_llm_cache(SQLiteCache(database_path=".langchain.db"))

In [6]:
%%time
# The first time, it is not yet in cache, so it should take longer
llm.invoke("Tell me a joke")

CPU times: user 21.5 ms, sys: 1.57 ms, total: 23.1 ms
Wall time: 3.85 s


AIMessage(content="Sure, here's one:\n\nWhy don't scientists trust atoms?\nBecause they make up everything!\n\nI hope you found that amusing! Do you want to hear another one?", additional_kwargs={}, response_metadata={'model': 'llama2:7b', 'created_at': '2025-09-08T10:22:26.309343864Z', 'done': True, 'done_reason': 'stop', 'total_duration': 3832381337, 'load_duration': 3360830118, 'prompt_eval_count': 26, 'prompt_eval_duration': 178174755, 'eval_count': 46, 'eval_duration': 292686606, 'model_name': 'llama2:7b'}, id='run--2d33ecb3-cf54-455b-9c90-f7c19ba8fbed-0', usage_metadata={'input_tokens': 26, 'output_tokens': 46, 'total_tokens': 72})

In [7]:
%%time
# The second time it is, so it goes faster
llm.invoke("Tell me a joke")

CPU times: user 1.35 ms, sys: 1.3 ms, total: 2.66 ms
Wall time: 2.01 ms


AIMessage(content="Sure, here's one:\n\nWhy don't scientists trust atoms?\nBecause they make up everything!\n\nI hope you found that amusing! Do you want to hear another one?", additional_kwargs={}, response_metadata={'model': 'llama2:7b', 'created_at': '2025-09-08T10:22:26.309343864Z', 'done': True, 'done_reason': 'stop', 'total_duration': 3832381337, 'load_duration': 3360830118, 'prompt_eval_count': 26, 'prompt_eval_duration': 178174755, 'eval_count': 46, 'eval_duration': 292686606, 'model_name': 'llama2:7b'}, id='run--2d33ecb3-cf54-455b-9c90-f7c19ba8fbed-0', usage_metadata={'input_tokens': 26, 'output_tokens': 46, 'total_tokens': 72, 'total_cost': 0})

## create a custom LLM class

Create a custom LLM wrapper allows to use your own LLM or a different wrapper than one that is supported in LangChain.

Wrapping your LLM with the standard LLM interface allow you to use your LLM in existing LangChain programs with minimal code modifications.

As an bonus, your LLM will automatically become a LangChain Runnable and will benefit from some optimizations out of the box, async support, the astream_events API, etc.

**Warning** This feature is for **text completion models** but the latest and most popular models are **chat completion models**.

In [8]:
# create a custom LLM class
from typing import Any, Dict, Iterator, List, Mapping, Optional
from langchain_core.callbacks.manager import CallbackManagerForLLMRun
from langchain_core.language_models.llms import LLM
from langchain_core.outputs import GenerationChunk


class CustomLLM(LLM):
    """A custom chat model that echoes the first `n` characters of the input.
    """

    n: int
    """The number of characters from the last message of the prompt to be echoed."""

    def _call(self,
                prompt: str,
                stop: Optional[List[str]] = None,
                run_manager: Optional[CallbackManagerForLLMRun] = None,
                **kwargs: Any,
            ) -> str:
        """Run the LLM on the given input.

        Args:
            prompt: The prompt to generate from.
            stop: Stop words to use when generating. Model output is cut off at the first occurrence of any of the stop substrings.
                If stop tokens are not supported consider raising NotImplementedError.
            run_manager: Callback manager for the run.
            **kwargs: Arbitrary additional keyword arguments. These are usually passed
                to the model provider API call.

        Returns:
            The model output as a string. Actual completions SHOULD NOT include the prompt.
        """
        if stop is not None:
            raise ValueError("stop kwargs are not permitted.")
        return prompt[: self.n]

    def _stream( self,
                prompt: str,
                stop: Optional[List[str]] = None,
                run_manager: Optional[CallbackManagerForLLMRun] = None,
                **kwargs: Any,
            ) -> Iterator[GenerationChunk]:
        """Stream the LLM on the given prompt.
        If not implemented, the default behavior of calls to stream will be to fallback to the non-streaming version of the model and return
        the output as a single chunk.

        Args:
            prompt: The prompt to generate from.
            stop: Stop words to use when generating. Model output is cut off at the
                first occurrence of any of these substrings.
            run_manager: Callback manager for the run.
            **kwargs: Arbitrary additional keyword arguments. These are usually passed
                to the model provider API call.

        Returns:
            An iterator of GenerationChunks.
        """
        for char in prompt[: self.n]:
            chunk = GenerationChunk(text=char)
            if run_manager:
                run_manager.on_llm_new_token(chunk.text, chunk=chunk)

            yield chunk

    @property
    def _identifying_params(self) -> Dict[str, Any]:
        """Return a dictionary of identifying parameters."""
        return {
            # The model name allows users to specify custom token counting
            # rules in LLM monitoring applications (e.g., in LangSmith users
            # can provide per token pricing for their model and monitor
            # costs for the given LLM.)
            "model_name": "CustomChatModel",
        }

    @property
    def _llm_type(self) -> str:
        """Get the type of language model used by this chat model. Used for logging purposes only."""
        return "custom"

In [9]:
# Testing 

llm = CustomLLM(n=5)
print(llm)

[1mCustomLLM[0m
Params: {'model_name': 'CustomChatModel'}


In [10]:
llm.invoke("This is a foobar thing")

'This '

In [11]:
await llm.ainvoke("world")

'world'

In [12]:
llm.batch(["woof woof woof", "meow meow meow"])

['woof ', 'meow ']

In [13]:
await llm.abatch(["woof woof woof", "meow meow meow"])

['woof ', 'meow ']

In [14]:
async for token in llm.astream("hello"):
    print(token, end=">", flush=True)

h>e>l>l>o>

In [15]:
# Integration with langchain prompt template and chain
from langchain_core.prompts.chat import ChatPromptTemplate

prompt = ChatPromptTemplate.from_messages(
    [("system", "you are a bot"), ("human", "{input}")]
)

llm = CustomLLM(n=7)
chain = prompt | llm

In [16]:
idx = 0
async for event in chain.astream_events({"input": "hello there!"}, version="v1"):
    print(event)
    idx += 1
    if idx > 7:
        # Truncate
        break

{'event': 'on_chain_start', 'run_id': '3ad7d496-5390-46e5-b3f6-4ce222a0d406', 'name': 'RunnableSequence', 'tags': [], 'metadata': {}, 'data': {'input': {'input': 'hello there!'}}, 'parent_ids': []}
{'event': 'on_prompt_start', 'name': 'ChatPromptTemplate', 'run_id': 'a36649ff-1239-4668-bc6d-a4baa26cc693', 'tags': ['seq:step:1'], 'metadata': {}, 'data': {'input': {'input': 'hello there!'}}, 'parent_ids': []}
{'event': 'on_prompt_end', 'name': 'ChatPromptTemplate', 'run_id': 'a36649ff-1239-4668-bc6d-a4baa26cc693', 'tags': ['seq:step:1'], 'metadata': {}, 'data': {'input': {'input': 'hello there!'}, 'output': ChatPromptValue(messages=[SystemMessage(content='you are a bot', additional_kwargs={}, response_metadata={}), HumanMessage(content='hello there!', additional_kwargs={}, response_metadata={})])}, 'parent_ids': []}
{'event': 'on_llm_start', 'name': 'CustomLLM', 'run_id': '8ec96921-4683-4de5-b760-499c56f63cbd', 'tags': ['seq:step:2'], 'metadata': {'ls_provider': 'custom', 'ls_model_type'

## How to track token usage for LLMs

Tracking token usage to calculate cost is an important part of putting your app in production. This guide goes over how to obtain this information from your LangChain model calls.

In [None]:
# Single call 
from langchain_community.callbacks import get_openai_callback
from langchain_openai import OpenAI

llm = OpenAI(name="gpt-3.5-turbo-instruct")

with get_openai_callback() as cb:
    result = llm.invoke("Tell me a joke")
    print(result)
    print("---")
print()

print(f"Total Tokens: {cb.total_tokens}")
print(f"Prompt Tokens: {cb.prompt_tokens}")
print(f"Completion Tokens: {cb.completion_tokens}")
print(f"Total Cost (USD): ${cb.total_cost}")