# Using LLMs locally on a laptop/PC

Quantized versions of LLMs can be run locally on a laptop or PC. Having at least 16GB of RAM and a GPU of 8GB of memory can help in terms of generation speed, but are not necessary: 8GB of RAM and a decent CPU can run the small versions of some LLMS.
One of the easiest way to get LLMs work on a laptop or PC is to install Ollama (https://ollama.com/). Download the version that is compatible with your machine and install it. Ollama can then operate as an LLM server, and can be used either interactively through the CLI, or a compatible UI, or programmatically through the Ollama python API.

Once you have installed ollama, you need to select a model and download it: open a terminal and type in the the following command to pull and download an LLM:

```shell
ollama pull model_name
```

The list of available LLMs in Ollama can be found in this page: https://ollama.com/search

There are plenty of models you can try, but generally speaking, models from the llama3.2, Qwen2.5 or gemma3 families are good options. Here are examples of LLMs sorted by size (number of parameters), from 1.5 to 12 billions parameters:

- Qwen2.5 1.5B: https://ollama.com/library/qwen2.5:1.5b
- llama3.2 3B:  https://ollama.com/library/llama3.2:3b
- gemma3 4B: https://ollama.com/library/gemma3:4b
- qwen3:4b-instruct: https://ollama.com/library/qwen3:4b-instruct
- Qwen2.5 7B: https://ollama.com/library/qwen2.5:7b
- gemma3 12B: https://ollama.com/library/gemma3:12b

Which LLM and which size to download depends on the available computer capabilities in terms of GPU, GPU memory, CPU and RAM size.

The models can be downloaded with an ollama command. For instance, to download gemma3 4B, open a terminal and type in the following command:

```ollama pull gemma3:4b```

Once the downloading of the model has finished, you can use it either within the CLI, by typing in a terminal:


```ollama run gemma3:4b```


However, it is much more convenient to use a chat user interface (UI). Among those compatible with the Ollama server: `open-webui`, a very rich, open source UI. A simpler alternative is `Page Assist`, which is an open source browser extension available for Firefox-based browsers (https://addons.mozilla.org/en-US/firefox/addon/page-assist/) and Chrome-based browsers (https://chromewebstore.google.com/detail/page-assist-a-web-ui-for/jfgfiigpkhlkbnfnbobbkinehhfdhndo) 



## Programmatic use of ollama

In order to use the ollama API programmatically, you need to install the ollama server and launch it, and also install the ollama-python library (https://github.com/ollama/ollama-python):

```shell
pip install ollama
```

In python, you have to create an ollama client first, which you will use to send requests to the ollama server.

The following example shows how to access the local ollama server to get a list of available (installed) model:

In [2]:
from ollama import Client

# We assume that ollama has been installed and the ollama server already started (see https://ollama.com/)

ollama_url = 'http://localhost:11434'

# Get an ollama client
llmclient = Client(host=ollama_url)

# Print the list of available models:
llmclient.list()['models']

[Model(model='qwen2.5:7b', modified_at=datetime.datetime(2025, 10, 8, 17, 56, 16, 472887, tzinfo=TzInfo(7200)), digest='845dbda0ea48ed749caafd9e6037047aa19acfcfd82e704d7ca97d631a0b697e', size=4683087332, details=ModelDetails(parent_model='', format='gguf', family='qwen2', families=['qwen2'], parameter_size='7.6B', quantization_level='Q4_K_M')),
 Model(model='gemma3:4b', modified_at=datetime.datetime(2025, 10, 8, 16, 30, 55, 385871, tzinfo=TzInfo(7200)), digest='a2af6cc3eb7fa8be8504abaf9b04e88f17a119ec3f04a3addf55f92841195f5a', size=3338801804, details=ModelDetails(parent_model='', format='gguf', family='gemma3', families=['gemma3'], parameter_size='4.3B', quantization_level='Q4_K_M')),
 Model(model='gemma3:1b', modified_at=datetime.datetime(2025, 9, 23, 11, 18, 51, 513820, tzinfo=TzInfo(7200)), digest='8648f39daa8fbf5b18c7b4e6a8fb4990c692751d49917417b8842ca5758e7ffc', size=815319791, details=ModelDetails(parent_model='', format='gguf', family='gemma3', families=['gemma3'], parameter_si

### Generate() method

The simplest way to send queries and get LLM responses is the `generate()` function, as shown in this example:

In [4]:
from pprint import pprint
from ollama import Client

ollama_url = 'http://localhost:11434'
model_name = 'gemma3:4b'
# model_name = 'qwen2.5:72b'

# Get an ollama client
llmclient = Client(host=ollama_url)


model_options = {
    'num_predict': 200,  # max number of tokens to predict
    'temperature': 0.1,
    'top_p': 0.9,
}

result = llmclient.generate(model=model_name, prompt='Who is currently the French Prime Minister?', options=model_options)

# pprint(vars(result), compact=True, sort_dicts=False)

print("\n===================== LLM generated response:\n", result.response)



 As of today, November 2, 2023, the French Prime Minister is **Gabriel Attal**. 

You can always find the most up-to-date information on this and other government positions here:

*   **Official Website of the French Government:** [https://www.gouvernement.fr/](https://www.gouvernement.fr/)


### Chat() method

It is also possible to use the chat function:


In [3]:
from ollama import ChatResponse

response: ChatResponse = llmclient.chat(
    model=model_name, 
    messages=[ {'role': 'user', 'content': 'Why is the sky blue?'} ],
    options={'num_predict': 100}
)
print(response.message.content)

The blue color of the sky is a fascinating phenomenon caused by a process called **Rayleigh scattering**. Here's a breakdown of how it works:

**1. Sunlight and Colors:**

* Sunlight is actually made up of *all* the colors of the rainbow. We see this when light passes through a prism and separates into its different wavelengths.

**2. Rayleigh Scattering:**

* As sunlight enters the Earth's atmosphere, it collides with tiny air molecules (mostly nitrogen


### Stream mode
In the following code, the client retrieves the answer in a streaming mode. The streaming mode is useful when you want to display LLM answers in an interactive UI, so the user can immediately see the LLM answer while it is being fetched, instead of waiting for the full answer to be retrieved and displayed.

In [4]:
from ollama import Client

ollama_url = 'http://localhost:11434'
model_name = 'gemma3:4b'
# Get an ollama client
llmclient = Client(ollama_url)

prompt = """Explain in a few sentences what is a transformer"""

options={
    'num_predict': 200,  # max number of tokens to predict
    'temperature': 0.1,
    'top_p': 0.9,
}

stream = llmclient.chat(
  model=model_name,
  messages=[{'role': 'user', 'content': prompt}],
  options=options,
  stream=True,
)

for chunk in stream:
  print(chunk['message']['content'], end='', flush=True)


Okay, here's an explanation of a transformer in a few sentences:

A transformer is a device that uses electromagnetic induction to transfer electrical energy from one circuit to another with a different voltage. It consists of two or more coils of wire linked by a magnetic core.  By varying the number of turns in the coils, it can step up or step down the voltage efficiently, making it a crucial component in power distribution systems. 

---

Would you like me to delve into a specific aspect of transformers, like their operation, types, or applications?

### Generic OpenAI Completion API

Instead of using the ollama specific python API, it is possible to use the OpenAI API with the ollama server, provided that you have installed the openai python library. 

Creating an OpenAI client for the ollama server and running a query with the `completions` API:

In [6]:
from openai import OpenAI

server_url = "http://localhost:11434/v1"
model_name = "qwen2.5:7b"

llmclient = OpenAI(
    base_url=server_url,
    api_key='EMPTY', # required, but not used
)

prompt = "Who is the president of France?"
resp = llmclient.completions.create(
            model=model_name,
            prompt=prompt,
            temperature=0.1, 
            top_p=0.1,
            max_tokens=300,
            extra_body=None,
            stream=False,
        )


print("RESPONSE:", resp.choices[0].text)


RESPONSE: As of my last update in October 2023, the President of France is Emmanuel Macron. He has been serving since May 14, 2017. However, please check for the most current information as presidential terms can be extended or changed over time.


### OpenAI Chat function

Doing the same as above, but using the `chat` OpenAI API instead:

In [7]:
from openai import OpenAI

base_url = "http://localhost:11434/v1"
model_name = "gemma3:4b"

# Configure the client to use th local Ollama server
client = OpenAI(
    base_url=base_url,  # Ollama's OpenAI-compatible endpoint
    api_key="ollama"  # Dummy key; Ollama does not require auth
)

# Make a chat completion request
response = client.chat.completions.create(
    model=model_name,  # Or any other model you've pulled with `ollama pull`
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What's the capital of France?"}
    ]
)

# Print the response
print(response.choices[0].message.content)

The capital of France is **Paris**. üòä 

Do you want to know anything more about Paris, or would you like to ask me another question?


## LLM-based text classification

In this example, we ask the selected LLM (in French) to classify a text into one or more possible categories among the ones provided. We then get and print the answer:

In [8]:
from openai import OpenAI

model_name = 'qwen2.5:7b'

# Get an OpenAI client
llmclient = OpenAI(
    base_url=base_url,  
    api_key="ollama"
)

prompt = """Consid√©rez le texte suivant;

"Microsoft s'associe √† G42 pour lancer un projet colossal d'1 milliard de dollars : un centre de donn√©es √©cologique au Kenya ! Aliment√© par le potentiel inexploit√© des 10 gigawatts d'√©nergie g√©othermique du Kenya, le m√©ga centre de donn√©es est le premier de son genre"


Le texte fait-il partie des cat√©gories suivantes ? Si oui listez juste la ou les cat√©gories concern√©es sous forme de liste Python sans explications, sinon r√©pondez juste "non" :

Liste de cat√©gories possibles : "Activisme √©cologique", "Comportement consommateur", "Energies renouvelables et nucl√©aires", "Engagement politique et entreprises", "Reforestation", "Solution √©cologique innovante", "Transport d√©carbon√©".
"""

messages = [
    {"role": "system", "content": "You are a helpful AI assistant that performs text classification."},
    {"role": "user", "content": prompt}
]

result = llmclient.chat.completions.create(
    model=model_name, 
    messages= messages,
    max_tokens=200,
    temperature=0.1,
    top_p=0.9)

print("LLM response:\n", result.choices[0].message.content)

LLM response:
 ["Energies renouvelables et nucl√©aires", "Engagement politique et entreprises", "Solution √©cologique innovante"]


## Opinion classification


We will use the HggingFace dataset library to download an Amazon product review dataset (only the part in French):

In [10]:
from datasets import load_dataset

ds = load_dataset("SetFit/amazon_reviews_multi_fr", split="validation")

ds

  from .autonotebook import tqdm as notebook_tqdm


Dataset({
    features: ['id', 'text', 'label', 'label_text'],
    num_rows: 5000
})

An example of a review sample:

In [11]:
pprint(ds[0])

{'id': 'fr_0112905',
 'label': 0,
 'label_text': '0',
 'text': 'Colis bien re√ßus avec le bo√Ætier du jeu ouvert, cass√© et sans le cd. '
         'Pr√©command√© pour recevoir √ßa le jour de la sortie sa fait plaisir... '
         'Et le service client Amazon qui cherche √† mettre la faute sur le '
         'livreur qui a tr√®s bien fait sont travail en livrant un colis en '
         'excellant √©tat... Cela m√©riterait m√™me un 0!'}


Next we filter out the neutral reviews (i.e. reviews with rating score==2) and print the first 3 data samples, both the review texts and the labels. The labels are from 0 to 4 and correspond to the rating score given by the user to the product:

In [12]:
ds = ds.filter(lambda sample: sample['label'] != 2)
ds = ds.shuffle(seed=42)
ds[0:3]

Filter: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 5000/5000 [00:00<00:00, 51062.87 examples/s]


{'id': ['fr_0820252', 'fr_0736148', 'fr_0032774'],
 'text': ['J\'ai achet√© ce c√¢ble display port pour mon nouvel √©cran 144hz 24" afin d\'en profiter au maximum. De prime abord le c√¢ble semblait de bonne qualit√©, une bonne tenue sur l\'√©cran et le pc, avec une longueur plus que correcte pour relier mon √©cran √† ma tour. Seulement apr√®s quelques semaines, l\'affichage n\'arr√™te pas de "sauter", l\'√©cran n\'est plus d√©tect√© et n\'affiche plus rien. Plus r√©cemment lorsque je regarde des vid√©os sur Youtube ou Netflix par exemple, l\'image de la vid√©o devient floue par endroits, avec des reflets brouillons... Je suis plut√¥t d√©sappoint√©, je vais devoir en racheter un, √† moins qu\'il y ait possibilit√© d\'obtenir un produit de remplacement pour ce c√¢ble d√©fectueux.',
  "L'article re√ßu √©tait noir et non gris. Le c√¢ble s'est ab√Æm√© tr√®s rapidement. Assez d√©√ßue !",
  "Je d√©nigre pas le produit mais je le d√©conseille aux personnes chez qui la perte de cheveux fait sui

The following code is for the classification of the reviews into positive or negative, using an LLM via the ollama library. We consider all the ratings equal or greater than 3 to be positive, and rating of 2 or below to be negative. The script also computes the classification precision (performance).  

In [11]:
from jinja2 import Template
from tqdm import tqdm
from ollama import Client

ollama_url = 'http://localhost:11434'

# model_name = 'llama3.1:8b'   # acc=91.00%  [00:42<00:00,  4.72it/s]
# model_name = 'gemma2:2b'   # acc=91.00%   [00:33<00:00,  5.90it/s]
# model_name = 'gemma2:9b'   # acc=92.00%   [01:36<00:00,  2.08it/s]
model_name = 'qwen2.5:7b'    # acc=92.00% on first 200: [00:40<00:00,  4.90it/s]
# model_name = 'qwen2.5:32b'  # acc=94% on first 200 samples  200/200 [11:54<00:00,  3.57s/it]
# model_name = 'gemma3:4b'    # acc=92.50%  [00:38<00:00,  5.20it/s]
# model_name = 'gemma3:12b'    # acc=96.00%  [06:03<00:00,  1.82s/it]
# model_name = 'llama3.2:3b-instruct-fp16'  # acc=87.00% [01:02<00:00,  3.19it/s]
# model_name = 'qwen3:4b'

model_options = {
    'num_predict': 200,  # max number of tokens to predict
    'temperature': 0.0,
    'top_p': 0.0,
}

llmclient = Client(host=ollama_url)

prompt_template = """Consid√©rez l'avis suivant:

"{{text}}"

Est-ce que ce texte exprime globalement un avis positif ou n√©gatif ? R√©pondez seulement par "positif" ou "n√©gatif" sans donner d'explications."""

jtemplate = Template(prompt_template)

n_correct = 0
n=0

# We will do that for only the first 20 examples:
# tqdm = barre de progression
for i in tqdm(range(20)):
    sample = ds[i]
    opinion_label = "n√©gatif" if sample['label'] < 3 else "positif"
    prompt = jtemplate.render(text=sample['text'])
    # print(sample['text'])
    result = llmclient.generate(model=model_name, prompt=prompt, options=model_options)
    predicted_label = result['response'].lower()
    predicted_label = "".join(c for c in predicted_label if c.isalpha())
    # print("LLM response:\n", predicted_label)
    n += 1
    if predicted_label == opinion_label:
        n_correct += 1

precision = round(100*float(n_correct)/n, 2)

print(f"LLM={model_name}, Classification Precision={precision:.2f}%  ({n_correct}/{n})")

100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 20/20 [01:44<00:00,  5.20s/it]

LLM=qwen2.5:7b, Classification Precision=95.00%  (19/20)





## Aspect-based sentiment analysis in French

In [13]:
from jinja2 import Template
from ollama import Client


model_name = 'qwen2.5:7b'
ollama_url = 'http://localhost:11434'

llmclient = Client(host=ollama_url)

model_options = {
    'num_predict': 4000,
    'temperature': 0.1,
    'top_p': 0.9,
}


text_avis = """Bonne nourriture avec des produits frais
Service tr√®s aimable
Ambiance sonore √©lev√©e et service long en raison d'un nombre important de convives m√™me en semaine."""

prompt_template = """Consid√©rez l'avis suivant:

"{{text}}"

Quelle est la valeur de l'opinion exprim√©e sur chacun des aspects suivants : Prix, Cuisine, Service, Ambiance et Emplacement?

La valeur d'une opinion doit √™tre une des valeurs suivantes: "Positive", "N√©gative", "Neutre", ou "Non exprim√©e".
La valeur neutre correspond au cas o√π le texte contient √† la fois au moins une opinion positive sur l'aspect et au moins une opinion n√©gative sur le m√™me aspect.

La r√©ponse doit se limiter au format json suivant:
{ "Prix": opinion, "Cuisine": opinion, "Service": opinion, "Ambiance": opinion, "Emplacement": opinion}.

"""
# /no_think √† la fin du prompt si model avec raisonnement et ne pas utiliser le raisonnement

jtemplate = Template(prompt_template)


prompt = jtemplate.render(text=text_avis)

result = llmclient.generate(model=model_name, prompt=prompt, options=model_options)

print(result.response)

```json
{
  "Prix": "Non exprim√©e",
  "Cuisine": "Positive",
  "Service": "Positive",
  "Ambiance": "N√©gative",
  "Emplacement": "Non exprim√©e"
}
```


## Exercise: Topic classification in French

In [15]:
from datasets import load_dataset

ds_train = load_dataset("mteb/sib200", "fra_Latn", split='train', download_mode=None)


ds_train

Dataset({
    features: ['label', 'text', 'lang'],
    num_rows: 701
})

In [16]:
ds_train[0]

{'label': 1,
 'text': "La Turquie est entour√©e par des mers sur trois c√¥t√©s : la mer √âg√©e √† l'ouest, la mer Noire au nord et la mer M√©diterran√©e au sud.",
 'lang': 'fra_Latn'}

In [17]:
list(set(ds_train['label']))

[0, 1, 2, 3, 4, 5, 6]

In [18]:
ds_train[1]

{'label': 4,
 'text': "Au d√©but de la guerre, ils voguaient essentiellement sur la mer. Mais au vu des progr√®s r√©alis√©s en mati√®re de radars, et √©tant donn√© leur pr√©cision sans cesse grandissante, les sous-marins √©taient oblig√© de passer sous l'eau pour ne pas √™tre rep√©r√©s.",
 'lang': 'fra_Latn'}

Complete this exercise by writing code to use LLM prompting with ollama in order to classify the texts of the dataset into topics. Compute the accuracy of such classification.

In [None]:
# faire sur les 20 premiers exemples

## Enforced structured output with LLMs

To extract structured data from text, we can explcitely instruct the LLM (in the prompt) to produce the requested information in a structured format, for instance json. We then parse the response to isolate the json structure and transform it into a json object.

However, instructing the LLM about the output format does not gurantee that the model will always follow the instruction and output the information in the same predefined structured format.

To solve this issue, we need to use `structured decoding` or `guided decoding`: we formally define the structure of the output using regular expressions, or context-free grammars, or Json schemas, and then provide this formal definition as an additional argument for generation. During the decoding (generation), a parsing state is maintained, and at each token generation step, only tokens that are compatible with the current parsing state can be sampled. This gurantees that the generation output will strictly match the structure definition.

The Ollama Python library offers this guided decoding functionality. Pass in the structure definition to the format argument as a Json schema. The best way to do this is to define Pydantic dataclasses, and pass the Json schema of the main dataclass, obtained with the `model_json_schema()` function.
More details at: https://ollama.com/blog/structured-outputs

In the following cells, we will show examples of structured decoding.

## Named Entity Recognition (NER) with LLMs

First, we perform LLM-based NER without structured decoding:

In [19]:
from jinja2 import Template
from ollama import Client

model_name = 'qwen2.5:7b'
ollama_url = 'http://localhost:11434'

llmclient = Client(host=ollama_url)

model_options = {
    'num_predict': 400,  # max number of tokens to predict
    'temperature': 0.1,
    'top_p': 0.9,
}

prompt_template = """Definition: A named entity mention is either a name that refers to an entity like a person (PERSON), organization (ORG), location (LOC) or an event (EVENT), or an expression denoting a date (DATE) or an amount of money (MONEY).

Extract all named entity mentions from the following text. The result should be a json list where each element is a json object of the form: {"entity": "the extracted named entity", "type": "the type of the entity"}:

Text: "{{text}}"
"""

text = """Korean cloud service provider Naver Cloud has partnered with AI chip giant Nvidia to establish a localized AI system targeting the East Asian market and expects to deliver ‚Äútangible results‚Äù within the year. The announcement was made at Nvidia's GTC 2025 in San Jose, California, during a keynote speech by Naver Cloud CEO Kim Yu-won on Thursday."""

jtemplate = Template(prompt_template)

prompt = jtemplate.render(text=text)

result = llmclient.generate(model=model_name, prompt=prompt, options=model_options)

print(result.response)

```json
[
    {"entity": "Korean", "type": "LOC"},
    {"entity": "Naver Cloud", "type": "ORG"},
    {"entity": "cloud service provider", "type": "ORG"},
    {"entity": "Nvidia", "type": "ORG"},
    {"entity": "AI chip giant", "type": "ORG"},
    {"entity": "East Asian", "type": "LOC"},
    {"entity": "California", "type": "LOC"},
    {"entity": "GTC 2025", "type": "EVENT"},
    {"entity": "San Jose", "type": "LOC"},
    {"entity": "Thursday", "type": "DATE"},
    {"entity": "Naver Cloud CEO Kim Yu-won", "type": "PERSON"}
]
```


Same as above, but with structured decoding to enforce the json structure:

In [24]:
from typing import Literal
from jinja2 import Template
from ollama import Client
from pydantic import BaseModel
from pprint import pprint

class NamedEntity(BaseModel):
  entity: str
  type: Literal["PERSON", "ORG", "LOC", "DATE", "EVENT", "MONEY"]
  

class NamedEntities(BaseModel):
  entities: list[NamedEntity]


model_name = 'qwen2.5:7b'
ollama_url = 'http://localhost:11434'

llmclient = Client(host=ollama_url)

model_options = {
    'num_predict': 400,  # max number of tokens to predict
    'temperature': 0.1,
    'top_p': 0.9,
}

prompt_template = """Definition: A named entity mention is either a name that refers to an entity like a person (PERSON), organization (ORG), location (LOC) or an event (EVENT), or an expression denoting a date (DATE) or an amount of money (MONEY).

Extract all named entity mentions from the following text. The result should be a json list where each element is a json object of the form: {"entity": "the extracted named entity", "type": "the type of the entity"}:

Text: "{{text}}"
"""

text = """Korean cloud service provider Naver Cloud has partnered with AI chip giant Nvidia to establish a localized AI system targeting the East Asian market and expects to deliver ‚Äútangible results‚Äù within the year. The announcement was made at Nvidia‚Äôs GTC 2025 in San Jose, California, during a keynote speech by Naver Cloud CEO Kim Yu-won on Thursday."""

jtemplate = Template(prompt_template)

prompt = jtemplate.render(text=text)

result = llmclient.generate(model=model_name, prompt=prompt, options=model_options, format=NamedEntities.model_json_schema())

print("LLM output string (guranteed to be a valid json string):\n", result.response)

# We can automatically tranform the json string output produced by the LLM into a 
# Python object of class NamedEntities defined above
output_object = NamedEntities.model_validate_json(result.response)

print("Python object:")
pprint(output_object.entities, compact=True, sort_dicts=False)

LLM output string (guranteed to be a valid json string):
 {
    "entities": [
        {"entity": "Korean", "type": "LOC"},
        {"entity": "Naver Cloud", "type": "ORG"},
        {"entity": "cloud", "type": "LOC"}, {"entity": "Nvidia", "type": "ORG"},
        {"entity": "AI chip", "type": "MONEY"}, {"entity": "California", "type": "LOC"},
        {"entity": "GTC 2025", "type": "EVENT"},
        {"entity": "San Jose", "type": "LOC"},
        {"entity": "Thursday", "type": "DATE"},
        {"entity": "Naver Cloud CEO Kim Yu-won", "type": "PERSON"}
    ]
}
Python object:
[NamedEntity(entity='Korean', type='LOC'),
 NamedEntity(entity='Naver Cloud', type='ORG'),
 NamedEntity(entity='cloud', type='LOC'),
 NamedEntity(entity='Nvidia', type='ORG'),
 NamedEntity(entity='AI chip', type='MONEY'),
 NamedEntity(entity='California', type='LOC'),
 NamedEntity(entity='GTC 2025', type='EVENT'),
 NamedEntity(entity='San Jose', type='LOC'),
 NamedEntity(entity='Thursday', type='DATE'),
 NamedEntity(ent

In [28]:
from typing import Literal
from jinja2 import Template
from openai import OpenAI
from pydantic import BaseModel
from pprint import pprint

class NamedEntity(BaseModel):
  entity: str
  type: Literal["PERSON", "ORG", "LOC", "DATE", "EVENT", "MONEY"]
  

class NamedEntities(BaseModel):
  entities: list[NamedEntity]


model_name = 'qwen2.5:7b'
ollama_url = 'http://localhost:11434/v1'

llmclient = OpenAI(base_url=ollama_url, api_key="EMPTY")

prompt_template = """Definition: A named entity mention is either a name that refers to an entity like a person (PERSON), organization (ORG), location (LOC) or an event (EVENT), or an expression denoting a date (DATE) or an amount of money (MONEY).

Extract all named entity mentions from the following text. The result should be a json list where each element is a json object of the form: {"entity": "the extracted named entity", "type": "the type of the entity"}:

Text: "{{text}}"
"""

text = """Korean cloud service provider Naver Cloud has partnered with AI chip giant Nvidia to establish a localized AI system targeting the East Asian market and expects to deliver ‚Äútangible results‚Äù within the year. The announcement was made at Nvidia‚Äôs GTC 2025 in San Jose, California, during a keynote speech by Naver Cloud CEO Kim Yu-won on Thursday."""

jtemplate = Template(prompt_template)

prompt = jtemplate.render(text=text)

messages = [
    {"role": "user", "content": prompt},
]

result = llmclient.chat.completions.parse(
    model=model_name, 
    messages= messages,
    max_tokens=400,
    temperature=0.1,
    top_p=0.9,
    response_format=NamedEntities)

print("LLM output string (guranteed to be a valid json string):\n", result.choices[0].message.content)

# We can automatically tranform the json string output produced by the LLM into a 
# Python object of class NamedEntities defined above
output_object = NamedEntities.model_validate_json(result.choices[0].message.content)

print("Python object:")
pprint(output_object.entities, compact=True, sort_dicts=False)

LLM output string (guranteed to be a valid json string):
 {
    "entities": [
        {"entity": "Korean", "type": "LOC"},
        {"entity": "Naver Cloud", "type": "ORG"},
        {"entity": "cloud", "type": "LOC"}, {"entity": "Nvidia", "type": "ORG"},
        {"entity": "AI chip", "type": "MONEY"}, {"entity": "California", "type": "LOC"},
        {"entity": "GTC 2025", "type": "EVENT"},
        {"entity": "San Jose", "type": "LOC"},
        {"entity": "Thursday", "type": "DATE"},
        {"entity": "Naver Cloud CEO Kim Yu-won", "type": "PERSON"}
    ]
}
Python object:
[NamedEntity(entity='Korean', type='LOC'),
 NamedEntity(entity='Naver Cloud', type='ORG'),
 NamedEntity(entity='cloud', type='LOC'),
 NamedEntity(entity='Nvidia', type='ORG'),
 NamedEntity(entity='AI chip', type='MONEY'),
 NamedEntity(entity='California', type='LOC'),
 NamedEntity(entity='GTC 2025', type='EVENT'),
 NamedEntity(entity='San Jose', type='LOC'),
 NamedEntity(entity='Thursday', type='DATE'),
 NamedEntity(ent

Note that even if we don't explicitly instruct the LLM to extract named entities and we use as prompt only the text (without any instruction), the output will still contain extracted named entities. This is because with the enforced structured decoding, the decoding algorithm is forced to output tokens compatible with the defined json structure, and therefore, is forced to output the json keys `entities:`, `entity:`and `type:`, hence pushing for the selection of entity tokens. However, without the explicit instructions and entity deiniftions, the NER performance will probably be lower. See the following example:

In [22]:
from typing import Literal
from ollama import Client
from pydantic import BaseModel

class NamedEntity(BaseModel):
  entity: str
  type: Literal["PERSON", "ORG", "LOC", "DATE", "EVENT", "MONEY"]
  

class NamedEntities(BaseModel):
  entities: list[NamedEntity]


model_name = 'qwen2.5:7b'
ollama_url = 'http://localhost:11434'

llmclient = Client(host=ollama_url)

model_options = {
    'num_predict': 400,  # max number of tokens to predict
    'temperature': 0.1,
    'top_p': 0.9,
}


text = """Korean cloud service provider Naver Cloud has partnered with AI chip giant Nvidia to establish a localized AI system targeting the East Asian market and expects to deliver ‚Äútangible results‚Äù within the year. The announcement was made at Nvidia‚Äôs GTC 2025 in San Jose, California, during a keynote speech by Naver Cloud CEO Kim Yu-won on Thursday."""


result = llmclient.generate(model=model_name, prompt=text, options=model_options, format=NamedEntities.model_json_schema())

print("LLM output string (guranteed to be a valid json string):\n", result.response)

# We can automatically tranform the json string output produced by the LLM into a 
# Python object of class NamedEntities defined above
output_object = NamedEntities.model_validate_json(result.response)

print("Python object:")
pprint(output_object.entities, compact=True, sort_dicts=False)

LLM output string (guranteed to be a valid json string):
 { "entities": [ { "entity": "Naver Cloud", "type": "ORG" }, { "entity": "Nvidia", "type": "ORG" }, { "entity": "AI system", "type": "ORG" }, { "entity": "East Asian market", "type": "LOC" }, { "entity": "GTC 2025", "type": "EVENT" }, { "entity": "San Jose, California", "type": "LOC" } ] }

	  			  			  							
Python object:
[NamedEntity(entity='Naver Cloud', type='ORG'),
 NamedEntity(entity='Nvidia', type='ORG'),
 NamedEntity(entity='AI system', type='ORG'),
 NamedEntity(entity='East Asian market', type='LOC'),
 NamedEntity(entity='GTC 2025', type='EVENT'),
 NamedEntity(entity='San Jose, California', type='LOC')]


Note also that the choice of the structure format is defined by the dataclass definition (names of its felds/attributes and the types of values they take). This structure definition has an impact on the LLM task performance. For example:

In [37]:
from ollama import Client
from pydantic import BaseModel
from pprint import pprint

class NamedEntities(BaseModel):
  person_names_mentioned: list[str]
  organization_names_mentioned: list[str]
  location_names_mentioned: list[str]


model_name = 'qwen2.5:7b'
ollama_url = 'http://localhost:11434'

llmclient = Client(host=ollama_url)

model_options = {
    'num_predict': 400,  # max number of tokens to predict
    'temperature': 0.1,
    'top_p': 0.9,
}

text = """Selon la diplomatie turque, une s√©rie de r√©unions trilat√©rales est pr√©vue entre les Etats-Unis, la Turquie et l‚ÄôUkraine, ainsi qu‚Äôentre les deux premiers nomm√©s et la Russie. Volodymyr Zelensky a d√©cid√© de ne pas prendre part aux n√©gociations apr√®s avoir appris que son homologue russe, Vladimir Poutine, ne sera pas pr√©sent.
"""

text = """Korean cloud service provider Naver Cloud has partnered with AI chip giant Nvidia to establish a localized AI system targeting the East Asian market and expects to deliver ‚Äútangible results‚Äù within the year. The announcement was made at Nvidia‚Äôs GTC 2025 in San Jose, California, during a keynote speech by Naver Cloud CEO Kim Yu-won on Thursday."""

response = llmclient.chat(
  messages=[
    {
      'role': 'user',
      'content': text,
    }
  ],
  model=model_name,
  format=NamedEntities.model_json_schema(),
)

entities = NamedEntities.model_validate_json(response.message.content)
pprint(entities)

NamedEntities(person_names_mentioned=['Kim Yu-won'], organization_names_mentioned=['Naver Cloud', 'Nvidia'], location_names_mentioned=['San Jose, California', 'East Asian market'])


Another example without any explicit instruction, this time in French:

In [38]:
from ollama import Client
from pydantic import BaseModel
from pprint import pprint

class NamedEntities(BaseModel):
  person_names_mentioned: list[str]
  organization_names_mentioned: list[str]
  location_names_mentioned: list[str]


model_name = 'qwen2.5:7b'
ollama_url = 'http://localhost:11434'

llmclient = Client(host=ollama_url)

model_options = {
    'num_predict': 400,  # max number of tokens to predict
    'temperature': 0.1,
    'top_p': 0.9,
}

text = """Selon la diplomatie turque, une s√©rie de r√©unions trilat√©rales est pr√©vue entre les Etats-Unis, la Turquie et l‚ÄôUkraine, ainsi qu‚Äôentre les deux premiers nomm√©s et la Russie. Volodymyr Zelensky a d√©cid√© de ne pas prendre part aux n√©gociations apr√®s avoir appris que son homologue russe, Vladimir Poutine, ne sera pas pr√©sent.
"""

response = llmclient.chat(
  messages=[
    {
      'role': 'user',
      'content': text,
    }
  ],
  model=model_name,
  format=NamedEntities.model_json_schema(),
)

entities = NamedEntities.model_validate_json(response.message.content)
pprint(entities)

NamedEntities(person_names_mentioned=['Volodymyr Zelensky', 'Vladimir Poutine'], organization_names_mentioned=['Turquie', '√âtats-Unis', 'Ukraine', 'Russie'], location_names_mentioned=[])


## Extraction of named entities and their attributes/properties

We can push further this idea of enforcing the generation of pre-defined structures during LLM decoding, for instance to extract not only entities but also attributes or properties of the entities, as in this example (you can try and add other attributes/properties by adding corresponding fields in the Country dataclass):

In [36]:
from pydantic import BaseModel
from ollama import Client

# We assume that ollama has been installed and the ollama server already started (see https://ollama.com/)
ollama_url = 'http://localhost:11434'
llmclient = Client(host=ollama_url)

class Country(BaseModel):
  name: str
  capital: str
  languages: list[str]
  prime_minister: str
  population_in_millions: int
  surface_in_km2: int

model_name = 'gemma3:4b'

response = llmclient.chat(
  messages=[ {'role': 'user', 'content': 'Tell me about Canada.'} ],
  model=model_name,
  format=Country.model_json_schema(),
  options={'temperature': 0.5},  # Set temperature to 0 for more deterministic output
)

print("Model raw response:\n", response.message.content)

print("Structured output:")
country = Country.model_validate_json(response.message.content)
country

Model raw response:
 {
"name": "Canada",
"capital": "Ottawa",
"languages": ["English", "French"],
"prime_minister": "Justin Trudeau",
"population_in_millions": 40,
"surface_in_km2": 9984670
}

 	 	 	 	 	 	 	 	 	 	
Structured output:


Country(name='Canada', capital='Ottawa', languages=['English', 'French'], prime_minister='Justin Trudeau', population_in_millions=40, surface_in_km2=9984670)