# Perform LLM Inference

This notebook demonstrates how to perform inference using LLMs.

Whereas the [previous RAG example](Perform-RAG-Inference.ipynb) used existing examples,
this will perform de-novo inference using a schema.

Note that linkml-store is a data-first framework, the main emphasis is not on AI or LLMs. However, it does support a pluggable **Inference** framework, and one of the integrations is a simple LLM-based inference engine.

For this notebook, we will be using the command line interface, but the same can be done programmatically using the Python API.

## Loading the data into duckdb

In [50]:
%%bash
mkdir -p tmp
rm -rf tmp/up.ddb
linkml-store  -d duckdb:///tmp/up.ddb -c Entry insert ../../tests/input/uniprot/uniprot-comments.tsv

Inserted 2390 objects from ../../tests/input/uniprot/uniprot-comments.tsv into collection 'Entry'.


Let's check what this looks like by using `describe` and examining the first entry:

In [51]:
%%bash
linkml-store -d tmp/up.ddb describe

         count unique                                   top  freq
category  2390      1                                        2390
id        2390   2284                            EFC2_HUMAN     4
text      2390   1383  Could be the product of a pseudogene   259


## Load the controlled vocabulary for classification

Ready-made set of comment categories

In [52]:
%%bash
linkml-store  -d tmp/up.ddb -c cv insert ../../tests/input/uniprot/uniprot-caution-cv.csv

Inserted 15 objects from ../../tests/input/uniprot/uniprot-caution-cv.csv into collection 'cv'.


In [53]:
%%bash
linkml-store  -d tmp/up.ddb::cv query --limit 3 -O yaml

TERM: FUNCTION_DISPUTED
DEFINITION: A caution indicating that a previously reported function has been challenged
  or disproven in subsequent studies; may warrant GO NOT annotation
EXAMPLES: Originally described for its in vitro hydrolytic activity towards dGMP,
  dAMP and dIMP. However, this was not confirmed in vivo
---
TERM: FUNCTION_PREDICTION_ONLY
DEFINITION: A caution indicating function is based only on computational prediction
  or sequence similarity
EXAMPLES: Predicted to be involved in X based on sequence similarity
---
TERM: FUNCTION_LACKS_EVIDENCE
DEFINITION: A caution indicating insufficient experimental evidence to support predicted
  function
EXAMPLES: In contrast to other Macro-domain containing proteins, lacks ADP-ribose
  glycohydrolase activity



In [54]:
%%bash
linkml-store  -d tmp/up.ddb query --limit 3 -O yaml

TERM: FUNCTION_DISPUTED
DEFINITION: A caution indicating that a previously reported function has been challenged
  or disproven in subsequent studies; may warrant GO NOT annotation
EXAMPLES: Originally described for its in vitro hydrolytic activity towards dGMP,
  dAMP and dIMP. However, this was not confirmed in vivo
---
TERM: FUNCTION_PREDICTION_ONLY
DEFINITION: A caution indicating function is based only on computational prediction
  or sequence similarity
EXAMPLES: Predicted to be involved in X based on sequence similarity
---
TERM: FUNCTION_LACKS_EVIDENCE
DEFINITION: A caution indicating insufficient experimental evidence to support predicted
  function
EXAMPLES: In contrast to other Macro-domain containing proteins, lacks ADP-ribose
  glycohydrolase activity



In [55]:
%%bash
linkml-store -v -S ../../tests/input/uniprot/schema.yaml  -d tmp/up.ddb schema

2025-02-04 11:08:15,383 - linkml_store.api.client - INFO - Initializing database: phenopackets, base_dir: /Users/cjm
2025-02-04 11:08:15,383 - linkml_store.api.client - INFO - Initializing database: mgi, base_dir: /Users/cjm
2025-02-04 11:08:15,383 - linkml_store.api.client - INFO - Initializing database: nmdc, base_dir: /Users/cjm
2025-02-04 11:08:15,383 - linkml_store.api.client - INFO - Initializing database: amigo, base_dir: /Users/cjm
2025-02-04 11:08:15,383 - linkml_store.api.client - INFO - Initializing database: gocams, base_dir: /Users/cjm
2025-02-04 11:08:15,383 - linkml_store.api.client - INFO - Initializing database: cadsr, base_dir: /Users/cjm
2025-02-04 11:08:15,384 - linkml_store.api.client - INFO - Initializing database: mixs, base_dir: /Users/cjm
2025-02-04 11:08:15,384 - linkml_store.api.client - INFO - Initializing database: mondo, base_dir: /Users/cjm
2025-02-04 11:08:15,384 - linkml_store.api.client - INFO - Initializing database: hpoa, base_dir: /Users/cjm
2025-02

name: uniprot-comments
id: http://example.org/uniprot-comments
imports:
- linkml:types
prefixes:
  linkml:
    prefix_prefix: linkml
    prefix_reference: https://w3id.org/linkml/
  up:
    prefix_prefix: up
    prefix_reference: http://example.org/tuniprot-comments
default_prefix: up
default_range: string
enums:
  CommentCategory:
    name: CommentCategory
    permissible_values:
      FUNCTION_DISPUTED:
        text: FUNCTION_DISPUTED
        description: A caution indicating that a previously reported function has
          been challenged or disproven in subsequent studies; may warrant GO NOT annotation
        examples:
        - value: FUNCTION_DISPUTED
          description: Originally described for its in vitro hydrolytic activity towards
            dGMP, dAMP and dIMP. However, this was not confirmed in vivo
      FUNCTION_PREDICTION_ONLY:
        text: FUNCTION_PREDICTION_ONLY
        description: A caution indicating function is based only on computational
          predict

## Inferring a specific field

In [58]:
%%bash
linkml-store -S ../../tests/input/uniprot/schema.yaml  -d tmp/up.ddb -c Entry infer -t llm -T category --where "id: MOTSC_HUMAN"

2025-02-04 11:09:06,091 - linkml_store.api.client - INFO - Initializing database: phenopackets, base_dir: /Users/cjm
2025-02-04 11:09:06,092 - linkml_store.api.client - INFO - Initializing database: mgi, base_dir: /Users/cjm
2025-02-04 11:09:06,092 - linkml_store.api.client - INFO - Initializing database: nmdc, base_dir: /Users/cjm
2025-02-04 11:09:06,092 - linkml_store.api.client - INFO - Initializing database: amigo, base_dir: /Users/cjm
2025-02-04 11:09:06,092 - linkml_store.api.client - INFO - Initializing database: gocams, base_dir: /Users/cjm
2025-02-04 11:09:06,092 - linkml_store.api.client - INFO - Initializing database: cadsr, base_dir: /Users/cjm
2025-02-04 11:09:06,092 - linkml_store.api.client - INFO - Initializing database: mixs, base_dir: /Users/cjm
2025-02-04 11:09:06,092 - linkml_store.api.client - INFO - Initializing database: mondo, base_dir: /Users/cjm
2025-02-04 11:09:06,092 - linkml_store.api.client - INFO - Initializing database: hpoa, base_dir: /Users/cjm
2025-02

id: MOTSC_HUMAN
category: EXPRESSION_MECHANISM_UNCLEAR
text: This peptide has been shown to be biologically active but is the product of
  a mitochondrial gene. Usage of the mitochondrial genetic code yields tandem start
  and stop codons so translation must occur in the cytoplasm. The mechanisms allowing
  the production and secretion of the peptide remain unclear



In [63]:
%%bash
linkml-store -S ../../tests/input/uniprot/schema.yaml  -d tmp/up.ddb -c Entry infer -t llm -T category --where "{}" -O csv -o tmp/up.predicted.csv

In [64]:
import pandas as pd

In [65]:
df = pd.read_csv("tmp/up.predicted.csv")
df

Unnamed: 0,id,category,text
0,MOTSC_HUMAN,EXPRESSION_MECHANISM_UNCLEAR,This peptide has been shown to be biologically...
1,POTB3_HUMAN,GENE_COPY_NUMBER,Maps to a duplicated region on chromosome 15; ...
2,MYO1C_HUMAN,NAMING_CONFUSION,Represents an unconventional myosin. This prot...
3,IMA4_HUMAN,NAMING_CONFUSION,Was termed importin alpha-4
4,S22A1_HUMAN,LOCALIZATION_DISPUTED,Cellular localization of OCT1 in the intestine...
...,...,...,...
95,POK9_HUMAN,,Truncated; frameshift leads to premature stop ...
96,UB2L3_HUMAN,PSEUDOGENE_STATUS,"PubMed:10760570 reported that UBE2L1, UBE2L2 a..."
97,CBX1_HUMAN,CLAIMS_RETRACTED,Was previously reported to interact with ASXL1...
98,H33_HUMAN,CLAIMS_RETRACTED,The original paper reporting lysine deaminatio...


The RAG engine works by first indexing the countries collection by embedding each entry. The top N results matching the query are fetched and used as *context* for the LLM query.

Note that in this particular case, we have a very small collection of twenty entries, and it's not even necessary to perform RAG at all, as the entire collection can easily fit within the context window of the LLM query. However, this small set is useful for demo purposes.

## Inferring a whole object

In [6]:
%%bash
linkml-store  -d duckdb:///tmp/countries.ddb -c countries infer -t rag -q "name: Uruguay"

predicted_object:
  capital: Montevideo
  code: UY
  continent: South America
  languages:
  - Spanish



## Inferring from multiple fields

In [27]:
%%bash
linkml-store  -d duckdb:///tmp/countries.ddb -c countries infer -t rag  -q "{continent: South America, languages: [Dutch]}"

predicted_object:
  capital: Paramaribo
  code: SR
  name: Suriname



## RAG configuration - using a different model

The datasette llm framework is used under the hood. This means that you can use the `llm` command to list the available models and configurations, as well as install new ones.

In [28]:
%%bash
llm models

OpenAI Chat: gpt-3.5-turbo (aliases: 3.5, chatgpt)
OpenAI Chat: gpt-3.5-turbo-16k (aliases: chatgpt-16k, 3.5-16k)
OpenAI Chat: gpt-4 (aliases: 4, gpt4)
OpenAI Chat: gpt-4-32k (aliases: 4-32k)
OpenAI Chat: gpt-4-1106-preview
OpenAI Chat: gpt-4-0125-preview
OpenAI Chat: gpt-4-turbo-2024-04-09
OpenAI Chat: gpt-4-turbo (aliases: gpt-4-turbo-preview, 4-turbo, 4t)
OpenAI Chat: gpt-4o (aliases: 4o)
OpenAI Chat: gpt-4o-mini (aliases: 4o-mini)
OpenAI Completion: gpt-3.5-turbo-instruct (aliases: 3.5-instruct, chatgpt-instruct)
OpenAI Chat: gpt-4-vision-preview (aliases: 4V, gpt-4-vision)
OpenAI Chat: litellm-mixtral
OpenAI Chat: litellm-llama3
OpenAI Chat: litellm-llama3-chatqa
OpenAI Chat: litellm-groq-mixtral
OpenAI Chat: litellm-groq-llama
OpenAI Chat: gpt-4o-2024-05-13 (aliases: 4o, gpt-4o)
OpenAI Chat: lbl/llama-3
OpenAI Chat: lbl/claude-opus
OpenAI Chat: lbl/claude-sonnet
OpenAI Chat: lbl/gpt-4o
OpenAI Chat: lbl/llama-3
Anthropic Messages: claude-3-opus-20240229 (aliases: claude-3-opus)
An

We'll try `claude-3-haiku`, a small model. This may not be powerful enough for extraction tasks, but general knowledge about countries should be within its capabilities.

In [29]:
%%bash
linkml-store  -d duckdb:///tmp/countries.ddb -c countries infer -t rag:llm_config.model_name=claude-3-haiku -q "name: Uruguay" 

predicted_object:
  capital: Montevideo
  code: UY
  continent: South America
  languages:
  - Spanish



## Persisting the RAG model

In [44]:
%%bash
linkml-store  -d duckdb:///tmp/countries.ddb -c countries infer -t rag -q "name: Uruguay" -E tmp/countries.rag.json

predicted_object:
  capital: Montevideo
  code: UY
  continent: South America
  languages:
  - Spanish



In [45]:
%%bash
ls -l tmp/countries.rag.json

-rw-r--r--  1 cjm  staff  498212 Aug 21 16:05 tmp/countries.rag.json


In [46]:
%%bash
linkml-store  -d duckdb:///tmp/countries.ddb -c countries infer -t rag -q "name: Uruguay" -L tmp/countries.rag.json

predicted_object:
  capital: Montevideo
  code: UY
  continent: South America
  languages:
  - Spanish



## Evaluation

In [55]:
%%bash
linkml-store  -d duckdb:///tmp/countries.ddb -c countries infer -t rag  -T languages -T code -F name -n 5

Outcome: true_positive_count=5.0 total_count=5 // accuracy: 1.0


## How RAG indexing works under the hood

Behind the scenes, whenever you use the RAG inference engine, a separate collection is automatically created for a test dataset; additionally, an index is also created in the same database. This is true regardless of the database backend (DuckDB, MongoDB, etc.).

(note: if you are using an in-memory duckdb instance then the index is forgotten after each run, which
could get expensive if you have a large collection).

Let's examine our database to see the new collection and index. We will use the Jupyter SQL magic to query the database.

In [38]:
%load_ext sql
%config SqlMagic.autopandas = True
%config SqlMagic.feedback = False
%config SqlMagic.displaycon = False

The sql extension is already loaded. To reload it, use:
  %reload_ext sql


In [39]:
%%bash
cp tmp/countries.ddb tmp/countries-copy.ddb

In [40]:
%sql duckdb:///tmp/countries-copy.ddb

In [41]:
%%sql
SELECT * FROM information_schema.tables

Unnamed: 0,table_catalog,table_schema,table_name,table_type,self_referencing_column_name,reference_generation,user_defined_type_catalog,user_defined_type_schema,user_defined_type_name,is_insertable_into,is_typed,commit_action,TABLE_COMMENT
0,countries-copy,main,countries,BASE TABLE,,,,,,YES,NO,,
1,countries-copy,main,countries__rag_train,BASE TABLE,,,,,,YES,NO,,
2,countries-copy,main,internal__index__countries__rag_train__llm,BASE TABLE,,,,,,YES,NO,,


In [42]:
%%sql
select * from internal__index__countries__rag_train__llm limit 5

Unnamed: 0,name,code,capital,continent,languages,__index__
0,Argentina,AR,Buenos Aires,South America,[Spanish],"[-0.009016353, 0.02336632, 0.007532564, -0.008..."
1,South Korea,KR,Seoul,Asia,[Korean],"[3.8781454e-05, 0.013463534, 0.017664365, -0.0..."
2,United States,US,"Washington, D.C.",North America,[English],"[-0.0077237985, 0.016569635, -0.0042663547, -0..."
3,Nigeria,NG,Abuja,Africa,[English],"[-0.0055540577, 0.0037728157, -0.003473751, -0..."
4,India,IN,New Delhi,Asia,"[Hindi, English]","[-0.0031975685, 0.025214365, 0.002862445, 0.00..."


In [43]:
%%sql
select count(*) from internal__index__countries__rag_train__llm

Unnamed: 0,count_star()
0,14


In [25]:
%%sql
select count(*) from countries

Unnamed: 0,count_star()
0,20


## Configuring the training/test split

By default, the infer command will split your data in collection into a test and train set. This is useful for evaluation, but if you want to use the entire dataset, or you want to configure the split size, you can use `--training-test-data-split` (`-S`).


In [37]:
%%bash
linkml-store  -d duckdb:///tmp/countries.ddb -c countries infer -t rag -S 1.0 0.0 -q "name: Uruguay" 

predicted_object:
  capital: Montevideo
  code: UY
  continent: South America
  languages:
  - Spanish



## Extraction tasks

We can also use this engine for *extraction tasks* - this involves extracting structured data or knowledge from
textual or unstructured data.

In fact, we don't need any new capabilities here - extraction can just be seen as a special case of inference,
where the feature set includes or is restricted to text, and the target set is the whole object.

We can demonstrate this with a simple zero-shot example:

In [53]:
%%bash
echo '{text: I saw the cat sitting on the mat, subject: cat, predicate: sits-on, object: mat}' > tmp/extraction-examples.yaml

In [54]:
%%bash
linkml-store -i tmp/extraction-examples.yaml infer -t rag -q "text: the Earth rotates around the Sun"

predicted_object:
  object: Sun
  predicate: rotates-around
  subject: Earth

