# Perform LLM Inference

This notebook demonstrates how to perform inference using LLMs.

Whereas the [previous RAG example](Perform-RAG-Inference.ipynb) used existing examples,
this will perform de-novo inference using a schema.

Note that linkml-store is a data-first framework, the main emphasis is not on AI or LLMs. However, it does support a pluggable **Inference** framework, and one of the integrations is a simple LLM-based inference engine.

For this notebook, we will be using the command line interface, but the same can be done programmatically using the Python API.

## Loading the data into duckdb

For this we will take all uniprot "caution" free text comments for human proteins and load them into a duckdb database.

In [1]:
%%bash
mkdir -p tmp
rm -rf tmp/up.ddb
linkml-store  -d duckdb:///tmp/up.ddb -c Entry insert ../../tests/input/uniprot/uniprot-comments.tsv

Inserted 2390 objects from ../../tests/input/uniprot/uniprot-comments.tsv into collection 'Entry'.


Let's check what this looks like by using `describe` and examining the first entry:

In [2]:
%%bash
linkml-store -d tmp/up.ddb describe

         count unique                                   top  freq
category  2390      1                                        2390
id        2390   2284                            EFC2_HUMAN     4
text      2390   1383  Could be the product of a pseudogene   259


## Introspecting the schema

Here we will use a ready-made LinkML schema that has the categories we want to assign as a LinkML enum, with
examples (examples in schemas help humans *and* LLMs)

In [4]:
%%bash
linkml-store -S ../../tests/input/uniprot/schema.yaml  -d tmp/up.ddb schema

name: uniprot-comments
id: http://example.org/uniprot-comments
imports:
- linkml:types
prefixes:
  linkml:
    prefix_prefix: linkml
    prefix_reference: https://w3id.org/linkml/
  up:
    prefix_prefix: up
    prefix_reference: http://example.org/tuniprot-comments
default_prefix: up
default_range: string
enums:
  CommentCategory:
    name: CommentCategory
    permissible_values:
      FUNCTION_DISPUTED:
        text: FUNCTION_DISPUTED
        description: A caution indicating that a previously reported function has
          been challenged or disproven in subsequent studies; may warrant GO NOT annotation
        examples:
        - value: FUNCTION_DISPUTED
          description: Originally described for its in vitro hydrolytic activity towards
            dGMP, dAMP and dIMP. However, this was not confirmed in vivo
      FUNCTION_PREDICTION_ONLY:
        text: FUNCTION_PREDICTION_ONLY
        description: A caution indicating function is based only on computational
          predict

In [52]:
%%bash
linkml-store  -d tmp/up.ddb -c cv insert ../../tests/input/uniprot/uniprot-caution-cv.csv

Inserted 15 objects from ../../tests/input/uniprot/uniprot-caution-cv.csv into collection 'cv'.


In [53]:
%%bash
linkml-store  -d tmp/up.ddb::cv query --limit 3 -O yaml

TERM: FUNCTION_DISPUTED
DEFINITION: A caution indicating that a previously reported function has been challenged
  or disproven in subsequent studies; may warrant GO NOT annotation
EXAMPLES: Originally described for its in vitro hydrolytic activity towards dGMP,
  dAMP and dIMP. However, this was not confirmed in vivo
---
TERM: FUNCTION_PREDICTION_ONLY
DEFINITION: A caution indicating function is based only on computational prediction
  or sequence similarity
EXAMPLES: Predicted to be involved in X based on sequence similarity
---
TERM: FUNCTION_LACKS_EVIDENCE
DEFINITION: A caution indicating insufficient experimental evidence to support predicted
  function
EXAMPLES: In contrast to other Macro-domain containing proteins, lacks ADP-ribose
  glycohydrolase activity



In [54]:
%%bash
linkml-store  -d tmp/up.ddb query --limit 3 -O yaml

TERM: FUNCTION_DISPUTED
DEFINITION: A caution indicating that a previously reported function has been challenged
  or disproven in subsequent studies; may warrant GO NOT annotation
EXAMPLES: Originally described for its in vitro hydrolytic activity towards dGMP,
  dAMP and dIMP. However, this was not confirmed in vivo
---
TERM: FUNCTION_PREDICTION_ONLY
DEFINITION: A caution indicating function is based only on computational prediction
  or sequence similarity
EXAMPLES: Predicted to be involved in X based on sequence similarity
---
TERM: FUNCTION_LACKS_EVIDENCE
DEFINITION: A caution indicating insufficient experimental evidence to support predicted
  function
EXAMPLES: In contrast to other Macro-domain containing proteins, lacks ADP-ribose
  glycohydrolase activity



In [55]:
%%bash
linkml-store -S ../../tests/input/uniprot/schema.yaml  -d tmp/up.ddb schema

2025-02-04 11:08:15,383 - linkml_store.api.client - INFO - Initializing database: phenopackets, base_dir: /Users/cjm
2025-02-04 11:08:15,383 - linkml_store.api.client - INFO - Initializing database: mgi, base_dir: /Users/cjm
2025-02-04 11:08:15,383 - linkml_store.api.client - INFO - Initializing database: nmdc, base_dir: /Users/cjm
2025-02-04 11:08:15,383 - linkml_store.api.client - INFO - Initializing database: amigo, base_dir: /Users/cjm
2025-02-04 11:08:15,383 - linkml_store.api.client - INFO - Initializing database: gocams, base_dir: /Users/cjm
2025-02-04 11:08:15,383 - linkml_store.api.client - INFO - Initializing database: cadsr, base_dir: /Users/cjm
2025-02-04 11:08:15,384 - linkml_store.api.client - INFO - Initializing database: mixs, base_dir: /Users/cjm
2025-02-04 11:08:15,384 - linkml_store.api.client - INFO - Initializing database: mondo, base_dir: /Users/cjm
2025-02-04 11:08:15,384 - linkml_store.api.client - INFO - Initializing database: hpoa, base_dir: /Users/cjm
2025-02

name: uniprot-comments
id: http://example.org/uniprot-comments
imports:
- linkml:types
prefixes:
  linkml:
    prefix_prefix: linkml
    prefix_reference: https://w3id.org/linkml/
  up:
    prefix_prefix: up
    prefix_reference: http://example.org/tuniprot-comments
default_prefix: up
default_range: string
enums:
  CommentCategory:
    name: CommentCategory
    permissible_values:
      FUNCTION_DISPUTED:
        text: FUNCTION_DISPUTED
        description: A caution indicating that a previously reported function has
          been challenged or disproven in subsequent studies; may warrant GO NOT annotation
        examples:
        - value: FUNCTION_DISPUTED
          description: Originally described for its in vitro hydrolytic activity towards
            dGMP, dAMP and dIMP. However, this was not confirmed in vivo
      FUNCTION_PREDICTION_ONLY:
        text: FUNCTION_PREDICTION_ONLY
        description: A caution indicating function is based only on computational
          predict

## Inferring a specific field

In [5]:
%%bash
linkml-store -S ../../tests/input/uniprot/schema.yaml  -d tmp/up.ddb -c Entry infer -t llm -T category --where "id: MOTSC_HUMAN"

id: MOTSC_HUMAN
category: EXPRESSION_MECHANISM_UNCLEAR
text: This peptide has been shown to be biologically active but is the product of
  a mitochondrial gene. Usage of the mitochondrial genetic code yields tandem start
  and stop codons so translation must occur in the cytoplasm. The mechanisms allowing
  the production and secretion of the peptide remain unclear



## Inferring all rows

Here we use a `--where` clause to query all rows in our collection and pass them through the inference engine

In [63]:
%%bash
linkml-store -S ../../tests/input/uniprot/schema.yaml  -d tmp/up.ddb -c Entry infer -t llm -T category --where "{}" -O csv -o tmp/up.predicted.csv

In [6]:
import pandas as pd

In [7]:
df = pd.read_csv("tmp/up.predicted.csv")
df

Unnamed: 0,id,category,text
0,MOTSC_HUMAN,EXPRESSION_MECHANISM_UNCLEAR,This peptide has been shown to be biologically...
1,POTB3_HUMAN,GENE_COPY_NUMBER,Maps to a duplicated region on chromosome 15; ...
2,MYO1C_HUMAN,NAMING_CONFUSION,Represents an unconventional myosin. This prot...
3,IMA4_HUMAN,NAMING_CONFUSION,Was termed importin alpha-4
4,S22A1_HUMAN,LOCALIZATION_DISPUTED,Cellular localization of OCT1 in the intestine...
...,...,...,...
95,POK9_HUMAN,,Truncated; frameshift leads to premature stop ...
96,UB2L3_HUMAN,PSEUDOGENE_STATUS,"PubMed:10760570 reported that UBE2L1, UBE2L2 a..."
97,CBX1_HUMAN,CLAIMS_RETRACTED,Was previously reported to interact with ASXL1...
98,H33_HUMAN,CLAIMS_RETRACTED,The original paper reporting lysine deaminatio...
