# Tutorial: Using LinkML-Store using the Command Line Interface

This tutorial walks through usage of LinkML-Store via the Command Line Interface (CLI)

This tutorial is a Jupyter notebook: it can be executed in a command line environment,
or you can try it for yourself by running commands directly.

Note the `%%bash` is a directive for Jupyter itself, you don't need to type this

## Top level command

The top level command is `linkml-store`. This command doesn't do anything itself, instead there are various *subcommands*.

The store command has a few *global options* to specify configuration/database/collection

In [67]:
%%bash
linkml-store --help

Usage: linkml-store [OPTIONS] COMMAND [ARGS]...

  A CLI for interacting with the linkml-store.

Options:
  -d, --database TEXT             Database name
  -c, --collection TEXT           Collection name
  -C, --config PATH               Path to the configuration file
  -v, --verbose
  -q, --quiet / --no-quiet
  --stacktrace / --no-stacktrace  If set then show full stacktrace on error
                                  [default: no-stacktrace]
  --help                          Show this message and exit.

Commands:
  fq                Query facets from the specified collection.
  index             Create an index over a collection.
  indexes
  insert            Insert objects from files (JSON, YAML, TSV) into the...
  list-collections
  query             Query objects from the specified collection.
  schema            Show the schema for a database
  search            Search objects in the specified collection.
  validate          Validate objects in the specified collection.


## Inserting objects from a file

Next we'll explore the ``insert`` command:

In [68]:
%%bash
linkml-store --stacktrace insert --help

Usage: linkml-store insert [OPTIONS] [FILES]...

  Insert objects from files (JSON, YAML, TSV) into the specified collection.

Options:
  -f, --format [json|jsonl|yaml|tsv|csv]
                                  Input format
  -i, --object TEXT               Input object as YAML
  --help                          Show this message and exit.


We'll insert a small test file (in JSON Lines format) into a fresh database.

To make sure we have a fresh setup, we'll create a temporary directory `tmp` (if it doesn't already exist),
and be sure to remove any copy of the database we intend to create.

We'll then insert the objects:

In [69]:
%%bash
mkdir -p tmp
rm -rf tmp/countries.db
linkml-store --database duckdb:///tmp/countries.db --collection countries insert ../../tests/input/countries/countries.jsonl

Inserted 20 objects from ../../tests/input/countries/countries.jsonl into collection 'countries'.


Note that the `--database` and `--collection` options come *before* the `insert` subcommand.

With LinkML-Store, everything must go into a collection, so we specified `countries` as the name

## Querying

Let's query for all objects that have `code="GB"`, and get the results back as a CSV

In [70]:
%%bash
linkml-store --database duckdb:///tmp/countries.db -c countries query -w "code: GB" -O csv

name,code,capital,continent,languages
United Kingdom,GB,London,Europe,['English']


## Facet Counts

You can combine any query (including an empty query, for fetching the whole database) with a *facet query* which fetches counts for
numbers of objects broken down by some specified slot or slots.

In [71]:
%%bash
linkml-store -d duckdb:///tmp/countries.db -c countries fq -S continent

{
  "continent": {
    "Asia": 5,
    "Europe": 5,
    "Africa": 3,
    "North America": 3,
    "Oceania": 2,
    "South America": 2
  }
}


Remember this is a test dataset deliberately reduced so we don't expect to see all countries there!

## Search

LinkML-Store is intended to allow for a flexible range of *search strategies*. Some of these may come from the underlying data store
(for example, SOLr or ES is backed by Lucene indexing). Or they may be integrated orthogonally.

A key search mechanism that is supported is *text embedding* via *Large Language Models (LLMs)*. Note these are not enabled by default.

Currently the default mechanism (which works regardless of the underlying store) is a highly naive trigram-based vector embedding. This requires
no external model. It is intended primarily for demonstration purposes, and should be swapped out for something else.

In [72]:
%%bash
linkml-store -d duckdb:///tmp/countries.db -c countries index
linkml-store -d duckdb:///tmp/countries.db -c countries search "countries in the North where both english and french spoken" --limit 5 -O csv

score,name,code,capital,continent,languages
0.15670402880167877,Canada,CA,Ottawa,North America,"['English', 'French']"
0.14806601565681218,South Africa,ZA,Pretoria,Africa,"['Zulu', 'Xhosa', 'Afrikaans', 'English', 'Northern Sotho', 'Tswana', 'Southern Sotho', 'Tsonga', 'Swazi', 'Venda', 'Southern Ndebele']"
0.13749236361227862,United States,US,"Washington, D.C.",North America,['English']
0.09860812114511587,Argentina,AR,Buenos Aires,South America,['Spanish']
0.09765536333140983,Mexico,MX,Mexico City,North America,['Spanish']


By default, all fields in the object are indexed. Canada comes out top as the strings for English and France are present (or rather trigrams from those words). But remember the default method is just for illustration!

## Indexing using an LLM

Note for this to work, you need to have installed this package with the `llm` extra, like this:

```bash
pip install linkml-store[llm]
```

Or if you have this repo checked out and are using Poetry:

```bash
poetry install --all-extras
```

You will also need an OpenAI account.

If this is too much, you can just skip this section!

__TODO__

## Introspecting schemas

Note in the above we did not explicitly specify a schema; instead it is *induced*.

We can use the `schema` command to see the induced schema in LinkML YAML:

In [73]:
%%bash
linkml-store -d duckdb:///tmp/countries.db schema

name: test-schema
id: http://example.org/test-schema
imports:
- linkml:types
prefixes:
  linkml:
    prefix_prefix: linkml
    prefix_reference: https://w3id.org/linkml/
  test_schema:
    prefix_prefix: test_schema
    prefix_reference: http://example.org/test-schema/
default_prefix: test_schema
default_range: string
classes:
  countries:
    name: countries
    attributes:
      name:
        name: name
        multivalued: false
        range: string
        required: false
      code:
        name: code
        multivalued: false
        range: string
        required: false
      capital:
        name: capital
        multivalued: false
        range: string
        required: false
      continent:
        name: continent
        multivalued: false
        range: string
        required: false
      languages:
        name: languages
        multivalued: true
        range: string
        required: false
  internal__index__countries__test:
    name: internal__index__countries__tes

## Configuration Files and Explicit Schemas

Rather than repeat `--database` and `--collection` each time, we can make use of YAML config files.

These can also package useful information and schemas.

First we will create a fresh copy of a directory with both configuration files and schemas:

In [74]:
%%bash
cp -pr ../../tests/input/countries tmp
rm tmp/countries/countries.db

The configuration YAML is fairly minimal - it specifies a single database with a single collection, and a pointer to a schema

In [75]:
%%bash
cat tmp/countries/countries.config.yaml

databases:
  countries_db:
    handle: "duckdb:///{base_dir}/countries.db"
    schema_location: "{base_dir}/countries.linkml.yaml"
    collections:
      countries:
        type: Country


The schema itself is fairly basic - a single class (whose name matches the `type`) in the configuration,
with some slots. Note the slots have some constraints, e.g. regexps

In [76]:
%%bash
cat tmp/countries/countries.linkml.yaml

id: https://example.org/countries
name: countries
description: A schema for representing countries
license: https://creativecommons.org/publicdomain/zero/1.0/

prefixes:
  countries: https://example.org/countries/
  linkml: https://w3id.org/linkml/

default_prefix: countries
default_range: string

imports:
  - linkml:types

classes:
  Country:
    description: A sovereign state
    slots:
      - name
      - code
      - capital
      - continent
      - languages

slots:
  name:
    description: The name of the country
    required: true
    identifier: true
  code:
    description: The ISO 3166-1 alpha-2 code of the country
    required: true
    pattern: '^[A-Z]{2}$'
  capital:
    description: The capital city of the country
    required: true
  continent:
    description: The continent where the country is located
    required: true
  languages:
    description: The main languages spoken in the country
    range: Language
    multivalued: true

types:
  Language:
    typeof: stri

In [77]:
%%bash
linkml-store  -C tmp/countries/countries.config.yaml insert tmp/countries/countries.jsonl

Inserted 20 objects from tmp/countries/countries.jsonl into collection 'countries'.


In [78]:
%%bash
linkml-store -C tmp/countries/countries.config.yaml list-collections

countries
name: countries
alias: null
type: Country
metadata: null
attributes: null
indexers: null
hidden: false
is_prepopulated: false


In [79]:
%%bash
linkml-store --stacktrace -C tmp/countries/countries.config.yaml -c countries query -w "code: GB" 

[
  {
    "name": "United Kingdom",
    "code": "GB",
    "capital": "London",
    "continent": "Europe",
    "languages": [
      "English"
    ]
  }
]


## Validation

LinkML-Store is designed to allow for rich validation, regardless of the underlying database store used.

For validation to work, we need to specify an explicit schema, as we have done with the configuration above.

To test it, we will insert some fake data:

In [80]:
%%bash
linkml-store -C tmp/countries/countries.config.yaml insert --object '{name: Foolandia, code: "X Y", languages: ["Fooish"]}'

Inserted 3 objects from {name: Foolandia, code: "X Y", languages: ["Fooish"]} into collection 'countries'.


In [81]:
%%bash
linkml-store -C tmp/countries/countries.config.yaml list-collections

countries
name: countries
alias: null
type: Country
metadata: null
attributes: null
indexers: null
hidden: false
is_prepopulated: false


In [82]:
%%bash
linkml-store -C tmp/countries/countries.config.yaml query -w 'name: Foolandia'

[
  {
    "name": "Foolandia",
    "code": "X Y",
    "capital": null,
    "continent": null,
    "languages": [
      "Fooish"
    ]
  }
]


In [83]:
%%bash
linkml-store -C tmp/countries/countries.config.yaml schema

name: countries
description: A schema for representing countries
id: https://example.org/countries
imports:
- linkml:types
license: https://creativecommons.org/publicdomain/zero/1.0/
prefixes:
  countries:
    prefix_prefix: countries
    prefix_reference: https://example.org/countries/
  linkml:
    prefix_prefix: linkml
    prefix_reference: https://w3id.org/linkml/
default_prefix: countries
default_range: string
types:
  Language:
    name: Language
    description: A human language
    typeof: string
slots:
  name:
    name: name
    description: The name of the country
    identifier: true
    required: true
  code:
    name: code
    description: The ISO 3166-1 alpha-2 code of the country
    required: true
    pattern: ^[A-Z]{2}$
  capital:
    name: capital
    description: The capital city of the country
    required: true
  continent:
    name: continent
    description: The continent where the country is located
    required: true
  languages:
    name: languages
    descrip

In [84]:
%%bash
linkml-store -C tmp/countries/countries.config.yaml validate -O csv

type,severity,message,instance,instance_index,instantiates
jsonschema validation,ERROR,'X Y' does not match '^[A-Z]{2}$' in /code,"{'name': 'Foolandia', 'code': 'X Y', 'capital': None, 'continent': None, 'languages': ['Fooish']}",0,Country
jsonschema validation,ERROR,None is not of type 'string' in /capital,"{'name': 'Foolandia', 'code': 'X Y', 'capital': None, 'continent': None, 'languages': ['Fooish']}",0,Country
jsonschema validation,ERROR,None is not of type 'string' in /continent,"{'name': 'Foolandia', 'code': 'X Y', 'capital': None, 'continent': None, 'languages': ['Fooish']}",0,Country


Here we can see 3 issues with the data we added:

* the code doesn't match the regexp we provided (it has a space)
* the capital is missing
* the continent is missing
   