# OntoWeaver Vignette
OntoWeaver is a tool for constructing Semantic Knowledge Graphs (SKGs) from iterative data, such as CSV files. It allows users to define mappings between the data and an ontology, enabling the creation of a graph that can be queried and analyzed. This notebook provides a step-by-step guide on how to use OntoWeaver to build an SKG from synthetic clinical and genomic data, including single nucleotide variants (SNVs) and copy number alterations (CNAs).

## Semantic Knowledge Graphs


## Description of the data

TODO: Subset databases so that same alterations are present in all three databases, so we can connect the information in the graph.

The examples in this vignette are based on anonymized, shuffled and subsetted data from three databases:
- **Single Nucleotide Variants (SNVs)**: A database containing information about single nucleotide variants found in ovarian cancer patients.
- **Copy Number Alterations (CNAs)**: A database containing information about copy number alterations found in ovarian cancer patients.
- **Treatments (OncoKB)**: A [public database](https://www.oncokb.org/) that contains biological and clinical information about genomic alterations in cancer.

### Single Nucleotide Variants (SNVs)

### Copy Number Alterations (CNAs)

### Treatments (OncoKB)


## Set-up

### Installing dependencies

We use *Poetry* to manage dependencies and virtual environments. If you don't have it installed, you can install it with:

In [8]:
! poetry install

[34mInstalling dependencies from lock file[39m

No dependencies to install or update


So far, the OntoWeaver package works under python 3.12. If you have multiple python versions installed, you can direct *Poetry* to use the correct one with the following command:
`! poetry env use $(which python3.12)`

### Starting the *poetry* environment

In [9]:
! eval $(poetry env activate) # new implementation of `poetry shell`

## SKG construction

### 1. Simple mapping using SNVs

We want to build a KG with a simple schema, encompassing patient IDs, the IDs of the samples they provided, the sequence variants they have, and the genes those variants are in. We first start by defining the schema of the desired graph, which would in our case look like this:

After defining the schema, we need to define the mappings between the data and the ontology. The mappings will specify how each column in the CSV files corresponds to a node or edge in the graph. This is defined in the OntoWeaver mapping files, which are YAML files that describe the structure of the data and how it should be transformed into nodes and edges in the graph.

Below we display the mapping file we use for the build up of the first example graph of the SNV database.

In [29]:
import yaml
from IPython.display import display, JSON

# Read the file content.
with open("jobim/1_Simple_mapping/snv.yaml", "r") as file:
    content = yaml.safe_load(file)

# Display the content.
display(JSON(content))

<IPython.core.display.JSON object>

OntoWeaver maps the databases row by row, so the mapping file first specifies how the subject node of each row mapped will be created. We first define that the subject ID will be created from the `patient_id`, using the `columns` keyword, and that the node will be of type `patient`, using the `to_subject` keyword.

For each column we want to map, we must define the strategy of the extraction of the values from each cell of the column, which will serve as the ID of the created node. For this, we use `transformers`. OntoWeaver provides a robust set of transformers that can be used to extract and transform data from the columns of the CSV files. These transformers can be used to manipulate the data in various ways, such as combining several column values, splitting concatenated values, and many more. Detailed description of the transformers can be found in the [OntoWeaver documentation](https://ontoweaver.readthedocs.io/en/latest/readme_sections/mapping_api.html#available-transformers). Users are also able to program their own transformers to suit their specific needs.

For simplicity, in this first section we keep to the usage of only the `map` transformer, which simply extracts the data as it is from the cells of the defined column. For each transformer we define the `columns` to use for the extraction of the data, as well as the `to_object` keyword, which defines the type of node to be created from the extracted data, such as `gene` or `sample`. In addition to that, we also define the edge which is connecting the created nodes to the subject node, using the `via_relation` keyword.

In some cases we do not want to use the default subject type for each row, but rather use a different column as the start point of an edge. In this case, we can use the `from_subject` keyword to specify the type from which we want the edge to start from. For example, in the mapping file above, we use the `from_subject` keyword to create an edge type `alteration_affects_gene` from the `alteration` node, to the `gene` node, thus voiding the start of the edge from the `patient` node. which is the default subject type for each row.

Below we display the first few rows of our initial dataset of Single Nucleotide Variants. The database contains patient IDs, sample IDs, gene names, and the SNVs found in those genes. The data is anonymized and contains a subset of the original data for the first example use case.

After having identified the general structure of the graph and the definition of the mappings, we must now define the BioCypher schema, which is a YAML file that describes the structure of the graph. The schema defines the nodes and edges in the graph, their properties, and how they are related to each other. You can find more information about the BioCypher schema in the [BioCypher documentation](https://biocypher.org/BioCypher/learn/tutorials/pandas_tutorial/#schema-configuration).

In [30]:
import yaml
from IPython.display import display, JSON

# Read the file content.
with open("jobim/1_Simple_mapping/biocypher_schema.yaml", "r") as file:
    content = yaml.safe_load(file)

# Display the content.
display(JSON(content))

<IPython.core.display.JSON object>

Below we show the OntoWeaver CLI command, which is used to run the mapping process. The command specifies the input CSV file, the mapping file, the Biocypher configuration file, and the Biocypher schema file. The `--biocypher-config` option points to a YAML file that contains the configuration for Biocypher, while the `--biocypher-schema` option points to a YAML file that defines the schema of the graph.

More information about the OntoWeaver CLI can be found by running:

In [5]:
! ontoweave --help

INFO -- This is BioCypher v0.9.1.
INFO -- Logging into `biocypher-log/biocypher-20250625-123840.log`.
usage: ontoweave [-h] [-c FILE] [--print_config[=flags]] [-C FILE] [-s FILE]
                 [-p NB_CORES] [-i] [-r [PYTHON_MODULE ...]] [-S CHARACTER]
                 [-a {suffix,prefix,none}] [-A CHARACTER] [-E]
                 [-Ds CHARACTER] [-D]
                 FILE:MAPPING [FILE:MAPPING ...]

A command line tool to run OntoWeaver mapping adapters on a set of tabular data, and call the created BioCypher export scripts.

default config file locations:
  ['/etc/xdg/ontoweave/ontoweave.yaml', '/Users/mbaric/.config/ontoweave/ontoweave.yaml', '../ontoweaver/ontoweave.yaml'], Note: no existing default config file found.

positional arguments:
  FILE:MAPPING          Run the given YAML MAPPING to extract data from the tabular FILE (usually a CSV). Several mappings can be passed to ontoweave. You may also use the same mapping on different data files. If set to `STDIN`, 

Below we show the command that we use to run the mapping process for the first example graph of the SNV database. The command specifies the input CSV file, the mapping file, the Biocypher configuration file, and the Biocypher schema file. The `-a suffix` option is used to add a suffix to the generated nodes. Each suffix represents the ontological type of the node.

In [31]:
! ontoweave DATABASE:./jobim/1_Simple_mapping/snv_1.yaml --biocypher-config ./jobim/1_Simple_mapping/biocypher_config.yaml --biocypher-schema ./jobim/1_Simple_mapping/biocypher_schema.yaml -a suffix

INFO -- This is BioCypher v0.9.1.
INFO -- Logging into `biocypher-log/biocypher-20250625-182023.log`.
ERROR:root:File `DATABASE` not found.


## 2. Adding properties

It is often useful to attach additional metadata to the nodes and edges in the graph. This metadata can include information such as the source of the data, the date of creation, or any other relevant information that can help in understanding the context of the data. This metadata is attached via the usage of properties.

Let's look at an extended version of the database we used in the previous section. Imagine that we have an additional column called `mutationEffectDescription`, which contains a description of the effect of a given alteration. We can use this column to add a property to the `alteration` nodes in the graph. The property will be created by using the `map` transformer, which will extract the data from the `mutationEffectDescription` column and attach it to the `alteration` nodes. We achieve this by adding a `to_property` keyword to the mapping file, which specifies the name of the property to be created, as well as the `for_objects` keyword, which specifies the type of node or edge to which the property will be attached. In this case, we want to attach the property to the `alteration` nodes.

In [32]:
# Read the file content.
with open("jobim/2_Properties/snv.yaml", "r") as file:
    content = yaml.safe_load(file)

# Display the content.
display(JSON(content))

<IPython.core.display.JSON object>

In addition to defining the correct properties in the mapping file, we also need to define the properties in the BioCypher schema. The schema defines the properties that can be attached to the nodes and edges in the graph, as well as their data types. In this case, we want to add a property called `mutationEffectDescription` to the `alteration` nodes, which will be of type `string`. We achieve this by adding a `properties` section to the schema file, which specifies the name of the property and its data type.

In [33]:
# Read the file content.
with open("jobim/2_Properties/biocypher_schema.yaml", "r") as file:
    content = yaml.safe_load(file)

# Display the content.
display(JSON(content))

<IPython.core.display.JSON object>

## 3. Multiple databases and additional transformers


Very often you might be in a situation where the information you wish to integrate is spread across several databases. In this case, you can use OntoWeaver to map multiple databases into a single graph. This is done by defining multiple mapping files, each corresponding to a different database, and then running the OntoWeaver CLI for all the adapters together.

In this example we will be using the same database as in the previous sections, containing SNVs, but we will expand our graph schema to include actionable drugs for the identified alterations. We will use the OncoKB database, which contains information about actionable drugs for alterations in cancer.

Below we show the mapping file for the OncoKB database.

In [34]:
# Read the file content.
with open("jobim/3_Multiple_databases/oncokb.yaml", "r") as file:
    content = yaml.safe_load(file)

# Display the content.
display(JSON(content))

<IPython.core.display.JSON object>

In this mapping file, we define the subject node as the `alteration` node, which is created from the `alteration` column, similarly as in the SNV example. We then define the `drug` node, which is created from the `treatment` column. Here we have our first example of a more complex transformer - `replace`, which replaces all the special characters (TODO MENTION SPECIAL CHARACTERS PRESENT IN THE DATA) with an underscore, defined via the `substitute` keyword, specific for the transformer, so that the node ID is valid. We then define the edge of type `alteration_biomarker_for_drug` between the `alteration` and `drug` nodes.