# Datapath Example 3

This notebook gives an example of how to build relatively simple data paths.
It assumes that you understand the concepts presented in the example 2
notebook.

## Exampe Data Model
The examples require that you understand a little bit about the example
catalog data model, which is based on the FaceBase project.

### Key tables
- `'dataset'` : represents a unit of data usually a `'study'` or `'experiment'`
- `'sample'` : a biosample
- `'assay'` : a bioassay (typically RNA-seq or ChIP-seq assays)

### Relationships
- `dataset <- sample`: A dataset may have one to many samples. I.e., there 
  is a foreign key reference from sample to dataset.
- `sample <- assay`: A sample may have one to many assays. I.e., there is a
  foreign key reference from assay to sample.

In [1]:
# Import deriva modules
from deriva_common import ErmrestCatalog, get_credential

In [2]:
# Connect with the deriva catalog
protocol = 'https'
hostname = 'www.facebase.org'
catalog_number = 1
credential = None
# If you need to authenticate, use Deriva Auth agent and get the credential
# credential = get_credential(hostname)
catalog = ErmrestCatalog(protocol, hostname, catalog_number, credential)

In [3]:
# Get the path builder interface for this catalog
pb = catalog.getPathBuilder()

## Building a DataPath
Build a data path by linking together tables that are related. To make things a little easier we will use python variables to reference the tables. This is not necessary, but simplifies the examples.

In [4]:
dataset = pb.isa.dataset
sample = pb.isa.sample
assay = pb.isa.assay

### Initiate a path from a table object
Like the example 2 notebook, begin by initiating a `path` instance from a `Table` object. This path will be "rooted" at the table it was initiated from, in this case, the `dataset` table. `DataPath`'s have URIs that identify the resource in the catalog.

In [5]:
path = dataset.path
print(path.uri)

https://www.facebase.org/ermrest/catalog/1/entity/dataset:=isa:dataset


### Link other related tables to the path
In the catalog's model, tables are _related_ by foreign key references. Related tables may be linked together in a `DataPath`. Here we link the following tables based on their foreign key references (i.e., `dataset <- sample <- assay`).

In [6]:
path.link(sample).link(assay)
print(path.uri)

https://www.facebase.org/ermrest/catalog/1/entity/dataset:=isa:dataset/sample:=isa:sample/assay:=isa:assay


### Path context
By default, `DataPath` objects return entities for the _last_ linked entity set in the path. The `path` from the prior step ended in `assay` which is therefore the `context` for this path.

In [7]:
path.context.name

'assay'

### Get entities for the current context
The following DataPath will fetch `assay` entities not `dataset`s.

In [8]:
entities = path.entities()
len(entities)

171

### Get entities for a different path context
Let's say we wanted to fetch the entities for the `dataset` table rather than the current context which is the `assay` table. We can do that by referencing the table as a property of the path object. **Note** that these are known as "table instances" rather than tables when used within a path expression. We will discuss table instances later in this notebook.

In [9]:
path.table_instances['dataset']
# or
path.dataset

Table name: 'dataset'
List of columns:
  Column name: 'id'	Type: serial4	Comment: 'None'
  Column name: 'accession'	Type: text	Comment: 'None'
  Column name: 'title'	Type: text	Comment: 'None'
  Column name: 'project'	Type: int8	Comment: 'None'
  Column name: 'funding'	Type: text	Comment: 'None'
  Column name: 'summary'	Type: text	Comment: 'None'
  Column name: 'description'	Type: markdown	Comment: 'None'
  Column name: 'view_gene_summary'	Type: text	Comment: 'None'
  Column name: 'view_related_datasets'	Type: text	Comment: 'None'
  Column name: 'mouse_genetic'	Type: text	Comment: 'None'
  Column name: 'human_anatomic'	Type: text	Comment: 'None'
  Column name: 'study_design'	Type: markdown	Comment: 'None'
  Column name: 'release_date'	Type: date	Comment: 'None'
  Column name: 'status'	Type: int4	Comment: 'None'
  Column name: 'gene_summary'	Type: int4	Comment: 'None'
  Column name: 'thumbnail'	Type: int4	Comment: 'None'
  Column name: 'show_in_jbrowse'	Type: boolean	Comment: 'None'
  C

From that table instance we can fetch entities, add a filter specific to that table instance, or even link another table. Here we will get the `dataset` entities from the path.

In [10]:
entities = path.dataset.entities()
len(entities)

7

Notice that we fetched fewer entities this time which is the number of `dataset` entities rather than the `assay` entities that we previously fetched.

## Filtering a DataPath

Building off of the path, a filter can be added. Like fetching entities, linking and filtering are performed _relative to the current context_. In this filter, the assay's attriburtes are referenced in the expression.

Currently, _binary comparisons_ and _logical operators_ are supported. _Unary opertors_ have not yet been implemented. In binary comparisons, the left operand must be an attribute (column name) while the right operand must be a literal
value.

In [11]:
path.filter(assay.molecule_type == 'mRNA')
print(path.uri)

https://www.facebase.org/ermrest/catalog/1/entity/dataset:=isa:dataset/sample:=isa:sample/assay:=isa:assay/molecule_type=mRNA


In [12]:
entities = path.entities()
len(entities)

6

Let's see it rendered as a Pandas DataFrame.

In [13]:
entities.dataframe

Unnamed: 0,alignment_id,cell_count,dataset,fragmentation_method,id,isolation_protocol,library_id,markers,molecule_type,pretreatment,...,reagent_batch_number,reagent_catalog_number,reagent_source,replicate,sample,sample_composition,sample_purification,sample_type,selection,tracks_id
0,41,,14068,Fragmentation Buffer from Illumina,1,,61,histology,mRNA,Trizol,...,,15032619.0,Illumina,5,1,medial nasal process,excision,RNA-seq,totalRNA,21
1,46,,14068,Fragmentation Buffer from Illumina,6,,66,histology,mRNA,Trizol,...,,15032619.0,Illumina,5,3,latero nasal process,excision,RNA-seq,totalRNA,26
2,55,,14068,Fragmentation Buffer from Illumina,15,,75,histology,mRNA,Trizol,...,,15032619.0,Illumina,5,2,maxillary process,excision,RNA-seq,totalRNA,35
3,60,,14068,Fragmentation Buffer from Illumina,20,,80,histology,mRNA,Trizol,...,,15032619.0,Illumina,5,4,mandibular process,excision,RNA-seq,totalRNA,40
4,62,,14130,Fragmentation Buffer from Illumina,25,,85,Histology,mRNA,Trizol,...,,15032619.0,Illumina,5,1088,face,Excision,RNA-seq,totalRNA,43
5,64,,14130,Fragmentation Buffer from Illumina,30,,90,Histology,mRNA,Trizol,...,,15032619.0,Illumina,5,1089,face,Excision,RNA-seq,totalRNA,43


# Table Instances
So far we have discussed _base_ tables. A _base_ table is a representation of the table as it is stored in the ERMrest catalog. A table _instance_ is a usage or reference of a table _within the context_ of a data path. As demonstrated above, we may link together multiple tables and thus create multiple table instances within a data path.

For example, in `path.link(dataset).link(sample).link(assay)` the table instance `sample` is no longer the same as the original base table `sample` because _within the context_ of this data path the `sample` entities must satisfy the constraints of the data path. The `sample` entities must reference a `dataset` entity, and they must be referenced by an `assay` entity. Thus within this path, the entity set for `sample` may be quite different than the entity set for the base table on its own.

## Table instances are bound to the path
Whenever you initiate a data path (e.g., `table.path`) or link a table to a path (e.g., `path.link(table)`) a table instance is created and bound to the DataPath object (e.g., `path`). These table instances can be referenced via the `DataPath`'s `table_instances` container or directly as a property of the `DataPath` object itself.

In [14]:
dataset_instance = path.table_instances['dataset']
# or
dataset_instance = path.dataset

## Aliases for table instances
Whenever a table instance is created and bound to a path, it is given a name. If no name is specified for it, it will be named after the name of its base table. For example, a table named "My Table" will result in a table instance also named "My Table". Tables may appear _more than once_ in a path (as table instances), and if the table name is taken, the instance will be given the "'base name' + `number`" (e.g., "My Table2").

You may wish to specify the name of your table instance. In conventional database terms, an alternate name is called an "alias". Here we give the `dataset` table instance an alias of 'D' though longer strings are also valid as long as they do not contain special characters in them.

In [15]:
path.link(dataset.alias('D'))

<deriva_common.datapath.DataPath at 0x1073900f0>

In [16]:
path.D.uri

'https://www.facebase.org/ermrest/catalog/1/entity/dataset:=isa:dataset/sample:=isa:sample/assay:=isa:assay/molecule_type=mRNA/D:=isa:dataset'

You'll notice that in this path we added an additional _instance_ of the `dataset` table from our catalog model. In addition, we linked it to the `isa.assay` table. This was possible because in this model, there is a foriegn key reference from the base table `assay` to the base table `dataset`. The entities for the table instance named `dataset` and the instance name `D` will likely consist of different entities because the constraints for each are different.

## Selecting Attributes From Linked Entities

Returning to the initial example, if we want to include additional attributes
from other table instances in the path, we need to be able to reference the
table instances at any point in the path. First, we will build our original path.

In [17]:
path = dataset.path.link(sample).link(assay).filter(assay.molecule_type == 'mRNA')
print(path.uri) 

https://www.facebase.org/ermrest/catalog/1/entity/dataset:=isa:dataset/sample:=isa:sample/assay:=isa:assay/molecule_type=mRNA


Now let's fetch an entity set with attributes pulled from each of the table instances in the path.

In [18]:
entities = path.entities(path.dataset.accession, 
                         local_sample_id=path.sample.local_identifier, 
                         assay_molecule=path.assay.molecule_type)
print(entities.uri)

https://www.facebase.org/ermrest/catalog/1/attribute/dataset:=isa:dataset/sample:=isa:sample/assay:=isa:assay/molecule_type=mRNA/dataset:accession,local_sample_id:=sample:local_identifier,assay_molecule:=assay:molecule_type


**Notice** that the `EntitySet` also has a `uri` property. This URI may differ from the origin path URI because the attribute projection does not get appended to the path URI.

In [19]:
path.uri != entities.uri

True

As usual, `fetch(...)` the entities from the catalog.

In [20]:
entities.fetch(limit=5)
for e in entities:
    print(e)

{'accession': 'FB00000806.2', 'local_sample_id': 'E11.5_MNP', 'assay_molecule': 'mRNA'}
{'accession': 'FB00000806.2', 'local_sample_id': 'E11.5_LNP', 'assay_molecule': 'mRNA'}
{'accession': 'FB00000806.2', 'local_sample_id': 'E11.5_MX', 'assay_molecule': 'mRNA'}
{'accession': 'FB00000806.2', 'local_sample_id': 'E11.5_MD', 'assay_molecule': 'mRNA'}
{'accession': 'FB00000807.2', 'local_sample_id': 'CS22_11865', 'assay_molecule': 'mRNA'}
