# Datapath example 3

This notebook gives an example of how to build relativley simple data paths.
It assumes that you understand the concepts presented in the example 2
notebook.

## Exampe Data Model
The examples require that you understand a little bit about the example
catalog data model, which is based on the FaceBase project.

### Key tables
- `'dataset'` : represents a unit of data usually a `'study'` or `'experiment'`
- `'sample'` : a biosample
- `'assay'` : a bioassay (typically RNA-seq or ChIP-seq assays)

### Relationships
- `dataset <- sample`: A dataset may have one to many samples. I.e., there 
  is a foreign key reference from sample to dataset.
- `sample <- assay`: A sample may have one to many assays. I.e., there is a
  foreign key reference from assay to sample.

In [None]:
# Import deriva modules
from deriva_common import ErmrestCatalog, get_credential

In [None]:
# Connect with the deriva catalog
protocol = 'https'
hostname = 'www.facebase.org'
catalog_number = 1
credential = None
# If you need to authenticate, use Deriva Auth agent and get the credential
# credential = get_credential(hostname)
catalog = ErmrestCatalog(protocol, hostname, catalog_number, credential)

In [None]:
# Get the path builder interface for this catalog
pb = catalog.getPathBuilder()

## Building a Datapath
We will build a data path by linking tables from the catalog. To make things a little easier we will use python variables to reference the tables. This is not necessary, but simplifies the examples.

In [None]:
dataset = pb.isa.dataset
sample = pb.isa.sample
assay = pb.isa.assay

Build a data path by linking together different tables that are related.
By default, data path returns entities for the _last_ linked entity set
in the path. The following data path will therefore return assays not
datasets.

In [None]:
path = dataset.path            # a new path rooted at the "dataset" table
path.link(sample).link(assay)  # extended path dataset<-sample<-assay
print(path.uri)                # URI for this path

Get the entity set for this linked data path.

In [None]:
entities = path.entities()
len(entities)

## Filtering a Datapath

Building off of the path, a filter can be added. In this filter, the assay's
attriburtes may be reference in the expressions. We did not have to split this
step from the prior step.

**Note**:
In these binary comparisons 
the left operand must be an attribute while the right operand must a literal
value.

In [None]:
path.filter(assay.molecule_type == 'mRNA')
print(path.uri)

In [None]:
entities = assays_datapath_filtered.entities()
len(entities)

## Slicing EntitySets
Any entity set can be sliced too.

In [None]:
print (entities[2:4])

Let's see it rendered as a Pandas DataFrame.

In [None]:
entities.dataframe

# Table Instances
A "table instance" is a key concept when working with DataPaths. A table _instance_ is a table that is in use _within the context_ of a DataPath. Such a table _instance_ should not be confused with the _base_ table. A table _instance_ may be constrained by the _linking_ relationships expressed in the DataPath and the _filters_ on its attributes.

## Projecting Attributes From Linked Entities

Returning to the initial example, if we want to project additional attributes
from other entities in the DataPath, we need to be able to reference the
"table instances" at any point in the path. 

**Table Instance**: A "table instance" is a Table within the context of a DataPath.

To do so, first
we need to define a few table "aliases" that we can use in the paths.

### Define a table alias
Start by defining an alias for the 'dataset' table. Any table can be aliased.
The argument to the '`as_(...)`' method is a string without special characters
in it.

In [None]:
D = dataset.as_('D')

### Access columns of an aliased table
Like the original table, an alias may be used to reference the columns of the
original table.

In [None]:
D.columns['accession']

Now repeat the path but use the aliased table in place of the table.

In [None]:
datapath = D.link(sample).link(assay)

Project attributes from the last referenced table and any aliased tables.

In [None]:
datapath = datapath.attributes(D.accession, assay.molecule_type, assay.sample_type)
print(datapath.uri)

In [None]:
entities = datapath.entities()
for e in entities[0:10]:
    print(e)

### Alias a table anywhere in the data path
Now define another alias so that sample's columns may be projected as well.

In [None]:
S = sample.as_('S')

This is an all new datapath instance. When linking the samples table we will
need to first indicate which table is being linked, and then what alias to
link it "as". This is similar in spirit to the SQL concept of joining tables
and renaming them "as" a given table instance name.

In [None]:
datapath = D.link(sample, as_=S).link(assay).attributes(D.accession, S.stage, assay.sample_type)
print(datapath.uri)

In [None]:
for e in datapath.entities(limit=5):
    print(e)