# The INDRA Database: Description and Demos

This notebook walks through some of the basic structure of the INDRA Database, and then works through some use-case examples. It is generally assumed for the purposes of this notebook (unless otherwise stated), that the user has direct access to the database.

--------------------------------------

## The Need-to-knows of INDRA

As the name suggests, this database is built using the tools of INDRA, and in turn it can be used to help with many uses of INDRA. It is thus valuable to go over some key features of the INDRA toolbox.

### The INDRA Statement
The bread and butter of the INDRA Database, and of INDRA itself, is the INDRA Statement, which is described extensively [here](file:///home/patrick/Workspace/indra/doc/_build/html/modules/statements.html). These Statements provide a robust and fairly extensible format for representing mechanistic interactions as Python objects. For the purposes of this tutorial, it is essential to know that Statements:
- Have a **type**, for example:
    - Phosphorylation
    - Complex
- Have **agents**, which in turn have some **db refs**, for example:
    - MEK has the Famplex db ref id MEK
    - Vemurafenib is an agent with the db refs for a CHEBI id "CHEBI:63637" and a ChEMBL id "ChEMBL1229517"
    
Most have two agents, a subject and an object, for example:
- `Phosphorylation(MEK(), ERK())`
- `Inhibition(Vemurafenib(), BRAF())`

but there are some types of Statement that are notable exceptions:
- Complexes (any number of agents)
- Auto-Phospohorylations (one agent)

### Sources of INDRA Statements
INDRA has implemented tools for loading and generating these Statements from several sources. Here, the key points to recall are that:
- INDRA can draw from both from **machine reading systems** such as REACH, and from **mechanism databases**, such as Pathway Commons
- For readings, INDRA also provides the groundwork for **running certain readers at massive scales**, fairly easily using AWS Batch.
- The results from these sources, especially when combined, **contain a lot of duplicate and closely related information**.

### Preassembly of INDRA Statements
To build useful models from all these sources, INDRA supplies tools to perform what is call "preasssembly" (what you do before "assembling" your model), in which:
- grounding is regularized (fixes agent db refs), as are protein sites and agent names.
- the **redunant information between sources is merged, *with the original source information and evidence preserved*, into a distilled set of unique mechanisms**
- the relationship between similar mechanistic information is recorded, such that a more general Statement, such as `Phosphorylation(MEK(), ERK())` can be identified as generalizing `Phosphorylation(MAP2K1(), MAPK1())`.
- **Such Preassembled Statements can be uniquely identified by a hash generated from their contents**.


----------

## The Structure of the Database

<img src="db_basic_structure.png">

The INDRA Database is made up of several tables. There are 4 core groups, shown in the three cylinders and one box above:
- **Sources:** Keep track of the content that we read, and the readings of that content, including titles, abstracts, and full texts from various sources. Also keep some metadata on the databases we import.
    - `text_refs`,
    - `text_content`
    - `reading`
    - `db_info`
- **Raw Statements:** Store all the statements extracted from all the sources, as-is.
    - `raw_statements`
- **Preassembled Statements:** Here are stored the cleaned, distilled, and relation-mapped statements.
    - `raw_unique_links`
    - `pa_statements`
    - `pa_agents`
    - `pa_support_links`
- **Materialized Views:** Pre-calculate certain queries for rapid retrieval.
    - `pa_meta`
    - `fast_raw_pa_link`
    - `pa_stmt_src`
    - `reading_ref_link`

There are many more tables, however there are in general not going to be essential in this demo.

------

## Demos

What follows are some demonstrations of the ways you can access the database, at various different levels.

### Low level access

To access and manage the database at the lowest level, the `DatabaseManager` class, from `indra_db.managers.database_manager` is used. You need to have access to the database, hosted on AWS RDS, configured in a config file (documented elsewhere). Here is an example of getting a piece of content from the database:

In [7]:
from indra_db.util import get_db, unpack

# Get a handle to the database
db = get_db('primary')

# Get a piece of text content that is an abstract. Everything after the first argument is a condition.
tc = db.select_one(db.TextContent, db.TextContent.text_type == 'abstract')
print(tc)

text_content:
	content: [not shown]
	format: text
	text_ref_id: 28416337
	insert_date: 2018-05-18 17:45:23.406707
	text_type: abstract
	source: pubmed
	id: 20202368
	last_updated: None



The actual content is not shown so that the metadata is readable. But you can look at the content by just printing:

In [8]:
print(unpack(tc.content))

Visual expertise induces changes in neural processing for many different domains of expertise. However, it is unclear how expertise effects for different domains of expertise are related. In the present fMRI study, we combine large-scale univariate and multi-voxel analyses to contrast the expertise-related neural changes associated with two different domains of expertise, bird expertise (ornithology) and mineral expertise (mineralogy). Results indicated distributed expertise-related neural changes, with effects for both domains of expertise in high-level visual cortex and effects for bird expertise even extending to low-level visual regions and the frontal lobe. Importantly, a multivariate generalization analysis showed that effects in high-level visual cortex were specific to the domain of expertise. In contrast, the neural changes in the frontal lobe relating to expertise showed significant generalization, signaling the presence of domain-independent expertise effects. In conclusion,

Note that the content must be `unpack`ed. This is because we store compressed binary on the database.

You can get a raw statement from a pmcid by using the `db.link` feature, which uses a networkx graph to construct the necessary joins on your behalf.

In [17]:
raw_stmts = db.select_all(db.RawStatements, db.TextRef.pmcid == 'PMC4055958', *db.link(db.RawStatements, db.TextRef))

Lets look at some of these objects that were returned. The `repr` of the object is not especially informative:

In [19]:
raw_stmts[0]

<indra_db.managers.database_manager.DatabaseManager.__init__.<locals>.RawStatements at 0x7f34a469ee10>

However you can, as shown above, `print` the object. Again, the more verbose column, the `json` encoding of the Statement is not printed in this display.

In [20]:
print(raw_stmts[0])

raw_statements:
	type: IncreaseAmount
	db_info_id: None
	text_hash: -3758986799612051399
	batch_id: 533420918
	id: 10341408
	create_date: 2019-05-31 14:06:53.451841
	indra_version: 1.12.0-8d138ebe7e70fefdb7edde1769c0c8bd8cb91526
	reading_id: 10100019060322
	source_hash: 1446941550084421822
	mk_hash: -35673697574246703
	uuid: 6f59cf8d-0210-448b-89de-f5363479e116
	json: [not shown]



In [22]:
raw_stmts[0].json

b'{"type": "IncreaseAmount", "subj": {"name": "MDMA", "db_refs": {"PUBCHEM": "1615", "TEXT": "MDMA"}}, "obj": {"name": "Ca", "db_refs": {"PUBCHEM": "271", "TEXT": "Ca"}}, "belief": 1, "evidence": [{"source_api": "reach", "text": "MDMA induced an increase in basal cytosolic Ca 2+ levels, measured after drug washout.", "annotations": {"found_by": "amount_1", "agents": {"coords": [[0, 4], [44, 46]]}}, "epistemics": {"direct": false, "section_type": null}, "text_refs": {"PMID": "18050169"}, "source_hash": 1446941550084421822}], "id": "6f59cf8d-0210-448b-89de-f5363479e116"}'

In [24]:
from indra_db.util import get_statement_object

raw_stmt_objs = [get_statement_object(row) for row in raw_stmts]

The details of this code are not essential, however you can see that we get a lot of statements from this fulltext, and that there are two different readings producting this content.

In [40]:
rids = set()
for row in raw_stmts:
    if row.reading_id not in rids:
        print(row.reading_id, end='  ')
        rids.add(row.reading_id)
    else:
        print(' '*len(str(row.reading_id)), end='  ')
    stmt = get_statement_object(row)
    print(stmt)

10100019060322  IncreaseAmount(MDMA(), Ca())
                Activation(NO(), peroxynitrite())
                Inhibition(MEM(), alpha7 nAChR())
                Activation(nAChR(), MDMA())
                Inhibition(memantine(), METH())
                Inhibition(METH(), dopamine())
                Activation(voltage(), calcium())
                Activation(alpha-bungarotoxin(), ROS())
                DecreaseAmount(NO(), CYCS())
                Inhibition(METH(), ROS())
                Inhibition(MEM(), NMDA receptors())
                Activation(MDMA(), calcium())
                Activation(MDMA(), calcium())
                Activation(MDMA(), cell death())
                Activation(METH(), ROS())
                Activation(METH(), ROS())
                Activation(METH(), ROS())
                Activation(METH(), ROS())
                Activation(METH(), ROS())
                Inhibition(MDMA(), dopamine())
                Activation(MDMA(), nAChR())
                Activation(MDM

- show evidence text
- pa statement example
- show loopback to other papers


## Higher Level
- from agents
- from hash

## Even higher 
- Describe split between UI and API
- Rest api from agents
- from hash
- submit curation

## UI
- From Agents

## indrabot
- a few screenshots

## Plug for INDRA Google
- simple one-bar interface
- ablity to ask basic follow-up filtering/extending questions