## Trying out extracting causal statements from papers

This is the major content from the [gene network](https://indra.readthedocs.io/en/latest/tutorials/gene_network.html#) tutorial.

INDRA is a very cool project but it seems more relevant for assembling information around a smallish number of molecular species. 

In [1]:
from indra.tools.gene_network import GeneNetwork

gn = GeneNetwork(['H2AX'])
biopax_stmts = gn.get_biopax_stmts()
bel_stmts = gn.get_bel_stmts()

Processing OWL elements: 100%|██████████| 132k/132k [00:04<00:00, 26.5kit/s] 


In [2]:
# pathways database entries
biopax_stmts[0:3]

[Ubiquitination(RNF4(), MDC1(mods: (phosphorylation, T, 699), (phosphorylation, T, 765), (phosphorylation, T, 752), (phosphorylation, T, 719), (sumoylation, K, 1840), (phosphorylation, T, 4)), K),
 Deubiquitination(CHEK2(mods: (phosphorylation, T, 68), (phosphorylation, T, 383), (phosphorylation, T, 387), (phosphorylation, S, 379)), BARD1(mods: (phosphorylation, T, 734), (phosphorylation, T, 714)), K),
 Phosphorylation(CHEK2(mods: (phosphorylation, T, 68), (phosphorylation, T, 383), (phosphorylation, T, 387), (phosphorylation, S, 379)), BRCA1(mods: (ubiquitination, K), (phosphorylation, S, 1524), (phosphorylation, S, 1457), (phosphorylation, S, 1423), (phosphorylation, S, 1387)), S, 988)]

In [3]:
# literature curations
bel_stmts[0:3]

[DecreaseAmount(trichostatin A(), H2AX()),
 Phosphorylation(etoposide(), H2AX(), S, 140),
 Phosphorylation(etoposide(), H2AX(), S, 140)]

In [4]:
from indra import literature

pmids = literature.pubmed_client.get_ids_for_gene('H2AX')
len(pmids)

570

In [8]:
from indra import literature

paper_contents = {}
for pmid in pmids:
    content, content_type = literature.get_full_text(pmid, 'pmid')
    if content_type == 'abstract':
        paper_contents[pmid] = content
    if len(paper_contents) == 10:
        break

# sickkk
paper_contents

{'12447390': 'DNA damage-induced G2-M checkpoint activation by histone H2AX and 53BP1. Activation of the ataxia telangiectasia mutated (ATM) kinase triggers diverse cellular responses to ionizing radiation (IR), including the initiation of cell cycle checkpoints. Histone H2AX, p53 binding-protein 1 (53BP1) and Chk2 are targets of ATM-mediated phosphorylation, but little is known about their roles in signalling the presence of DNA damage. Here, we show that mice lacking either H2AX or 53BP1, but not Chk2, manifest a G2-M checkpoint defect close to that observed in ATM(-/-) cells after exposure to low, but not high, doses of IR. Moreover, H2AX regulates the ability of 53BP1 to efficiently accumulate into IR-induced foci. We propose that at threshold levels of DNA damage, H2AX-mediated concentration of 53BP1 at double-strand breaks is essential for the amplification of signals that might otherwise be insufficient to prevent entry of damaged cells into mitosis.',
 '16872365': "Extent of co

In [19]:
# fails
#from indra.sources import reach

#literature_stmts = []
#for pmid, content in paper_contents.items():
#    rp = reach.process_text(content, url=reach.local_text_url)
#    literature_stmts += rp.statements
#print('Got %d statements' % len(literature_stmts))

In [9]:
from indra.tools import assemble_corpus as ac

stmts = biopax_stmts + bel_stmts

stmts = ac.map_grounding(stmts)
stmts = ac.map_sequence(stmts)
stmts = ac.run_preassembly(stmts)

Finding refinement relations: 100%|██████████| 1480/1480 [00:00<00:00, 6629.84it/s]


In [10]:
from indra.assemblers.indranet import IndraNetAssembler
indranet_assembler = IndraNetAssembler(statements=stmts)
indranet = indranet_assembler.make_model()

In [11]:
import networkx as nx
paths = nx.single_source_shortest_path(G=indranet, source='H2AX', cutoff=1)

In [12]:
# statements are the primary unit in INDRA
# they include a set of agents
stmts[1].agent_list()

[SIRT6:Nucleosome(H3K9ac):NOTCH1 gene(), H3C15()]

In [13]:
# nice annotations of provinence
stmts[1].evidence

[Evidence(source_api='biopax',
          source_id='pc14:reactome:Catalysis4074',
          annotations={
                       "source_sub_id": "reactome",
                       "agents": {
                        "raw_text": [
                         null,
                         null
                        ],
                        "raw_grounding": [
                         {},
                         {
                          "UP": "Q71DI3",
                          "EGID": "333932",
                          "HGNC": "20505"
                         }
                        ]
                       },
                       "prior_uuids": [
                        "907e7bd3-ef3b-417a-8f6b-95a710b5c2c4"
                       ],
                       "indranet_edge": {
                        "residue": "K",
                        "position": "10",
                        "stmt_type": "Deacetylation",
                        "evidence_count": 1,
                       

In [15]:
a_participant = stmts[1].agent_list()[0]
# participants have nice properties
a_participant.isa

<bound method Agent.isa of SIRT6:Nucleosome(H3K9ac):NOTCH1 gene()>

## Specific sources

INDRA ingests a range of data sources, lets have a look

In [16]:
from indra.sources import signor

SIGNOR_DATA_FILE = "/tmp/signor_data_file.csv"
SIGNOR_COMPLEX_FILE = "/tmp/signore_complex_file.csv"

signor_expressions = signor.api.process_from_web(
    signor_data_file=SIGNOR_DATA_FILE,
    signor_complexes_file=SIGNOR_COMPLEX_FILE
    )

Processing SIGNOR rows: 100%|██████████| 39169/39169 [00:16<00:00, 2417.59it/s] 
Processing SIGNOR complexes: 100%|██████████| 520/520 [00:00<00:00, 33365.53it/s]


In [None]:
# statements can be serialized to json
signor_expressions.statements[0].to_json(use_sbo=True)

In [None]:
# participants have some systematic identies as xrefs
x = signor_expressions.statements[0].agent_list()[1]
x.db_refs

## SBML export

In [30]:
from indra.assemblers.pysb import assembler

SBML_OUT_PATH = "/tmp/indra_signor.sbml"
SBGN_OUT_PATH = "/tmp/indra_signor.sbgn"
JSON_OUT_PATH = "/tmp/indra_signor.json"

In [51]:
pysb_model = assembler.PysbAssembler(signor_expressions.statements)
# the model isn't actually being assembled
pysb_model.model

In [53]:
# since the model isn't assembled exports don't work
# pysb_model.export_model("sbml", SBML_OUT_PATH)