# Working with `metabolomics.info` data
Just getting started with Jupyter Lab, like Jupyter Notebooks, but with more features.  FO rmore about Jupyter Lab see [Introduction to Jupyter Lab](https://jupyterlab.readthedocs.io/en/stable/getting_started/overview.html)

In this notebook we are trying various things with python tooling for VIVO data.  Some of the cells are not VIVO specific, some are.

Many of the cells use `rdflib`, a very handy library for manipulating RDF graphs using Python.  See [rdflib docs](https://rdflib.readthedocs.io/en/stable/)

## Query the data with Fuseki

Fuseki is a stand-alone SPARQL server from the [Apache Jena Project](http://apache.org/jena).  
Triples can be loaded into a Fuseki server and then queried.  In the code below, the triples have been loaded into
an in-memory (very fast) database called people-portal.

### Count triples in the datbase

In [11]:
import requests
q = """
SELECT (COUNT(*) AS ?ntriples)
WHERE {
  ?s ?p ?o .
}
"""
%time response = requests.post('http://localhost:3030/people-portal/sparql', data={'query': q})
print(response.json()['results']['bindings'][0]['ntriples']['value'])

CPU times: user 3.12 ms, sys: 1.6 ms, total: 4.73 ms
Wall time: 675 ms
1232155


### Count people, papers, organizations

In [17]:
import requests
q = """
PREFIX bibo: <http://purl.org/ontology/bibo/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT ?npeople ?narticles ?norgs
WHERE {
    
    {
    SELECT (COUNT(DISTINCT ?person) AS ?npeople)
    WHERE { 
        ?person a foaf:Person .
        }
    }

    {
    SELECT (COUNT(DISTINCT ?article) AS ?narticles)
    WHERE { 
        ?article a bibo:Article .
        }
    }

    {
    SELECT (COUNT(DISTINCT ?org) AS ?norgs)
    WHERE { 
        ?org a foaf:Organization .
        }
    }
    
}
"""
%time response = requests.post('http://localhost:3030/people-portal/sparql', data={'query': q})
vals = response.json()['results']['bindings'][0]
print("People", vals['npeople']['value'], "Articles", vals['narticles']['value'], "Orgs", vals['norgs']['value'])

CPU times: user 2.9 ms, sys: 1.52 ms, total: 4.42 ms
Wall time: 35 ms
People 1372 Articles 17790 Orgs 1446


### People by count of papers

In [27]:
import requests
q = """
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX vivo: <http://vivoweb.org/ontology/core#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX bibo: <http://purl.org/ontology/bibo/>
SELECT (MAX(?lname) as ?name) (COUNT (DISTINCT ?a) AS ?npubs)
WHERE {
  ?person a foaf:Person .
  ?person rdfs:label ?lname .
  ?person vivo:relatedBy ?a .
  ?a a vivo:Authorship .
}
GROUP BY ?person
ORDER BY DESC(?npubs)
"""
%time response = requests.post('http://localhost:3030/people-portal/sparql', data={'query': q})
for row in response.json()['results']['bindings'][0:10]:
    print(row['name']['value'], row['npubs']['value'])

CPU times: user 2.93 ms, sys: 1.57 ms, total: 4.5 ms
Wall time: 72 ms
Ronald Petersen 588
Dean Jones 320
Stefan Kaufmann 270
Theodore Lawrence 263
Quiying Chen 258
Stefan H.E. Kaufmann 258
Eva Feldman 253
Brett Finlay 248
Michelle Mielke 214
John Meeker 198


### Find co-authors

For social network analysis, we are interested in finding people who are co-authors, and how many times they have been co-authors.
For each person we look at their papers, and the authors of those papers and create a pair for the person and each of their co-authors.
We then count the pairs.  If a is a co-author of b, then we will also find later that b is a co-author of a.  The counts will be the same.

In [40]:
import requests
q = """
PREFIX vivo: <http://vivoweb.org/ontology/core#>
PREFIX bibo: <http://purl.org/ontology/bibo/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX vitro: <http://vitro.mannlib.cornell.edu/ns/vitro/0.7#>
PREFIX vitrop: <http://vitro.mannlib.cornell.edu/ns/vitro/public#>
PREFIX m3c: <http://www.metabolomics.info/ontologies/2019/metabolomics-consortium#>
PREFIX obo: <http://purl.obolibrary.org/obo/>
PREFIX vcard: <http://www.w3.org/2006/vcard/ns#>
PREFIX fao: <http://aims.fao.org/aos/geopolitical.owl#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>

SELECT (MAX(?name1) AS ?name) ?p (MAX(?name2) AS ?coauthor) ?auth (COUNT(DISTINCT ?paper) AS ?npapers)
WHERE {
  ?p a foaf:Person .
  ?p rdfs:label ?name1 .
  ?p vivo:relatedBy ?a .
  ?a a vivo:Authorship .
  ?a vivo:relates ?paper .
  ?paper bibo:doi ?doi .
  ?paper vivo:relatedBy ?a2 .
  FILTER(?a2 != ?a)
  ?a2 vivo:relates ?auth .
  ?auth a foaf:Person .
  ?auth rdfs:label ?name2 .
}
GROUP BY ?p ?auth
ORDER BY DESC(?npapers)
"""
%time response = requests.post('http://localhost:3030/people-portal/sparql', data={'query': q})
for row in response.json()['results']['bindings'][0:10]:
    print(row['name']['value'], row['coauthor']['value'], row['npapers']['value'])

CPU times: user 3.95 ms, sys: 2.17 ms, total: 6.12 ms
Wall time: 248 ms
Stefan H.E. Kaufmann Stefan Kaufmann 257
Stefan Kaufmann Stefan H.E. Kaufmann 257
Michelle Mielke Ronald Petersen 138
Ronald Petersen Michelle Mielke 138
Dean Jones Karan Uppal 60
Karan Uppal Dean Jones 60
Pramod P Wangikar Pramod Wangikar 47
Pramod Wangikar Pramod P Wangikar 47
Dean Jones Douglas Walker 46
Douglas Walker Dean Jones 46


## Read the data into a graph

In [28]:
from rdflib import Graph, Namespace
from rdflib.namespace import RDF, FOAF, RDFS
from lookup import term_lookup, VIVO, OBO, BIBO, UFL, VCARD, M3C

g = Graph()
%time g = g.parse("../m3c-r/m3c-20200316.nt", format="nt")
print(len(g))

CPU times: user 1min 5s, sys: 688 ms, total: 1min 5s
Wall time: 1min 6s
1232155


## Type list

In [29]:
types = dict()
for s,p,o in g.triples( (None, RDF.type, None) ) :
    os = str(o)
    if os in types.keys():
        types[os] += 1
    else:
        types[os] = 1
for v,k in sorted( ((v,k) for k,v in types.items()), reverse=True):
    print(v,'\t',term_lookup(k))

130292 	 http://www.w3.org/2002/07/owl#Thing
84744 	 http://www.metabolomics.info/ontologies/2019/metabolomics-consortium#Domain_entity
82658 	 http://www.metabolomics.info/ontologies/2019/metabolomics-consortium#Dataset
41006 	 http://purl.obolibrary.org/obo/BFO_0000002 (continuant)
41006 	 http://purl.obolibrary.org/obo/BFO_0000001 (entity)
20075 	 http://vivoweb.org/ontology/core#Relationship
20075 	 http://vivoweb.org/ontology/core#Authorship
20075 	 http://purl.obolibrary.org/obo/BFO_0000020 (specifically dependent continuant)
17790 	 http://purl.org/ontology/bibo/Document
17790 	 http://purl.org/ontology/bibo/Article
17790 	 http://purl.obolibrary.org/obo/IAO_0000030 (information content entity)
17790 	 http://purl.obolibrary.org/obo/BFO_0000031 (generically dependent continuant)
3141 	 http://purl.obolibrary.org/obo/BFO_0000004 (independent continuant)
2818 	 http://xmlns.com/foaf/0.1/Agent
2529 	 http://www.w3.org/2006/vcard/ns#Identification
2529 	 http://www.w3.org/2006/vcard

## Predicate List

Count the occurances of predictes in the triples.  Loop over the triples, counting occurances, 
then sort the resulting table of predicates and their counts by the counts.  Look up each term
to convert the term to something more readable if needed.

In [30]:
predicates = dict()
for s,p,o in g:
    ps = str(p)
    if ps in predicates.keys():
        predicates[ps] += 1
    else:
        predicates[ps] = 1
for v,k in sorted( ((v,k) for k,v in predicates.items()), reverse=True):
    print(v,'\t',term_lookup(k))

542717 	 http://www.w3.org/1999/02/22-rdf-syntax-ns#type
131157 	 http://vitro.mannlib.cornell.edu/ns/vitro/0.7#mostSpecificType
83728 	 http://www.metabolomics.info/ontologies/2019/metabolomics-consortium#subjectSpecies
82658 	 http://www.metabolomics.info/ontologies/2019/metabolomics-consortium#sampleId
68100 	 http://www.metabolomics.info/ontologies/2019/metabolomics-consortium#developedFrom
68100 	 http://www.metabolomics.info/ontologies/2019/metabolomics-consortium#dataFor
40150 	 http://vivoweb.org/ontology/core#relates
40150 	 http://vivoweb.org/ontology/core#relatedBy
22698 	 http://www.w3.org/2000/01/rdf-schema#label
18364 	 http://purl.org/ontology/bibo/doi
17790 	 http://www.metabolomics.info/ontologies/2019/metabolomics-consortium#citation
17790 	 http://vivoweb.org/ontology/core#dateTimeValue
17790 	 http://vivoweb.org/ontology/core#dateTimePrecision
17790 	 http://vivoweb.org/ontology/core#dateTime
17790 	 http://purl.org/ontology/bibo/pmid
2503 	 http://www.metabolomics.

## List the people in the graph

Some of the triples in the graph look like:
    
     <uri>  rdf:type foaf:Person
                
These triples are assertions that the uri is the uri of a person.  Find those triples and print them.

To find them, loop over all the triples and select those matching the pattern shown.  **None** matches any.

In [32]:
n = 0
for s,p,o in g.triples( (None, RDF.type, FOAF.Person) ):
    print ("%s is a person named %s"% (s,g.value(s, RDFS.label)))
    n += 1
    if n > 10:
        break

https://vivo.metabolomics.info/individual/p404 is a person named Johann Gudjonsson
https://vivo.metabolomics.info/individual/p188 is a person named Warren Kruger
https://vivo.metabolomics.info/individual/p1351 is a person named Xue Shi
https://vivo.metabolomics.info/individual/p325 is a person named Hiroshi Tsugawa
https://vivo.metabolomics.info/individual/p1016 is a person named Murray R Badger
https://vivo.metabolomics.info/individual/p1181 is a person named David P Enot
https://vivo.metabolomics.info/individual/p786 is a person named Reda A Ammar
https://vivo.metabolomics.info/individual/p902 is a person named Frank Madeo
https://vivo.metabolomics.info/individual/p211 is a person named LaiFang Zhou
https://vivo.metabolomics.info/individual/p1273 is a person named Mami Okamoto
https://vivo.metabolomics.info/individual/p1257 is a person named Mohammad R Nezami Ranjbar


## List names with special characters

In [34]:
qres = g.query(
    """
    SELECT ?name
    WHERE {
      ?p a foaf:Person .
      ?p rdfs:label ?name .
      FILTER(regex(?name, "[^A-Za-z ]+?"))
    }
    ORDER BY ?name
    """, initNs = { "vivo": VIVO, "foaf": FOAF, "rdfs": RDFS, "bibo": BIBO, "m3c": M3C})

for row in qres:
    print("%s" % row)

A Ben-Hur
Alejandro Soto-Gutierrez
Alejandro Villar-Briones
Alexandre Perera-Lluna
Alison Berent-Spillson
Allison O'Kell
Alvaro Cuadros-Inostroza
Amro, Monte Ilaiwy, WIllis
Amro, Monte Ilaiwy, Willis
Ana Belén Lozano
Antonio Julià
Archana Sharma-Oates
Aurélie Roux
Brian O'Neill
Brigitte Wägele
Bruce McClenathan, M.D.
Carole Sztalryd-Woodle
Carolina Gonzalez-Riano
Chang Su-youne
Charmion Cruickshank-Quinn
Ching-Tai Chen
Chiun-Gung Juo
Christian Jäger
Christoph Böttcher
Coriness Piñeyro-Ruiz
Cristina Andres-Lacueva
Cristina Andrés-Lacueva
Cristina López-Hidalgo
Dennis Mook-Kanamori
Domingo Barber Hernández
Eric B. Haura
Etienne A Thévenot
Florence Nicolè
Fong-Fu Hsu
Francesc Fernández-Albert
Gabi Kastenmüller
Georg Langenkämper
Gildas Le Corguillé
Gonçalo Amarante Guimarães Pereira
Grace O'Maille
Grégory Genta-Jouve
Hans-Ulrich Häring
Harald C Köfeler
Henk Henk W.M. Hilhorst
Hugo Peña-Cortés
Hui-Yin Chang
Hyung-Ok Lee
Ivana Blaženović
Javier Lopez-Lbanez
Javier López-Ibáñez
Jean-Baptiste

## Tabulate Species of Studies

In [38]:
qres = g.query("""
    SELECT (COUNT (DISTINCT ?s) AS ?nstudies) ?o
    WHERE {
      ?s a m3c:Study .
      ?s m3c:subjectSpecies ?o .
    }
    GROUP BY ?o
    ORDER BY DESC(?nstudies)
    """, initNs = { "vivo": VIVO, "foaf": FOAF, "rdfs": RDFS, "bibo": BIBO, "m3c": M3C})

for row in qres:
    print("%s %s" % row)

454 Homo sapiens
314 Mus musculus
49 Rattus norvegicus
17 Mycobacterium tuberculosis H37Rv
11 Ovis aries
7 Macaca mulatta
7 Escherichia coli
7 Caenorhabditis elegans
7 Plasmodium falciparum
7 Arabidopsis thaliana
7 None
6 Drosophila melanogaster
6 Saccharomyces cerevisiae
6 Bos taurus
5 Rhesus Macaque
5 Canis lupus familiaris
5 Synechocystis sp. PCC6803
5 Megalobrama amblycephala
4 Metacarcinus magister
4 Plasmodium falciparum;Homo sapiens
4 Rattus
4 Staphylococcus aureus
4 Zea mays
3 Trypanosoma brucei brucei
3 Chlamydomonas reinhardtii
3 Sus scrofa
3 Equus caballus
3 Homo sapiens | Mus musculus
3 Danio rerio
3 Salmonella enterica/Escherichia coli
3 Felis catus
2 Mus musculus/Homo sapiens
2 Enterococcus faecalis
2 Homo sapiens/Rattus norvegicus
2 Cricetinae
2 Plasma
2 Xenopus laevis
2 Auxenochlorella protothecoides
2 Bacteria
2 Mus
2 Bacteroides tethaiotamicron
2 Phoenix dactylifera
2 Gallus gallus
2 Streptococcus mutans
2 Sinorhizobium meliloti
2 N/A
2 Synechococcus elongatus PCC7942

## List the first 25 publications by DOI and title

In [18]:
qres = g.query(
    """
    SELECT ?doi ?label
    WHERE {
      ?p bibo:doi ?doi .
      ?p rdfs:label ?label .
    }
    ORDER BY ?label
    LIMIT 25
    """, initNs = { "vivo": VIVO, "foaf": FOAF, "rdfs": RDFS, "bibo": BIBO})

for row in qres:
    print("%s %s" % row)

10.20960/nh.591 
10.1093/gerona/gln062 "Dividends" from research on aging--can biogerontologists, at long last, find something useful to do?
http://dx.doi.org/10.21228/M8997F "Evaluating lipid mediator structural complexity using ion mobility spectrometry combined with mass spectrometry"
10.1016/j.semarthrit.2013.12.007 "Generalized osteoarthritis": a systematic review.
10.1093/jbcr/irz210 "Geospatial Mapping as a Guide for Resource Allocation Among Burn Centers in India".
10.1016/j.chembiol.2011.06.001 "Going KiNativ": probing the Native Kinome.
10.1021/pr400710q "Out-gel" tryptic digestion procedure for chemical cross-linking studies with mass spectrometric detection.
10.1126/science.aal4573 "Pheno"menal value for human health.
10.1016/j.hrthm.2014.10.039 "Power-on resets" in cardiac implantable electronic devices during magnetic resonance imaging.
10.1371/journal.pone.0010936 "Topological significance" analysis of gene expression and proteomic profiles from prostate cancer cells rev

## Write Graph to a file

In [None]:
f = open("out.ttl", "w")
print(g.serialize(format="ttl").decode('utf-8'), file=f)