# VIVO Pipeline

Given a spreadsheet, create VIVO data and load to VIVO.  Do everything responsibly:

1. Use PythonAnywhere APIs to fetch elements from the "world" that are identified -- MeSH terms, dates, pubmed,
DOI, etc, etc.  Convert to JSON and build a hierarchical document as needed -- a person may have many papers.
Support some standard entity spreadsheets-- person, org -- perhaps everything starts with these two.
1. Transform the JSON to TTL using RMLMapper.
1. Check the TTL using SHACL constraints.  Add inferred helper properties (authorOf, counts of things)
1. VIVO IRI rewriting.  Query a VIVO to get good IRIs.  Rewrite the TTL with good IRI.
1. Robot reasoning and reduce for sound ontological assertions.
1. Data is loaded to VIVO

*Voila*.  VIVO data.

Some principles:

1. Don't over engineer (don't assume, don't limit, don't require, don't force)
1. Show the human where the error(s) are

Some must haves:
1. Spreadsheet must be a TSV (tab separated values).  Commas often appear data values and cannot be used as a field separator.
1. Allow multiple values in a single cell with an in-cell separator (semi-colon).  That way we only need one spreadsheet cell for a person's pubs, 
and one for their research interests.
1. Allow multi-valued values using a secondary separator (pipe "|").  An example is education.  Using the semicolon, we can separate one degree from another.  Using the pipe,
we can separate the values for a degree: date, degree, topic, awarder

    
## Todo
1. Simplify -- does the JSON make sense?  Can we remove the upper layer and just get to many small blocks about the person?
1. Go deeper -- add more processing steps.  Fetch steps, RMLMapper steps.
1. Go wider -- add more attributes and sub-attributes
1. Review approach with M3C group.

    
## Attributes

Here's what we have so far.  Some attributes have sub-attributes

1. orcid
1. department
1. name
1. position
    1. start
    1. end
    1. title
    1. unit
1. education
    1. date
    1. degree
    1. topic
    1. awarder
1. overview
1. photo
1. email
1. local_id
1. comment
1. grant
    1. date
    1. id
    1. awarder
    1. amount
1. teaching
    1. date
    1. course_id
1. address
    1. kind
    1. address1
    1. address2
    1. city
    1. state
    1. postal code
    1. country
1. topic
1. title
1. phone
    1. kind
    1. number
1. website
    1. kind
    1. url
1. language
    1. capability
    1. iso639

## 1. From a spreadsheet, fetch public data from open APIs; return JSON

From a spreadsheet of people -- one row per person, create a JSON object for further processing.  The spreadsheet can may multiple values in every cell -- a person 
may have more than one name, more than one phone number, etc.  further processing may flag some multiple values as errors.  Values may have sub-values.  Teaching,
for example has two sub-values -- the date of the start of the course, and the identifier (local) of the course.  

Then the JSON is processed, looking up each of the values using a standard set of APIs for VIVO.  These APIs provide details about the things referenced in the 
spreadsheet data.  The VIVO APIs hide processing details of the data sources, and standardize each, 
producing simple JSON for subsequent processing.

In [5]:
%load_ext autoreload
%autoreload 2
import csv
import extract
import json
input_filename = "people.tsv"
data = []
header = list()
with open(input_filename, 'r') as csvfile:
    reader = csv.reader(csvfile, delimiter='\t', quotechar='"')
    for row in reader:
        if len(header) == 0:
            header = row
        else:
            data.append(dict(zip(header, row)))
for row in data:
    for key,cell in row.items():
        if cell == "":
            row[key] = list()
        else:
            row[key] = cell.split(";")
            cell_list = [getattr(extract, key)(val) for val in row[key]]
            row[key] = cell_list      
print(json.dumps(data[2], indent=4))

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
{
    "orcid": [
        {
            "orcid": "1234-3342-2221-2024"
        }
    ],
    "name": [
        {
            "name": "Amir Amarbad"
        }
    ],
    "department": [
        {
            "department": "French"
        }
    ],
    "pmid": [
        {
            "pmid": "00400"
        },
        {
            "pmid": "300033"
        }
    ],
    "position": [],
    "education": [],
    "teaching": [
        {
            "teaching": {
                "date": "201506",
                "course_id": "ENC3114"
            }
        },
        {
            "teaching": {
                "date": "20159",
                "course_id": "ENC3114"
            }
        },
        {
            "teaching": {
                "date": "201601",
                "course_id": "ENC2021"
            }
        },
        {
            "teaching": {
                "date": "201601",
                "c

## 2. RMLMapper makes first version of triples in TTL

In [4]:
import os
os.system("rmlmapper -m person-map.ttl -o people.ttl -s turtle")
# sed <people.ttl >temp.ttl '10,$s/<http:\/\/vivo.ufl.edu\/individual\/\(.*\)>/data:\1/'
os.system("< people.ttl python3 improve_rdf.py")

0

## 3. SHACL is used to test the TTL against a series of validations

## 4. QUERY VIVO to rewrite URI as needed

## 5. Robot is used to infer triples and reduce triples

## 6. Final triples are loaded to VIVO