# ROR testing

This notebooks is a simple test.  It is made of two parts.  

## Part 1  Graph

In this section a simple SPARQL call is made that collects all the unique type organization names from the Ocean InfoHub graph.  Later, this will be improved a bit, for now it serves the demonstratoion purpose.  

## Part 2 ROR Retriever

This section uses [ROR Retriever](https://github.com/Metadata-Game-Changers/RORRetriever) code base.  The existing code at that repo is modified a bit for the workflow here.  A conversion of this code base to a library approach allowing for a pip install would be a useful pull request.  


### Imports

In [1]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
from SPARQLWrapper import SPARQLWrapper, JSON
import requests                     # for making web requests
import json                         # json reading and access
import pandas as pd                 # pandas for dataframe processing
from urllib.parse import quote      # URL encoding
import logging

sparqlep = "https://ts.collaborium.io/blazegraph/namespace/development/sparql"

## Part 1 Graph 

### Support Functions

In [3]:
#@title
def get_sparql_dataframe(service, query):
    """
    Helper function to convert SPARQL results into a Pandas data frame.
    """
    sparql = SPARQLWrapper(service)
    sparql.setQuery(query)
    sparql.setReturnFormat(JSON)
    result = sparql.query()

    processed_results = json.load(result.response)
    cols = processed_results['head']['vars']

    out = []
    for row in processed_results['results']['bindings']:
        item = []
        for c in cols:
            item.append(row.get(c, {}).get('value'))
        out.append(item)

    return pd.DataFrame(out, columns=cols)

## Queries

What follows is a set of queries designed to provide a feel for the OIH graph

### Simple Count

How many triples are there?

In [4]:
rq_count = """SELECT (COUNT(*) as ?Triples) 
WHERE 
  {
      { ?s ?p ?o } 
  }
"""

In [5]:
dfsc = get_sparql_dataframe(sparqlep, rq_count)
dfsc.head()

Unnamed: 0,Triples
0,5546932


### Organization Query


In [33]:
rq_pcount = """prefix prov: <http://www.w3.org/ns/prov#>
PREFIX con: <http://www.ontotext.com/connectors/lucene#>
PREFIX luc: <http://www.ontotext.com/owlim/lucene#>
PREFIX con-inst: <http://www.ontotext.com/connectors/lucene/instance#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX schema: <https://schema.org/>
PREFIX schemaold: <http://schema.org/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>

SELECT DISTINCT ?name
WHERE
{
  ?s a schema:Organization .
  ?s schema:name ?name 

}     
"""

dfc = get_sparql_dataframe(sparqlep, rq_pcount)


In [35]:
dfc.head(20)

Unnamed: 0,name
0,aquadocs
1,marineie
2,marinetraining
3,obis
4,obps
5,oceanexperts
6,The Institute for Marine and Antarctic Studies...
7,MARine Litter in Europe Seas: Social AwarenesS...
8,Copernicus Marine Environmental Monitoring Ser...
9,Unesco


In [36]:
dfl = dfc['name'].values.tolist()
print(dfl[:10])

['aquadocs', 'marineie', 'marinetraining', 'obis', 'obps', 'oceanexperts', 'The Institute for Marine and Antarctic Studies, University of Tasmania', 'MARine Litter in Europe Seas: Social AwarenesS and CO Responsibility', 'Copernicus Marine Environmental Monitoring Service (CMEMS)/Mercator Ocean', 'Unesco']


## Part 2 ROR Retriever


Look at https://github.com/Metadata-Game-Changers/RORRetriever/blob/main/RORRetriever.py


In [27]:
lggr = logging.getLogger('RORRetriever')

def outputResults(tl: list,             # affiliation list input (list of dicts)
                  output: str,          # name of output file
                  cnt: int,             # number of RORs
                  writeHeader: bool):   # write header flag (true for first write)
    '''
        Create a dataframe from a list of dictionaries (tl) and output it to a file.
        Rows are output in sets with length = -o outputInterval. This prevents output
        from being lost if something goes wrong in long processing runs.
    '''
    #
    # The items that are defined in the dictionaries (in tl) will be used to create a dataframe
    # The columns that are defined depend on whether or nor RORs have been discovered. Defining the
    # complete set of columns here avoids an error caused by outputing data without all of the
    # columns defined.
    #
    c_names = ['affiliation', 'searchString_Affiliation',
               'ROR_Affiliation', 'organizationLookupName_Affiliation',
               'country_Affiliation', 'match_Affiliation',
               'chosen_Affiliation', 'score',
               'numberOfResults_Affiliation', 'valid']

    df = pd.DataFrame(tl)                       # create the dataframe from the list of dicts

    if len(df.columns) < 9:                     # add columns if needed
        for c in ['searchString_Affiliation', 'ROR_Affiliation',
                  'organizationLookupName_Affiliation', 'country_Affiliation',
                  'chosen_Affiliation']:
            df.insert(0, c, value='')           # insert empty text ('') row named c at position 0
        df.insert(0,'score',0)                  # insert empty numeric row for scores (default = 0)

    df.to_csv(output, sep='\t', index=False,                # output dataframe to tab delimited file
              encoding='utf-8', header=writeHeader,         # in append mode with header on first write
              mode = 'a', columns=c_names)

    lggr.info("{} new RORs written to {}".format(cnt,output))

def printResponse(df:pd.core.frame.DataFrame):
    '''
        Show response table on terminal with following fields:
        substring:      Search string (can be substring of complete affiliation)
        score:          Match score between 0 and 1 (1 is the best match chosen by algorithm)
        matchingType:   Method that found the match (provided by algorithm)
        chosen:         True for chosen ROR False for others
        organization:   Name of organization for ROR (should match substring)
        country:        Country of organization
    '''
    out_df = df[['substring','score','matching_type','chosen']]
    out_df['ror'] = df['organization'].apply(lambda x: x['id'])
    out_df['organization'] = df['organization'].apply(lambda x: x['name'])
    out_df['country'] = df['organization'].apply(lambda x: x['country']['country_name'])
    pd.set_option('display.width', 1000)
    print(out_df.to_string(index=False))


def retrieveData(url:str                            # URL to search
                 )->requests.models.Response:        # requests response
    '''
        read data for url, return response
    '''
    lggr.debug(f"Retrieving Data URL: {URL}")

    try:
        response = requests.get(URL)
        response.raise_for_status()
    except requests.exceptions.HTTPError as err:
        lggr.warning(f'URL: {URL} Error: {err}')
        return None
    except requests.exceptions.ConnectionError as err:
        lggr.warning(f'URL: {URL} Error: {err}')
        return None
    except requests.exceptions.Timeout as err:
        lggr.warning(f'URL: {URL} Error: {err}')
        return None
    except requests.exceptions.TooManyRedirects as err:
        lggr.warning(f'URL: {URL} Error: {err}')
        return None
    except requests.exceptions.MissingSchema as err:
        lggr.warning(f'URL: {URL} Error: {err}')
        return None

    lggr.debug(f'Response length: {len(response.text)}')
    return response

### ROR call

#### Notes

* need to add parallel execution at this level, but worry if the API is rate or call limited

In [37]:
# input_l = ["UNESCO", "IFREMER"]
# input_l = ["The University of Alabama Libraries","Arizona State University Library", "UC Berkeley Library", "MIT Libraries"]
input_l = dfl[:40]  # just do a subset for now

lggr = logging.getLogger('RORRetriever')

ror_list = []
newRORCount = 2

for i, affiliation in enumerate(input_l):       # loop affiliations in input_l]

    if type(affiliation) != str:                # skip non-strings (NaN)
        continue
    affiliation = affiliation.replace('\n','').strip()

    # if (i % args.outputInterval == 0) & (len(ror_list) > 0):              # output current results
    #     lggr.info("{} processed affiliation: {}".format(i,affiliation))
    #     outputResults(ror_list, outputFileName,newRORCount,writeHeader)
    #     writeHeader = False
    #     ror_list = []

    URL = 'https://api.ror.org/organizations?affiliation=' + quote(affiliation.encode('utf-8'))
    r = retrieveData(URL)

    if (r is None) or (r.status_code != 200):
        lggr.warning('****************** HTTP Error: {URL}')
        continue
    #
    # convert response to json
    #
    response = r.json()


   ## -----------------------------------------------------------------------------------------------------

    response_df = pd.DataFrame(response.get('items'))           # create response dataframe

    if (response['number_of_results'] == 0):                    # ROR search had no results
        ror_list.append({'affiliation':affiliation,             # set affiliation, numberOfResults, and match = 'No Result'
                         'numberOfResults_Affiliation':0,
                         'match_Affiliation':'No Result',
                         'valid':False})
        continue                                                # next affiliation

    flags_noAcronyms = True

    acronymCount = len(response_df[response_df['matching_type']=='ACRONYM'])                # count Acronyms
    if (acronymCount == response['number_of_results']) and (flags_noAcronyms is True):       # search result is all acronyms:
        ror_list.append({'affiliation':affiliation,              # set affiliation, numberOfResults, and match = 'No Result'
                         'numberOfResults_Affiliation':0,
                         'match_Affiliation':'No Result',
                         'valid':False})
        continue                                                # next affiliation

    # TODO make the following two items flags

    # if (len(input_l) == 1):  # or args.showDetails:                # if only one affiliation is being tested
    #     printResponse(response_df)                              # or --response is set, print results

    # if (args.noAcronyms is True):                           # remove acronym matches from response _df
    #     response_df = response_df[response_df['matching_type'] != 'ACRONYM']

    #
    # search for item chosen as best match be affiliation API
    #

    flag_matchMax = False

    maxScore = response_df.score.max()                               # find maximum score
    if flag_matchMax is False:
        chosen_df = response_df[response_df.chosen == True]          # create chosen dataframe where chosen = True
    else:
        chosen_df = response_df[response_df.score == maxScore]       # create chosen dataframe where score = maxScore

        # items can be chosen (chosenScore = 1) or,
        # if --max is set the item with the max score is chosen
        # even if ROR algorithm did not chose item
    if len(chosen_df) > 0:
        newRORCount += len(chosen_df)                                   # count new RORs
        for i in chosen_df.index:                                       # add new RORs to ror_list
            ror_list.append({'affiliation':affiliation,
                             'searchString_Affiliation':chosen_df.loc[i,'substring'].replace('"',''),
                             'ROR_Affiliation':chosen_df.loc[i,'organization']['id'],
                             'organizationLookupName_Affiliation':chosen_df.loc[i,'organization']['name'],
                             'chosen_Affiliation':chosen_df.loc[i,'chosen'],
                             'score':chosen_df.loc[i,'score'],
                             'match_Affiliation': chosen_df.loc[i,'matching_type'],
                             'country_Affiliation':chosen_df.loc[i,'organization']['country']['country_name'],
                             'numberOfResults_Affiliation':response['number_of_results'],
                             'valid': True})
            lggr.debug(newRORCount,':', affiliation,'<'+response_df.loc[i,'substring']+'>',response_df.loc[i,'organization']['name'])
    else:           # No item choosen
        ror_list.append({'affiliation': affiliation,
                         'numberOfResults_Affiliation': response['number_of_results'],
                         'match_Affiliation': 'No Match', 'valid': False})
    #
# output final results
#
# outputResults(ror_list, outputFileName, newRORCount, writeHeader)
# lggr.info("{} {} RORs Found".format(outputFileName, newRORCount))

# print(ror_list)
#     printResponse(response_df)                              # or --response is set, print results
# response_df.head()
# print(ror_list)
df = pd.DataFrame(ror_list)

In [38]:
df.head(20)

Unnamed: 0,affiliation,numberOfResults_Affiliation,match_Affiliation,valid,searchString_Affiliation,ROR_Affiliation,organizationLookupName_Affiliation,chosen_Affiliation,score,country_Affiliation
0,aquadocs,0,No Result,False,,,,,,
1,marineie,0,No Result,False,,,,,,
2,marinetraining,0,No Result,False,,,,,,
3,obis,0,No Result,False,,,,,,
4,obps,0,No Result,False,,,,,,
5,oceanexperts,0,No Result,False,,,,,,
6,The Institute for Marine and Antarctic Studies...,3,No Match,False,,,,,,
7,MARine Litter in Europe Seas: Social AwarenesS...,10,No Match,False,,,,,,
8,Copernicus Marine Environmental Monitoring Ser...,1,No Match,False,,,,,,
9,Unesco,0,No Result,False,,,,,,


## Closing Thoughts / future directions

At this point the process of querying the graph for organizations and then doing searches for RORs is fleshed out.  What remains is assessing the ROR matches.

* how do we validate a match?  By score only?  By score and some other factor(s).  A heuristic?
* if a match is made, then what?
    * generate triples in what vocabulary
    * generate PROV (in what manner)
    * load triples into default graph or some other graph  (how to leverage quads in a given approach)
    
What is the value of this or any other such KG completion process?
Is it:

* to improve the hosted graph
* build a support graph
* provide feedback to the source to improve their orginal source material
* other
