# Ocean Acidification

### References

* [ocean expert metadata doc](https://oceanexpert.org/document/26001)
* [oa erdap](https://erddap.oa.iode.org/erddap/index.html)
* An example query (used as guide for below): https://github.com/iodepo/odis-arch/blob/schema-dev-df/code/SPARQL/baseQuery.rq
* SHACL shapes for potential reference: https://github.com/iodepo/odis-arch/tree/schema-dev-df/code/SHACL

### Need to

* Look for datasets with distribution
* connect their prov
* validate with SHACL for variable measured


In [1]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

from SPARQLWrapper import SPARQLWrapper, JSON
import pandas as pd
# import dask, boto3
# import dask.dataframe as dd
import numpy as np
import json



In [2]:
sparqlep = "http://graph.oceaninfohub.org/blazegraph/namespace/oih/sparql"


In [3]:
#@title
def get_sparql_dataframe(service, query):
    """
    Helper function to convert SPARQL results into a Pandas data frame.
    """
    sparql = SPARQLWrapper(service)
    sparql.setQuery(query)
    sparql.setReturnFormat(JSON)
    result = sparql.query()

    processed_results = json.load(result.response)
    cols = processed_results['head']['vars']

    out = []
    for row in processed_results['results']['bindings']:
        item = []
        for c in cols:
            item.append(row.get(c, {}).get('value'))
        out.append(item)

    return pd.DataFrame(out, columns=cols)

In [4]:
rq_count = """PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX schema: <https://schema.org/>
PREFIX bds: <http://www.bigdata.com/rdf/search#>

SELECT DISTINCT  ?s ?url ?dist ?g ?type ?score ?name ?lit ?description ?headline
WHERE
{
   ?lit bds:search "ocean acidification" .
   ?lit bds:matchAllTerms "false" .
   ?lit bds:relevance ?score .
   graph ?g {
    ?s ?p ?lit .
    ?s rdf:type ?type .
    OPTIONAL { ?s schema:distribution ?dist .   }
    OPTIONAL { ?s schema:name ?name .   }
    OPTIONAL { ?s schema:headline ?headline .   }
    OPTIONAL { ?s schema:url ?url .   }
    OPTIONAL { ?s schema:description ?description .    }
  }

}
ORDER BY DESC(?score)
OFFSET 0
"""

In [5]:
dfsc = get_sparql_dataframe(sparqlep, rq_count)

In [6]:
dfsc.head(30)

Unnamed: 0,s,url,dist,g,type,score,name,lit,description,headline
0,https://catalogue.cioos.ca/dataset/ca-cioos_94...,,,urn:gleaner.oih:cioos:06ad273673d73ea121b46ccb...,http://schema.org/Dataset,0.8838834764831843,,ocean-acidification,,
1,https://catalogue.cioos.ca/dataset/ca-cioos_6d...,,,urn:gleaner.oih:cioos:07b8dbe98aa969748be4ccd9...,http://schema.org/Dataset,0.8838834764831843,,ocean-acidification,,
2,https://catalogue.cioos.ca/dataset/ca-cioos_28...,,,urn:gleaner.oih:cioos:470274a756f88f69ab10ba8b...,http://schema.org/Dataset,0.8838834764831843,,ocean-acidification,,
3,https://catalogue.cioos.ca/dataset/ca-cioos_46...,,,urn:gleaner.oih:cioos:5b269267eff9816d38de6111...,http://schema.org/Dataset,0.8838834764831843,,ocean-acidification,,
4,https://catalogue.cioos.ca/dataset/ca-cioos_fe...,,,urn:gleaner.oih:cioos:9da9e91c55f1e3a24526fd7c...,http://schema.org/Dataset,0.8838834764831843,,ocean-acidification,,
5,https://catalogue.cioos.ca/dataset/ca-cioos_80...,,,urn:gleaner.oih:cioos:a93ce67b3674406c3128f45f...,http://schema.org/Dataset,0.8838834764831843,,ocean-acidification,,
6,https://catalogue.cioos.ca/dataset/ca-cioos_17...,,,urn:gleaner.oih:cioos:b1784e3376afc7eb9f4b713c...,http://schema.org/Dataset,0.8838834764831843,,ocean-acidification,,
7,https://catalogue.cioos.ca/dataset/ca-cioos_f7...,,,urn:gleaner.oih:cioos:b923332b130b7d5115695f55...,http://schema.org/Dataset,0.8838834764831843,,ocean-acidification,,
8,https://catalogue.cioos.ca/dataset/ca-cioos_fe...,,,urn:gleaner.oih:cioos:fd138ecf934ae5539fe11454...,http://schema.org/Dataset,0.8838834764831843,,ocean-acidification,,
9,https://catalogue.cioos.ca/dataset/ca-cioos_b6...,,,urn:gleaner.oih:cioos:ffbf74e046ba420f7dfbd6c6...,http://schema.org/Dataset,0.8838834764831843,,ocean-acidification,,


In [7]:
dfsc.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 98168 entries, 0 to 98167
Data columns (total 10 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   s            98168 non-null  object
 1   url          28786 non-null  object
 2   dist         2835 non-null   object
 3   g            98168 non-null  object
 4   type         98168 non-null  object
 5   score        98168 non-null  object
 6   name         23708 non-null  object
 7   lit          98168 non-null  object
 8   description  29983 non-null  object
 9   headline     0 non-null      object
dtypes: object(10)
memory usage: 7.5+ MB
