# Tutorial

In this tutorial, we are creating a database, which interacts with two custom databases:
- A CSV data store: InMemoryCSVDB
- A RDF metadata store: InMemoryRDFDB
Both stores are implemented in this tutorial.

In [1]:
#!pip install gldb[tutorial]

In [2]:
import gldb

from typing import Union, List, Dict
import pandas as pd
import pathlib
import rdflib

## Data Stores

The database shall interact with "**data stores**" (here used as a more generic word for database). Through them, data can be accessed (uploaded and queried).

They can be databases for raw or metadata.

Let's implement the concrete implementations for a CSV and a in-memory-RDF database.

In [3]:
class CSVQuery(gldb.query.Query):
    
    def __init__(self, tables: Dict[str, pd.DataFrame]):
        self.tables = tables
        
    def execute(self, query: Dict, description=None, *args, **kwargs):
        table = kwargs.get("table")
        return gldb.query.QueryResult(query=self, data=self.tables[table].query(query))

In [4]:
class CSVDatabase(gldb.stores.DataStore):

    def __init__(self, filenames=None):
        self._filenames = filenames or []
        self.tables = {}

    @property
    def query(self):
        return CSVQuery(self.tables)

    def upload_file(self, filename: pathlib.Path) -> bool:
        if filename.resolve().absolute() in self._filenames:
            return True
        self._filenames.append(filename.resolve().absolute())
        self.tables[filename.stem] = pd.read_csv(filename)
        return True

In [5]:
class InMemoryRDFDatabase(gldb.stores.RDFStore):

    def __init__(self):
        self._filenames = []
        self._graphs = {}

    def upload_file(self, filename: pathlib.Path) -> bool:
        self._filenames.append(filename.resolve().absolute())
        return True

    @property
    def graph(self) -> rdflib.Graph:
        """Return graph for the metadata store."""
        combined_graph = rdflib.Graph()
        for filename in self._filenames:
            g = self._graphs.get(filename, None)
            if not g:
                g = rdflib.Graph()
                g.parse(filename)
                for s, p, o in g:
                    if isinstance(s, rdflib.BNode):
                        new_s = rdflib.URIRef(f"https://example.org/{s}")
                    else:
                        new_s = s
                    if isinstance(o, rdflib.BNode):
                        new_o = rdflib.URIRef(f"https://example.org/{o}")
                    else:
                        new_o = o
                    g.remove((s, p, o))
                    g.add((new_s, p, new_o))
                self._graphs[filename] = g
            combined_graph += g
        return combined_graph

## The Database instance

The core implementation concerns the implementation of `GenericLinkedDatabase`:

In [6]:
db = gldb.GenericLinkedDatabase(
    {
        "csv": CSVDatabase(),
        "rdf": InMemoryRDFDatabase()
    }
)

Popoulate the stores with data:

In [7]:
for filename in pathlib.Path("data").glob('*.jsonld'):
    db.stores.rdf.upload_file(filename)

In [8]:
for filename in pathlib.Path("data").glob('*.csv'):
    db.stores.csv.upload_file(filename)

## Query the RDF store

Every store as a property `query` which returns the query-object of the store. In case of the already implemented `RDFStore` it is the `SparqlQuery` (also implemented by `gldb`):

In [9]:
db.stores.rdf.query

SparqlQuery(graph=[a rdfg:Graph;rdflib:storage [a rdflib:Store;rdfs:label 'Memory']].)

Let's formulate the SPARQL query string:

In [10]:
query_str = """
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX dcterm: <http://purl.org/dc/terms/>
PREFIX dcat: <http://www.w3.org/ns/dcat#>
SELECT * WHERE {
    ?file a dcat:Dataset .
    ?file dcterm:creator ?person .
    ?person a foaf:Person .
}
"""

Perform the query:

In [11]:
res = db.stores.rdf.query(
    query=query_str,
    description="Selecting all persons")
res

QueryResult(
  query=SparqlQuery(graph=[a rdfg:Graph;rdflib:storage [a rdflib:Store;rdfs:label 'Memory']].),
  data=<rdflib.plugins.sparql.processor.SPARQLResult object at 0x0000015953EE8BB0>,
  description=Selecting all persons
)

In [12]:
for row in res.data.bindings:
    print(row)

{rdflib.term.Variable('file'): rdflib.term.URIRef('file:temperature.csv'), rdflib.term.Variable('person'): rdflib.term.URIRef('https://example.org/Nbf69432ebd3d49e3abdcfd104b91ca08')}


## Query the CSV store

In [19]:
csv_res = db.stores.csv.query("temperature > 23.0", table="temperature")
csv_res.data

Unnamed: 0,timestamp,temperature
4,2024-01-01 04:00,23.2
5,2024-01-01 05:00,23.8
6,2024-01-01 06:00,24.9
7,2024-01-01 07:00,25.5
8,2024-01-01 08:00,24.0


# Federated queries

Since raw temperature data is stored in a different database (store) than the metadata, there is the need to combine the data.

This must currently be done using custom functions that return a `FederatedQueryResult`:

In [48]:
def fetch_temperature_dataset(query, table) -> gldb.query.FederatedQueryResult:
    csv_res = db.stores.csv.query(query, table=table)

    query_all_metadata_of_temperature = f"""
    PREFIX dcat: <http://www.w3.org/ns/dcat#>
    SELECT * WHERE {{
        <file:{table}.csv> a dcat:Dataset .
        ?s ?p ?o
    }}
    """
    rdf_res = db.stores.rdf.query(query=query_all_metadata_of_temperature)
    fed_res = gldb.query.FederatedQueryResult(
        data=csv_res,
        metadata=rdf_res
    )
    return fed_res

In [50]:
federated_result = fetch_temperature_dataset(query="temperature > 23.0", table="temperature")

In [51]:
federated_result

FederatedQueryResult(data=QueryResult(
  query=<__main__.CSVQuery object at 0x00000159554C6350>,
  data=          timestamp  temperature
4  2024-01-01 04:00         23.2
5  2024-01-01 05:00         23.8
6  2024-01-01 06:00         24.9
7  2024-01-01 07:00         25.5
8  2024-01-01 08:00         24.0,
  description=None
), metadata=QueryResult(
  query=SparqlQuery(graph=[a rdfg:Graph;rdflib:storage [a rdflib:Store;rdfs:label 'Memory']].),
  data=<rdflib.plugins.sparql.processor.SPARQLResult object at 0x00000159554C69E0>,
  description=None
))