**Exploring the data - json side**

Starting from the csv and json files that we have as exemplar data, we decide to study more in details the configuration and the structure of IIIF "International Image Interoperability Framework" format.

https://iiif.io/api/presentation/3.0/#1-introduction


Here we can find more information related to the hierarchical structure of the json files -> collection, manifest and canvas -> with ids, type, label and items.

In our case we have the file "collection-1.json" which contains ONE collection with label "Works of Dante Alighieri" which itself contains only ONE manifest with label "Il Canzoniere" with 239 canvases inside.

The file "collection-2.json" contains ONE collection with label "Fondo Giuseppe Raimondi" which contains inside TWO manifests. The first one with label "Raimondi, Giuseppe. Quaderno manoscritto, "Caserma Scalo : 1930-1968"" (https://dl.ficlit.unibo.it/s/lib/item/19428) which have 16 canvases inside. The second one with label "Raimondi, Giuseppe. Quaderno manoscritto, "La vecchia centrale termica. Aprile 965"" (https://dl.ficlit.unibo.it/s/lib/item/19425) with other 16 canvases inside itself. The whole collection contains 32 canvases. 

One of the first things I analysed was the structure of json files, which look like dictionaries containing lists of dictionaries within them.
Following this cascade of internal lists, one arrives at the canvases contained in each manifest. 

Still continuing my observation of the data, I went to look up the meaning of the attribute 'label' in the IIIF documentation to better understand what it meant.

https://iiif.io/api/presentation/3.0/#language-of-property-values

Inside the label is contained a dictionary which has as its key 'none' and as its value a list containing a string with a name or a description. The key, in this case 'none', is where the language in which the label is written should be indicated, following the BCP 47 language code for the language (https://www.rfc-editor.org/info/bcp47). So the language should be indicated with 'en', 'it', 'fr' etc. In our case "none" is used when the language is not reported or is unknown.


**Building the CollectionProcessor**

It is son of class Processor.

The CollectionProcessor has as method:

uploadData: it takes in input the path of a JSON file containing collections (with manifests and canvases) and uploads them in the database. This method can be called everytime there is a need to upload collections in the database.



In [None]:
class CollectionProcessor(Processor):

    def __init__(self):
        super().__init__()

    def uploadData(self, path: str):
        try: 
            base_url = "https://github.com/n1kg0r/ds-project-dhdk/"  
            my_graph = Graph()

            # define namespaces 
            nikCl = Namespace("https://github.com/n1kg0r/ds-project-dhdk/classes/")
            nikAttr = Namespace("https://github.com/n1kg0r/ds-project-dhdk/attributes/")
            nikRel = Namespace("https://github.com/n1kg0r/ds-project-dhdk/relations/")
            dc = Namespace("http://purl.org/dc/elements/1.1/")

            my_graph.bind("nikCl", nikCl)
            my_graph.bind("nikAttr", nikAttr)
            my_graph.bind("nikRel", nikRel)
            my_graph.bind("dc", dc)
            
            with open(path, mode='r', encoding="utf-8") as jsonfile:
                json_object = load(jsonfile)
            
            #CREATE GRAPH
            if type(json_object) is list: #CONTROLLARE!!!
                for collection in json_object:
                    create_Graph(collection, base_url, my_graph)
            
            else:
                create_Graph(json_object, base_url, my_graph)
            
                    
            #DB UPTDATE
            store = SPARQLUpdateStore()

            endpoint = self.getDbPathOrUrl()

            store.open((endpoint, endpoint))

            for triple in my_graph.triples((None, None, None)):
                store.add(triple)
            store.close()
            
            with open('Graph_db.ttl', mode='a', encoding='utf-8') as f:
                f.write(my_graph.serialize(format='turtle'))

            return True
        
        except Exception as e:
            print(str(e))
            return False


I created the method with a "try and except" function that returns True if the process of uploading data ends well, or False if it ends with an error + it will write also the error itself to clarify why the method is not working properly.

I used rdflib to create the RDF graph and RDF statements.
From this library I used the Classes Graph and Namespace in CollectionProcessor, the I will use other classes and properties in CreateGraph, a side function I've created for populating the Graph.

Class Graph() -> https://rdflib.readthedocs.io/en/stable/intro_to_parsing.html -> to create a new (set as empty) RDF Graph

Class Namespace -> 
https://rdflib.readthedocs.io/en/stable/apidocs/rdflib.namespace.html#rdflib.namespace.Namespace -> In RDF, namespaces are used to provide an abbreviation or prefix for URIs (Uniform Resource Identifiers) that identify resources in the RDF graph. By using a namespace, a prefix or keyword can be defined to represent the complete URI. This simplifies writing and reading RDF graphs, as URIs can be quite long and complex.

I created four different Namespace. One for Classes, one for Attributes, one for Relations, and one for DublinCore (used for its property "identifier").
As you can see, I choose to define my personal URIs for almost everything. The only property I reuse from an already existing standard is "identifier" from Dublin Core -> http://purl.org/dc/elements/1.1/identifier
This choice was made in order to remain faithful to the lexicon required by the UML.

Before taking this decision I also read the documentation of the ontology proposed by IIIF, the Shared Canvas Data Model ->
https://iiif.io/api/model/shared-canvas/1.0/#Namespaces
Here there are some classes and properties useful for the description of a IIIF document as sc:Canvas or sc:Manifest or the properties "sc:hasSequences" or "sc:hasCanvases" where "sc:hasSequences" is used to represent the set of sequences within a manifest, while "sc:hasCanvases" is used to represent the set of canvases within a sequence or manifest.
Studying the documentation of this IIIF-specific ontology, it did not seem suitable for our project as there were no specific classes for representing "Collection" and the relationship between Collection and Manifest or Collection and Canvas was not considered. I therefore preferred to create custom URIs to be more free in the creation of the Graph.

Then I use the binding method (.bing) to bound the prefixes in the Graph -> https://rdflib.readthedocs.io/en/stable/namespaces_and_bindings.html

Then I open the json file and I read it.

Then I created a if-else statement that allow me to check whether there is a list in the json file or not.
This is because, as mentioned above, in our starting json files there are dictionaries each expressing a single collection, but I have also evaluated the possible case in which in a json file there are several collections (expressed through dictionaries) contained one by one within a list. 
With this if-else statement it is possible to check for the presence of a list and, if it's True, iterate over the list and create a triple for each collection.

Then I import the class SPARQLUpdateStore from rdflib.plugins.stores.sparqlstore to upload the graph persistently on our triplestore. We specify the endpoint of our triplestore (using method from Processor) and we store there all out triples usign a for loop. Then we close the connection.

After that, I added "with open" function to store the rdf graph as a turtle file. Mode is equal to "a" that stands for "append" cause I wanted to store the update of the graph in the same file.




**Create Graph Function**

In [None]:
from rdflib import Graph, URIRef, RDF, Literal, Namespace

def create_Graph(json_object:dict, base_url, my_graph:Graph):
    
    # create an internal id for the collections using an external counter
    # .strip is for removing eventually white space
    with open('collection_counter.txt', 'r', encoding='utf-8') as a:
        collection_counter = int(a.read().strip())

    # create an internal id for the manifest using an external counter
    with open('manifest_counter.txt', 'r', encoding='utf-8') as b:
        manifest_counter = int(b.read().strip())

    # create an internal id for the canvases using an external counter
    with open('canvas_counter.txt', 'r', encoding='utf-8') as c:
        canvas_counter = int(c.read().strip())

    # define namespaces 
    nikCl = Namespace("https://github.com/n1kg0r/ds-project-dhdk/classes/")
    nikAttr = Namespace("https://github.com/n1kg0r/ds-project-dhdk/attributes/")
    nikRel = Namespace("https://github.com/n1kg0r/ds-project-dhdk/relations/")
    dc = Namespace("http://purl.org/dc/elements/1.1/")

    # classes
    Collection = nikCl["Collection"]
    Manifest = nikCl["Manifest"]
    Canvas = nikCl["Canvas"]

    # attributes related to classes
    label = nikAttr["label"]

    # relations among classes
    items = nikRel["items"]
    has_id = dc["identifier"]

    # create a variable for the id
    collection_id = json_object['id'] 

    # create an internal id with an external counter
    collection_counter += 1
    collection_IntId = json_object['type'] + f"_{collection_counter}"
    Coll_internalId = URIRef(base_url + collection_IntId)

    # create a list from the dictionary of label and catch the value of the key "none" (language) in a variable
    label_list = list(json_object['label'].values())  
    value_label = label_list[0][0]

    # remove the square brackets from the label value
    value_label = str(value_label)

    # create the graph with the triples
    my_graph.add((Coll_internalId, has_id, Literal(collection_id)))
    my_graph.add((Coll_internalId, RDF.type, Collection))
    my_graph.add((Coll_internalId, label, Literal(str(value_label))))

    
    # second step is entering the collection items list (enter the manifest) -> entering a list of dictionaries
    # here i take the id and I store it in a variable
    for manifest in json_object["items"]:
        manifest_id = manifest['id']

        # here i raise the counter for the manifest internal id
        manifest_counter += 1
        manifest_IntId = manifest['type'] + f"_{manifest_counter}" 
        Man_internalId = URIRef(base_url + manifest_IntId)

        #add the "has Item" to connect Collection to Manifest
        my_graph.add((Coll_internalId, items, Man_internalId))

        # create a list from the dictionary of label and catch the value of the key "none" (language) in a variable
        M_label_list = list(manifest['label'].values())  
        M_value_label = M_label_list[0][0]

        # remove the square brackets from the label value
        M_value_label = str(M_value_label)
        

        # create the graph with the triples
        my_graph.add((Man_internalId, has_id, Literal(manifest_id)))
        my_graph.add((Man_internalId, RDF.type, Manifest))
        my_graph.add((Man_internalId, label, Literal(str(M_value_label))))

        # third step is entering the manifest items list (enter the canvases) -> entering a list of dictionaries
        # here i take the id and I store it in a variable
        for canvas in manifest["items"]:
            canvas_id = canvas['id']

            # here i raise the counter for the manifest internal id
            canvas_counter += 1
            canvas_IntId = canvas['type'] + f"_{canvas_counter}" 
            Can_internalId = URIRef(base_url + canvas_IntId)

            #add the "has Item" to connect Collection to Manifest
            my_graph.add((Man_internalId, items, Can_internalId))

            # create a list from the dictionary of label and catch the value of the key "none" (language) in a variable
            C_label_list = list(canvas['label'].values())  
            C_value_label = C_label_list[0][0]

            # remove the square brackets from the label value
            C_value_label = str(C_value_label)


            # create the graph with the triples
            my_graph.add((Can_internalId, has_id, Literal(canvas_id)))
            my_graph.add((Can_internalId, RDF.type, Canvas))
            my_graph.add((Can_internalId, label, Literal(str(C_value_label))))


    #upload the counters text file
    with open('collection_counter.txt', 'w') as a:
        a.write(str(collection_counter))

    with open('manifest_counter.txt', 'w') as b:
        b.write(str(manifest_counter))

    with open('canvas_counter.txt', 'w') as c:
        c.write(str(canvas_counter))

I've created this separated function to build the structure of the graph.
First thing that this function does is opening three text files that I've called "counters", one for collections, one for manifests and one for canvases. 
The I recall the namespaces and I create the specific classes, attributes and relations that I neeeded. 

I raise the collection counter of 1 every time I use the create_graph function, entering the first dictionary. 

Here I use also URIRef, Literal and RDF from rdflib.

With two for loop I entered first the Manifest and then the Canvas lists of dictionary.

At the end of the function I update the text files of the counters. 

**Cleaning functions**

I also developed two extra-functions to clean the counters and to clean the blazegraph database. 

They were useful during the testing phase of our code and represent a very easy way to start from zero with the creation of the graph. 

In [None]:
import requests

def clear_counter():
    with open('collection_counter.txt', 'w') as a:
        a.write('0')

    with open('manifest_counter.txt', 'w') as b:
        b.write('0')

    with open('canvas_counter.txt', 'w') as c:
        c.write('0')

def clear_blazegraph_database(db_path):
    url = f"{db_path}/sparql"
    query = "DELETE WHERE { ?s ?p ?o }"

    response = requests.post(url, data=query, headers={"Content-Type": "application/sparql-update"})

    if response.status_code == 200:
        print("Database cleared successfully.")
    else:
        print("Failed to clear the database.")

# RUN
blazegraph_path = "http://127.0.0.1:9999/blazegraph"
clear_blazegraph_database(blazegraph_path)
clear_counter()

The first function write "0" inside all the three counters text files.

The second one takes in input the path of the database that we want to clear and it make a SPARQL Update Query via HTTP Post request.
The response from the server is stored in the response variable. 
If the status of the response is equal to 200 (indicating a successful request), it prints "Database cleared successfully." Otherwise, it prints "Failed to clear the database."

**TriplestoreQueryProcessor**

In [None]:
class TriplestoreQueryProcessor(QueryProcessor):

    def __init__(self):
        super().__init__()

    def getAllCanvases(self):

        endpoint = self.getDbPathOrUrl()
        query_canvases = """
        PREFIX dc: <http://purl.org/dc/elements/1.1/> 
        PREFIX nikAttr: <https://github.com/n1kg0r/ds-project-dhdk/attributes/> 
        PREFIX nikCl: <https://github.com/n1kg0r/ds-project-dhdk/classes/> 
        PREFIX nikRel: <https://github.com/n1kg0r/ds-project-dhdk/relations/> 

        SELECT ?canvas ?id ?label
        WHERE {
            ?canvas a nikCl:Canvas;
            dc:identifier ?id;
            nikAttr:label ?label.
        }
        """

        df_sparql_getAllCanvases = get(endpoint, query_canvases, True)
        return df_sparql_getAllCanvases

    def getAllCollections(self):

        endpoint = self.getDbPathOrUrl()
        query_collections = """
        PREFIX dc: <http://purl.org/dc/elements/1.1/> 
        PREFIX nikAttr: <https://github.com/n1kg0r/ds-project-dhdk/attributes/> 
        PREFIX nikCl: <https://github.com/n1kg0r/ds-project-dhdk/classes/> 
        PREFIX nikRel: <https://github.com/n1kg0r/ds-project-dhdk/relations/> 

        SELECT ?collection ?id ?label
        WHERE {
           ?collection a nikCl:Collection;
           dc:identifier ?id;
           nikAttr:label ?label .
        }
        """

        df_sparql_getAllCollections = get(endpoint, query_collections, True)
        return df_sparql_getAllCollections

    def getAllManifests(self):

        endpoint = self.getDbPathOrUrl()
        query_manifest = """
        PREFIX dc: <http://purl.org/dc/elements/1.1/> 
        PREFIX nikAttr: <https://github.com/n1kg0r/ds-project-dhdk/attributes/> 
        PREFIX nikCl: <https://github.com/n1kg0r/ds-project-dhdk/classes/> 
        PREFIX nikRel: <https://github.com/n1kg0r/ds-project-dhdk/relations/>

        SELECT ?manifest ?id ?label
        WHERE {
           ?manifest a nikCl:Manifest ;
           dc:identifier ?id ;
           nikAttr:label ?label .
        }
        """

        df_sparql_getAllManifest = get(endpoint, query_manifest, True)
        return df_sparql_getAllManifest

    def getCanvasesInCollection(self, collectionId: str):

        endpoint = self.getDbPathOrUrl()
        query_canInCol = """
        PREFIX dc: <http://purl.org/dc/elements/1.1/> 
        PREFIX nikAttr: <https://github.com/n1kg0r/ds-project-dhdk/attributes/> 
        PREFIX nikCl: <https://github.com/n1kg0r/ds-project-dhdk/classes/> 
        PREFIX nikRel: <https://github.com/n1kg0r/ds-project-dhdk/relations/>

        SELECT ?canvas ?id ?label 
        WHERE {
            ?collection a nikCl:Collection ;
            dc:identifier "%s" ;
            nikRel:items ?manifest .
            ?manifest a nikCl:Manifest ;
            nikRel:items ?canvas .
            ?canvas a nikCl:Canvas ;
            dc:identifier ?id ;
            nikAttr:label ?label .
        }
        """ % collectionId

        df_sparql_getCanvasesInCollection = get(endpoint, query_canInCol, True)
        return df_sparql_getCanvasesInCollection

    def getCanvasesInManifest(self, manifestId: str):

        endpoint = self.getDbPathOrUrl()
        query_canInMan = """
        PREFIX dc: <http://purl.org/dc/elements/1.1/> 
        PREFIX nikAttr: <https://github.com/n1kg0r/ds-project-dhdk/attributes/> 
        PREFIX nikCl: <https://github.com/n1kg0r/ds-project-dhdk/classes/> 
        PREFIX nikRel: <https://github.com/n1kg0r/ds-project-dhdk/relations/> 

        SELECT ?canvas ?id ?label
        WHERE {
            ?manifest a nikCl:Manifest ;
            dc:identifier "%s" ;
            nikRel:items ?canvas .
            ?canvas a nikCl:Canvas ;
            dc:identifier ?id ;
            nikAttr:label ?label .
        }
        """ % manifestId

        df_sparql_getCanvasesInManifest = get(endpoint, query_canInMan, True)
        return df_sparql_getCanvasesInManifest


    def getManifestsInCollection(self, collectionId: str):

        endpoint = self.getDbPathOrUrl()
        query_manInCol = """
        PREFIX dc: <http://purl.org/dc/elements/1.1/> 
        PREFIX nikAttr: <https://github.com/n1kg0r/ds-project-dhdk/attributes/> 
        PREFIX nikCl: <https://github.com/n1kg0r/ds-project-dhdk/classes/> 
        PREFIX nikRel: <https://github.com/n1kg0r/ds-project-dhdk/relations/>  

        SELECT ?manifest ?id ?label
        WHERE {
            ?collection a nikCl:Collection ;
            dc:identifier "%s" ;
            nikRel:items ?manifest .
            ?manifest a nikCl:Manifest ;
            dc:identifier ?id ;
            nikAttr:label ?label .
        }
        """ % collectionId

        df_sparql_getManifestInCollection = get(endpoint, query_manInCol, True)
        return df_sparql_getManifestInCollection
    

    def getEntitiesWithLabel(self, label: str): 
            

        endpoint = self.getDbPathOrUrl()
        query_entityLabel = """
        PREFIX dc: <http://purl.org/dc/elements/1.1/> 
        PREFIX nikAttr: <https://github.com/n1kg0r/ds-project-dhdk/attributes/> 
        PREFIX nikCl: <https://github.com/n1kg0r/ds-project-dhdk/classes/> 
        PREFIX nikRel: <https://github.com/n1kg0r/ds-project-dhdk/relations/>

        SELECT ?entity ?type ?label ?id
        WHERE {
            ?entity nikAttr:label "%s" ;
            a ?type ;
            nikAttr:label ?label ;
            dc:identifier ?id .
        }
        """ % remove_special_chars(label)

        df_sparql_getEntitiesWithLabel = get(endpoint, query_entityLabel, True)
        return df_sparql_getEntitiesWithLabel
    

    def getEntitiesWithCanvas(self, canvasId: str): 
            
        endpoint = self.getDbPathOrUrl()
        query_entityCanvas = """
        PREFIX dc: <http://purl.org/dc/elements/1.1/> 
        PREFIX nikAttr: <https://github.com/n1kg0r/ds-project-dhdk/attributes/> 
        PREFIX nikCl: <https://github.com/n1kg0r/ds-project-dhdk/classes/> 
        PREFIX nikRel: <https://github.com/n1kg0r/ds-project-dhdk/relations/> 

        SELECT ?id ?label ?type
        WHERE {
            ?entity dc:identifier "%s" ;
            dc:identifier ?id ;
            nikAttr:label ?label ;
            a ?type .
        }
        """ % canvasId

        df_sparql_getEntitiesWithCanvas = get(endpoint, query_entityCanvas, True)
        return df_sparql_getEntitiesWithCanvas
    
    def getEntitiesWithId(self, id: str): 
            
        endpoint = self.getDbPathOrUrl()
        query_entityId = """
        PREFIX dc: <http://purl.org/dc/elements/1.1/> 
        PREFIX nikAttr: <https://github.com/n1kg0r/ds-project-dhdk/attributes/> 
        PREFIX nikCl: <https://github.com/n1kg0r/ds-project-dhdk/classes/> 
        PREFIX nikRel: <https://github.com/n1kg0r/ds-project-dhdk/relations/> 

        SELECT ?id ?label ?type
        WHERE {
            ?entity dc:identifier "%s" ;
            dc:identifier ?id ;
            nikAttr:label ?label ;
            a ?type .
        }
        """ % id

        df_sparql_getEntitiesWithId = get(endpoint, query_entityId, True)
        return df_sparql_getEntitiesWithId
    

    def getAllEntities(self): 
            
        endpoint = self.getDbPathOrUrl()
        query_AllEntities = """
        PREFIX dc: <http://purl.org/dc/elements/1.1/> 
        PREFIX nikAttr: <https://github.com/n1kg0r/ds-project-dhdk/attributes/> 
        PREFIX nikCl: <https://github.com/n1kg0r/ds-project-dhdk/classes/> 
        PREFIX nikRel: <https://github.com/n1kg0r/ds-project-dhdk/relations/> 

        SELECT ?entity ?id ?label ?type
        WHERE {
            ?entity dc:identifier ?id ;
                    dc:identifier ?id ;
                    nikAttr:label ?label ;
                    a ?type .
        }
        """ 

        df_sparql_getAllEntities = get(endpoint, query_AllEntities, True)
        return df_sparql_getAllEntities

Here I use a facility of Pandas that permits to interact with a SPARQL endpoint provided by an RDF triplestore (in our case Blazegraph) -> the library sparql_dataframe -> with this library the answer of our query will be shows as a Pandas dataframe. 
We use the method "get", that takes in input three parameters, to perform this operation. The input parameters are: the URL of the SPARQL endpoint to contact, the query to execute, and a boolean that specify if to contact the SPARQL endpoint using th PostHTTP method.

For some of these queries I've used "%s" and "%" that are part of a string formatting in Python. They are used to dynamically insert the "input" value into the query.

In the query "%s" is placed within the query string as a placeholder for the input value. Then, later in the query, the "%" character is used to indicate that you want to replace %s with the actual input value.

In "getEntitiesWithLabel" I used also an external function to clean the string and escape some problematic characters.

**Clean String Function**

In [None]:
def remove_special_chars(s: str) -> str:
    if '\"' in s:
        return s.replace('\"', '\\\"')
    elif '"' in s:
        return s.replace('"', '\\\"')
    else:
        return s

With this function I'm escaping the double-quotes '"' and the slash + double-quotes '\"' that we can meet inside the labels of the json files in order to produce safe strings to insert them in the SPARQL queries withouth problems.


**Last 4 methods of the GenericQueryProcessor**

In [None]:
def getEntitiesWithLabel(self, label):
        
        graph_db = DataFrame()
        relation_db = DataFrame()
        result = list()

        for processor in self.queryProcessors:
            if isinstance(processor, TriplestoreQueryProcessor):
                graph_to_add = processor.getEntitiesWithLabel(label)
                graph_db = concat([graph_db, graph_to_add], ignore_index= True)
            elif isinstance(processor, RelationalQueryProcessor):
                relation_to_add = processor.getEntities()
                relation_db = concat([relation_db, relation_to_add], ignore_index=True)
            else:
                break
        
        
        if not graph_db.empty: #check if the call got some result
            df_joined = merge(graph_db, relation_db, left_on="id", right_on="id") #create the merge with the two db
            grouped = df_joined.groupby("id").agg({
                                                        "label": "first",
                                                        "title": "first",
                                                        "creator": lambda x: "; ".join(x)
                                                    }).reset_index() #this is to avoid duplicates when we have more than one creator
            grouped_fill = grouped.fillna('')
            sorted = grouped_fill.sort_values("id") #sorted for id
            
            if not sorted.empty:
                for row_idx, row in sorted.iterrows():
                    id = row["id"]
                    label = label
                    title = row["title"]
                    creators = row['creator'].split('; ')
                    entities = EntityWithMetadata(id, label, title, creators)
                    result.append(entities)            
            
                return result

            else:
                for _, row in graph_db.iterrows():
                    id = row["id"]
                    label = label
                    title = ""
                    creators = ""
                    entities = EntityWithMetadata(id, label, title, creators)
                    result.append(entities)
                return result
        return result
                

The first is "getEntitiesWithLabel": it returns a list of objects having class EntityWithMetadata, included in the databases accessible via the query processors, related to the entities having, as label, the input label.

I create two empty dataframe and I iterate every processor that are present in the list "queryProcessor", by dividing them according to the type of processor they are (either TriplestoreQueryProcessor or RelationalQueryProcessor), thus either linked to a graph database or linked to a relational database.

If they are istances of TriplestoreQP I use the method "getEntitiesWithLabel" already developed in the TriplestoreQP and I store the result in a variable that I concat with one of the empty Dataframes in order to collect data from different processors all together in the same dataframe.
The method "getEntitiesWithLabel" returns the internalId, the type, the label and the id of an entity that got as label the input label.
I do the same for RelationalQP, but here, instead of "getEntitiesWithLabel", I use "getEntities" from the RelationalQP, that returns internalId, id, creators and title of all the entities.

If the graphDB has got some result inside the dataframe (it is the one that contains "label", so it's the most important ot check) we can proceed with merging the dataframes with the column "id" as merging column.

After that, I proceed with an operation of grouping.
- "grouped" is the new DataFrame that is created to contain the results of the grouping and aggregation operation.
- groupby("id") -> performs the grouping of rows in the DataFrame df_joined based on the column "id". This means that groups of rows with the same "id" value will be created.
- agg({...}) -> This method is called on the grouped DataFrame to perform data aggregation within each group. The arguments passed to "agg" specify the columns to be aggregated and the aggregation operations to be applied.
- "label": "first": This line specifies that the first value of the group will be selected as the aggregate value for the "label" column of the grouped DataFrame. In other words, the first value of the "label" column within each group is taken.
- same is for "title".
- "creator": lambda x: '; '.join(x) -> This line specifies that a lambda function is defined for the column "creator" and applied to each group. The lambda function takes the entire set of values of the "creator" column as input and uses '; '.join(x) to join the values into a single string separated by '; '.
- .reset_index() -> At the end of the aggregation operation, the reset_index() method is called to reset the index of the grouped DataFrame, so that the "id" column is returned as an ordinary column instead of an index.

Finally I use "fillna('')" to fill in any empty cells and sort_values("id") to sort the result by the ids.

Then I check if the sorted df has something inside, if True, we can proceed with an iterrow for loop, checking every row of every column and creating some variables. For the row inside the column "creator" I split the values that I've previously merged together in a string, separated with "; ". 
The splitting operation returns a list of strings. With these variables I create the EntityWithMetadata objects and I append them to the result list. 

If the sorted df is empty (so there are data coming only from the graph db) we can do the same iterrow for loop but this time only with the graph_db dataframe, and we will insert some empty strings for title and creators.


In [None]:
def getEntitiesWithTitle(self, title):

        graph_db = DataFrame()
        relation_db = DataFrame()
        result = list()

        for processor in self.queryProcessors:
            if isinstance(processor, TriplestoreQueryProcessor):
                graph_to_add = processor.getAllEntities()
                graph_db = concat([graph_db ,graph_to_add], ignore_index= True)
            elif isinstance(processor, RelationalQueryProcessor):
                relation_to_add = processor.getEntitiesWithTitle(title)
                relation_db = concat([relation_db, relation_to_add], ignore_index=True)
            else:
                break        
        

        if not graph_db.empty:
            df_joined = merge(graph_db, relation_db, left_on="id", right_on="id")
            grouped = df_joined.groupby("id").agg({
                                                        "label": "first",
                                                        "title": "first",
                                                        "creator": lambda x: "; ".join(x)
                                                    }).reset_index() #this is to avoid duplicates when we have more than one creator
            grouped_fill = grouped.fillna('')
            # sorted = grouped_fill.sort_values("id")

            for _, row in grouped_fill.iterrows():
                id = row["id"]
                label = row["label"]
                title = row['title']
                creators = row['creator'].split('; ')
                entities = EntityWithMetadata(id, label, title, creators)
                result.append(entities)

        return result

The second is "getEntitiesWithTitle" -> it returns a list of objects having class EntityWithMetadata, included in the databases accessible via the query processors, related to the entities having, as title, the input title.

The very first part is the same as the method before.
But here we use the method "getAllEntities" from the TriplestoreQP that returns internalId, id, label and type of all the entities.
And the method "getEntitiesWithTitle(title)" from the RelationalQP that returnsinternalId, id , creator and title of the entities that correspond to the input title. 

The I check if the graph_db has something inside and I merge the dataframe together by the "id" column.
Then I used the same function than before "groupby". 
And I proceed with an iterrows for loop to take all the variables to fill EntitiyWithMetadata arguments and I append the objects to the result list.


In [None]:
 def getImagesAnnotatingCanvas(self, canvasId):

        graph_db = DataFrame()
        relation_db = DataFrame()
        result = list()

        for processor in self.queryProcessors:

            if isinstance(processor, TriplestoreQueryProcessor):
                graph_to_add = processor.getEntitiesWithCanvas(canvasId)
                graph_db = concat([graph_db,graph_to_add], ignore_index= True)
            elif isinstance(processor, RelationalQueryProcessor):
                relation_to_add = processor.getAllAnnotations()
                relation_db = concat([relation_db, relation_to_add], ignore_index=True)
            else:
                break

        if not graph_db.empty:
            df_joined = merge(graph_db, relation_db, left_on="id", right_on="target")

            for _, row in df_joined.iterrows():
                id = row["body"]
                images = Image(id)
            result.append(images)

        return result

The third is "getImagesAnnotatingCanvas" -> it returns a list of objects having class Image, included in the databases accessible via the query processors, that are body of the annotations targetting the canvaes specified by the input identifier.

Here I use the method "getEntitiesWithCanvas" from the TriplestoreQP that returns id, label and type of the canvas corresponding to the input canvasId.
And I use "getAllAnnotations()" from the RelationalQP that returns everything from the Annotation table so id, body, target and motivation.

Then I merge the df with the column "id" from the graph_db and the column "target" from the relational_db.

I do an iterrows for loop and i create the variable "id" taking data from the column "body" to fill the arguments of the Class Image. Finally, I append the entities to the result list.

In [None]:
def getManifestsInCollection(self, collectionId):

        graph_db = DataFrame()
        relation_db = DataFrame()
        result = list()
        
        for processor in self.queryProcessors:
            if isinstance(processor, TriplestoreQueryProcessor):
                graph_to_add = processor.getManifestsInCollection(collectionId)
                graph_db = concat([graph_db,graph_to_add], ignore_index= True)
            elif isinstance(processor, RelationalQueryProcessor):
                relation_to_add = processor.getEntities()
                relation_db = concat([relation_db, relation_to_add], ignore_index=True)
            else:
                break
        

        if not graph_db.empty:
            df_joined = merge(graph_db, relation_db, left_on="id", right_on="id") 

            for _, row in df_joined.iterrows():

                graph_db_canvas = DataFrame()

                for processor in self.queryProcessors:
                    if isinstance(processor, TriplestoreQueryProcessor):
                        graph_to_add_canvas = processor.getCanvasesInManifest(row['id'])
                        graph_db_canvas = concat([graph_db_canvas,graph_to_add_canvas], ignore_index= True).drop_duplicates()

                        df_joined_canvas = merge(graph_db_canvas, 
                                                relation_db,
                                                how='left',
                                                left_on='id',
                                                right_on='id'
                                                ).fillna('')
        
        
                        df_joined_canvas['creator'] =  df_joined_canvas.groupby(['canvas','id','label', 'entityId', 'title'])['creator'].transform(lambda x: '; '.join(x))
                        df_joined_canvas = df_joined_canvas.drop_duplicates()

                        canvases_list = [
                                Canvas(row1['id'], 
                                row1['label'], 
                                row1['title'],
                                row1['creator'].split('; ')) 
                                for _, row1 in df_joined_canvas.iterrows()
                            ]
                            
                        result.append(
                                Manifest(row["id"],
                                    row["label"], 
                                    canvases_list,
                                    row['title'], 
                                    row['creator'].split('; ')
                                    ) 
                            )

        return result
            

The last is the method "getManifestInCollection" -> it returns a list of objects having class Manifest, included in the databases accessible via the query processors, that are contained in the collection identified by the input identifier.

Here I use the method "getManifestsInCollection(collectionId)" from the TriplestoreQP that returns internalId, id, label. And I use "getEntities()" from the RelationalQP, that returns internalId, id, creators and title of all the entities.

If the graph_db DataFrame is not empty, it merges (inner join) the graph_db and relation_db DataFrames based on their "id" columns. 

The code iterates over each row of the df_joined DataFrame.
A new DataFrame, "graph_db_canvas", is initialized for each iteration to store the results specific to canvases.
Another loop iterates over each query processor.
If the processor is an instance of TriplestoreQP, it calls the "getCanvasesInManifest" method of that processor with the "id" value from the current row. The result is stored in "graph_to_add_canvas". The "graph_db_canvas" DataFrame is concatenated with "graph_to_add_canvas" to accumulate the results, and duplicates are dropped.

Then, the "graph_db_canvas" DataFrame is merged with the relation_db DataFrame based on the common column "id". Missing values are filled with empty strings.

At the end the canvases_list is inserted in the items arguments in the Class Manifest.