#  How to Explore an Unknown Dataset - Quickstart

When exploring a linked dataset via a SPARQL endpoint for the first time, the hurdle can be very high and prickly. Unless one has prior knowledge of the structure of the ontology in question, or documentation is available to explain it in detail, one doesn't know where to turn. Luckily, there are strategies to start with when exploring a dataset. Discovering them is precisely the purpose of this workshop.

## The Structure of a Dataset
In general, the triples of a Linked Data source can be divided into two groups: A-Box and T-Box.
- The T-Box (Terminological Box) contains information related to the definition of classes, properties and more generally to the structure of the dataset. 
- The A-Box (Assertional Box), on the other hand, contains information about instances, relationships between instances and more generally information about the main content of the dataset.

### Note on the method
We will begin by exploring the T-Box. This way we can understand the size of the dataset and its descriptive structure. To do this, we can look at the number of triples present and the types of classes and properties used. We'll then examine in more detail the nature of the information contained in the dataset in relation to these classes and properties (A-Box).

The reference resource is the Zeri Foundation's SPARQL endpoint.
First, I import a Python library called [sparql-dataframe] (https://pypi.org/project/sparql-dataframe/), which allows me to display the results cleanly. Then I declare in a variable the URL of the SPARQL endpoint. I do the same with the query I want to run (I recommend trying it first directly through the interface of the SPARQL endpoint, if available). Finally, I create a data frame using the library's *.get()* method and print it. The same process is repeated for all the SPARQL queries we'll see.

### The T-Box
#### The Number of Triples

In [2]:
import sparql_dataframe

endpoint = 'http://data.fondazionezeri.unibo.it/sparql'

query_triple_count = '''
    SELECT (COUNT (*) AS ?tripleCount) 
    WHERE {
        ?s ?p ?o .
    }
'''

df = sparql_dataframe.get(endpoint, query_triple_count)
print(f'The total number of triples is:\n {df}')

The total number of triples is:
    tripleCount
0     11827416


#### The Number of Predicates
#### The List of Predicates
This is the quickest way to get an idea of the kind of data available. Many of the predicates can indeed tell us interesting things.
An RDF dataset may or may not have an explicit structure, and the use for example of a property such as rdfs:subClassOf can indicate its presence. The next query might then ask which classes are subclasses of which classes, so you can get an overview of the structure of the dataset. Or you can simply search for classes that are present.

In [4]:
query_predicates = '''
    SELECT DISTINCT ?p
    WHERE { 
    ?s ?p ?o .
    }
'''

df = sparql_dataframe.get(endpoint, query_predicates)
print(f'The list of predicates:\n {df}')

The list of predicates:
                                                      p
0      http://www.w3.org/1999/02/22-rdf-syntax-ns#type
1           http://www.w3.org/2000/01/rdf-schema#label
2         http://www.w3.org/2000/01/rdf-schema#comment
3                 http://purl.org/dc/terms/description
4    http://www.essepuntato.it/2014/03/fentry/descr...
..                                                 ...
119                 http://purl.org/dc/terms/publisher
120                   http://purl.org/dc/terms/license
121                   http://purl.org/dc/terms/subject
122                    http://rdfs.org/ns/void#feature
123             http://rdfs.org/ns/void#sparqlEndpoint

[124 rows x 1 columns]


It could be interesting to try to figure out which properties are repeated most times. It is possible to do this by using the COUNT construct, and sorting the results in descending order (DESC).

##### The Number of Predicates

In [5]:
query_predicate_repetition = '''
    SELECT ?p (COUNT(?p) AS ?predicate)
    WHERE { 
    ?s ?p ?o .
    }
    GROUP BY ?p
    ORDER BY DESC(?predicate)
'''

df = sparql_dataframe.get(endpoint, query_predicate_repetition)
print(f'The number of times each predicate is used:\n {df}')

The number of times each predicate is used:
                                                      p  predicate
0           http://www.w3.org/2000/01/rdf-schema#label    3305175
1      http://www.w3.org/1999/02/22-rdf-syntax-ns#type    1897432
2         http://www.w3.org/2000/01/rdf-schema#comment     743296
3       http://www.cidoc-crm.org/cidoc-crm/P2_has_type     327750
4       http://purl.org/spar/pro/isRelatedToRoleInTime     293977
..                                                 ...        ...
119                    http://rdfs.org/ns/void#feature          1
120             http://rdfs.org/ns/void#sparqlEndpoint          1
121    http://www.cidoc-crm.org/cidoc-crm/P128_carries          1
122  http://www.cidoc-crm.org/cidoc-crm/P128i_is_ca...          1
123                 http://xmlns.com/foaf/0.1/homepage          1

[124 rows x 2 columns]


It will certainly be interesting to delve into the 4/5 most recurrent properties later on. However, if you quickly look at the full list, you'll be able understand at a first glance how the information contained is mainly about the description of cultural objects and interaction with entities (artists/institutions).

#### The Classes
With Classes, the situation is a little bit more complicated. It could be that a Class has type (rdf:type or a) rdfs:Class, or more often owl:Class. However, it is common to not have any result back from this type of searches, because it could be the case that the dataset has its own way to define a Class. Let's check with Zeri:

In [6]:
query_classes = '''
    PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
    SELECT DISTINCT ?c
    WHERE {
        ?c a rdfs:Class .
    }
    ORDER BY ?c
'''
df = sparql_dataframe.get(endpoint, query_classes)
print(f'The list of classes:\n {df}')

The list of classes:
 Empty DataFrame
Columns: [c]
Index: []


In [7]:
query_classes = '''
    PREFIX owl: <http://www.w3.org/2002/07/owl#>
    SELECT DISTINCT ?c
    WHERE {
        ?c a owl:Class .
    }
    ORDER BY ?c
'''
df = sparql_dataframe.get(endpoint, query_classes)
print(f'The list of classes:\n {df}')

The list of classes:
 Empty DataFrame
Columns: [c]
Index: []


#### The List of Classes
What we can do instead is to look at the concept that describes a subject, searching for its type: rdf:type or a.

In [8]:
query_concept = '''
    SELECT DISTINCT ?concept 
    WHERE {
    ?s a ?concept .
    }
'''
df = sparql_dataframe.get(endpoint, query_concept)
print(f'The list of Classes types:\n {df}')

The list of Classes types:
                                                concept
0      http://www.essepuntato.it/2014/03/fentry/FEntry
1         http://www.cidoc-crm.org/cidoc-crm/E35_Title
2    http://www.cidoc-crm.org/cidoc-crm/E15_Identif...
3    http://www.cidoc-crm.org/cidoc-crm/E42_Identifier
4          http://www.cidoc-crm.org/cidoc-crm/E55_Type
..                                                 ...
100       http://purl.org/spar/fabio/InstructionalWork
101    http://purl.org/spar/fabio/ExpressionCollection
102                  http://purl.org/spar/fabio/Thesis
103  http://www.essepuntato.it/2012/04/tvc/ValueInTime
104                    http://rdfs.org/ns/void#Dataset

[105 rows x 1 columns]


Moreover, we can check how many predicates are associated to each Class.

In [9]:
query_property_per_type = '''
    SELECT DISTINCT ?type (COUNT(DISTINCT ?p) AS ?count)
    WHERE {
    ?s a ?type . 
    ?s ?p ?o . 
    }
    GROUP BY ?type
    ORDER BY DESC(?count)
'''

df = sparql_dataframe.get(endpoint, query_property_per_type)
print(f'The number of properties per type in descending order:\n {df}')

The number of properties per type in descending order:
                                                   type  count
0    http://www.cidoc-crm.org/cidoc-crm/E28_Concept...     20
1      http://www.cidoc-crm.org/cidoc-crm/E31_Document     19
2              http://purl.org/spar/fabio/ArtisticWork     17
3                http://purl.org/spar/fabio/AnalogItem     16
4    http://www.cidoc-crm.org/cidoc-crm/E22_Man-Mad...     16
..                                                 ...    ...
100  http://www.cidoc-crm.org/cidoc-crm/E58_Measure...      2
101       http://www.cidoc-crm.org/cidoc-crm/E74_Group      2
102  http://www.cidoc-crm.org/cidoc-crm/E78_Collection      2
103  http://www.ontologydesignpatterns.org/cp/owl/t...      2
104                    http://www.w3.org/ns/prov#Agent      2

[105 rows x 2 columns]


### The A-Box
#### Instances per Concept
With Classes, and in particular the instances that refer to them, we get to the heart of the dataset's content. First we can look at how many instances each Class has. The previous reasoning applies here as well: by checking the list, it is possible to see which concepts are most recurrent.

In [10]:
query_instance_per_concept = '''
    SELECT ?concept (COUNT (?s) AS ?instanceCount) 
    WHERE {
    ?s a ?concept . 
    }
    GROUP BY ?concept
    ORDER BY DESC(?instanceCount)
'''

df = sparql_dataframe.get(endpoint, query_instance_per_concept)
print(f'The number of instances per class are:\n {df}')

The number of instances per class are:
                                                concept  instanceCount
0                  http://purl.org/spar/pro/RoleInTime         272862
1    http://www.cidoc-crm.org/cidoc-crm/E42_Identifier         163774
2          http://purl.org/spar/fabio/MetadataDocument         146088
3        http://purl.org/emmedi/hico/InterpretationAct         144561
4    http://www.cidoc-crm.org/cidoc-crm/E15_Identif...         115169
..                                                 ...            ...
100               http://purl.org/spar/fabio/Newspaper              1
101                  http://purl.org/spar/fabio/Thesis              1
102                    http://rdfs.org/ns/void#Dataset              1
103     http://www.cidoc-crm.org/cidoc-crm/E45_Address              1
104                    http://www.w3.org/ns/prov#Agent              1

[105 rows x 2 columns]


If we go back and look at the properties that recur most often, we see rdfs:label in first place. This is a very useful property that allows the nomenclature of instances to be extracted in natural language. 
In this regard, one factor to pay attention to are repetitions of the same concept due very often to typos ("Federico Zeri" and " Federico Zeri" are the same thing, but written with an extra space that makes them perceived as different labels). To get around this problem, one can use the SAMPLE construct, which allows one label to be picked up representing all the similar ones.

In [12]:
query_instance_label = '''
    PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> 
    SELECT ?instance 
        (SAMPLE(?label) AS ?instanceLabel) 
        (COUNT(?instance) AS ?instanceCount) 
    WHERE { 
        ?instance a ?class . 
        OPTIONAL{ ?instance rdfs:label ?label .} 
        }
        GROUP BY ?instance ?instanceLabel
        ORDER BY DESC(?instanceCount)
'''

df = sparql_dataframe.get(endpoint, query_instance_label)
print(f'The list of instances with labels and repetitions:\n {df}')

HTTPError: HTTP Error 504: Gateway Time-out

Try to do a similar search and explore results related to properties:
- P2_has_type
- P14_corried_out_by

or any other that you find interesting.

# References

- DuCharme Bob, «Exploring a SPARQL endpoint», 24 agosto 2014. https://www.bobdc.com/blog/exploring-a-sparql-endpoint/.
- DuCharme Bob, «Queries to explore a dataset», 30 aprile 2022. https://www.bobdc.com/blog/exploringadataset/.