# General instructions

The goal of the project is to materialize a set of **exploratory workloads** over a real-world, large-scale,  open-domain KG: [WikiData](https://www.wikidata.org/wiki/Wikidata:Main_Page)

An exploratory workload  is composed by a set of queries, where each query is related to the information obtained previously.

An exploratory workload starts with a usually vague, open ended question, and does not assume the person issuing the workload has a clear understanding of the data contained in the target database or its structure.

Remeber that:

1. All the queries must run in the python notebook
2. You can use classes and properties only if you find them via a SPARQL query that must be present in the notebook
3. You do not delete useless queries. Keep everything that is synthatically valid 

```
?p <http://schema.org/name> ?name .
```
    
is the BGP returning a human-readable name of a property or a class in Wikidata.
    
    

In [1]:
## SETUP used later

from SPARQLWrapper import SPARQLWrapper, JSON


prefixString = """
##-bd8f39b93a-##
PREFIX wd: <http://www.wikidata.org/entity/> 
PREFIX wdt: <http://www.wikidata.org/prop/direct/> 
PREFIX sc: <http://schema.org/>
"""

# select and construct queries
def run_query(queryString):
    to_run = prefixString + "\n" + queryString

    sparql = SPARQLWrapper("http://a256-gc1-02.srv.aau.dk:5820/sparql")
    sparql.setTimeout(300)
    sparql.setReturnFormat(JSON)
    sparql.setQuery(to_run)

    try :
       results = sparql.query()
       json_results = results.convert()
       if len(json_results['results']['bindings'])==0:
          print("Empty")
          return 0
    
       for bindings in json_results['results']['bindings']:
          print( [ (var, value['value'])  for var, value in bindings.items() ] )

       return len(json_results['results']['bindings'])

    except Exception as e :
        print("The operation failed", e)
    
# ASk queries
def run_ask_query(queryString):
    to_run = prefixString + "\n" + queryString

    sparql = SPARQLWrapper("http://a256-gc1-02.srv.aau.dk:5820/sparql")
    sparql.setTimeout(300)
    sparql.setReturnFormat(JSON)
    sparql.setQuery(to_run)

    try :
        return sparql.query().convert()

    except Exception as e :
        print("The operation failed", e)


# GEO Workflow Series ("archaeological sites") 

Consider the following exploratory information need:

> Search for archaeological site in the world, across countries, continents, and in reference to their culture

## Useful URIs for the current workflow


The following are given:

| IRI           | Description   | Role      |
| -----------   | -----------   |-----------|
| `wdt:P1647`   | subproperty   | predicate |
| `wdt:P31`     | instance of   | predicate |
| `wdt:P279`    | subclass      | predicate |
| `wdt:P17`     | country       | predicate |
| `wd:Q38`      | Italy  | node      |
| `wd:Q641556`  | Verona Arena  | node      |
| `wd:Q1747689` | Ancient Rome  | node |
| `wd:Q46`      | Europe        | node |
| `wd:Q173527`  | Knossos       | node |
| `wd:Q839954`  | archaeological site | node |


Also consider

```
?p wdt:P17 wd:Q38  . 
?p wdt:P31 wd:Q839954  . 
```

is the BGP to retrieve all **archaeological sites in italy**

## Workload Goals

1. Identify the BGP that connect an archaeological site to the country, the continent, and the culture 

2. Identify the BGP to retrieve other types of an archaeological site, e.g., human settlement or theatre

3. Is there any relevant numerical attribute that describes these sites, e.g., number visitors?

4. Analyze the number of archaeological sites per type and country
 
   4.1 Which country has more archaeological sites? Which country has the most  human settlements?
   
   4.2 Which countries have Ancient Rome sites, which other "archaeological cultures" are described?
   
   4.3 Which country has the most diverse set of civilizations or cultures across its sites?
   
   4.4 If you are interested in visiting some sites, which country would you pick? Based on what criteria?


In [2]:
# start your workflow here

## Identify the BGP that connect an archaeological site to the country, the continent, and the culture

First, I can look for the desired properties for archaeological sites.

In [3]:
queryString = """
select distinct ?p ?pName where {
    ?arch wdt:P31 wd:Q839954 ;
          ?p ?o .
    
    ?p <http://schema.org/name> ?pName .
    
    filter (regex(?pName, "country|culture", "i"))
}
"""

print("Results")
run_query(queryString)

Results
[('p', 'http://www.wikidata.org/prop/direct/P17'), ('pName', 'country')]
[('p', 'http://www.wikidata.org/prop/direct/P2596'), ('pName', 'culture')]
[('p', 'http://www.wikidata.org/prop/direct/P3569'), ('pName', 'Cultureel Woordenboek ID')]
[('p', 'http://www.wikidata.org/prop/direct/P8698'), ('pName', "Turkey's Culture Portal ID")]
[('p', 'http://www.wikidata.org/prop/direct/P205'), ('pName', 'basin country')]
[('p', 'http://www.wikidata.org/prop/direct/P4702'), ('pName', 'Google Arts & Culture partner ID')]
[('p', 'http://www.wikidata.org/prop/direct/P495'), ('pName', 'country of origin')]
[('p', 'http://www.wikidata.org/prop/direct/P8047'), ('pName', 'country of registry')]
[('p', 'http://www.wikidata.org/prop/direct/P9013'), ('pName', 'Encyclopedia of Saami Culture ID')]


9

So, I can retrieve the country from the property *country (**P17**)* and the culture from property *culture (**P2596**)*.

For what concern the continent, I must look for the properties in a country, for example Italy.

In [4]:
queryString = """
select distinct ?p ?pName where {
    wd:Q38 ?p ?o .
    
    ?p <http://schema.org/name> ?pName .
} order by ?pName
"""

print("Results")
run_query(queryString)

Results
[('p', 'http://www.wikidata.org/prop/direct/P8061'), ('pName', 'AGROVOC ID')]
[('p', 'http://www.wikidata.org/prop/direct/P5198'), ('pName', 'ASC Leiden Thesaurus ID')]
[('p', 'http://www.wikidata.org/prop/direct/P6150'), ('pName', 'Academy Awards Database nominee ID')]
[('p', 'http://www.wikidata.org/prop/direct/P8895'), ('pName', 'All the Tropes identifier')]
[('p', 'http://www.wikidata.org/prop/direct/P7870'), ('pName', 'Analysis & Policy Observatory term ID')]
[('p', 'http://www.wikidata.org/prop/direct/P8785'), ('pName', 'AniDB tag ID')]
[('p', 'http://www.wikidata.org/prop/direct/P9629'), ('pName', 'Armeniapedia ID')]
[('p', 'http://www.wikidata.org/prop/direct/P6200'), ('pName', 'BBC News topic ID')]
[('p', 'http://www.wikidata.org/prop/direct/P1617'), ('pName', 'BBC Things ID')]
[('p', 'http://www.wikidata.org/prop/direct/P9037'), ('pName', 'BHCL UUID')]
[('p', 'http://www.wikidata.org/prop/direct/P2581'), ('pName', 'BabelNet ID')]
[('p', 'http://www.wikidata.org/prop/d

230

The countries are connected to the continent through the property *continent (**P30**)*. So, the final BGP to get the country, the continent and the culture of an archeological site is:

```
?arch wdt:P17 ?country ;
      wdt:P2596 ?culture .
?country wdt:P30 ?continent .
```

I can try it with Arena di Verona

In [5]:
queryString = """
select ?archName ?countryName ?continentName ?cultureName where {
    values ?arch { wd:Q641556 } .
    
    ?arch wdt:P17 ?country ;
          wdt:P2596 ?culture .
    ?country wdt:P30 ?continent .
    
    ?arch <http://schema.org/name> ?archName .
    ?country <http://schema.org/name> ?countryName .
    ?continent <http://schema.org/name> ?continentName .
    ?culture <http://schema.org/name> ?cultureName .
}
"""

print("Results")
run_query(queryString)

Results
[('archName', 'Verona Arena'), ('countryName', 'Italy'), ('continentName', 'Europe'), ('cultureName', 'Ancient Rome')]


1

## 2. Identify the BGP to retrieve other types of an archaeological site, e.g., human settlement or theatre

To answer this question, I can retrieve the list of instances connected to archeological sites

In [6]:
queryString = """
select distinct ?type ?typeName where {
    ?arch wdt:P31 wd:Q839954 ;
          wdt:P31 ?type .
    
    ?type <http://schema.org/name> ?typeName .
}
"""

print("Results")
run_query(queryString)

Results
[('type', 'http://www.wikidata.org/entity/Q14947481'), ('typeName', 'wierde')]
[('type', 'http://www.wikidata.org/entity/Q505963'), ('typeName', 'artificial dwelling hill')]
[('type', 'http://www.wikidata.org/entity/Q14601169'), ('typeName', 'human fortified settlement')]
[('type', 'http://www.wikidata.org/entity/Q325017'), ('typeName', 'causewayed enclosure')]
[('type', 'http://www.wikidata.org/entity/Q5069563'), ('typeName', 'chambered long barrow')]
[('type', 'http://www.wikidata.org/entity/Q30504603'), ('typeName', 'Maya site in Mexico')]
[('type', 'http://www.wikidata.org/entity/Q6581615'), ('typeName', 'thermae')]
[('type', 'http://www.wikidata.org/entity/Q95463156'), ('typeName', 'Ancient Roman dam')]
[('type', 'http://www.wikidata.org/entity/Q1128906'), ('typeName', 'medina quarter')]
[('type', 'http://www.wikidata.org/entity/Q19979289'), ('typeName', 'birth house')]
[('type', 'http://www.wikidata.org/entity/Q29702995'), ('typeName', 'chalkotheke')]
[('type', 'http://ww

1117

And using regex I can filter for something particular.

In [7]:
queryString = """
select distinct ?type ?typeName where {
    ?arch wdt:P31 wd:Q839954 ;
          wdt:P31 ?type .
    
    ?type <http://schema.org/name> ?typeName .
    
    filter (regex(?typeName, "theatre|settlement", "i"))
}
"""

print("Results")
run_query(queryString)

Results
[('type', 'http://www.wikidata.org/entity/Q14601169'), ('typeName', 'human fortified settlement')]
[('type', 'http://www.wikidata.org/entity/Q1708422'), ('typeName', 'settlement site')]
[('type', 'http://www.wikidata.org/entity/Q7362268'), ('typeName', 'Roman amphitheatre')]
[('type', 'http://www.wikidata.org/entity/Q486972'), ('typeName', 'human settlement')]
[('type', 'http://www.wikidata.org/entity/Q2860319'), ('typeName', 'Greek theatre')]
[('type', 'http://www.wikidata.org/entity/Q19757'), ('typeName', 'Roman theatre')]
[('type', 'http://www.wikidata.org/entity/Q24354'), ('typeName', 'theatre')]
[('type', 'http://www.wikidata.org/entity/Q106468588'), ('typeName', 'Jewish settlement in the land of Israel')]
[('type', 'http://www.wikidata.org/entity/Q2844395'), ('typeName', 'Avenches amphitheatre')]
[('type', 'http://www.wikidata.org/entity/Q100268926'), ('typeName', 'iberian settlement')]
[('type', 'http://www.wikidata.org/entity/Q2264924'), ('typeName', 'port settlement')]

18

And of course I can use property paths to looks for instances and subclasses iteratively.

## 3. Is there any relevant numerical attribute that describes these sites, e.g., number visitors?

For relevant numerical attributes, I can list the numerical properties of archeological sites.

In [8]:
queryString = """
select distinct ?p ?pName where {
    ?arch wdt:P31 wd:Q839954 ;
          ?p ?o .
    
    ?p <http://schema.org/name> ?pName .
    filter (isNumeric(?o))
}
"""

print("Results")
run_query(queryString)

Results
[('p', 'http://www.wikidata.org/prop/direct/P1082'), ('pName', 'population')]
[('p', 'http://www.wikidata.org/prop/direct/P2043'), ('pName', 'length')]
[('p', 'http://www.wikidata.org/prop/direct/P2044'), ('pName', 'elevation above sea level')]
[('p', 'http://www.wikidata.org/prop/direct/P2046'), ('pName', 'area')]
[('p', 'http://www.wikidata.org/prop/direct/P2048'), ('pName', 'height')]
[('p', 'http://www.wikidata.org/prop/direct/P2049'), ('pName', 'width')]
[('p', 'http://www.wikidata.org/prop/direct/P2109'), ('pName', 'installed capacity')]
[('p', 'http://www.wikidata.org/prop/direct/P4511'), ('pName', 'vertical depth')]
[('p', 'http://www.wikidata.org/prop/direct/P8687'), ('pName', 'social media followers')]
[('p', 'http://www.wikidata.org/prop/direct/P1174'), ('pName', 'visitors per year')]
[('p', 'http://www.wikidata.org/prop/direct/P2130'), ('pName', 'cost')]
[('p', 'http://www.wikidata.org/prop/direct/P1538'), ('pName', 'number of households')]
[('p', 'http://www.wikida

41

There are some relevant attributes releated to the archeological sites, like the *visitors per year (**P1174**)*, the *cost (**P2130**)*, the *area (**P2046**)* or the *maximum capacity (**P1083**)*.

## 4.1. Which country has more archaeological sites? Which country has the most human settlements?

For what concern the archaeological sites, I can simply use the BGP defined in the first section

In [9]:
queryString = """
select ?country ?countryName (count(?arch) as ?numSites) where {
    ?arch wdt:P31+/wdt:P279* wd:Q839954 ;
          wdt:P17 ?country .
    
    ?country <http://schema.org/name> ?countryName .
} group by ?country ?countryName
order by desc(?numSites)
limit 1
"""

print("Results")
run_query(queryString)

Results
[('country', 'http://www.wikidata.org/entity/Q34'), ('countryName', 'Sweden'), ('numSites', '80859')]


1

Sweden is the country with the highest number of archaeological sites. For what concern the human settlements, from point 2, the human settlement is described by entity **Q486972**.

In [10]:
queryString = """
select ?country ?countryName (count(distinct ?settlement) as ?numSettlements) where {
    ?settlement wdt:P31+/wdt:P279* wd:Q486972 ;
                wdt:P17 ?country .
    
    ?country <http://schema.org/name> ?countryName .
} group by ?country ?countryName
order by desc(?numSettlements)
limit 1
"""

print("Results")
run_query(queryString)

Results
[('country', 'http://www.wikidata.org/entity/Q148'), ('countryName', "People's Republic of China"), ('numSettlements', '704508')]


1

RPC is the country with the highest number of human settlements

## 4.2. Which countries have Ancient Rome sites, which other "archaeological cultures" are described?

I can get the countries with Ancient Rome sites looking for the culture of archaeological sites, using the BGP defined in the first section. I can also list the continent of these sites, to check if there are not uncongrous data, since it would be strange to have Ancient Rome sites, for example, in America.

In [11]:
queryString = """
select ?country ?countryName (group_concat(distinct ?continentName; separator=", ") as ?continents) (count(distinct ?site) as ?numSites) where {
    ?site wdt:P31+/wdt:P279* wd:Q839954 ;
          wdt:P2596 wd:Q1747689 ;
          wdt:P17 ?country .
    
    ?country wdt:P30 ?continent .
    
    ?country <http://schema.org/name> ?countryName .
    ?continent <http://schema.org/name> ?continentName .
} group by ?country ?countryName
order by desc(?numSites)
"""

print("Results")
run_query(queryString)

Results
[('country', 'http://www.wikidata.org/entity/Q38'), ('countryName', 'Italy'), ('continents', 'Europe'), ('numSites', '502')]
[('country', 'http://www.wikidata.org/entity/Q142'), ('countryName', 'France'), ('continents', 'Europe'), ('numSites', '58')]
[('country', 'http://www.wikidata.org/entity/Q29'), ('countryName', 'Spain'), ('continents', 'Europe'), ('numSites', '53')]
[('country', 'http://www.wikidata.org/entity/Q948'), ('countryName', 'Tunisia'), ('continents', 'Africa'), ('numSites', '29')]
[('country', 'http://www.wikidata.org/entity/Q145'), ('countryName', 'United Kingdom'), ('continents', 'Europe'), ('numSites', '27')]
[('country', 'http://www.wikidata.org/entity/Q183'), ('countryName', 'Germany'), ('continents', 'Europe'), ('numSites', '21')]
[('country', 'http://www.wikidata.org/entity/Q219'), ('countryName', 'Bulgaria'), ('continents', 'Europe'), ('numSites', '9')]
[('country', 'http://www.wikidata.org/entity/Q1747689'), ('countryName', 'Ancient Rome'), ('continents

33

And then, the others archaeological cultures which leaved in the same place of ancient romans are:

In [12]:
queryString = """
select distinct ?culture ?cultureName where {
    ?site wdt:P31+/wdt:P279* wd:Q839954 ;
          wdt:P2596 wd:Q1747689 ;
          wdt:P2596 ?culture ;
          wdt:P17 ?country .
    
    ?culture <http://schema.org/name> ?cultureName .
}
"""

print("Results")
run_query(queryString)

Results
[('culture', 'http://www.wikidata.org/entity/Q1747689'), ('cultureName', 'Ancient Rome')]
[('culture', 'http://www.wikidata.org/entity/Q530894'), ('cultureName', 'Talaiotic Culture')]
[('culture', 'http://www.wikidata.org/entity/Q5011445'), ('cultureName', 'Celtiberians')]
[('culture', 'http://www.wikidata.org/entity/Q45315'), ('cultureName', 'Berbers')]
[('culture', 'http://www.wikidata.org/entity/Q4383747'), ('cultureName', 'Punics')]
[('culture', 'http://www.wikidata.org/entity/Q11772'), ('cultureName', 'Ancient Greece')]
[('culture', 'http://www.wikidata.org/entity/Q12544'), ('cultureName', 'Byzantine Empire')]
[('culture', 'http://www.wikidata.org/entity/Q1200427'), ('cultureName', 'culture of ancient Rome')]
[('culture', 'http://www.wikidata.org/entity/Q22633'), ('cultureName', 'Germanic peoples')]
[('culture', 'http://www.wikidata.org/entity/Q220379'), ('cultureName', 'Adriatic Veneti')]
[('culture', 'http://www.wikidata.org/entity/Q582861'), ('cultureName', 'Marinid Dyn

14

## 4.3. Which country has the most diverse set of civilizations or cultures across its sites?

In order to answer this question, I can get the culture releated to the sites and group by country.

In [13]:
queryString = """
select ?country ?countryName (count(distinct ?culture) as ?numCultures) where {
    ?site wdt:P31+/wdt:P279* wd:Q839954 ;
          wdt:P2596 ?culture ;
          wdt:P17 ?country .
    
    ?country <http://schema.org/name> ?countryName .
} group by ?country ?countryName
order by desc(?numCultures)
limit 1
"""

print("Results")
run_query(queryString)

Results
[('country', 'http://www.wikidata.org/entity/Q38'), ('countryName', 'Italy'), ('numCultures', '62')]


1

Italy is the country with most diverse set of civilizations and cultures across its sites. In the next query I will also list these cultures.

In [14]:
queryString = """
select distinct ?culture ?cultureName where {
    ?site wdt:P31+/wdt:P279* wd:Q839954 ;
          wdt:P2596 ?culture ;
          wdt:P17 wd:Q38 .
    
    ?culture <http://schema.org/name> ?cultureName .
}
"""

print("Results")
run_query(queryString)

Results
[('culture', 'http://www.wikidata.org/entity/Q1747689'), ('cultureName', 'Ancient Rome')]
[('culture', 'http://www.wikidata.org/entity/Q130900'), ('cultureName', 'Lombards')]
[('culture', 'http://www.wikidata.org/entity/Q941821'), ('cultureName', 'Paeligni')]
[('culture', 'http://www.wikidata.org/entity/Q2277'), ('cultureName', 'Roman Empire')]
[('culture', 'http://www.wikidata.org/entity/Q1421436'), ('cultureName', 'nuragic civilization')]
[('culture', 'http://www.wikidata.org/entity/Q500272'), ('cultureName', 'Samnites')]
[('culture', 'http://www.wikidata.org/entity/Q17161'), ('cultureName', 'Etruscans')]
[('culture', 'http://www.wikidata.org/entity/Q11772'), ('cultureName', 'Ancient Greece')]
[('culture', 'http://www.wikidata.org/entity/Q202335'), ('cultureName', 'Latins')]
[('culture', 'http://www.wikidata.org/entity/Q2341546'), ('cultureName', 'Vestini')]
[('culture', 'http://www.wikidata.org/entity/Q841576'), ('cultureName', 'Iapyges')]
[('culture', 'http://www.wikidata.o

62

## 4.4. If you are interested in visiting some sites, which country would you pick? Based on what criteria?

If I want to visit a country based on its archaeological sites, I can consider different criterias to chose the best one. For example, I can consider the number of cultures in the country, as done for the point 4.3. Another parameter can be the number of archeological sites, as done in point 4.1. I can also consider the number of visitors per year, that can give an idea of the popularity of the site.

I can also merge all these parameters together, assigning to each country a point based on the following formula:

```
a * avg(numVisitorsPerYear) + b * numArchaeologicalSites + c * numCultures
```

With `a`, `b`, `c` the weight for each criteria.

For example, with a = 0.75, b = 1 and c = 0.5, we can discover which is the preferred country to visit, based on this preferences

In [15]:
queryString = """
select ?country ?countryName ?score where {
    {
        select ?country (count(distinct ?site) as ?numSites) (count(distinct ?culture) as ?numCultures) (avg(?visitorsPerYear) as ?avgVisitors) where {
            ?site wdt:P31+/wdt:P279* wd:Q839954 ;
                  wdt:P17 ?country .

            optional { ?site wdt:P2596 ?culture } .
            optional { ?site wdt:P1174 ?visitorsPerYear filter (isNumeric(?visitorsPerYear)) } .
        } group by ?country
    } .
    
    bind (if(?avgVisitors, ?avgVisitors, 0) as ?avgVisitorsInt) .
    bind ((?avgVisitorsInt * 0.7 + ?numSites + ?numCultures * 0.5) as ?score) .
    
    ?country <http://schema.org/name> ?countryName .
} order by desc(?score)
limit 1
"""

print("Results")
run_query(queryString)

Results
[('country', 'http://www.wikidata.org/entity/Q38'), ('countryName', 'Italy'), ('score', '12336301.616666666666667')]


1

And the best country to visit for archaeological sites, number of visitors and number of cultures is Italy.

If I am interested also only in some types of archeological sites, such as amphitheatre, I can add a filter or using a different BGP. For the example of amphitheatre, I can use:

In [16]:
queryString = """
select ?country ?countryName ?score where {
    {
        select ?country (count(distinct ?site) as ?numSites) (count(distinct ?culture) as ?numCultures) (avg(?visitorsPerYear) as ?avgVisitors) where {
            ?site wdt:P31+/wdt:P279* wd:Q54831 ;
                  wdt:P17 ?country .

            optional { ?site wdt:P2596 ?culture } .
            optional { ?site wdt:P1174 ?visitorsPerYear filter (isNumeric(?visitorsPerYear)) } .
        } group by ?country
    } .
    
    bind (if(?avgVisitors, ?avgVisitors, 0) as ?avgVisitorsInt) .
    
    bind ((?avgVisitorsInt * 0.7 + ?numSites + ?numCultures * 0.5) as ?score) .
    
    ?country <http://schema.org/name> ?countryName .
} order by desc(?score)
limit 1
"""

print("Results")
run_query(queryString)

Results
[('country', 'http://www.wikidata.org/entity/Q38'), ('countryName', 'Italy'), ('score', '5180120.5')]


1

And also in this case, Italy is the best choice.

Finally, I can also do another example, hypothesizing I am interested in mosque (**Q32815**)

In [17]:
queryString = """
select ?country ?countryName ?score where {
    {
        select ?country (count(distinct ?site) as ?numSites) (count(distinct ?culture) as ?numCultures) (avg(?visitorsPerYear) as ?avgVisitors) where {
            ?site wdt:P31+/wdt:P279* wd:Q32815 ;
                  wdt:P17 ?country .

            optional { ?site wdt:P2596 ?culture } .
            optional { ?site wdt:P1174 ?visitorsPerYear filter (isNumeric(?visitorsPerYear))} .
        } group by ?country
    } .
    
    bind (if(?avgVisitors, ?avgVisitors, 0) as ?avgVisitorsInt) .
    
    bind ((?avgVisitorsInt * 0.7 + ?numSites + ?numCultures * 0.5) as ?score) .
    
    ?country <http://schema.org/name> ?countryName .
} order by desc(?score)
limit 1
"""

print("Results")
run_query(queryString)

Results
[('country', 'http://www.wikidata.org/entity/Q29'), ('countryName', 'Spain'), ('score', '980072')]


1

And in this case the best choice is Spain.