# Task: Find corresponding data in DBpedia and Wikidata

## Overview:
Find instances in DBpedia and Wikidata equivalent/co-refer to the movies in your RDF data set, this means, find resources that you can link to via the _owl:sameAs_ property


## Task Details 

1. Use the SPARQL Endpoints of DBpedia (https://dbpedia.org/sparql) and Wikidata (https://query.wikidata.org) to get the data
> __Hint__: Try a few queries in the SPARQL endpoints before incporporating the query in your code
2. If using Python, you  can query the SPARQL endpoints with RDFLib or SPARQLWrapper. To find the correct match, you can use the title of the movie, its publication date (or year), and/or the directors' names. 
> __Hint__: Exact matches do not always get the desired results, as labels might be different across Knowledge bases, e.g., “Charles Chaplin” vs. “Charlie Chaplin”
3. If a match is found, add the one correct _owl:sameAs_ link to the DBpedia and the one correct link to the Wikidata resource to your dataset from Task 1, e.g.,
`<https://firstname-lastname.org/resource/the_godfather> owl:sameAs  <http://dbpedia.org/resource/The_Godfather>`
> __Hint__: Use RDFLib to load the data you have saved in Task 1 and add the links to the corresponding movies

<br>

## Sumission 2: 

Save the new dataset containing _owl:sameAs_ statements in N3 in the output folder with the naming __movies_task_2.n3__.


<br>

## Your code

In [19]:
from rdflib import URIRef, Literal, Graph, Namespace
from rdflib.namespace import FOAF, RDF, RDFS, XSD, DC, OWL
import urllib
from datetime import datetime
from SPARQLWrapper import SPARQLWrapper, JSON, N3
import numpy as np
from time import sleep

In [20]:
EX = Namespace("https://ex1.org/")
DBO = Namespace("http://dbpedia.org/ontology/")
RSC = Namespace("http://philip-broehl.org/resource/")
WD = Namespace("http://www.wikidata.org/entity/")
WDT = Namespace("http://www.wikidata.org/prop/direct/")
SCH = Namespace("http://www.schema.org/")

# TODO:
add film date (if in film name in dbpedia: OPTIONAL)

e.g. film alien, david scott, not correct

In [21]:
g = Graph()
g.load(source = "../output_data/movies_task_1.n3", format = 'n3')

sparql_query_db = """
    PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
    PREFIX dbo: <http://dbpedia.org/ontology/>
    PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

    Select distinct ?movie ?title ?director ?director_name ?sameAsWD WHERE {
        ?movie rdf:type dbo:Film ;
            rdfs:label ?title ;
            dbo:director ?director .
        ?director foaf:name ?director_name .
    OPTIONAL { ?movie owl:sameAs ?sameAsWD . FILTER (contains(str(?sameAsWD), "wikidata"))}
    FILTER ((lang(?title) = "en") && ("""

sparql_query_wd = """
    PREFIX wd: <http://www.wikidata.org/entity/>
    PREFIX wdt: <http://www.wikidata.org/prop/direct/>

    Select distinct ?movie ?title ?director ?director_name WHERE {
        ?movie wdt:P31 wd:Q11424 ;
            wdt:P1476 ?title ;
            wdt:P57 ?director .
        ?director wdt:P1559 ?director_name .
    FILTER (lang(?title) = "en" && ("""

distinct_film_uris = []
results_db = []
results_wd = []
filter_data = ""
i = 0
for s in g.subjects(object = SCH.Movie):
    #  print(g.value(subject = URIRef(s), predicate = RDF.type))
    # if s not in distinct_film_uris and g.value(subject = URIRef(s), predicate = RDF.type) == SCH.Movie:
    i += 1
    print(f"\rFilms processed after current batch: {i}", end = '')
    distinct_film_uris.append(s)

    # add film title to filters in query
    title = list(g.objects(subject = URIRef(s), predicate = RDFS.label))[0]
    filter_data += "((contains(?title, \"" + title + "\")) &&" 

    # get the directors of the current film
    #directors = list(g.objects(subject = URIRef(s), predicate = DBO.director))

    # add director names to filters in query, at least one director has to be a match
    #for director in directors:
    #    director_name = list(g.objects(URIRef(director), predicate = FOAF.name))[0]
    #    filter_data += "(contains(?director_name, \"" + director_name + "\")) ||"

    filter_data = filter_data[0:-3]
    filter_data += ") || "

    # make a SPARQL query every 40 films ("too long query" error if too many films at once)
    # or "too many requests" if query too short
    if i % 40 == 0 or i == 250:
        filter_data = filter_data[0:-4]
        filter_data += ")) .\n    }\n"

        sparql = SPARQLWrapper("https://dbpedia.org/sparql")
        sparql.setQuery(sparql_query_db + filter_data)
        sparql.setReturnFormat(JSON)
        results_db.append(sparql.query().convert())

        sparql = SPARQLWrapper("https://query.wikidata.org/sparql")
        sparql.setQuery(sparql_query_wd + filter_data)
        sparql.setReturnFormat(JSON)
        results_wd.append(sparql.query().convert())            

        filter_data = ""
        sleep(10)
print(f"\rDone! {i} films in total.               ", end = '')

Done! 250 films in total.               

In [22]:
# first, we combine all the results from the last step, where we queried in multiple steps
combined_results_db = []
combined_results_wd = []

# results in one list each
for result in results_db:
    combined_results_db += result['results']['bindings']

for result in results_wd:
    combined_results_wd += result['results']['bindings']
    
# establish sameAs links for dbpedia results
for entry in combined_results_db:
    movie_name = entry['title']['value']
    director_name = entry['director_name']['value']
    for s in list(g.subjects(predicate = RDF.type, object = SCH.Movie)):
        
        # match film in Graph with result film
        match_movie_name = g.value(subject = URIRef(s), predicate = RDFS.label)
        if match_movie_name in movie_name:
            g.add((URIRef(s), OWL.sameAs, URIRef(entry['movie']['value'])))
            if 'sameAsWD' in entry.keys():
                g.add((URIRef(s), OWL.sameAs, URIRef(entry['sameAsWD']['value'])))
        
        # match directors with result directors
            directors = list(g.objects(subject = URIRef(s), predicate = DBO.director))
            for director in directors:
                g.add((URIRef(director), OWL.sameAs, URIRef(entry['director']['value'])))
                
# establish sameAs links for wikidata results
for entry in combined_results_wd:
    movie_name = entry['title']['value']
    director_name = entry['director_name']['value']
    for s in list(g.subjects(predicate = RDF.type, object = SCH.Movie)):
        
        # match film in Graph with result film
        match_movie_name = g.value(subject = URIRef(s), predicate = RDFS.label)
        if match_movie_name in movie_name:
            g.add((URIRef(s), OWL.sameAs, URIRef(entry['movie']['value'])))
        
        # match directors with result directors
            directors = list(g.objects(subject = URIRef(s), predicate = SCH.director))
            for director in directors:
                g.add((URIRef(director), OWL.sameAs, URIRef(entry['director']['value'])))

for s in list(g.subjects(predicate = RDF.type, object = SCH.Movie)):
    proposed_db = []
    proposed_wd = []
    proposed_db_wd = []
    winner_uri_db_wd = None
    for sameAs in g.objects(subject = URIRef(s), predicate = OWL.sameAs):
        if "http://dbpedia.org/resource/" in sameAs:
            proposed_db.append(sameAs)
        elif "http://www.wikidata.org/entity/" in sameAs:
            proposed_wd.append(sameAs)
        else:
            proposed_db_wd.append(sameAs)
    # For dbpedia, we take the shortest URI, since the name of the movie is in it. Say we have the film "Die hard".
    # Then we will have proposed uris ending with "die_hard" and "die_hard_with_vengeance", where the latter one is the
    # continuation of the former. Then the correct film is oviously the first. However, if we had the film
    # "Die hard with vengeance", we would not even have a proposed uri with "die_hard" at the end. This applies to most
    # films with continuations.
    if len(proposed_db) != 0:
        winner_uri_db = proposed_db[0]
        if len(proposed_db_wd) != 0:
            winner_uri_db_wd = proposed_db_wd[0]
        for i in range(1, len(proposed_db)):
            if len(proposed_db[i]) < len(winner_uri_db):
                winner_uri_db = proposed_db[i]
                if i < len(proposed_db_wd):
                    winner_uri_db_wd = proposed_db_wd[i]
                
        # (hopefully) remove wrong sameAs links from dbpedia
        for uri in proposed_db:
            if uri != winner_uri_db:
                g.remove((URIRef(s), OWL.sameAs, URIRef(uri)))
        if winner_uri_db_wd != None:
            # some uris from sameAs links from dbpedia do not resolve
            winner_uri_db_wd = winner_uri_db_wd.replace("http://wikidata.dbpedia.org/resource/", "http://www.wikidata.org/entity/")
            for uri in proposed_wd:
                if uri != winner_uri_db_wd:
                    g.remove((URIRef(s), OWL.sameAs, URIRef(uri)))
            for uri in proposed_db_wd:
                if uri != winner_uri_db_wd:
                    g.remove((URIRef(s), OWL.sameAs, URIRef(uri)))
            g.add((URIRef(s), OWL.sameAs, URIRef(winner_uri_db_wd)))
            
    # for wikidata, we use the sameAs link from dbpedia if it exists, otherwise the 
    # lesser number is mostly the best choice
    if len(proposed_wd) != 0 and winner_uri_db_wd == None:
        numbers_wd = []
        for uri in proposed_wd:
            numbers_wd += [int(s) for s in uri.split("Q") if s.isdigit()]  # retrieves number after Q in every uri
        lowest_index = np.argmin(numbers_wd)
        winner_uri_wd = proposed_wd[lowest_index]

        # (hopefully) remove wrong sameAs links from wikidata
        for uri in proposed_wd:
            if uri != winner_uri_wd:
                g.remove((URIRef(s), OWL.sameAs, URIRef(uri)))

# serialize Graph
print(g.serialize(format="n3").decode("utf-8"))
g.serialize(destination='../output_data/movies_task_2.n3', format='n3')

@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix rsc: <http://philip-broehl.org/resource/> .
@prefix sch: <http://www.schema.org/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

rsc:12_angry_men a sch:Movie ;
    rdfs:label "12 Angry Men"@en ;
    sch:datePublished "1957-01-01"^^xsd:gYear ;
    sch:director rsc:sidney_lumet ;
    = <http://dbpedia.org/resource/12_Angry_Men_(1957_film)>,
        <http://www.wikidata.org/entity/Q386042> .

rsc:12_years_a_slave a sch:Movie ;
    rdfs:label "12 Years a Slave"@en ;
    sch:datePublished "2013-01-01"^^xsd:gYear ;
    sch:director rsc:steve_mcqueen ;
    = <http://dbpedia.org/resource/12_Years_a_Slave_(film)>,
        <http://www.wikidata.org/entity/Q3023357> .

rsc:2001:_a_space_odyssey a sch:Movie ;
    rdfs:label "2001: A Space Odyssey"@en ;
    sch:datePublished "1968-01-01"^^xsd:gYear ;
    sch:director rsc:stanley_kubrick ;
    = <http://dbpedia.org/resource/2001:_A_Spac

In [24]:
one_sameAs = 0
two_sameAs = 0
db_sameAs = 0
wd_sameAs = 0
three_sameAs = 0
for s in list(g.subjects(predicate = RDF.type, object = SCH.Movie)):
    i = 0
    for sameAs in g.objects(subject = URIRef(s), predicate = OWL.sameAs):
        i += 1
        if "dbpedia" in sameAs:
            db_sameAs += 1
        else:
            wd_sameAs += 1
    if i >= 1:
        one_sameAs += 1
    if i == 2:
        two_sameAs += 1
    if i >= 3:
        three_sameAs += 1
print(f'Relative amount of films in Graph with at least one sameAs link: {round(one_sameAs / 250 * 100, 2)}%')
print(f'Relative amount of films in Graph with dbpedia sameAs link: {round(db_sameAs / 250 * 100, 2)}%')
print(f'Relative amount of films in Graph with wikidata sameAs link: {round(wd_sameAs / 250 * 100, 2)}%')
print(f'Relative amount of films in Graph with two sameAs links: {round(two_sameAs / 250 * 100, 2)}%')
print(f'Relative amount of films in Graph with at least three sameAs links: {round(three_sameAs / 250 * 100, 2)}%')

Relative amount of films in Graph with at least one sameAs link: 93.6%
Relative amount of films in Graph with dbpedia sameAs link: 93.2%
Relative amount of films in Graph with wikidata sameAs link: 93.6%
Relative amount of films in Graph with two sameAs links: 93.2%
Relative amount of films in Graph with at least three sameAs links: 0.0%
