**Federated data query**

I mentioned before we are using Wikidata URLs just as URLs, but now we are going to use Wikidata data to enrich our own

Our triplestore is limited to basic info around film work, director, item and base

Now we are asked a question, "show me all your film works by the age of the director when the film was released"

To calculate the age of the director we need to know two things, when the director was born and when the film was released

Wikidata often contains both these data points

In [1]:
import pandas
import pydash
import requests

def value_extract(row, col):

    """Extract dictionary values."""

    return pydash.get(row[col], "value")

def sparql_query(query):

    """Send sparql request, and formulate results into a dataframe."""

    r = requests.post('http://138.197.180.196:3030/test-data/sparql', data={'query': query}, verify=False)  
    data = pydash.get(r.json(), "results.bindings")
    data = pandas.DataFrame.from_dict(data)
    for x in data.columns:
        data[x] = data.apply(value_extract, col=x, axis=1)
    return data

To begin we need to query our own film work and director data

In [2]:
dataframe = sparql_query(
    '''
    select ?film_work_url ?film_director_url
    where { 
        ?film_work_url <http://www.wikidata.org/entity/P57>  ?film_director_url .
    }
    '''
    ) 

dataframe.head()

Unnamed: 0,film_work_url,film_director_url
0,http://www.wikidata.org/entity/Q2316927,http://www.wikidata.org/entity/Q446294
1,http://www.wikidata.org/entity/Q455552,http://www.wikidata.org/entity/Q446294


Using "service" and the Wikidata SPARQL address we can pull data out of wikidata

In [3]:
dataframe = sparql_query(
    '''
    select ?film_work_url ?film_director_url ?film_director_birth
    where { 
        ?film_work_url <http://www.wikidata.org/entity/P57>  ?film_director_url .
        service <https://query.wikidata.org/sparql> {
            ?film_director_url <http://www.wikidata.org/prop/direct/P569> ?film_director_birth .
        }
    }
    '''
    ) 

dataframe.head()

Unnamed: 0,film_work_url,film_director_url,film_director_birth
0,http://www.wikidata.org/entity/Q2316927,http://www.wikidata.org/entity/Q446294,1959-11-03T00:00:00Z
1,http://www.wikidata.org/entity/Q455552,http://www.wikidata.org/entity/Q446294,1959-11-03T00:00:00Z


And also pull the film release date (there may be more than one)

In [4]:
dataframe = sparql_query(
    '''
    select ?film_work_url ?film_director_url ?film_director_birth ?film_release_date
    where { 
        ?film_work_url <http://www.wikidata.org/entity/P57>  ?film_director_url .
        service <https://query.wikidata.org/sparql> {
            ?film_director_url <http://www.wikidata.org/prop/direct/P569> ?film_director_birth .
            ?film_work_url <http://www.wikidata.org/prop/direct/P577> ?film_release_date .
        }
    }
    '''
    ) 

dataframe.head()

Unnamed: 0,film_work_url,film_director_url,film_director_birth,film_release_date
0,http://www.wikidata.org/entity/Q2316927,http://www.wikidata.org/entity/Q446294,1959-11-03T00:00:00Z,1992-01-01T00:00:00Z
1,http://www.wikidata.org/entity/Q2316927,http://www.wikidata.org/entity/Q446294,1959-11-03T00:00:00Z,1993-01-14T00:00:00Z
2,http://www.wikidata.org/entity/Q455552,http://www.wikidata.org/entity/Q446294,1959-11-03T00:00:00Z,1994-01-01T00:00:00Z
3,http://www.wikidata.org/entity/Q455552,http://www.wikidata.org/entity/Q446294,1959-11-03T00:00:00Z,1994-11-10T00:00:00Z


We can use basic math within our SPARQL query to subtract birth date from release date, which gives us director age 

In [5]:
dataframe = sparql_query(
    '''
    select ?film_work_url ?film_director_url ?film_director_birth ?film_release_date ?director_age
    where { 
        ?film_work_url <http://www.wikidata.org/entity/P57>  ?film_director_url .
        service <https://query.wikidata.org/sparql> {
            ?film_director_url <http://www.wikidata.org/prop/direct/P569> ?film_director_birth .
            ?film_work_url <http://www.wikidata.org/prop/direct/P577> ?film_release_date .
        }
        bind( year(?film_release_date)-year(?film_director_birth) as ?director_age )
    }
    '''
    ) 

dataframe.head()

Unnamed: 0,film_work_url,film_director_url,film_director_birth,film_release_date,director_age
0,http://www.wikidata.org/entity/Q2316927,http://www.wikidata.org/entity/Q446294,1959-11-03T00:00:00Z,1992-01-01T00:00:00Z,33
1,http://www.wikidata.org/entity/Q2316927,http://www.wikidata.org/entity/Q446294,1959-11-03T00:00:00Z,1993-01-14T00:00:00Z,34
2,http://www.wikidata.org/entity/Q455552,http://www.wikidata.org/entity/Q446294,1959-11-03T00:00:00Z,1994-01-01T00:00:00Z,35
3,http://www.wikidata.org/entity/Q455552,http://www.wikidata.org/entity/Q446294,1959-11-03T00:00:00Z,1994-11-10T00:00:00Z,35


Add labels as before

In [6]:

dataframe = sparql_query(
    '''
    select ?film_work_url ?film_work ?film_director_url ?film_director ?film_director_birth ?film_release_date ?director_age
    where { 
        ?film_work_url <http://www.wikidata.org/entity/P57>  ?film_director_url .
        ?film_work_url <http://www.w3.org/2000/01/rdf-schema#label> ?film_work .
        ?film_director_url <http://www.w3.org/2000/01/rdf-schema#label> ?film_director .
        service <https://query.wikidata.org/sparql> {
            ?film_director_url <http://www.wikidata.org/prop/direct/P569> ?film_director_birth .
            ?film_work_url <http://www.wikidata.org/prop/direct/P577> ?film_release_date .
        }
        bind( year(?film_release_date)-year(?film_director_birth) as ?director_age )
    }
    '''
    ) 

dataframe.head()

Unnamed: 0,film_work_url,film_work,film_director_url,film_director,film_director_birth,film_release_date,director_age
0,http://www.wikidata.org/entity/Q2316927,Simple Men,http://www.wikidata.org/entity/Q446294,Hal Hartley,1959-11-03T00:00:00Z,1992-01-01T00:00:00Z,33
1,http://www.wikidata.org/entity/Q2316927,Simple Men,http://www.wikidata.org/entity/Q446294,Hal Hartley,1959-11-03T00:00:00Z,1993-01-14T00:00:00Z,34
2,http://www.wikidata.org/entity/Q455552,Amateur,http://www.wikidata.org/entity/Q446294,Hal Hartley,1959-11-03T00:00:00Z,1994-01-01T00:00:00Z,35
3,http://www.wikidata.org/entity/Q455552,Amateur,http://www.wikidata.org/entity/Q446294,Hal Hartley,1959-11-03T00:00:00Z,1994-11-10T00:00:00Z,35


Reduce to only list the earliest release date per film and remove unrequired columns

In [7]:

dataframe = sparql_query(
    '''
    select  ?film_work  ?film_director (min(?director_age) as ?age)
    where { 
        ?film_work_url <http://www.wikidata.org/entity/P57>  ?film_director_url .
        ?film_work_url <http://www.w3.org/2000/01/rdf-schema#label> ?film_work .
        ?film_director_url <http://www.w3.org/2000/01/rdf-schema#label> ?film_director .
        service <https://query.wikidata.org/sparql> {
            ?film_director_url <http://www.wikidata.org/prop/direct/P569> ?film_director_birth .
            ?film_work_url <http://www.wikidata.org/prop/direct/P577> ?film_release_date .
        }
        bind( year(?film_release_date)-year(?film_director_birth) as ?director_age )
    } group by ?film_work ?film_director
    '''
    ) 

dataframe.head()

Unnamed: 0,film_work,film_director,age
0,Simple Men,Hal Hartley,33
1,Amateur,Hal Hartley,35
