# Data-Driven Research Assignment 3: Linked Open Data
This notebook contains the third, collaborative, graded assignment of the 2023 Data-Driven Research course. In this assignment you'll use Linked Open Data tools in order to search for information on the Web in a more thorough way than with Google.

To complete the assignment, complete the highlighted **Part 1, Part 2, Part 3 and Part 4**.

This is a collaborative assignment. In the text cell below, please include all the names of your group members.

If you used code or a solution from the internet (such as StackOverflow) or another external resource, please make reference to it (in any format). Unattributed copied code will be considered plagiarism and therefore fraud.


**Authors of this answer:**

# 1. Introduction

In this exercise, you'll experiment with a very explicit approach to semantics, and experience how powerful a little semantics can be when searching.

You'll use the DBpedia knowledge base, essentially the content of Wikipedia in machine readable form, and explore what explicit semantic enables using [SPARQL](http://en.wikipedia.org/wiki/SPARQL). Queries allowing you to query and search through the Web of Linked Open Data. We will use the Python SPARQLWrapper library to access the DBpedia endpoint.

## 1.1. RDF
Linked Open Data consists of a huge number of small facts, in the form of RDF triples, <*Subject*, *Predicate*, *Object*>, which each consists of a pair of concepts or entities and a relationship between them, such as `Rembrandt birthPlace Leiden`. We work specifically with DBpedia's machine readable information from Wikipedia, in this case http://dbpedia.org/page/Rembrandt:
`dbr:Rembrandt dbo:birthPlace dbr:Leiden`

It's hard to guess upfront how information is encoded in DBpedia, and linked data is all about having unique identifiers for every entity or concept.   The best way is to look at examples, and use google to kickstart to find a particular DBpedia entity.

For example, Google "dbpedia rembrandt" which will give you a neat page with DBpedia facts about him (https://dbpedia.org/page/Rembrandt).   If you look at the link of the "About: Rembrandt" you find the unique link that is the identifier of this entity is http://dbpedia.org/resource/Rembrandt, and entering this link/ID in your browser will generate the overview page. 

Inside DBpedia, you can use a shorthand  `dbr:Rembrandt` (which is defined to unfold to http://dbpedia.org/resource/Rembrandt) as the unique ID, but it also works if you use the long URL!  Recall that DBpedia does not have pages, but only countless RDF triples, and the overview page is just the output of all triples, <`dbr:Rembrandt` *Predicate*, *Object*>, allowing you also to see what further relations and concepts to explore.

## 1.2. SPARQL
SPARQL is the designated query language for RDF modelled on an extended SQL relational database query language.  It uses natural language words like SELECT, DISTINCT, WHERE, ORDER BY, and LIMIT in a very specific way. We introduce it by example, but feel free to backtrack to one of the many tutorial and introductions on the web.  

# 2. Working with SPARQL queries

There is a large database of facts derived from Wikipedia, called [DBpedia](http://dbpedia.org/About), which contains information about everything and the rest.  Let's look at American Films in the first part of this assignment. You will access a so-called SPARQL endpoint for DBpedia through Python. Each film in the category American Films has as a fact about it in DBpedia that is has a relationship with the category American Films, namely, the `dct:subject` property. That is, the film Pulp Fiction has an RDF triple 
`dbr:Pulp_Fiction dct:subject category:American_films`

If you put the name of the entity `Pulp_Fiction`, so `dbr:Pulp_Fiction` which is shorthand for http://dbpedia.org/resource/Pulp_Fiction, your browser will generate a page http://dbpedia.org/page/Pulp_Fiction with a selection of facts  <`dbr:Pulp_Fiction`, ?, ?> in the data base.

Now, let's access this database through Python. We will need the SPARQLWrapper, which does not come with Google Colab by default, so we should install it:

In [48]:
!pip install sparqlwrapper



Then, we are able to import it and choose the DBPedia endpoint to access that database:

In [49]:
from SPARQLWrapper import SPARQLWrapper
import pandas as pd
from io import StringIO
from IPython.display import display

# Specify the DBPedia endpoint
sparql = SPARQLWrapper("http://dbpedia.org/sparql")

# Specify that we want results in the CSV format
sparql.setReturnFormat('csv')

Now, we can try some SPARQL queries. First we set the query as a multi-line string, and then we use the query function to actually run the query.

Note that in the query below, ?film is a variable name, just like a Python variable - we could rename it to anything else, and the result would be the same.

In [50]:
sparql.setQuery("""
    SELECT ?film 
    WHERE {?film dct:subject dbc:2010s_American_films} 
    LIMIT 1000
""")

result = sparql.query().convert().decode("utf-8") 
print(result[:200])

"film"
"http://dbpedia.org/resource/Cabin_Fever:_Patient_Zero"
"http://dbpedia.org/resource/Cabin_Fever_(2016_film)"
"http://dbpedia.org/resource/Caesar_and_Otto's_Deadly_Xmas"
"http://dbpedia.org/res


The result variable now contains a list in CSV format of 1000 links to 2010s American films on dbpedia (we limited the number of results to 1000, but it can be increased). To make it easier to visualize, let's turn it into a Pandas dataframe (this will become more useful as we retrieve more properties):

In [51]:
df = pd.read_csv(StringIO(result), sep=",")
df

Unnamed: 0,film
0,http://dbpedia.org/resource/Cabin_Fever:_Patie...
1,http://dbpedia.org/resource/Cabin_Fever_(2016_...
2,http://dbpedia.org/resource/Caesar_and_Otto's_...
3,http://dbpedia.org/resource/Café_(2010_film)
4,http://dbpedia.org/resource/Café_Society_(2016...
...,...
995,http://dbpedia.org/resource/Johnny_English_Str...
996,http://dbpedia.org/resource/Johnny_Frank_Garre...
997,http://dbpedia.org/resource/Joint_Body
998,http://dbpedia.org/resource/Jojo_Rabbit


We see that the enitities, e.g. `dbc:American_films` are URI's referring to a unique entity in the linked open data cloud. So the name is a unique ID, in this case shorthand for http://dbpedia.org/resource/Category:2010_American_films.  

Take a closer look at that SPARQL query. Can you figure out how it works? If you're familiar with structured query languages like SQL, you'll recognize many aspects. If this is completely new to you, there are still many recognizable words to help you interpret this query. Let's get back to this later.

To see the power of this form of searching, let's try a slightly more complex query, where you add a second RDF-like condition to the query, separated from the first by a dot `.` representing a `join` or `AND`:

In [52]:
sparql.setQuery("""
    SELECT ?film ?actor
    WHERE {?film dct:subject dbc:2010s_American_films . ?film dbo:starring ?actor } 
    LIMIT 1000
""")

result = sparql.query().convert().decode("utf-8") 
df = pd.read_csv(StringIO(result), sep=",")
df

Unnamed: 0,film,actor
0,http://dbpedia.org/resource/Cabin_Fever:_Patie...,http://dbpedia.org/resource/Brando_Eaton
1,http://dbpedia.org/resource/Cabin_Fever:_Patie...,http://dbpedia.org/resource/Ryan_Donowho
2,http://dbpedia.org/resource/Cabin_Fever:_Patie...,http://dbpedia.org/resource/Lydia_Hearst
3,http://dbpedia.org/resource/Cabin_Fever:_Patie...,http://dbpedia.org/resource/Jillian_Murray
4,http://dbpedia.org/resource/Cabin_Fever:_Patie...,http://dbpedia.org/resource/Sean_Astin
...,...,...
995,http://dbpedia.org/resource/Scary_Movie_5,http://dbpedia.org/resource/J._P._Manoux
996,http://dbpedia.org/resource/Scary_Movie_5,http://dbpedia.org/resource/Terry_Crews
997,http://dbpedia.org/resource/Scary_Movie_5,http://dbpedia.org/resource/Ashley_Tisdale
998,http://dbpedia.org/resource/Scary_Movie_5,http://dbpedia.org/resource/Jerry_O'Connell


You should now get a table with American films and the actors that have played in them. This list is not complete. The content of DBpedia is based on the Infoboxes of Wikipedia pages, which have a standard format. The knowledge in DBpedia is as good as the encyclopedic information on Wikipedia. Not all actors per film are listed and not all films and actors have their own Wikipedia article.

Try a few more SPARQL queries, given below. See if you can figure out before hand what results they will give:

In [53]:
sparql.setQuery("""
    SELECT DISTINCT ?actor
    WHERE {?film dct:subject dbc:2010s_American_films . ?film dbo:starring ?actor } 
    LIMIT 1000
""")

result = sparql.query().convert().decode("utf-8") 
df = pd.read_csv(StringIO(result), sep=",")
df

Unnamed: 0,actor
0,http://dbpedia.org/resource/Brando_Eaton
1,http://dbpedia.org/resource/Ryan_Donowho
2,http://dbpedia.org/resource/Lydia_Hearst
3,http://dbpedia.org/resource/Jillian_Murray
4,http://dbpedia.org/resource/Sean_Astin
...,...
995,http://dbpedia.org/resource/Chris_Zylka
996,http://dbpedia.org/resource/Louisa_Krause
997,http://dbpedia.org/resource/Dianna_Agron
998,http://dbpedia.org/resource/Scott_Speedman


In [54]:
sparql.setQuery("""
    SELECT DISTINCT COUNT(?actor) ?actor
    WHERE {?film dct:subject dbc:2010s_American_films . ?film dbo:starring ?actor } 
    ORDER BY DESC(COUNT(?actor))
    LIMIT 1000
""")

result = sparql.query().convert().decode("utf-8") 
df = pd.read_csv(StringIO(result), sep=",")
df

Unnamed: 0,callret-0,actor
0,49,http://dbpedia.org/resource/James_Franco
1,44,http://dbpedia.org/resource/Danny_Trejo
2,34,http://dbpedia.org/resource/Nicolas_Cage
3,31,http://dbpedia.org/resource/Danny_Glover
4,31,http://dbpedia.org/resource/Eric_Roberts
...,...,...
995,7,http://dbpedia.org/resource/Cody_Horn
996,7,http://dbpedia.org/resource/Lawrence_Michael_L...
997,7,http://dbpedia.org/resource/Sean_Paul_Lockhart
998,7,http://dbpedia.org/resource/Joel_David_Moore


**Part 1**: Adapt the SPARQL query above to count per film how many actors it has. This requires a minimal change to the query.   

In [55]:
#Changing the query to count the number of actors per film
sparql.setQuery("""
    SELECT DISTINCT COUNT(?actor) ?film
    WHERE {?film dct:subject dbc:2010s_American_films . ?film dbo:starring ?actor }
    ORDER BY DESC(COUNT(?actor))
    LIMIT 1000
""")

result = sparql.query().convert().decode("utf-8") 
df = pd.read_csv(StringIO(result), sep=",")
df

Unnamed: 0,callret-0,film
0,36,http://dbpedia.org/resource/Holidays_(2016_film)
1,22,http://dbpedia.org/resource/Movie_43
2,21,http://dbpedia.org/resource/Isle_of_Dogs_(film)
3,21,http://dbpedia.org/resource/I_Am_Comic
4,20,http://dbpedia.org/resource/Muscle_Shoals_(film)
...,...,...
995,7,http://dbpedia.org/resource/Motherless_Brookly...
996,7,http://dbpedia.org/resource/The_Last_Rescue
997,7,http://dbpedia.org/resource/The_Pretenders_(20...
998,7,http://dbpedia.org/resource/The_Thinning


# 3. Six Degrees of Kevin Bacon

There is a game related to the notion of [Six Degrees of Separation](http://en.wikipedia.org/wiki/Six_degrees_of_separation). This involves the network of actors who have played in a film together with Kevin Bacon, and actors who have played with those actors, etc. The goal is to figure out the shortest path, between actors who have co-starred in a film, between any Hollywood actor and the actor Kevin Bacon. One of the research aspects is whether the network of actors in Hollywood form a so-called [Small World](http://en.wikipedia.org/wiki/Small-world_experiment) network.

**Part 2** (some questions in the steps below)

Steps:

1. Let's first query:

In [56]:
sparql.setQuery("""
    SELECT ?film
    WHERE {?film dbo:starring dbr:Kevin_Bacon}
""")

result = sparql.query().convert().decode("utf-8") 
df = pd.read_csv(StringIO(result), sep=",")
df

Unnamed: 0,film
0,http://dbpedia.org/resource/Queens_Logic
1,http://dbpedia.org/resource/Quicksilver_(film)
2,http://dbpedia.org/resource/Murder_in_the_Firs...
3,http://dbpedia.org/resource/Beauty_Shop
4,http://dbpedia.org/resource/Beverly_Hills_Cop:...
...,...
64,http://dbpedia.org/resource/R.I.P.D.
65,http://dbpedia.org/resource/Rails_&_Ties
66,http://dbpedia.org/resource/She's_Having_a_Baby
67,http://dbpedia.org/resource/X-Men:_First_Class


2. How many films are in the list? We can count using a COUNT statement:

In [57]:
sparql.setQuery("""
    SELECT COUNT(?film)
    WHERE {?film dbo:starring dbr:Kevin_Bacon}
""")

result = sparql.query().convert().decode("utf-8") 
df = pd.read_csv(StringIO(result), sep=",")
df

Unnamed: 0,callret-0
0,69


3. Is this list complete? Compare this number with, for instance, the number of films listed on Kevin Bacon's IMDB page.

In [58]:
#Importing the BeautifulSoup library for HTML Parsing
import requests
from bs4 import BeautifulSoup
#Setting the brower user agent to avoid 403 errors
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
#Getting the movies from Kevin Bacon's IMDB page
url = "https://www.imdb.com/name/nm0000102/"
html = requests.get(url).content
soup = BeautifulSoup(html, 'html.parser')
movies = soup.find_all("div", {"class": "filmo-row"})
print(len(movies))

0


**(Your answer here)**

4. To see the power of this form of searching, let's gradually make this into a more complex query. Let's ask for the list of actors co-starring with Kevin Bacon. You can add a second RDF-like condition to the query, separated from the first by a dot:

In [59]:
sparql.setQuery("""
    SELECT ?film ?actor
    WHERE { ?film dbo:starring dbr:Kevin_Bacon . ?film dbo:starring dbr:Madeline_Brewer}
""")

result = sparql.query().convert().decode("utf-8") 
df = pd.read_csv(StringIO(result), sep=",")
df

Unnamed: 0,film,actor
0,http://dbpedia.org/resource/Space_Oddity_(film),


## We have added Madeline Brewer to the query as she has acted with Kevin Bacon in the movie "Space Oddity" (2022)
https://www.imdb.com/title/tt6317762/fullcredits?ref_=tt_ov_st_sm
### We can see that the query returns the movie "Space Oddity" (2022) and the actor Madeline Brewer

You should now get a table with films that Kevin Bacon played in, and the actors who played with him in those films. This list also includes Kevin Bacon in each of those films.

5. To remove Kevin Bacon himself from the list of co-actors, you can use a Regular Expression (as discussed in [Coding the Humanities](https://github.com/bloemj/2023-coding-the-humanities/blob/main/notebooks/2_Text.ipynb)) to remove any actor containing the name `Kevin Bacon`, like this:

In [60]:
sparql.setQuery("""
    SELECT ?film ?actor
    WHERE { ?film dbo:starring dbr:Kevin_Bacon . ?film dbo:starring ?actor .
            FILTER (!regex(?actor, "Kevin_Bacon"))} 
""")

result = sparql.query().convert().decode("utf-8") 
df = pd.read_csv(StringIO(result), sep=",")
df

Unnamed: 0,film,actor
0,http://dbpedia.org/resource/Queens_Logic,http://dbpedia.org/resource/Linda_Fiorentino
1,http://dbpedia.org/resource/Queens_Logic,http://dbpedia.org/resource/Tom_Waits
2,http://dbpedia.org/resource/Queens_Logic,http://dbpedia.org/resource/Tony_Spiridakis
3,http://dbpedia.org/resource/Queens_Logic,http://dbpedia.org/resource/Jamie_Lee_Curtis
4,http://dbpedia.org/resource/Queens_Logic,http://dbpedia.org/resource/Chloe_Webb
...,...,...
323,http://dbpedia.org/resource/X-Men:_First_Class,http://dbpedia.org/resource/January_Jones
324,http://dbpedia.org/resource/X-Men:_First_Class,http://dbpedia.org/resource/Jennifer_Lawrence
325,http://dbpedia.org/resource/X-Men:_First_Class,http://dbpedia.org/resource/Michael_Fassbender
326,http://dbpedia.org/resource/X-Men:_First_Class,http://dbpedia.org/resource/Oliver_Platt


Or even better, using the semantic entity name directly:

In [61]:
sparql.setQuery("""
    SELECT ?film ?actor
    WHERE { ?film dbo:starring dbr:Kevin_Bacon . ?film dbo:starring ?actor .
            FILTER ( ?actor != dbr:Kevin_Bacon )}
""")

result = sparql.query().convert().decode("utf-8") 
df = pd.read_csv(StringIO(result), sep=",")
df

Unnamed: 0,film,actor
0,http://dbpedia.org/resource/Queens_Logic,http://dbpedia.org/resource/Linda_Fiorentino
1,http://dbpedia.org/resource/Queens_Logic,http://dbpedia.org/resource/Tom_Waits
2,http://dbpedia.org/resource/Queens_Logic,http://dbpedia.org/resource/Tony_Spiridakis
3,http://dbpedia.org/resource/Queens_Logic,http://dbpedia.org/resource/Jamie_Lee_Curtis
4,http://dbpedia.org/resource/Queens_Logic,http://dbpedia.org/resource/Chloe_Webb
...,...,...
323,http://dbpedia.org/resource/X-Men:_First_Class,http://dbpedia.org/resource/January_Jones
324,http://dbpedia.org/resource/X-Men:_First_Class,http://dbpedia.org/resource/Jennifer_Lawrence
325,http://dbpedia.org/resource/X-Men:_First_Class,http://dbpedia.org/resource/Michael_Fassbender
326,http://dbpedia.org/resource/X-Men:_First_Class,http://dbpedia.org/resource/Oliver_Platt


6. To get a list of only the actors, remove `?film` from the selection. Use `SELECT DISTINCT` if needed to avoid duplicates.  Now can you get a count of the number of actors in the list?

In [62]:
sparql.setQuery("""
    SELECT DISTINCT ?actor
    WHERE { ?film dbo:starring dbr:Kevin_Bacon . ?film dbo:starring ?actor .
            FILTER ( ?actor != dbr:Kevin_Bacon )}
""")
result = sparql.query().convert().decode("utf-8")
df = pd.read_csv(StringIO(result), sep=",")
df

Unnamed: 0,actor
0,http://dbpedia.org/resource/Linda_Fiorentino
1,http://dbpedia.org/resource/Tom_Waits
2,http://dbpedia.org/resource/Tony_Spiridakis
3,http://dbpedia.org/resource/Jamie_Lee_Curtis
4,http://dbpedia.org/resource/Chloe_Webb
...,...
292,http://dbpedia.org/resource/James_McAvoy
293,http://dbpedia.org/resource/January_Jones
294,http://dbpedia.org/resource/Jennifer_Lawrence
295,http://dbpedia.org/resource/Michael_Fassbender


7. We'll expand the SPARQL query to get actors who co-starred with actors who co-starred with Kevin Bacon, i.e. actors who are two steps away from Kevin Bacon. First get all films that the co-stars of Kevin Bacon played in:

In [63]:
sparql.setQuery("""
    SELECT DISTINCT ?film1 ?actor1 ?film2 
    WHERE { ?film1 dbo:starring dbr:Kevin_Bacon . ?film1 dbo:starring ?actor1 
               . ?film2 dbo:starring ?actor1 .
               FILTER ( ?actor1 != dbr:Kevin_Bacon )  }
""")

result = sparql.query().convert().decode("utf-8") 
df = pd.read_csv(StringIO(result), sep=",")
df

Unnamed: 0,film1,actor1,film2
0,http://dbpedia.org/resource/Telling_Lies_in_Am...,http://dbpedia.org/resource/Calista_Flockhart,http://dbpedia.org/resource/The_Last_Shot
1,http://dbpedia.org/resource/Telling_Lies_in_Am...,http://dbpedia.org/resource/Calista_Flockhart,http://dbpedia.org/resource/Things_You_Can_Tel...
2,http://dbpedia.org/resource/Telling_Lies_in_Am...,http://dbpedia.org/resource/Calista_Flockhart,http://dbpedia.org/resource/Brothers_&_Sisters...
3,http://dbpedia.org/resource/Telling_Lies_in_Am...,http://dbpedia.org/resource/Calista_Flockhart,http://dbpedia.org/resource/A_Midsummer_Night'...
4,http://dbpedia.org/resource/Telling_Lies_in_Am...,http://dbpedia.org/resource/Calista_Flockhart,http://dbpedia.org/resource/Ally_McBeal
...,...,...,...
9995,http://dbpedia.org/resource/JFK_(film),http://dbpedia.org/resource/Michael_Rooker,http://dbpedia.org/resource/Afterburn_(film)
9996,http://dbpedia.org/resource/JFK_(film),http://dbpedia.org/resource/Michael_Rooker,http://dbpedia.org/resource/DC_Showcase:_Jonah...
9997,http://dbpedia.org/resource/JFK_(film),http://dbpedia.org/resource/Michael_Rooker,http://dbpedia.org/resource/F9_(film)
9998,http://dbpedia.org/resource/JFK_(film),http://dbpedia.org/resource/Michael_Rooker,http://dbpedia.org/resource/Fantasy_Island_(film)


8. Next we add the co-stars of the co-stars of Kevin Bacon as `?actor2`:

In [64]:
sparql.setQuery("""
    SELECT ?film1 ?actor1 ?film2 ?actor2 
    WHERE { ?film1 dbo:starring dbr:Kevin_Bacon . ?film1 dbo:starring ?actor1 
           . ?film2 dbo:starring ?actor1 . ?film2 dbo:starring ?actor2 .
               FILTER (?actor1 != dbr:Kevin_Bacon && ?actor2 != dbr:Kevin_Bacon ) }
""")

result = sparql.query().convert().decode("utf-8") 
df = pd.read_csv(StringIO(result), sep=",")
df

Unnamed: 0,film1,actor1,film2,actor2
0,http://dbpedia.org/resource/Wild_Things_(film)...,http://dbpedia.org/resource/Bill_Murray,http://dbpedia.org/resource/Caddyshack,http://dbpedia.org/resource/Rodney_Dangerfield
1,http://dbpedia.org/resource/Wild_Things_(film)...,http://dbpedia.org/resource/Bill_Murray,http://dbpedia.org/resource/Caddyshack,http://dbpedia.org/resource/Bill_Murray
2,http://dbpedia.org/resource/Wild_Things_(film)...,http://dbpedia.org/resource/Bill_Murray,http://dbpedia.org/resource/Caddyshack,http://dbpedia.org/resource/Ted_Knight
3,http://dbpedia.org/resource/Wild_Things_(film)...,http://dbpedia.org/resource/Bill_Murray,http://dbpedia.org/resource/Caddyshack,http://dbpedia.org/resource/Chevy_Chase
4,http://dbpedia.org/resource/Wild_Things_(film)...,http://dbpedia.org/resource/Bill_Murray,http://dbpedia.org/resource/Caddyshack,http://dbpedia.org/resource/Michael_O'Keefe
...,...,...,...,...
9995,http://dbpedia.org/resource/A_Few_Good_Men,http://dbpedia.org/resource/Demi_Moore,http://dbpedia.org/resource/We're_No_Angels_(1...,http://dbpedia.org/resource/Robert_De_Niro
9996,http://dbpedia.org/resource/A_Few_Good_Men,http://dbpedia.org/resource/Demi_Moore,http://dbpedia.org/resource/We're_No_Angels_(1...,http://dbpedia.org/resource/Bruno_Kirby
9997,http://dbpedia.org/resource/A_Few_Good_Men,http://dbpedia.org/resource/Demi_Moore,http://dbpedia.org/resource/We're_No_Angels_(1...,http://dbpedia.org/resource/James_Russo
9998,http://dbpedia.org/resource/A_Few_Good_Men,http://dbpedia.org/resource/Demi_Moore,http://dbpedia.org/resource/We're_No_Angels_(1...,http://dbpedia.org/resource/Ray_McAnally


9. How many actors are within two steps of Kevin Bacon?  Are there no duplicates? Note the difference between `COUNT(DISTINCT ?film)` and `COUNT(?film)`.

In [65]:
sparql.setQuery("""
    SELECT COUNT(DISTINCT ?actor2)
    WHERE { ?film1 dbo:starring dbr:Kevin_Bacon . ?film1 dbo:starring ?actor1
           . ?film2 dbo:starring ?actor1 . ?film2 dbo:starring ?actor2 .
               FILTER (?actor1 != dbr:Kevin_Bacon && ?actor2 != dbr:Kevin_Bacon ) }
""")
result = sparql.query().convert().decode("utf-8")
df = pd.read_csv(StringIO(result), sep=",")
df

Unnamed: 0,callret-0
0,11962


In [66]:
sparql.setQuery("""
    SELECT COUNT(?actor2)
    WHERE { ?film1 dbo:starring dbr:Kevin_Bacon . ?film1 dbo:starring ?actor1
           . ?film2 dbo:starring ?actor1 . ?film2 dbo:starring ?actor2 .
               FILTER (?actor1 != dbr:Kevin_Bacon && ?actor2 != dbr:Kevin_Bacon ) }
""")
result = sparql.query().convert().decode("utf-8")
df = pd.read_csv(StringIO(result), sep=",")
df

Unnamed: 0,callret-0
0,72987


The first query gives the number of distinct actors, the second query gives the number of co-actor pairs. This creates a huge difference, 11962 vs. 72987. This difference stems from the fact that some actors co-starred with Kevin Bacon in more than one film.

# 4. Wanderlust

DBpedia is great to wander around---just like browsing through Wikipedia---but then with powerful aggregation tools at your finger tips.   Follow this walk, and make your own walks.

## Example Walk

1. Starting is always hard, so let's start with Google "dbpedia rembrandt", as we did in the lecture.

This gives you quite some info, and reveals Rembrandt is a dbpedia resource, hence `dbr:Rembrandt` (shorthand for http://dbpedia.org/resource/Rembrandt) is the unique ID inside dbpedia.

2. Let's see what information there is with `dbr:Rembrandt`

In [67]:
sparql.setQuery("""
    SELECT ?p ?o WHERE { dbr:Rembrandt ?p ?o }
""")

result = sparql.query().convert().decode("utf-8") 
df = pd.read_csv(StringIO(result), sep=",")
df.head(20) #Show the first 30 results

Unnamed: 0,p,o
0,http://www.w3.org/1999/02/22-rdf-syntax-ns#type,http://www.w3.org/2002/07/owl#Thing
1,http://www.w3.org/1999/02/22-rdf-syntax-ns#type,http://xmlns.com/foaf/0.1/Person
2,http://www.w3.org/1999/02/22-rdf-syntax-ns#type,http://dbpedia.org/ontology/Person
3,http://www.w3.org/1999/02/22-rdf-syntax-ns#type,http://www.ontologydesignpatterns.org/ont/dul/...
4,http://www.w3.org/1999/02/22-rdf-syntax-ns#type,http://www.wikidata.org/entity/Q19088
5,http://www.w3.org/1999/02/22-rdf-syntax-ns#type,http://www.wikidata.org/entity/Q215627
6,http://www.w3.org/1999/02/22-rdf-syntax-ns#type,http://www.wikidata.org/entity/Q483501
7,http://www.w3.org/1999/02/22-rdf-syntax-ns#type,http://www.wikidata.org/entity/Q5
8,http://www.w3.org/1999/02/22-rdf-syntax-ns#type,http://www.wikidata.org/entity/Q729
9,http://www.w3.org/1999/02/22-rdf-syntax-ns#type,http://dbpedia.org/class/yago/WikicatArtistsFr...


3. The list is partial, but one relation to explore could be the "type" of entity.

In [68]:
sparql.setQuery("""
    SELECT ?o WHERE { dbr:Rembrandt rdf:type ?o }
""")

result = sparql.query().convert().decode("utf-8") 
df = pd.read_csv(StringIO(result), sep=",")
df.head(30) #Show the first 30 results

Unnamed: 0,o
0,http://www.w3.org/2002/07/owl#Thing
1,http://xmlns.com/foaf/0.1/Person
2,http://dbpedia.org/ontology/Person
3,http://www.ontologydesignpatterns.org/ont/dul/...
4,http://www.wikidata.org/entity/Q19088
5,http://www.wikidata.org/entity/Q215627
6,http://www.wikidata.org/entity/Q483501
7,http://www.wikidata.org/entity/Q5
8,http://www.wikidata.org/entity/Q729
9,http://dbpedia.org/class/yago/WikicatArtistsFr...


4. That's a lot---but probably you knew already a lot about him, let's move to another entity that is well-represented in Wikipedia.

In [69]:
sparql.setQuery("""
    SELECT ?o WHERE { dbr:Darth_Vader rdf:type ?o }
""")

result = sparql.query().convert().decode("utf-8") 
df = pd.read_csv(StringIO(result), sep=",")
df.head(30) #Show the first 30 results

Unnamed: 0,o
0,http://www.w3.org/2002/07/owl#Thing
1,http://www.ontologydesignpatterns.org/ont/dul/...
2,http://dbpedia.org/ontology/Agent
3,http://www.wikidata.org/entity/Q24229398
4,http://www.wikidata.org/entity/Q95074
5,http://dbpedia.org/class/yago/Amputee109789566
6,http://dbpedia.org/class/yago/Assassin109813696
7,http://dbpedia.org/class/yago/Aviator109826204
8,http://dbpedia.org/class/yago/BadPerson109831962
9,http://dbpedia.org/class/yago/CausalAgent10000...


5. With so many things, what is his actual occupation?

In [70]:
sparql.setQuery("""
    SELECT ?occupation WHERE { dbr:Darth_Vader dbo:occupation ?occupation }
""")

result = sparql.query().convert().decode("utf-8") 
df = pd.read_csv(StringIO(result), sep=",")
df

Unnamed: 0,occupation
0,http://dbpedia.org/resource/Sith
1,http://dbpedia.org/resource/Slave
2,http://dbpedia.org/resource/Jedi


6. Wow, but who else then is a Jedi?

In [71]:
sparql.setQuery("""
    SELECT ?person WHERE { ?person dbo:occupation dbr:Jedi }
""")

result = sparql.query().convert().decode("utf-8") 
df = pd.read_csv(StringIO(result), sep=",")
df

Unnamed: 0,person
0,http://dbpedia.org/resource/Rey_(Star_Wars)
1,http://dbpedia.org/resource/Mara_Jade
2,http://dbpedia.org/resource/Qui-Gon_Jinn
3,http://dbpedia.org/resource/Count_Dooku
4,http://dbpedia.org/resource/Quinlan_Vos
5,http://dbpedia.org/resource/General_Grievous
6,http://dbpedia.org/resource/Luke_Skywalker
7,http://dbpedia.org/resource/Mace_Windu
8,http://dbpedia.org/resource/Starkiller
9,http://dbpedia.org/resource/Ahsoka_Tano


7. And who is actually a Sith (spoiler alert)?  

In [72]:
sparql.setQuery("""
    SELECT ?person WHERE { ?person dbo:occupation dbr:Sith }
""")

result = sparql.query().convert().decode("utf-8") 
df = pd.read_csv(StringIO(result), sep=",")
df

Unnamed: 0,person
0,http://dbpedia.org/resource/Count_Dooku
1,http://dbpedia.org/resource/Starkiller
2,http://dbpedia.org/resource/Darth_Plagueis
3,http://dbpedia.org/resource/Darth_Maul
4,http://dbpedia.org/resource/Darth_Vader
5,http://dbpedia.org/resource/Palpatine
6,http://dbpedia.org/resource/Asajj_Ventress


8. For those who don't want to know, how many are there?

In [73]:
sparql.setQuery("""
    SELECT count(?person) WHERE { ?person dbo:occupation dbr:Sith }
""")

result = sparql.query().convert().decode("utf-8") 
df = pd.read_csv(StringIO(result), sep=",")
df

Unnamed: 0,callret-0
0,7


9. With Jedi becoming a career opportunity, what occupations are there anyway (incomplete list)?

In [74]:
sparql.setQuery("""
    SELECT ?person ?occupation WHERE { ?person dbo:occupation ?occupation } ORDER BY ?occupation
""")

result = sparql.query().convert().decode("utf-8") 
df = pd.read_csv(StringIO(result), sep=",")
df

Unnamed: 0,person,occupation
0,http://dbpedia.org/resource/Everett_Alvarez_Jr.,http://alvarezassociates.com/
1,http://dbpedia.org/resource/Fred_Hassan,http://caretgroup.com/
2,http://dbpedia.org/resource/Lisa_Poppaw,http://childsafecolorado.org/Staff.html
3,http://dbpedia.org/resource/David_Brailer,http://cigna.com/
4,http://dbpedia.org/resource/Geoffrey_Cowan,http://communicationleadership.usc.edu/
...,...,...
9995,http://dbpedia.org/resource/Jack_Perrin,http://dbpedia.org/resource/Actor
9996,http://dbpedia.org/resource/Jack_Plotnick,http://dbpedia.org/resource/Actor
9997,http://dbpedia.org/resource/Jack_Raymond,http://dbpedia.org/resource/Actor
9998,http://dbpedia.org/resource/Jack_Ryder_(actor),http://dbpedia.org/resource/Actor


10. Impressive, but how many people are there in DBpedia anyway, when looking at nationality?

In [75]:
sparql.setQuery("""
    SELECT count(?nationality) ?nationality 
       WHERE { ?person dbo:nationality ?nationality } ORDER BY ?nationality
""")

result = sparql.query().convert().decode("utf-8") 
df = pd.read_csv(StringIO(result), sep=",")
df

Unnamed: 0,callret-0,nationality
0,1,http://dbpedia.org/resource/18th-century_histo...
1,1,http://dbpedia.org/resource/1931_French_Grand_...
2,1,http://dbpedia.org/resource/1951_French_Grand_...
3,1,http://dbpedia.org/resource/1952_Italian_Grand...
4,1,http://dbpedia.org/resource/1956_Argentine_Gra...
...,...,...
3611,1,http://dbpedia.org/resource/Åland
3612,7,http://dbpedia.org/resource/Úrvalsdeild_karla_...
3613,1,http://dbpedia.org/resource/Úrvalsdeild_kvenna...
3614,2,http://dbpedia.org/resource/Đại_Cồ_Việt


11. Hmm. Some interesting nationalities there... Also this may be a somewhat biased world view---let's sort that by "impact on Wikipedia"?

In [76]:
sparql.setQuery("""
    SELECT count(?nationality) AS ?count ?nationality 
       WHERE { ?person dbo:nationality ?nationality } ORDER BY DESC(?count)
""")

result = sparql.query().convert().decode("utf-8") 
df = pd.read_csv(StringIO(result), sep=",")
df

Unnamed: 0,count,nationality
0,20971,http://dbpedia.org/resource/United_States
1,7260,http://dbpedia.org/resource/United_Kingdom
2,4818,http://dbpedia.org/resource/India
3,4764,http://dbpedia.org/resource/Americans
4,3912,http://dbpedia.org/resource/Canadians
...,...,...
3611,1,http://dbpedia.org/resource/Fijian_people
3612,1,http://dbpedia.org/resource/Limbu_people
3613,1,http://dbpedia.org/resource/Dubrovnik
3614,1,http://dbpedia.org/resource/Teochew_people


12. Etcetera.

**Part 3**

Are there issues with completeness of encoding (does Rembrandt have an occupation or nationality?) or with selection and bias/representation that you observed in this example walk?

In [77]:
#Checking if Rembrandt has an occupation
sparql.setQuery("""
    SELECT ?occupation WHERE { dbr:Rembrandt dbo:occupation ?occupation }
""")
result = sparql.query().convert().decode("utf-8")
df = pd.read_csv(StringIO(result), sep=",")
df

Unnamed: 0,occupation


In [78]:
#Checking if Rembrandt has an nationality
sparql.setQuery("""
    SELECT ?nationality WHERE { dbr:Rembrandt dbo:nationality ?nationality }
""")
result = sparql.query().convert().decode("utf-8")
df = pd.read_csv(StringIO(result), sep=",")
df

Unnamed: 0,nationality


### According to the code, there is no nationality or occupation for Rembrandt. This is consistent with the DBPedia page as there are no nationality or occupation atributes.

## Make your own walk!

In a similar way as the example walk, make your own walk through DBpedia/Wikipedia. Explore some of the amazing power SPARQL queries give you to explore, as well as being aware of the limitations and bias of the collection and the encoding. We reward creativity as much as technical skills: some of the most interesting queries are a very simple SPARQL statement! Just like in Assignment 1, please do put a topic that you are interested into this one, not one of our boring example ones.

**(Your answers with query code blocks here)**

# 5. Annotating a corpus

Up to now, we have only queried DBpedia/Wikipedia, but the true power of linked open data is the ability to connect any corpus to the entities in DBpedia.  Manually annotating a corpus is very laborious, but automatic tools for entity linking can potentially annotate any DBpedia entity found in any text.

In Python, Spacy has a DBpedia Spotlight-based entity recognizer. Spacy is a very useful Python tool that can handle a large variety of text processing tasks, including also named entity recognition, sentiment analysis, part-of-speech tagging or text categorization, for various languages. It is definitely worth exploring more in your project or some other time.

As this assignment is already long enough, we will only briefly show how it works and ask you to experiment with it in a free format.

Let's install it:

In [79]:
!pip install spacy-dbpedia-spotlight

Collecting spacy-dbpedia-spotlight
  Downloading spacy_dbpedia_spotlight-0.2.6.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25ldone
[?25hCollecting spacy<4.0.0,>=3.0.0 (from spacy-dbpedia-spotlight)
  Downloading spacy-3.5.2-cp310-cp310-macosx_11_0_arm64.whl (6.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.6/6.6 MB[0m [31m6.8 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0mm
[?25hCollecting loguru (from spacy-dbpedia-spotlight)
  Downloading loguru-0.7.0-py3-none-any.whl (59 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.0/60.0 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting spacy-legacy<3.1.0,>=3.0.11 (from spacy<4.0.0,>=3.0.0->spacy-dbpedia-spotlight)
  Downloading spacy_legacy-3.0.12-py2.py3-none-any.whl (29 kB)
Collecting spacy-loggers<2.0.0,>=1.0.0 (from spacy<4.0.0,>=3.0.0->spacy-dbpedia-spotlight)
  Downloading spacy_loggers-1.0.4-py3-none-any.whl (11 kB)
Collecting murmurhash<1.1.0,>=0.28.

We will make a text processing pipeline that only includes the DBedia entity linker, and nothing else:

In [80]:
import spacy_dbpedia_spotlight
# a new blank model will be created, with the language code provided in the parameter
nlp = spacy_dbpedia_spotlight.create('en')
# in this case, the pipeline will be only contain the EntityLinker
print(nlp.pipe_names)
# ['dbpedia_spotlight']

['dbpedia_spotlight']


If you are able to experiment with a different language than English, we encourage you to try it! For that, you can change the 'en' language code above to a different language code.

Now, we simply need to use the nlp function on a string, and it will attempt to recognize DBPedia entities in it:

In [81]:
doc = nlp('The University Of Amsterdam is a Dutch higher education institution located in Amsterdam.')

print(doc.ents) #This just prints the entities that were found

#This prints some more details, including the DBPedia identifier and the similarity score:
print([(ent.text, ent.kb_id_, ent._.dbpedia_raw_result['@similarityScore']) for ent in doc.ents])

(University Of Amsterdam, Dutch, Amsterdam)
[('University Of Amsterdam', 'http://dbpedia.org/resource/University_of_Amsterdam', '1.0'), ('Dutch', 'http://dbpedia.org/resource/Netherlands', '0.7255664488349297'), ('Amsterdam', 'http://dbpedia.org/resource/Amsterdam', '0.9999436227871693')]


Let's just make a tiny change to the input and see if the output changes:

In [82]:
doc = nlp('The University of Amsterdam is a Dutch higher education institution located in Amsterdam.')
print([(ent.text, ent.kb_id_, ent._.dbpedia_raw_result['@similarityScore']) for ent in doc.ents])

[('Amsterdam', 'http://dbpedia.org/resource/Amsterdam', '0.9999436227871693'), ('Dutch', 'http://dbpedia.org/resource/Netherlands', '0.7255664488349297'), ('Amsterdam', 'http://dbpedia.org/resource/Amsterdam', '0.9999436227871693')]


Interesting. It no longer recognizes the university in my case.

**Part 4**

Now, it's your turn to experiment with entity linking. Find a paragraph of text anywhere in your target language -- text with concrete names of persons, organizations, locations, events, ... (like news) may have more entities than abstract philosophy -- and try it for yourself. How well does this work?  Did it find all or most entities?  Do you see "errors"? 

In [83]:
# Your experimentation here

**(Your answer here)**

Now assume you have a large text corpus annotated in this way (think up your own corpus of interest, or otherwise think about the corpus of movie reviews you explored earlier).   

Can you think up something you could explore using these annotations? E.g. if we remember the movie reviews collection from Assignment 2 we could look at particular actors, but also about comparing male and female actors as two groups, and doing aggregated queries about male and female actor mentions in the whole corpus of negative and positive movie reviews (similar to the SPARQL queries earlier).

**(Your answer here)**
