# COLX 561 Lab Assignment 2: Predicates (Cheat sheet)

## Assignment objectives

In this assignment you will:
- Write formulas in first-order logic corresponding to the semantics of simple sentences
- Access semantic information contained in DBpedia and identify new semantic relationships 

## Getting started

For this assignment, you will need the "Mapping Based Objects" from the 2016 version of DBpedia, which can be downloaded directly from the DBpedia site [here](http://downloads.dbpedia.org/2016-10/core-i18n/en/mappingbased_objects_en.ttl.bz2). This file is too large to be included in your Github repo, so you should ignore the lab instructions and store it elsewhere. Do **NOT** unzip the file, keep it as is.

Run the code below to access relevant modules (you can add to this as needed):

In [29]:
#provided code
# !python3 -m pip install --user rdflib
# !python3 -m pip install --user owlrl

import nltk
read_expr = nltk.sem.Expression.fromstring

import bz2
from collections import defaultdict, Counter
from nltk.sem import Valuation, Model,Assignment

import rdflib
from rdflib import URIRef, Literal
from rdflib.namespace import RDF, RDFS, FOAF, OWL
import owlrl

## Tidy submission

rubric={mechanics:1}

To get the marks for tidy submission:

- Submit the assignment by filling in this jupyter notebook with your answers embedded
- Be sure to follow the [general lab instructions](https://ubc-mds.github.io/resources_pages/general_lab_instructions)
- Except that you should **NOT** include the data in your repo, put it elsewhere on your hard drive and modify the path below

In [4]:
#provided code
path_to_data = "/Users/your-path-to/" # change this path!

# wget https://downloads.dbpedia.org/2016-10/core-i18n/en/mappingbased_objects_en.ttl.bz2
# MDS-CL % du -h mappingbased_objects_en.ttl.bz2 
# 176M	mappingbased_objects_en.ttl.bz2

### Exercise 1: First order logic for sentences

For each of the sentences below, write an expression in first order logic which represents the semantics of the sentence. Assign it a variable `pn` (e.g. `p0`, `p1`, `p2`, etc.) where `n` is the subexercise number. An example is given for you (which you should use as a premise for **Exercise 1.7**).

*John is a person*

In [33]:
#provided code
p0 = read_expr("person(John)")

Some examples:

- _every undergraduate student is a student_: $\forall_x.(\text{undergraduate\_student}(x) \rightarrow \text{student}(x))$
- _all cubes are small_: $\forall_x.(\text{cube}(x) \rightarrow \text{small}(x))$
- _Every player for Tottenham Hotspur F.C. is a good player_: $\forall_x.(\text{play\_for}(x, \text{Tottenham}) \rightarrow \text{good}(x))$
- _All blue cars are fast_: $\forall_x.(\text{blue}(x) \land \text{car}(x) \rightarrow \text{fast}(x))$
- _John owns a house_: $\exists_x.(\text{owns}(\text{John}, x) \land \text{house}(x))$

#### Exercise 1.1

rubric={raw:1}

*Fluffy is a black cat and Rover is a black dog*

In [7]:
p1 = 

#### Exercise 1.2
rubric={raw:1}

*Dogs and cats are both kinds of pets*

In [8]:
p2 = 

#### Exercise 1.3

rubric={raw:1}

*People like pets*

In [9]:
p3 = 

#### Exercise 1.4

rubric={raw:2}

*Black cats don't like black dogs*

In [10]:
p4 = 

#### Exercise 1.5

rubric={raw:2}

*Cats are happy as long as someone likes them*

In [11]:
p5 = 

#### Exercise 1.6

rubric={raw:2}

*Dogs aren't happy if there is anyone (or anything) that dislikes them*

In [12]:
p6 = 

#### Exercise 1.7

rubric={raw:2}

Convert each of the two sentences below into first order logic and then prove they follow from (are implied by) the sentences above. In each case you should **only** include relevant premises (if you include irrelevant premises your code may crash!).

*Fluffy is happy*

*Rover is not happy*

In [13]:
c = read_expr("happy(Fluffy)")
nltk.TableauProver().prove(c, [which logic ... p0,p1,p2,p3,p4,p5,p6])

True

In [14]:
c = read_expr("-happy(Rover)")
nltk.TableauProver().prove(c, [which logic ...])

True

### Exercise 2: Accessing the RDF triples in DBpedia

rubric={accuracy:2}

For **Exercises 3-5**, you will be using RDF triples derived from the infoboxes of English Wikipedia. You need to load them and convert them into a readable format. Some starter code is provided, which loads the file assuming it is the `path_to_data` directory (still compressed in bz2 format) and iterates line by line. The lines are formatted as follows:

```
<http://dbpedia.org/resource/Alabama> <http://dbpedia.org/ontology/country> <http://dbpedia.org/resource/United_States> .
```

We will refer to the first of these triples as the *subject*, the second as the *predicate*, and the third as the *object*. We are only interested in triples where there is a `dbpedia.org` URL for all three elements of the triple! Otherwise, you should remove the nondistinctive parts of the URL from each element and add the triple to a Python dict which maps from a `(subject, object)` tuple to a list of all the predicates the pair participates in. (This format is less efficient than storing mappings from predicates to `(subjet, object)` pairs, but very useful for Exercise 3)

For example, if you only read in the line above, you would create a dictionary like this:

```
predicates = {("Alabama", "United States"): ["country"]}
```

At the same time, create a mapping from each of your identifiers (e.g. "Alabama", "United States", and "country" in the example above)  to its original URI (e.g. ```http://dbpedia.org/resource/Alabama```) in a Python dict called `uri_lookup`, you'll use this in exercise 4.

```
MDS-CL % gzcat mappingbased_objects_en.ttl.bz2 | head
# started 2017-05-25T05:44:10Z
<http://dbpedia.org/resource/Anarchism> <http://www.w3.org/2000/01/rdf-schema#seeAlso><http://dbpedia.org/resource/Anarchist_terminology> . <-- KO
<http://dbpedia.org/resource/Anarchism> <http://www.w3.org/2000/01/rdf-schema#seeAlso> <http://dbpedia.org/resource/Anarchism> . <-- KO
<http://dbpedia.org/resource/Anarchism> <http://www.w3.org/2000/01/rdf-schema#seeAlso> <http://dbpedia.org/resource/France> . <-- KO
<http://dbpedia.org/resource/Anarchism> <http://www.w3.org/2000/01/rdf-schema#seeAlso> <http://dbpedia.org/resource/Violence> . <-- KO
<http://dbpedia.org/resource/Anarchism> <http://www.w3.org/2000/01/rdf-schema#seeAlso> <http://dbpedia.org/resource/Education> . <-- KO
<http://dbpedia.org/resource/Alabama> <http://dbpedia.org/ontology/country> <http://dbpedia.org/resource/United_States> . <-- OK
<http://dbpedia.org/resource/Alabama> <http://dbpedia.org/ontology/language> <http://dbpedia.org/resource/English_American> . <-- OK
<http://dbpedia.org/resource/Alabama> <http://dbpedia.org/ontology/capital> <http://dbpedia.org/resource/Montgomery,_Alabama> . <-- OK
<http://dbpedia.org/resource/Alabama> <http://dbpedia.org/ontology/largestCity> <http://dbpedia.org/resource/Birmingham,_Alabama> . <-- OK
```

Infobox: 

![Alabama](alabama.jpg)


- **Semantic Web** is to make Internet data machine-readable (source: https://en.wikipedia.org/wiki/Semantic_Web)
- **Linked Data**  is structured data which is interlinked with other data so it becomes more useful through semantic queries (source: https://en.wikipedia.org/wiki/Linked_data)
- **SPARQL** is an RDF query language—that is, a semantic query language for databases—able to retrieve and manipulate data stored in Resource Description Framework (RDF) format (source: https://en.wikipedia.org/wiki/SPARQL)
- **Entity linking**, also referred to as named-entity linking (NEL), named-entity disambiguation (NED), named-entity recognition and disambiguation (NERD) or named-entity normalization (NEN), is the task of assigning a unique identity to entities (such as famous individuals, locations, or companies) mentioned in text.



![Paris](paris.png)



- *Paris* (mythology), a prince of Troy in Greek mythology
- *Paris* (city) is the capital of France.
- *Paris, Texas* (film), a 1984 drama directed by Wim Wenders
- *Paris* Hilton (person), is an American media personality

- **DBpedia** is a project aiming to extract structured content from the information created in the Wikipedia project (source: https://en.wikipedia.org/wiki/DBpedia / https://wiki.dbpedia.org)

- **Freebase**  was a large collaborative knowledge base consisting of data composed mainly by its community members (source: https://en.wikipedia.org/wiki/Freebase_(database) / https://developers.google.com/freebase)

- **YAGO (Yet Another Great Ontology)** is an open source knowledge base developed at the Max Planck Institute for Computer Science in Saarbrücken. It is automatically extracted from Wikipedia and other sources (source: https://en.wikipedia.org/wiki/YAGO_(database) / https://www.mpi-inf.mpg.de/departments/databases-and-information-systems/research/yago-naga/yago)

- SemEval-2021 Task $>$ Information in scientific \& clinical text $>$ Task 11: **NLPContributionGraph** by Jennifer D’Souza, Sören Auer, and Ted Pedersen (https://ncg-task.github.io)

In [15]:
## takes long....

#my code here
def extract_name(url):
    return url[url.rfind("/")+1:-1].replace("_", " ")
#my code here

predicates = defaultdict(list)
uri_lookup = {}
f = bz2.open(path_to_data + "mappingbased_objects_en.ttl.bz2", "rt", encoding="utf-8")

for line in f:
    # your code here

    # your code here
f.close()

In [1]:

assert len(predicates) == 17320526
assert ('Animalia (book)', 'Graeme Base') in predicates
assert predicates[('Animalia (book)', 'Graeme Base')] == ['author', 'illustrator']
assert uri_lookup["Animalia (book)"] == "http://dbpedia.org/resource/Animalia_(book)"
assert len(uri_lookup) == 5381326
print("Success!")

# # assert len(predicates_pickle) == 17320526
# # assert ('Animalia (book)', 'Graeme Base') in predicates_pickle
# # assert predicates_pickle[('Animalia (book)', 'Graeme Base')] == ['author', 'illustrator']
# # assert url_lookup_pickle["Animalia (book)"] == "http://dbpedia.org/resource/Animalia_(book)"
# # assert len(url_lookup_pickle) == 5381326
# # print("Success!")


Success!


### Exercise 3: Implied predicates

As with *pet* and *dog*, one predicate will often imply the other. Your first task using the DBpedia data is to identify such predicates. In a bottom-up approach, we can identify such predicates by noting that, if Pred1 implies Pred2, any `(subject, object)` argument pair (from our `predicates` list built above) which has Pred1 in its list of predicates will also have Pred2, which in turn means that the count of argument pairs with have both Pred1 and Pred2 as valid predicates will be equal to the number with just Pred1.

#### Exercise 3.1

rubric={accuracy:2,efficiency:1}

The first thing you need to do to derive your implication is to complete the `get_pred_counts` which will count how often individual predicates appear in `predicates`, and how often pairs of predicates appear together (with the same arguments). You should avoid creating duplicate counts for pairs in opposite order (i.e. you should have `("officialLanguage","language")` or `("language","officialLanguage")`, but not both!)

`pred_counts["officialLanguage"] = 561`

`pred_counts["language"] = 108030`


```
('Canada', 'O_Canada')	anthem
('Canada', 'God_Save_the_Queen')	anthem
('Canada', 'Federalism')	governmentType
('Canada', 'Parliamentary_system')	governmentType
('Canada', 'Representative_democracy')	governmentType
('Canada', 'Constitutional_monarchy')	governmentType
('Canada', 'Canadian_dollar')	currency
('Canada', 'Ottawa')	capital
('Canada', 'Toronto')	largestCity
('Canada', 'English_language') officialLanguage         <--
('Canada', 'French_language') officialLanguage         
('Canada', 'English_language') language                 <--
('Canada', 'French_language')  language                 


```

`pred_pair_counts[("language","officialLanguage")] = 561` where `language` > `officialLanguage`

```
('Canada', 'English_language') officialLanguage     <---|
('Canada', 'French_language') officialLanguage          |
('Canada', 'English_language') language             <---|
('Canada', 'French_language') language
('Canada', 'Cree_language') language
```

In [17]:
def get_pred_counts(predicates):
    '''This function counts how many times predicates appear individually (pred_counts) and 
    how often they appear together with the same (subject, object) pair (pred_pair_counts)
    It returns a tuple with each of these counts, which are Counters'''
    pred_counts = Counter()
    pred_pair_counts = Counter()
    # your code here
 
    # your code here
    return pred_counts,pred_pair_counts

In [5]:
pred_counts, pred_pair_counts = get_pred_counts(predicates)
assert pred_counts["officialLanguage"] == 561
assert ("officialLanguage","language") in pred_pair_counts or ("language","officialLanguage") in pred_pair_counts
if ("officialLanguage","language") in pred_pair_counts:
    assert ("language","officialLanguage") not in pred_pair_counts
    assert pred_pair_counts[("officialLanguage","language")] == 561
else:
    print("language > officialLanguage")
    assert pred_pair_counts[("language","officialLanguage")] == 561    
    
print("Success!")

language > officialLanguage
Success!


#### Exercise 3.2

rubric={accuracy:2}

Now use the output of `get_pred_counts` to identify predicates which imply one another, based on the logic discussed above. Print out a list of the predicate pairs you find in the form of a logical implicature (i.e. `predicate_1 -> predictate_2`). Some predicates are equivalent based on the data (the implication goes both ways), for those you should print out both implications (`predicate_1 -> predictate_2` and `predicate_2 -> predictate_1`).

`officialLanguage -> language` where `pred_counts['officialLanguage'] = 561` $<$ `pred_counts['language'] = 108030`

In [20]:
pred_counts, pred_pair_counts = get_pred_counts(predicates)

# your code here


officialLanguage -> language
associatedMusicalArtist -> associatedBand
mouthMountain -> mouthPlace
mouthPlace -> mouthMountain
stateOfOrigin -> nationality
sourceMountain -> sourcePlace
sourcePlace -> sourceMountain
capitalMountain -> capitalPlace
capitalPlace -> capitalMountain
regionalLanguage -> language
sourceConfluenceMountain -> sourceConfluencePlace
sourceConfluencePlace -> sourceConfluenceMountain
lowestMountain -> lowestPlace
capitalCountry -> state
highestMountain -> highestPlace
distributingCompany -> distributingLabel
distributingLabel -> distributingCompany
governmentMountain -> governmentPlace
governmentPlace -> governmentMountain
managementMountain -> managementPlace
managementPlace -> managementMountain
legalForm -> type
firstLaunchRocket -> associatedRocket
lastLaunchRocket -> associatedRocket
firstLaunchRocket -> lastLaunchRocket
lastLaunchRocket -> firstLaunchRocket
projectCoordinator -> projectParticipant


#### Exercise 3.3

rubric={accuracy:3,efficiency:1}

Pick one of the pairs of predicates you found above, and (programmatically) build an NLTK semantic model of the data you have for those two predicates (and just those two predications). The code from lecture for doing that is provided, but you must create a representation that can be turned into a valuation (call it `v`). Then prove (using the `m.evaluate()` method) that one implies the other in the model you've built Hint: your final logical formula should involve universal quantification.

```
>>> import nltk
>>> from nltk.sem import Valuation, Model
>>> v = [('adam', 'b1'), ('betty', 'g1'), ('fido', 'd1'),
... ('girl', set(['g1', 'g2'])), ('boy', set(['b1', 'b2'])),
... ('dog', set(['d1'])),
... ('love', set([('b1', 'g1'), ('b2', 'g2'), ('g1', 'b1'), ('g2', 'b1')]))]
>>> val = Valuation(v)
>>> dom = val.domain
>>> m = Model(dom, val)
>>> g = nltk.sem.Assignment(dom)
>>> m.evaluate('all x.(boy(x) -> - girl(x))', g)
True
```

See http://www.nltk.org/howto/semantics.html (nltk Semantics).

In [32]:
wanted_preds = {"language": set(), "officialLanguage": set()}
# your code here

...
# your code here

val = Valuation(v)
dom = val.domain
m = Model(dom, val)
g = Assignment(dom)

# your code here
m.evaluate('all x.(boy(x) -> - girl(x))', g) #  officialLanguage -> language
# your code here

True

`print(wanted_preds['language'])`:
```
{('He Died Fifteen Years Ago', 'Spanish language'),
 ('Achha Bura', 'Standard Hindi'),
 ('Murut people', 'Malaysian language'),
 ('The Hunt (1963 film)', 'Cinema of Portugal'),
 ("The Captain's Ship", 'Spanish language'),
 ('Khmer people', 'Khmer language'),
 ('Granma (newspaper)', 'Spanish language'),
... }
```

`print(wanted_preds['officialLanguage'])`:
```
{('Abkhazia', 'Abkhaz language'),
 ('Abkhazia', 'Russian language'),
 ('Abyei', 'Arabic language'),
 ('Adjara', 'Georgian language'),
 ('Adélie Land', 'French language'),
...}
```

### Exercise 4: Adding a new DBpedia predictate

DBpedia has two predicates related to making the connection between authors and their works: `author` which has the book as the subject and author as the object (it is derived from the field in the info box of the wikipedia page associated with a book) and `notableWork` which has the author as the subject and the work as the object (it is derived the wikipedia page associated with the author). The `notableWork` predicate, however, only mentions the author's most notable work. You're going to create a new predicate `work` which is like `notableWork` in argument structure but includes the information from both `notableWork` and `Author` predicate (if Y is a `notableWork` of X, then Y is also a `work` of X; if  X is an `author` of Y, then Y is a `work` of X), and then place class restrictions on the subject and object of your new prediction.


- `(Haruki Murakami, notableWork, 1Q84)` $\rightarrow$ `(Haruki Murakami, work, 1Q84)`
- `(1Q84, author, Haruki Murakami)` $\rightarrow$ `(Haruki Murakami, work, 1Q84)`

#### 4.1 
rubric={accuracy:2}

First, create an rdflib graph which contains the information associated with the `notableWork` and `author` predicates in DBpedia (and only that information). You should use the Python data structures you built in exercise 2 (`predicates` and `uri_lookup`), rather than loading the rdf from disk. Since later parts of this exercise can take a lot of time with the full dataset, it's a good idea to just load in a part of the data now, when you're setting things up/debugging. 

In [7]:
g = rdflib.Graph()

# your code here

# add "notableWork" in pred_list to g with its sbj and obj; 
# g should contain: 
#   URIRef("http://dbpedia.org/resource/Graeme_Base"),      <- sbj
#   URIRef("http://dbpedia.org/ontology/notableWork"),      <- predicate
#   URIRef("http://dbpedia.org/resource/Animalia_(book)")   <- obj

# add "author" in pred_list to g with its sbj and obj
    
# your code here

<Graph identifier=N0badbf399d6a47c0bd95958eb10315f4 (<class 'rdflib.graph.Graph'>)>


In [23]:
assert len(g) == 68295 # 36228
assert (URIRef("http://dbpedia.org/resource/Animalia_(book)"), URIRef("http://dbpedia.org/ontology/author"), URIRef("http://dbpedia.org/resource/Graeme_Base")) in g
assert (URIRef("http://dbpedia.org/resource/Graeme_Base"), URIRef("http://dbpedia.org/ontology/notableWork"), URIRef("http://dbpedia.org/resource/Animalia_(book)")) in g
print("Success!")

Success!


#### 4.2
rubric={accuracy:2}

**Define a `work` predicate** (pretend it is an existing DBpredia predicate) and then **add two additional edges to the graph which indicate the OWL relationship to `author` ('inverse of') and `notableWork` ('subproperty of')**. Then deduce the individual relationships for `work` using `owlrl`. This takes a while for the full dataset. You will get no points on this problem if you manually add the edges corresponding to individual instances of the `work` predicate to the graph, you must deduce them from the other predicates.

In [24]:
## takes long......

notable_work_ref = URIRef(uri_lookup["notableWork"])
author_ref = URIRef(uri_lookup["author"])
work_ref = URIRef("http://dbpedia.org/ontology/work")

# your code here
# add 'work' is insverse of 'author'
# add 'notable work' is subproperty of 'work'

# your code here

owlrl.DeductiveClosure(owlrl.CombinedClosure.RDFS_OWLRL_Semantics).expand(g)

In [25]:
count = 0
for subj,pred,obj in g:
    if "work" in pred:
        count += 1
assert count == 65563 # 34683
print("Success!")

Success!


#### 4.3
rubric={accuracy:2, quality:1}

Add appropriate class restrictions on subjects and predicates of your new `work` predicate, i.e. that the subject much be a person and the object must be a book. **Recall that `RDFS.domain` is used to limit the subject ('person'), and `RDFS.range` the object ('book')**. You can just make up URIs for your classes, like we did in lecture. Again, do not assign your classes to particular authors/books manually, you must deduce them!

After you have deduced the classes, create a test which shows that all the books from the original `author` relation have been identified as books (assigned your new book class) via the restrictions on the `work` predicate. 

In [26]:
## takes long..

book_ref = URIRef("http://example.org/thing/book")
person_ref = URIRef("http://example.org/thing/person")

# your code here
# add work as RDFS.domain to person
# add work as RDFS.range to book

# your code here

owlrl.DeductiveClosure(owlrl.CombinedClosure.RDFS_OWLRL_Semantics).expand(g)

In [27]:
# tests here
for sbj_obj_pair, pred_list in predicates.items():
    sbj, obj = sbj_obj_pair
    if "author" in pred_list:
        assert (URIRef(uri_lookup[sbj]),RDF.type,book_ref) in g
print ("Success!")

Success!


### Exercise 5: Properties (optional)
rubric={accuracy:1,reasoning:1}

There are no explict properties, in the sense of a one-argument predicates, in the DBpedia data (or the RDF framework generally); everything is expressed in terms of two-argument predicates. In this exercise, you will try to automatically identify potential properties contained in the data. Once you have identified some likely candidates, write about how you found them, whether you think your method always identifies good properties (is there something better you might do?), and provide names for a few distinct properties.

In [28]:
pred_obj_count = {} # predicate-object pair, for example; 

# your code here

# your code here
# print some 100 examples...

('type', 'Town')
('family', 'Arctiidae')
('deathPlace', 'United States')
('family', 'Tortricidae')
('occupation', 'Actress')
('birthPlace', 'India')
('family', 'Noctuidae')
('genre', 'Soul music')
('kingdom', 'Fungi')
('genre', 'Heavy metal music')
('genre', 'Contemporary R&B')
('family', 'Geometridae')
('genre', 'Punk rock')
('position', 'Defender (football)')
('genre', 'Folk music')
('birthPlace', 'Italy')
('position', 'Midfielder (association football)')
('type', 'Census-designated place')
('position', 'Striker (association football)')
('birthPlace', 'France')
('birthPlace', 'Japan')
('type', 'City')
('isPartOf', 'Masovian Voivodeship')
('recordLabel', 'Columbia Records')
('position', 'Pitcher')
('genre', 'Pop rock')
('nationality', 'United Kingdom')
('country', 'Spain')
('mediaType', 'Hardcover')
('country', 'Turkey')
('genre', 'Hard rock')
('family', 'Crambidae')
('genre', 'Indie rock')
('position', 'Goalkeeper (association football)')
('battle', 'World War II')
('birthPlace', 'Ca