## 1. Function to convert Wikipedia revision ID to Wikidata ID

For example, if you pass the revision ID 935784560 (corresponding with this revision of the English Wikipedia article for N. K. Jemisin: https://en.wikipedia.org/w/index.php?oldid=935784560) to revid_to_qid, then the function should return Q2427544.

* You can use the MediaWiki pageprops API to get this information: https://www.mediawiki.org/wiki/API:Pageprops
* Note that the API accepts a revids parameter: https://www.mediawiki.org/w/api.php?action=help&modules=query
* The sample code here might help you with getting started: https://www.mediawiki.org/wiki/API:Pageprops#Sample_code

In [5]:
import requests 

def revid_to_qid(revid, lang):
    """Takes a Wikipedia article revision ID and returns the corresponding Wikidata ID
    
    Args:
        revid: integer revision ID associated with an article in the provided language
        lang: the Wikiepdia language version -- e.g., 'en' corresponds with English Wikipedia
        
    Returns:
        qid: Wikidata ID associated with the article corresponding to the revision ID
    
    """
    return qid

In [59]:
# first try: using requests library

import requests

S = requests.Session()

URL = "https://en.wikipedia.org/w/api.php"

PARAMS = {
    "action": "query",
    "prop": "pageprops",
    "revids": "935784560",
    "format": "json"
}

R = S.get(url=URL, params=PARAMS)
DATA = R.json()


print(DATA)

{'batchcomplete': '', 'query': {'pages': {'29828568': {'pageid': 29828568, 'ns': 0, 'title': 'N. K. Jemisin', 'pageprops': {'defaultsort': 'Jemisin, N. K.', 'page_image_free': 'N._K._Jemisin_(cropped).jpg', 'wikibase-shortdesc': 'American science fiction and fantasy writer', 'wikibase_item': 'Q2427544'}}}}}


In [60]:
# second try: looping through a series of dictionary keys to get to the wanted id 
# seems long and redundant? 

import requests

S = requests.Session()

URL = "https://en.wikipedia.org/w/api.php"

PARAMS = {
    "action": "query",
    "prop": "pageprops",
    "revids": "935784560",
    "format": "json"
}

R = S.get(url=URL, params=PARAMS)
DATA = R.json()

print(DATA["query"]["pages"]["29828568"]["pageprops"]["wikibase_item"])    

Q2427544


In [65]:
# set qid

import requests

S = requests.Session()

URL = "https://en.wikipedia.org/w/api.php"

PARAMS = {
    "action": "query",
    "prop": "pageprops",
    "revids": "935784560",
    "format": "json"
}

R = S.get(url=URL, params=PARAMS)
DATA = R.json()

qid = DATA["query"]["pages"]["29828568"]["pageprops"]["wikibase_item"]
print(qid)

Q2427544


In [67]:
def revid_to_qid(revid, lang):
    lang = "en"
    S = requests.Session()

    URL = "https://en.wikipedia.org/w/api.php"

    PARAMS = {
        "action": "query",
        "prop": "pageprops",
        "revids": "revid",
        "format": "json"
    }

    R = S.get(url=URL, params=PARAMS)
    DATA = R.json()

    qid = DATA["query"]["pages"]["29828568"]["pageprops"]["wikibase_item"]
    return qid
    
revid_to_qid(935784560, "en")

KeyError: 'query'

## 2. Function to gather Wikidata claims for a Wikidata item

For example, if you pass the Wikidata ID Q2427544, then the claims and identifiers -- i.e. the properties and values listed on the page here (https://www.wikidata.org/wiki/Q2427544) -- should be returned as a sequence of (property, value) tuples as follows:

[(P31, Q5), (P18, ), (P21, Q6581072), ... , (P7400, ), (P7704, )]

A complete example for the item associated with the Shintokyo Maru https://www.wikidata.org/wiki/Q52329548:

[(P31, Q11446), (P458, ), (P373, ), (P729, ), (P176, Q11309941), (P18, ), (P7782, Q83904766), (P2043, ), (P1093, )]

Notes:

Order of claims does not matter
When there are multiple values for a property, they should all be included -- e.g., "languages spoken, written or signed" under https://www.wikidata.org/wiki/Q2427544 would be represented as ..., (P1412, Q7976), (P1412, Q1860), ...
You don't need to include references
When the value for a property is not a Wikidata item -- i.e. not entity-type of 'wikibase-entityid' -- then you can leave the value slot in the tuple blank

You should try to use the mwbase Python package for gathering this information (https://pypi.org/project/mwbase/)

Alternatively, you try to use the Wikidata API to get this information: https://www.mediawiki.org/wiki/Wikibase/API and more specifically https://www.wikidata.org/w/api.php?action=help&modules=wbgetclaims

In [None]:
def qid_to_claims(qid):
    """Takes a Wikidata ID and returns a sequence of claims.
    
    Args:
        qid: Wikidata ID
    Returns:
        claims: Sequence of claims tuples of form (property, value) or (property, ) when the value does not have a QID
    """
    # your code here
    return claims

## 3. Function to convert the claims in a Wikidata item into a document embedding

For example, given a Wikidata item's sequence of claims, return an embedding that represents that item. For the purpose of this application, individual embeddings for properties and values on Wikidata will be provided and the document embedding will be defined as the average of these individual embeddings.

Toy example: if a Wikidata item has the following claims [(P1, Q1), (P1, Q2), (P2, )] and these properties are associated with the following 3-dimensionnal embeddings:

* P1: [1, 2, 3]
* P2: [4, 5, 6]
* Q1: [3, 2, 1]
* Q2: [1, 0, 1]

Then, the document embedding for this Wikidata item would be:

* (P1 + Q1 + P1 + Q2 + P2) / 5, which ends up being:
* ([1, 2, 3] + [3, 2, 1] + [1, 2, 3] + [1, 0, 1] + [4, 5, 6]) / 5, which ends up being:
* [10, 11, 14] / 5 = [2, 2.2, 2.8]

If there is no embedding for a given property (P#) or entity (Q#) -- i.e. out-of-vocabulary -- an embedding of 0s ([0.0, 0.0, 0.0] in the above example) should be used as part of the averaging.

In [None]:
def claims_to_doc_embedding(claims, embeddings):
    """Takes a sequence of Wikidata claims and produces a document embedding.
    
    Args:
        claims: sequence of Wikidata claims.
        embeddings: look-up for the embeddings associated with each property/entity in the claims.
    Returns:
        document embedding: sequence of floats that is the average of the claims embeddings
    """
    # your code here
    return doc_embedding

## 4. Evaluation

At this stage, for topic classification, we would train a machine learning model that predicts a Wikidata item's topics based on the document embedding. We will leave that step off for now and instead focus on evaluating the document embeddings that the preceding code generates:

Choose some similar entities that have Wikipedia articles and compare how similar their document embeddings are. Also compare them to entities that are sorta similar and entities that are very different.

Do the embeddings capture your beliefs about how similar given entities are?

Can you find examples where the similarity of document embeddings does not match what you would expect? Why?
For example, you might imagine that the document embedding for Ernest Shackleton (Wikipedia; Wikidata) is similar to Antarctica (Wikipedia; Wikidata) or even Greenland because it is quite arctic (Wikipedia; Wikidata) but quite different than Tanzania (Wikipedia; Wikidata). However, Antarctica and Tanzania should still be relatively similar to each other given that they are both geographic regions.

For similarity, you can use the cosine similarity of two document embeddings. With cosine similarity, values close to 1 indicate that the two documents are very similar and values close to 0 indicate that the documents have little in common. Feel free to compute it yourself or find a Python package that can help.

For embeddings, you may use the embeddings found within the model.bin file here: https://github.com/geohci/wikidata-topic-model-api/blob/master/models/model.bin or https://drive.google.com/file/d/1YAniioZAtMHMMRWbA7HrbuSZVUi5QEpQ/view?usp=sharing (both files are the same, just hosted on different platforms). This is a trained fastText model that contains embeddings for Wikidata claims. You can download that file and upload it to your PAWS directory so that it is available from this notebook. You can then use the functions that you wrote earlier to convert the Wikidata items you are comparing to document embeddings that can be evaluated.



In [None]:
!pip install fasttext

import fasttext

# See: https://fasttext.cc/docs/en/python-module.html#model-object
model = fasttext.load_model('model.bin')