# Exercise #2: Indexing DBpedia data

Build a fielded Elasticsearch index of Norwegian athletes.

## Get a list of Norwegian athletes

Identify the list of people based on Wikipedia categories or DBpedia types.

We'll utilize the [List of Norwegian sportspeople](https://en.wikipedia.org/wiki/List_of_Norwegian_sportspeople) Wikipedia lists page. From this, we need to extract all outgoing lists. 

Note that not all pages will correspond to athlethes, so we'll need to filter those later (at indexing time).

In [2]:
import wikipediaapi

wiki_wiki = wikipediaapi.Wikipedia('en')

In [8]:
list_page = wiki_wiki.page("List_of_Norwegian_sportspeople")
pages = list(list_page.links.keys())

In [43]:
print(pages[:10], "...")

['Ailo Gaup (motocross rider)', 'Alf Hansen', 'Anders Bardal', 'Anders Jacobsen (ski jumper)', 'Andreas Thorkildsen', 'André Jørgensen', 'Ane Eidem', 'Anja Hammerseng-Edin', 'Anne Sophie Hunstad', 'Are Grongstad'] ...


## Index the data

Build a fielded Elasticsearch index. Fields should include:

  - Names
  - Description
  - Attributes
  - Related entities
  - Types/categories
  - Catch-all

In [11]:
from elasticsearch import Elasticsearch

es = Elasticsearch()

### Index configuration 

For each of the fields, store the term vectors. These should be stored in the index (to avoid them being computed on-
the-fly). See [here](https://www.elastic.co/guide/en/elasticsearch/reference/current/term-vector.html).

In [13]:
INDEX_NAME = "norwegian_athletes"

INDEX_SETTINGS = {
    'mappings': {
            'properties': {
                'names': {
                    'type': "text",
                    'term_vector': "yes",
                    'analyzer': "english"
                },
                'description': {
                    'type': "text",
                    'term_vector': "yes",
                    'analyzer': "english"
                },
                'attributes': {
                    'type': "text",
                    'term_vector': "yes",
                    'analyzer': "english"
                },
                'related_entities': {
                    'type': "text",
                    'term_vector': "yes",
                    'analyzer': "english"
                },
                'types': {
                    'type': "text",
                    'term_vector': "yes",
                    'analyzer': "english"
                },
                'catch_all': {
                    'type': "text",
                    'term_vector': "yes",
                    'analyzer': "english"
                },
            }
        }
    }

In [14]:
if es.indices.exists(INDEX_NAME):
    es.indices.delete(index=INDEX_NAME)
    
es.indices.create(index=INDEX_NAME, body=INDEX_SETTINGS)

{'acknowledged': True,
 'index': 'norwegian_athletes',
 'shards_acknowledged': True}

### Indexing athletes 

In [15]:
import requests

Method to decide whether an entity belongs to a certain type (i.e., whether `target_type` one of the many types assigned to the entity).

In [28]:
TYPE_PREDICATE = "http://www.w3.org/1999/02/22-rdf-syntax-ns#type"

def has_type(properties, target_type):
    if TYPE_PREDICATE not in properties:
        return False
    for p in properties[TYPE_PREDICATE]:
        if p['value'] == target_type:
            return True
    return False

Method to resolve a given URI. In principle, we should look up the name of the resource pointed by the URI. Here, we'll simply take the segment of the URI after the last slash and replace underscores with spaces. Note that this will not work for CamelCase (like DBpedia type names), among many other things.

In [26]:
def resolve_uri(uri):
    return uri.split("/")[-1].replace("_", " ")

In [27]:
print(resolve_uri("http://dbpedia.org/resource/Category:Norwegian_speed_skaters"))

Category:Norwegian speed skaters


Method to create a fielded document representation for a given entity.

In [34]:
NAME_PREDICATES = set(["http://www.w3.org/2000/01/rdf-schema#label", 
                       "http://xmlns.com/foaf/0.1/name", 
                       "http://xmlns.com/foaf/0.1/givenName",
                       "http://xmlns.com/foaf/0.1/surname"])
TYPE_PREDICATES = set([TYPE_PREDICATE, 
                       "http://purl.org/dc/terms/subject"])
COMMENT_PREDICATE = "http://www.w3.org/2000/01/rdf-schema#comment"

def get_entity_doc(properties):
    doc = {}
    
    for predicate, values in properties.items():
        for value in values:
            # Get indexable text from the object
            value_str = str(value['value']) if value['type'] == "literal" else resolve_uri(value['value'])
            
            # Mapping to different fields based on predicate and type of value        
            if predicate in NAME_PREDICATES:  # names
                doc['names'] = doc.get("names", "") + " " + value_str
            if predicate in TYPE_PREDICATES:  # types
                doc['types'] = doc.get("types", "") + " " + value_str
            elif predicate == COMMENT_PREDICATE:  # description
                doc['description'] = value_str
            elif value['type'] == "literal":  # attributes
                doc['attributes'] = doc.get("attributes", "") + " " + value_str                
            elif value['type'] == "uri":  # related entities
                doc['related_entities'] = doc.get("related_entities", "") + " " + value_str                
                
            # Always add to catch_all field
            doc['catch_all'] = doc.get("catch_all", "") + " " + value_str
    
    return doc

In [37]:
num_entities = 0
for page in pages:
    url_name = page.replace(" ", "_")
    data = requests.get("http://dbpedia.org/data/{}.json".format(url_name)).json()
    dict_key = "http://dbpedia.org/resource/{}".format(url_name)
    if dict_key not in data:
        continue
    properties = data[dict_key]
    # Filter out non-athletes (as well as entities without any type)
    if not has_type(properties, "http://dbpedia.org/ontology/Athlete"):
        continue

    print(page)
    es.index(index=INDEX_NAME, id=page, body=get_entity_doc(properties))
    num_entities += 1
    
print("{} entities indexed".format(num_entities))

Alf Hansen
Anders Bardal
Anders Jacobsen (ski jumper)
Andreas Thorkildsen
André Jørgensen
Ane Eidem
Anja Hammerseng-Edin
Are Grongstad
Atle Maurud
Bente Skari
Birger Ruud
Bjørg Eva Jensen
Bjørn Dæhlie
Bjørn Einar Romøren
Bjørn Helge Riise
Bjørn Wirkola
Cecilia Brækhus
Christina Vukicevic
Daniel Braaten
Daniel Hrcka Brøndberg
Edvald Boasson Hagen
Edwin Kjeldner
Egil Danielsen
Endre Hansen
Erik Noppi
Eskil Ervik
Espen Bredesen
Espen Knutsen
Even Blakstad
Frank Hansen (rower)
Fred Anton Maier
Frode Estil
Geir Karlstad
Goaltender
Grete Waitz
Gro Hammerseng-Edin
Gunn-Rita Dahle Flesjå
Gøran van den Burgt
Hans Anton Aalien
Hege Riise
Henning Solberg
Henrik Bjørnstad
Hjalmar Andersen
Inga Berit Svestad
Ingrid Kristiansen
Isabell Herlovsen
Ivar Ballangrud
Jan Egil Storholt
Jan Stenerud
Joachim Hansen (fighter)
Johan Martin Lianes
Johan Remen Evensen
Johann Olav Koss
John Arne Riise
John Carew
Jon Ludvig Hammer
Jon Rønningen
Jørn Goldstein
Kari Traa
Karl Erik Rimfeldt
Kenneth Trones
Kim-Roar Ha

## Test

Test the indexed content by fetching the terms vectors for a given entity from the index. Check in the [documentation](https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-termvectors.html) the meaning of the different values.

In [40]:
tv = es.termvectors(index=INDEX_NAME, id="Eskil Ervik", fields="names,description")

In [41]:
tv

{'_id': 'Eskil+Ervik',
 '_index': 'norwegian_athletes',
 '_type': '_doc',
 '_version': 1,
 'found': True,
 'term_vectors': {'description': {'field_statistics': {'doc_count': 134,
    'sum_doc_freq': 5112,
    'sum_ttf': 6535},
   'terms': {'11': {'term_freq': 1},
    '1975': {'term_freq': 1},
    'cba': {'term_freq': 1},
    'een': {'term_freq': 1},
    'en': {'term_freq': 1},
    'ervik': {'term_freq': 1},
    'eskil': {'term_freq': 1},
    'huidig': {'term_freq': 1},
    'januari': {'term_freq': 1},
    'langebaanschaats': {'term_freq': 1},
    'manag': {'term_freq': 1},
    'noor': {'term_freq': 1},
    'noorwegen': {'term_freq': 1},
    'oud': {'term_freq': 1},
    'team': {'term_freq': 1},
    'trondheim': {'term_freq': 1},
    'van': {'term_freq': 1}}},
  'names': {'field_statistics': {'doc_count': 134,
    'sum_doc_freq': 732,
    'sum_ttf': 2378},
   'terms': {'ervik': {'term_freq': 6},
    'eskil': {'term_freq': 6},
    'эрвик': {'term_freq': 1},
    'эскил': {'term_freq': 1}}

## Feedback

Please give (anonymous) feedback on this exercise by filling out [this form](https://forms.gle/22o3ursi5YsR1Ztb8).