<h2>Inspecting the Aligned Datasets</h2>

We will be inspecting the corpora of aligned Wikipedia summaries and knowledge base triples (i.e. D1 and D2).

In [1]:
import os
import cPickle as pickle
import operator

Choose which dataset file you would like to load by changing the value of the `selected_dataset` variable below.
* `selected_dataset = 'D1'`: will load the dataset of DBpedia triples aligned with Wikipedia biographies
* `selected_dataset = 'D2'`: will load the dataset of Wikidata triples aligned with Wikipedia biographies

In [2]:
selected_dataset = 'D1'
assert (selected_dataset == 'D1' or selected_dataset == 'D2'), "selected_dataset can be set to either 'D1' or 'D2'."

We are loading all the dataset `pickle` files that reside in the selected dataset's directory.

In [3]:
dataset_dir = './%s/data/' % selected_dataset
dataset = {'item': [], 'original_summary': [], 'summary_with_URIs': [], 'summary_with_surf_forms': [], 'triples': []}

for file in os.listdir(dataset_dir):
    if file.endswith(".p"):
        tempDatasetFileLocation = os.path.join(dataset_dir, file)
        with open(tempDatasetFileLocation, 'rb') as tempDatasetFile:
            tempDataset = pickle.load(tempDatasetFile)
            dataset['item'].extend(tempDataset['item'])
            dataset['original_summary'].extend(tempDataset['original_summary'])
            dataset['summary_with_URIs'].extend(tempDataset['summary_with_URIs'])
            dataset['summary_with_surf_forms'].extend(tempDataset['summary_with_surf_forms'])
            dataset['triples'].extend(tempDataset['triples'])
            print('Successfully loaded dataset file: %s' % (tempDatasetFileLocation))
assert(len(dataset['item']) == len(dataset['original_summary']))
assert(len(dataset['item']) == len(dataset['triples']))
assert(len(dataset['item']) == len(dataset['summary_with_URIs']))
assert(len(dataset['item']) == len(dataset['summary_with_surf_forms']))
print('Total items that have been loaded in the dataset: %d' % (len(dataset['item'])))

Successfully loaded dataset file: ./D1/data/22.p
Successfully loaded dataset file: ./D1/data/18.p
Successfully loaded dataset file: ./D1/data/8.p
Successfully loaded dataset file: ./D1/data/6.p
Successfully loaded dataset file: ./D1/data/13.p
Successfully loaded dataset file: ./D1/data/14.p
Successfully loaded dataset file: ./D1/data/24.p
Successfully loaded dataset file: ./D1/data/10.p
Successfully loaded dataset file: ./D1/data/20.p
Successfully loaded dataset file: ./D1/data/15.p
Successfully loaded dataset file: ./D1/data/21.p
Successfully loaded dataset file: ./D1/data/4.p
Successfully loaded dataset file: ./D1/data/1.p
Successfully loaded dataset file: ./D1/data/11.p
Successfully loaded dataset file: ./D1/data/2.p
Successfully loaded dataset file: ./D1/data/17.p
Successfully loaded dataset file: ./D1/data/23.p
Successfully loaded dataset file: ./D1/data/25.p
Successfully loaded dataset file: ./D1/data/3.p
Successfully loaded dataset file: ./D1/data/7.p
Successfully loaded dataset

The selected dataset is loaded as a dictionary of lists. The lists are aligned with each other. For example, in order to print all entries about the $15$-th item, we simply do:

In [4]:
index = 14
for key in dataset:
    print(key)
    print(dataset[key][index])

original_summary
George Yull Mackie, Baron Mackie of Benshie CBE DSO DFC (10 July 1919 – 17 February 2015) was a Scottish Liberal Party politician. After World War II in which he served as a decorated airman with RAF Bomber Command, Mackie took over a farm at Benshie, Angus, and subsequently set up a cattle ranch at Braeroy, Inverness-shire, near Spean Bridge.
item
http://dbpedia.org/resource/George_Mackie,_Baron_Mackie_of_Benshie
triples
['<item> http://dbpedia.org/ontology/birthPlace http://dbpedia.org/resource/Tarves', '<item> http://dbpedia.org/ontology/citizenship http://dbpedia.org/resource/Scotland', '<item> http://dbpedia.org/ontology/deathPlace http://dbpedia.org/resource/Dundee', '<item> http://dbpedia.org/ontology/hometown http://dbpedia.org/resource/Tarves', '<item> http://dbpedia.org/ontology/occupation http://dbpedia.org/resource/George_Mackie,_Baron_Mackie_of_Benshie__1', '<item> http://dbpedia.org/ontology/occupation http://dbpedia.org/resource/RAF', '<item> http://dbpe

The keys of the dictionary are described below:
* `item`: refers to the main entity of each Wikipedia summary.
* `original_summary`: refers to the original Wikipedia summary, prior to any pre-processing.
* `triples`: refers to the list of triples that associated with the Wikipedia summary.
* `summary_with_URIs`: refers to the Wikipedia summary after pre-processing. The entities that have been annotated in the original summary are represented as URIs.
* `summary_with_surf_forms`: refers to the Wikipedia summary after pre-processing. The entities that have been annotated in the original summary are represented as surface form tuples.

Any reference to the main entity in the `triples`, `summary_with_URIs`, and `summary_with_surf_forms` is represented with the special `<item>` token. 

Tokens, such as `#surFormToken101` and `#surFormToken103` are used as temporal placeholders for the annotated entities' URIs in the case of `summary_with_URIs` or the entities' surface form tuples in the case of `summary_with_surf_forms`. These can be replaced by using the supporting dictionaries of the following `pickle` files.

In [5]:
surf_forms_tokens_location = './%s/utils/Surface-Form-Tokens.p' % selected_dataset
with open('./%s/utils/Surface-Form-Tokens.p' % selected_dataset, 'rb') as f:
    surf_forms_tokens = pickle.load(f)
    print('Successfully loaded surface form tokens file at: %s' % surf_forms_tokens_location)

uri_tokens_location = './%s/utils/URI-Tokens.p' % selected_dataset   
with open('./%s/utils/URI-Tokens.p' % selected_dataset, 'rb') as f:
    uri_tokens = pickle.load(f)
    print('Successfully loaded surface form tokens file at: %s' % uri_tokens_location)

surf_form_counts_location = './%s/utils/Surface-Forms-Counts.p' % selected_dataset
with open('./%s/utils/Surface-Forms-Counts.p' % selected_dataset, 'rb') as f:
    surf_form_counts = pickle.load(f)
    print('Successfully loaded surface forms file at: %s' % surf_form_counts_location)
    
uri_counts_location =  './%s/utils/URI-Counts.p' % selected_dataset  
with open('./%s/utils/URI-Counts.p' % selected_dataset, 'rb') as f:
    uri_counts = pickle.load(f)
    print('Successfully loaded surface forms file at: %s' % uri_counts_location)

Successfully loaded surface form tokens file at: ./D1/utils/Surface-Form-Tokens.p
Successfully loaded surface form tokens file at: ./D1/utils/URI-Tokens.p
Successfully loaded surface forms file at: ./D1/utils/Surface-Forms-Counts.p
Successfully loaded surface forms file at: ./D1/utils/URI-Counts.p


In [6]:
# We are inverting the dictionaries of interest.
inv_uri_tokens = {v: k for k, v in uri_tokens.iteritems()}
inv_surf_forms_tokens = {v: k for k, v in surf_forms_tokens.iteritems()}

An example of the structure of each one of the supporting dictionaries is presented below. Below are examples for the dictionaries that map the temporal placeholders (e.g. `#surFormToken103`) to their respecive URIs and surface form tuples respectively.
```python
uri_tokens = {u'http://dbpedia.org/resource/Snyder_Rini': '#surFormToken77050',
              u'http://dbpedia.org/resource/Mountain_West_Conference': '#surFormToken77051',
              ...}
surf_forms_tokens = {(u'http://dbpedia.org/resource/Album', u'studio album'): '#surFormToken352',
                     (u'http://dbpedia.org/resource/Album', u'studio albums'): '#surFormToken697',
                     (u'http://dbpedia.org/resource/Actor', u'actor'): '#surFormToken693',
                     (u'http://dbpedia.org/resource/Actor', u'stage'): '#surFormToken622'}
```

Examples of the inverses of the above dictionaries, which map the temporal placeholders (e.g. `#surFormToken103`) to their respecive URIs and surface form tuples, are shown below:
```python
inv_uri_tokens = {'#surFormToken77050': u'http://dbpedia.org/resource/Snyder_Rini',
                  '#surFormToken77051': u'http://dbpedia.org/resource/Mountain_West_Conference', 
                  ...}
inv_surf_forms_tokens = {'#surFormToken77057': (u'http://dbpedia.org/resource/Snyder_Rini', u'Snyder Rini'),       
                         '#surFormToken77051': (u'http://dbpedia.org/resource/Richard_Webb_(actor)', u'Richard Webb'), 
                         ...}
```

Below are examples of the dictionaries' structure that track the frequency with which surface forms have been associated with entity URIs.
```python
uri_counts = {'http://dbpedia.org/resource/Actor': {u'actor': 19014, u'actress': 14941, ...},
              'http://dbpedia.org/resource/Europe': {u'Europe': 3169, u'European': 1446, ...}, 
              ...}
surf_form_counts = {'http://dbpedia.org/resource/Albert_Einstein': {'Albert Einstein': 1, 'Einstein': 2},
                    'http://dbpedia.org/resource/Artist': {'artist': 1, 'artists': 1}, 
                    ...}
```

In [7]:
def getAligned(index, use_surface_forms = False):
    if index < len(dataset['item']):
        
        # Printing the summary by representing the annotated entities as URIs.
        print ('Wikipedia Summary w/ URIs:')
        tokens = dataset['summary_with_URIs'][index].split()
        for j in range (0, len(tokens)):
            if tokens[j] in inv_uri_tokens:
                tempURI = inv_uri_tokens[tokens[j]]
                if use_surface_forms:              
                    tokens[j] = max(uri_counts[tempURI].iteritems(), key=operator.itemgetter(1))[0]
                else:
                    tokens[j] = tempURI
            elif tokens[j] == '<item>':
                tempURI = dataset['item'][index].decode('utf-8')
                print tempURI
                if use_surface_forms:              
                    tokens[j] = max(uri_counts[tempURI].iteritems(), key=operator.itemgetter(1))[0]
                else:
                    tokens[j] = tempURI
        print(' '.join(tokens))
        
        # Printing the summary by representing the annotated entities as surface form tuples.
        print ('\nWikipedia Summary w/ Surf. Form Tuples:')
        tokens = dataset['summary_with_surf_forms'][index].split()
        for j in range (0, len(tokens)):
            if tokens[j] in inv_surf_forms_tokens:
                tempTuple = inv_surf_forms_tokens[tokens[j]]
                if use_surface_forms:              
                    tokens[j] = tempTuple[1]
                else:
                    tokens[j] = str(tempTuple)
            elif tokens[j] == '<item>':
                tempURI = dataset['item'][index].decode('utf-8')
                if use_surface_forms:              
                    tokens[j] = max(surf_form_counts[tempURI].iteritems(), key=operator.itemgetter(1))[0]
                else:
                    tokens[j] = tempURI
        print(' '.join(tokens))
        
        # Printing the knowledge base triples allocated to the summary.
        print ('\nTriples:')
        for j in range(0, len(dataset['triples'][index])):
            print(dataset['triples'][index][j].replace('<item>', dataset['item'][index]))
    else:
        print('Pick an index between 0 and %d.' % (len(dataset['item']) - 1))

By running the `getAligned(i, use_surface_forms)` function, we are printing the $i$-th Wikipedia summary, both with URIs and surface form tuples, along with its corresponding triples. 

In case that the `use_surface_forms` variable is set to `True` then:
* In the case of the Wikipedia summaries with URIs, the entity URIs that exist in the text will be replaced by their corresponding most-frequently met surface forms.
* In the case of the Wikipedia summaries with surface form tuples, the tuples are removed and only their second element (surface form) is kept.

In [8]:
getAligned(4020, False)

Wikipedia Summary w/ URIs:
http://dbpedia.org/resource/Kiyoshi_Kuromiya
http://dbpedia.org/resource/Kiyoshi_Kuromiya ( May 0 , <year> – May 0 , <year> ) was a http://dbpedia.org/resource/Japanese_American_internment author and http://dbpedia.org/resource/Civil_and_political_rights http://dbpedia.org/resource/Activism . He was born in a http://dbpedia.org/resource/Japanese_American_internment camp on May 0 , <year> in http://dbpedia.org/resource/Heart_Mountain_Relocation_Center , http://dbpedia.org/resource/University_of_Wyoming .

Wikipedia Summary w/ Surf. Form Tuples:
http://dbpedia.org/resource/Kiyoshi_Kuromiya ( May 0 , <year> – May 0 , <year> ) was a (u'http://dbpedia.org/resource/Japanese_American_internment', u'Japanese American') author and (u'http://dbpedia.org/resource/Civil_and_political_rights', u'civil rights') (u'http://dbpedia.org/resource/Activism', u'activist') . He was born in a (u'http://dbpedia.org/resource/Japanese_American_internment', u'Japanese American internme

In [9]:
getAligned(72017, True)

Wikipedia Summary w/ URIs:
http://dbpedia.org/resource/Frederick_Hugh_Sherston_Roberts
Frederick Hugh Sherston Roberts VC (8 January <year> – 0 December <year> ) , son of the famous Victoria commander Field Marshal Lord Roberts , was born in Ambala , India , and received the VC , the highest and most prestigious award for gallantry in the face of the enemy that can be awarded to British and Commonwealth forces .

Wikipedia Summary w/ Surf. Form Tuples:
Frederick Hugh Sherston Roberts VC (8 January <year> – 0 December <year> ) , son of the famous Victorian commander Field Marshal Frederick Roberts, 1st Earl Roberts , was born in Umballa , India , and received the Victoria Cross , the highest and most prestigious award for gallantry in the face of the enemy that can be awarded to British and Commonwealth forces .

Triples:
http://dbpedia.org/resource/Frederick_Hugh_Sherston_Roberts http://dbpedia.org/ontology/award http://dbpedia.org/resource/Mention_in_Despatches
http://dbpedia.org/reso