<h2>Loading and Inspecting the Full Dataset</h2>

In [1]:
import os
import cPickle as pickle

We are loading all the `pickle` files that reside in the `Full/data/` directory.

In [2]:
dataset_dir = './Full/data/'
dataset = {'item': [], 'original_summary': [], 'summary_with_surf_forms': [], 'triples': []}

for file in os.listdir(dataset_dir):
    if file.endswith(".p"):
        tempDatasetFileLocation = os.path.join(dataset_dir, file)
        with open(tempDatasetFileLocation, 'rb') as tempDatasetFile:
            tempDataset = pickle.load(tempDatasetFile)
            dataset['item'].extend(tempDataset['item'])
            dataset['original_summary'].extend(tempDataset['original_summary'])
            dataset['summary_with_surf_forms'].extend(tempDataset['summary_with_surf_forms'])
            dataset['triples'].extend(tempDataset['triples'])
            print('Successfully loaded dataset file: %s' % (tempDatasetFileLocation))
assert(len(dataset['item']) == len(dataset['original_summary']))
assert(len(dataset['item']) == len(dataset['triples']))
assert(len(dataset['item']) == len(dataset['summary_with_surf_forms']))
print('Total items that have been loaded in the dataset: %d' % (len(dataset['item'])))

Successfully loaded dataset file: ./Full/data/25.p
Successfully loaded dataset file: ./Full/data/9.p
Successfully loaded dataset file: ./Full/data/39.p
Successfully loaded dataset file: ./Full/data/29.p
Successfully loaded dataset file: ./Full/data/28.p
Successfully loaded dataset file: ./Full/data/38.p
Successfully loaded dataset file: ./Full/data/34.p
Successfully loaded dataset file: ./Full/data/8.p
Successfully loaded dataset file: ./Full/data/20.p
Successfully loaded dataset file: ./Full/data/44.p
Successfully loaded dataset file: ./Full/data/37.p
Successfully loaded dataset file: ./Full/data/23.p
Successfully loaded dataset file: ./Full/data/40.p
Successfully loaded dataset file: ./Full/data/42.p
Successfully loaded dataset file: ./Full/data/32.p
Successfully loaded dataset file: ./Full/data/4.p
Successfully loaded dataset file: ./Full/data/30.p
Successfully loaded dataset file: ./Full/data/11.p
Successfully loaded dataset file: ./Full/data/35.p
Successfully loaded dataset file: 

The selected dataset is loaded as a dictionary of lists. The lists are aligned to each other. For example, in order to print all entries about the item in the $254$th position, we run the following:

In [3]:
index = 253
for key in dataset:
    print(key)
    # Print the aligned triples properly.
    if key == 'triples':
        for triple in dataset[key][index]:
            print triple
    else:
        print(dataset[key][index])

original_summary
The cisterna magna (or cerebellomedullaris cistern) is one of three principal openings in the subarachnoid space between the arachnoid and pia mater layers of the meninges surrounding the brain. The openings are collectively referred to as cisterns.
item
http://dbpedia.org/resource/Cisterna_magna
triples
<item> http://dbpedia.org/ontology/brainInfoNumber 0
<item> http://dbpedia.org/ontology/grayPage 0
<item> http://dbpedia.org/ontology/graySubject 0
summary_with_surf_forms
The <item> ( or cerebellomedullaris #surFormToken823661 ) is one of three principal openings in the #surFormToken490331 between the #surFormToken953899 and #surFormToken416940 layers of the #surFormToken177334 surrounding the #surFormToken35668 . The openings are collectively referred to as #surFormToken1359894 .


The keys of the dictionary are described below:
* `item`: refers to the main entity of each Wikipedia summary.
* `original_summary`: refers to the original Wikipedia summary, prior to any pre-processing.
* `triples`: refers to the list of triples that associated with the Wikipedia summary.
* `summary_with_surf_forms`: refers to the Wikipedia summary after the realisations of the identified entities have been replaced by their corresponding *surface form tuple* tokens. The realisations of the entities are identified with [DBpedia Spotlight](https://github.com/dbpedia-spotlight/dbpedia-spotlight-model).

Any reference to the main entity both in the `triples` and the `summary_with_surf_forms` is replaced by the `<item>` token.

Tokens, such as `#surFormToken101` and `#surFormToken103` are used as placeholders of the entities' surface form tuples.

<h2>Loading the Supporting Dictionaries</h2>

We are loading below the supporting dictionaries that reside in `Full/utils/`.

In [4]:
surf_forms_tokens_location = './Full/utils/Surface-Form-Tokens.p'
with open(surf_forms_tokens_location, 'rb') as f:
    surf_forms_tokens = pickle.load(f)
    print('Successfully loaded the surface form tokens file at: %s' % surf_forms_tokens_location)

Successfully loaded the surface form tokens file at: ./Full/utils/Surface-Form-Tokens.p


An example of the structure of the dictionary that maps surface form tuples to their corresponding tokens in `summary_with_surf_forms` (e.g. `#surFormToken103`) is presented below.
```python
surf_forms_tokens = {(u'http://dbpedia.org/resource/Science_fiction', u'science fiction'): '#surFormToken5740',
                     (u'http://dbpedia.org/resource/Science_fiction', u'sci-fi'): '#surFormToken22979',
                     (u'http://dbpedia.org/resource/Science_fiction', u'science-fiction'): '#surFormToken109715',
                     (u'http://dbpedia.org/resource/United_States', u'American'): '#surFormToken212',
                     ...,
                     (u'http://dbpedia.org/resource/United_States', u'U.S.'): '#surFormToken1416'}
```

In [5]:
surf_form_counts_location = './Full/utils/Surface-Forms-Counts.p'
with open(surf_form_counts_location, 'rb') as f:
    surf_form_counts = pickle.load(f)
    print('Successfully loaded surface form counts file at: %s' % surf_form_counts_location)

Successfully loaded surface form counts file at: ./Full/utils/Surface-Forms-Counts.p


An example of the structure of the dictionary that logs the frequency with which realisations have been associated with entity URIs in the Full dataset is displayed below:
```python
surf_form_counts = {'http://dbpedia.org/resource/Albert_Einstein': {'Albert Einstein': 142, 'Einstein': 108},
                    'http://dbpedia.org/resource/Actor': {'actor': 21638, 'artists': 16688}, 
                    ...}
```
According to the above example, for each entity $k_d \in K$ (e.g. `dbr:Albert_Einstein`), we get a dictionary that maps all its relevant realisations in the text $g_{1 \ldots R}^{k_d}$ (e.g. "Albert Einstein") to their corresponding frequency of occurrence $z_{1 \ldots R}^{k_d}$ (e.g. the number 142).

<h3>Loading the English Labels and Instance Types Dictionaries</h3>

Please be advised that a system with at least 24 GB of memory is required in order for the dictionaries below to be loaded together (along with the above dataset files). 

In [6]:
labels_location = './Full/utils/Labels.p'
with open(labels_location, 'rb') as f:
    entity2label = pickle.load(f)
    print('Successfully loaded the English labels file at: %s' % labels_location)

Successfully loaded the English labels file at: ./Full/utils/Labels.p


An example of the structure of the dictionary that maps entities to a single label (realisation) is presented below:
```python
entity2label = {'http://dbpedia.org/resource/Albert_Einstein': [u'Albert Einstein'],
                'http://dbpedia.org/resource/John_Galt': [u'John Galt'],
                ...}
```

The entities' labels are provided by DBpedia at: [http://downloads.dbpedia.org/2016-10/core-i18n/en/labels_en.ttl.bz2](http://downloads.dbpedia.org/2016-10/core-i18n/en/labels_en.ttl.bz2)

In [7]:
instance_types_location = './Full/utils/Instance-Types.p'
with open(instance_types_location, 'rb') as f:
    entity2type = pickle.load(f)
    print('Successfully loaded the instance types file at: %s' % instance_types_location)

Successfully loaded the instance types file at: ./Full/utils/Instance-Types.p


An example of the structure of the dictionary that maps entities to their corresponding instance type is presented below:
```python
entity2label = {'http://dbpedia.org/resource/Wetter_(Ruhr)': u'http://dbpedia.org/ontology/Town'
                'http://dbpedia.org/resource/London': u'http://dbpedia.org/ontology/Settlement',
                'http://dbpedia.org/resource/Lionel_Messi': u'http://dbpedia.org/ontology/SoccerPlayer',
                ...}
```

The instance types of the entities are provided by DBpedia at: [http://downloads.dbpedia.org/2016-10/core-i18n/en/instance_types_en.ttl.bz2](http://downloads.dbpedia.org/2016-10/core-i18n/en/instance_types_en.ttl.bz2)