## How Much is a Billion Dollars?

This notebook has been prepared in connection to Jacopo Tagliabue's post "How Much is a Billion Dollars?". The following notebook is shared as some hacky-code that gets the job (e.g. get facts from DBPedia) done ;-) 

This notebook should run smoothly on a standard Python 3 Anaconda kernel + networkx: please refer to the full blog post (and the README of this repo) for more details.

### Globals and import and utils stuff

In [None]:
import os
from math import log
from collections import defaultdict
import re
from datetime import datetime
import json
import networkx as nx   # https://networkx.github.io/documentation/networkx-1.10/tutorial/tutorial.html

In [None]:
# get data folder from relative position of this notebook and setup the path to dbpedia files
DATA_FOLDER = '{}/data'.format(os.path.abspath('..')) 
print("Data is in {}".format(DATA_FOLDER))
DB_LITERALS_FILE = '{}/mappingbased_literals_en.tql'.format(DATA_FOLDER)
DB_TRIPLE_FILE = '{}/mappingbased_objects_en.tql'.format(DATA_FOLDER)
JSON_OUTPUT_FILE ='{}/knowledge_base.json'.format(DATA_FOLDER)

### Parse DBpedia properties
We first parse DBPedia properties from the ontology file - since "range" and "domain" in the RDF collection may be misguided, we rely on string matching to extract a starting subset of _numerical_ properties from a list of units of measure (needless to say, the logic here could be massively improved).

In [None]:
def parse_dbpedia_triple(line):
    match_obj = re.match( r'<(.*?)> <(.*?)> \"(.*?)>.', line)
    if match_obj:
        return [match_obj.group(_) for _ in range(1, 4)]
    
    return None

In [None]:
test_line = '<http://dbpedia.org/resource/Audi> <http://dbpedia.org/ontology/assets> "1.6832E10"^^<http://dbpedia.org/datatype/euro> <http://en.wikipedia.org/wiki/Audi?oldid=744301837#absolute-line=28&template=Infobox_company&property=assets&split=1&wikiTextSize=62&plainTextSize=22&valueSize=44> .'
parse_dbpedia_triple(test_line)

In [None]:
def parse_dbpedia_ontology(line):
    match_obj = re.match( r'<(.*?)> <(.*?)> <(.*?)>.', line)
    if match_obj:
        return [match_obj.group(_) for _ in range(1, 4)]
    
    return None

In [None]:
test_line_2 = '<http://dbpedia.org/resource/Anarchism> <http://www.w3.org/2000/01/rdf-schema#seeAlso> <http://dbpedia.org/resource/Anarchist_terminology> <http://en.wikipedia.org/wiki/Anarchism?oldid=744318951#section=Etymology_and_terminolog&relative-line=2&absolute-line=23&template=Related_articles&property=1&split=1&wikiTextSize=21&plainTextSize=21&valueSize=21> .'
parse_dbpedia_ontology(test_line_2)

In [None]:
def parse_dbpedia_value(val):
    val = val.split('"')[0].strip()
    # if date
    if '-' in val:
        return float(val) if val[0] == '-' else float(val.split('-')[0])
    # if num
    return float(val)

In [None]:
parse_dbpedia_value(parse_dbpedia_triple(test_line)[2])

By inspecting frequent properties with units of measure that interest us (e.g. seconds, km2, $, etc.), we compiled a list of DBPedia properties to guide our fact extraction.

In [None]:
TARGET_PROPERTIES = { 
    'http://dbpedia.org/ontology/runtime': {
        'p': 'http://dbpedia.org/ontology/runtime',
        'd': 'runtime',
        'u': 'y', # originally these are seconds, so conversion is needed
        't': lambda x: x / (60.0 * 60.0 * 24.0 * 365.0),
        'm': lambda x: x < 100
    },
    'http://dbpedia.org/ontology/populationDensity': {
        'p': 'http://dbpedia.org/ontology/populationDensity',
        'd': 'population density',
        'u': 'person/km2',
        't': None,
        'm': None
    },
    'http://dbpedia.org/ontology/populationTotal': {
        'p': 'http://dbpedia.org/ontology/populationTotal',
        'd': 'people',
        'u': 'person',
        't': None,
        'm': lambda x: x < 10000000.0
    },
    'http://dbpedia.org/ontology/areaTotal': {
        'p': 'http://dbpedia.org/ontology/areaTotal',
        'd': 'total area',
        'u': 'km2', # originally these are m2, so conversion is needed
        't': lambda x: x / 1000000.0,
        'm': None
    },
    'http://dbpedia.org/ontology/careerPrizeMoney': {
        'p': 'http://dbpedia.org/ontology/careerPrizeMoney',
        'd': 'career money',
        'u': '$',
        't': None,
        'm': lambda x: x < 10000000
    },
    'http://dbpedia.org/ontology/tuition': {
        'p': 'http://dbpedia.org/ontology/tuition',
        'd': 'tuition',
        'u': '$/y',
        't': None,
        'm': lambda x: x < 1000000 and x > 100
    },
    'http://dbpedia.org/ontology/networth': {
        'p': 'http://dbpedia.org/ontology/networth',
        'd': 'net worth',
        'u': '$',
        't': None,
        'm': lambda x: x < 10000000 and x > 1000
    },
    'http://dbpedia.org/ontology/salary': {
        'p': 'http://dbpedia.org/ontology/salary',
        'd': 'salary',
        'u': '$/y',
        't': None,
        'm': lambda x: x < 500000 and x > 100
    },
    'http://dbpedia.org/ontology/foundingDate': {
        'p': 'http://dbpedia.org/ontology/foundingDate',
        'd': 'years of history',
        'u': 'y', # originally this is a date, so conversion is needed
        't': lambda x: datetime.today().year - x,
        'm': lambda x: x < 100
    }   
}

Collect mapping resource to name (as label) for convenience

In [None]:
uri2name = {}
with open(DB_LITERALS_FILE) as my_triples:
    for line in my_triples:
        elements = parse_dbpedia_triple(line)
        if elements and elements[1] == 'http://xmlns.com/foaf/0.1/name':
            uri2name[elements[0]] = elements[2].split('"')[0].strip()
            
len(uri2name)

Finally parse the triple store SUB-PROP-OBJ to induce a graph over DBPedia and calculate PageRank over entities

In [None]:
db_graph = nx.DiGraph()
with open(DB_TRIPLE_FILE) as my_triples:
    for line in my_triples:
        elements = parse_dbpedia_ontology(line)
        if not elements:
            continue
        # add SUB-OBJ as nodes
        db_graph.add_node(elements[0])
        db_graph.add_node(elements[2])
        # add edge
        db_graph.add_edge(elements[0], elements[2])

print(len(db_graph))
db_page_rank = nx.pagerank(db_graph, max_iter=20000)

Parse the triple story to collect, for each target property, a collection of facts

In [None]:
properties_collection = defaultdict(list)
properties_running_count = defaultdict(set)
with open(DB_LITERALS_FILE) as my_triples:
    for line in my_triples:
        elements = parse_dbpedia_triple(line)
        if not elements or elements[0] not in uri2name or elements[1] not in TARGET_PROPERTIES:
            continue
        prop = elements[1]
        property_owner = elements[0]
        property_owner_label = uri2name[elements[0]]
        property_value = parse_dbpedia_value(elements[2])
        # reject if property value == 0 or owner already added
        if property_value == 0.0 or property_owner in properties_running_count[prop]:
            continue
        # add to running count to avoid duplicates
        properties_running_count[prop].add(property_owner)
        property_meta = TARGET_PROPERTIES[prop]
        property_final_value = property_value if not property_meta['t'] else property_meta['t'](property_value)
        if property_meta['m'] and not property_meta['m'](property_final_value):
            continue
            
        new_fact = {
           'q': property_final_value,
           'd': '{}({})'.format(property_meta['d'], property_owner_label),
           'u': property_meta['u'],
           'p': prop,
           'r': db_page_rank.get(property_owner, 0.0)
        }
        # add to list
        properties_collection[prop].append(new_fact)

print(sum([len(p) for k, p in properties_collection.items()]))

In [None]:
print(sorted(properties_collection['http://dbpedia.org/ontology/networth'], 
             key=lambda i: i['r'], 
             reverse=True)[:10])

### Select and Export
Loop over the facts (as grouped by target property) and export subsets to be re-used in different WebPPL models.

In [None]:
# first dump all JSON
with open(JSON_OUTPUT_FILE, 'w') as outfile:  
    json.dump(properties_collection, outfile)

In [None]:
def dump_mini_json(prop_collection, top_n):
    # finally, for each property dumps only a subset to be used with simplified models 
    mini_properties_collection = {}
    for p, p_list in prop_collection.items():
        sorted_list = sorted(prop_collection[p], key=lambda i: i['r'], reverse=True)[:top_n]
        mini_properties_collection[p] = sorted_list
    all_facts = []
    all_priors = []
    for p, p_list in mini_properties_collection.items():
        all_facts.extend(p_list)
        all_priors.extend([log(len(p_list) - _) for _ in range(len(p_list))])
        
    categorical_distribution = {
          'ps': all_priors,
          'vs': [_ for _ in range(len(all_facts))]
        }
    with open('{}/distribution_top_{}.json'.format(DATA_FOLDER, top_n), 'w') as outfile:  
        json.dump(categorical_distribution, outfile)
    with open('{}/fact_metadata_top_{}.json'.format(DATA_FOLDER, top_n), 'w') as outfile:  
        json.dump(all_facts, outfile)
    

In [None]:
dump_mini_json(properties_collection, 10)
dump_mini_json(properties_collection, 15)
dump_mini_json(properties_collection, 100)
dump_mini_json(properties_collection, 200)