# Experiments with DBPedia
This notebook was originally created for the [Data and Knowledge Engineering](https://mitloehner.com/lehre/dke/) course by Prof. Mitlöhner at the Vienna University of Economics and Business in May 2020.
The present and published version has been edited to focus exclusively on fetching data from the DBPedia Sparql Endpoint ([http://dbpedia.org/sparql](http://dbpedia.org/sparql)) and generating a html page based on the fetched data (the original version used SQlite for storing and querying data, which was presented using ipywidgets).

The generated html page can be found in `rockets.html`, and is also hosted on my [github page](https://pastra98.github.io/rockets/rockets.html).

## Setting up SparQL endpoint and finding shared identifiers

In [2]:
from SPARQLWrapper import SPARQLWrapper, JSON
sparql = SPARQLWrapper("http://dbpedia.org/sparql")
sparql.setReturnFormat(JSON)

The below code was used to analyze the intersection of properties between various DBpedia entities to identify suitable properties with which physical rockets for the purpose of launching payloads (sub)orbital flights, i.e. sounding rockets as well as human-rated launch vehicles are to be included. While fictional space vehicles are excluded, concepts and proposals are considered valid.

In [3]:
def get_properties(resource_name):
    resource_uri = f"<http://dbpedia.org/resource/{resource_name}>"
    sparql.setQuery(f"""
        SELECT ?property
        WHERE {{ {resource_uri} ?property ?value .  }}
        ORDER BY ?property
        """)
    results = sparql.query().convert()
    return {r["property"]["value"] for r in results["results"]["bindings"]}


def compare_overlaps(r1, r2):
    p1, p2 = get_properties(r1), get_properties(r2)
    shared = p1.intersection(p2)
    only_p1 = p1.difference(p2)
    only_p2 = p2.difference(p1)
    print(80*"/", f"\nComparing {r1} and {r2}")
    print(f"Shared: {len(shared)}\n{"\n".join(shared)}\n{80*"-"}")
    print(f"Only {r1}: {len(only_p1)}\n{"\n".join(only_p1)}\n{80*"-"}")
    print(f"Only {r2}: {len(only_p2)}\n{"\n".join(only_p2)}\n{80*"-"}")
    

compare_overlaps("Saturn_V", "Falcon_9")
compare_overlaps("Soyuz_(rocket)", "Proton-M")

//////////////////////////////////////////////////////////////////////////////// 
Comparing Saturn_V and Falcon_9
Shared: 48
http://dbpedia.org/property/partial
http://dbpedia.org/ontology/successfulLaunches
http://dbpedia.org/ontology/wikiPageWikiLink
http://dbpedia.org/property/cpl
http://dbpedia.org/ontology/maidenFlight
http://dbpedia.org/ontology/thumbnail
http://dbpedia.org/property/sites
http://dbpedia.org/ontology/wikiPageRevisionID
http://purl.org/linguistics/gold/hypernym
http://dbpedia.org/property/fail
http://dbpedia.org/property/manufacturer
http://dbpedia.org/property/function
http://dbpedia.org/ontology/finalFlight
http://dbpedia.org/ontology/abstract
http://dbpedia.org/ontology/manufacturer
http://dbpedia.org/property/caption
http://www.w3.org/2002/07/owl#sameAs
http://xmlns.com/foaf/0.1/isPrimaryTopicOf
http://dbpedia.org/ontology/totalLaunches
http://dbpedia.org/property/last
http://dbpedia.org/property/success
http://dbpedia.org/property/first
http://dbpedia.org/prop

Based on these results and further heuristic testing, the dbo:Rocket, as well as the wikidata:Q41291 properties effectively limit the results to space launch vehicles, with the owl:Thing property ensuring that actual physical rockets are to be included.
Results must further include an abstract and thumbnail, and be in the english wikipedia to prevent double entries.

It proved difficult to filter out compilation-type articles such as lists or families of rockets using exclusively the properties included in wikipedia.
Using filters based on the label of articles (i.e. if the string contained "list") significantly slowed down the query and proved unreliable.
Therefore these articles are filtered out later in the code generating the html.

# The query
The SparQL query below is used to obtain a json representation of all DBPedia entities that are linked to the rocket properties.
Further relevant properties such as the launch cost, dry mass (converted from grams to tons), height, service status etc. are included as optional properties and will be used to extend the profile of each Rocket.
Launch sites are also queried and concatenated into a single string.
Results are grouped to prevent double entries and sorted alphabetically.

In [4]:
sparql.setQuery("""
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX dbp: <http://dbpedia.org/property/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX wikidata: <http://www.wikidata.org/entity/>


SELECT ?rocket_name 
        # using min for deterministic behavior
        (min(?rocket_info_text) as ?rocket_info_text)
        (min(?rocket_image) as ?rocket_image)
        (min(?rocket_link) as ?rocket_link)
        (min(?launch_cost) as ?launch_cost)
        (min(?mass_tons) as ?mass_tons)
        (min(?diameter_meter) as ?diameter_meter)
        (min(?height_meter) as ?height_meter)
        (min(?stages) as ?stages)
        (min(?launch_fails) as ?launch_fails)
        (min(?maiden_flight) as ?maiden_flight)
        (min(?final_flight) as ?final_flight)
        (min(?service_status) as ?service_status)
        (GROUP_CONCAT(DISTINCT ?launch_site_name; separator=", ") as ?launch_sites)
WHERE { 
    ?rocket rdf:type dbo:Rocket ; 
        rdf:type wikidata:Q41291 ; # wikidata for Rocket
        rdf:type owl:Thing ;
        rdfs:label ?rocket_name ;
        dbo:abstract ?rocket_info_text ;
        dbo:thumbnail ?rocket_image ;
        foaf:isPrimaryTopicOf ?rocket_link .

    OPTIONAL {?rocket dbo:rocketStages ?stages}
    OPTIONAL {?rocket dbp:cpl ?launch_cost}
    OPTIONAL {?rocket dbo:failedLaunches ?launch_fails}
    OPTIONAL {?rocket dbo:maidenFlight ?maiden_flight}
    OPTIONAL {?rocket dbo:finalFlight ?final_flight}
    OPTIONAL {?rocket dbo:status ?service_status}
    OPTIONAL {?rocket dbo:mass ?mass}
    OPTIONAL {?rocket dbo:diameter ?diameter_meter}
    OPTIONAL {?rocket dbo:height ?height_meter}

    # Get launch sites
    OPTIONAL {
        ?rocket dbo:launchSite ?launch_site .
        ?launch_site rdfs:label ?launch_site_name .
        FILTER(lang(?launch_site_name) = "en")
    }
    
    # convert mass from grams to tons
    BIND(IF(BOUND(?mass), ?mass / 1000000, ?mass) AS ?mass_tons)
    
    # filtering out non english texts
    FILTER(lang(?rocket_info_text) = "en") .
    FILTER(lang(?rocket_name) = "en")


}

GROUP BY ?rocket_name
ORDER BY ?rocket_name
""")

result_json = sparql.query().convert()

## Generating the HTML

Finally, this code generates a full html page where each Rocket is presented within a rocket-box flexboxes of fixed dimensions, with properties listed on the right, description in the middle and image (from wikipedia) on the right.
The json data is iterated to generate a rocket-box for each valid result, and written to the rockets.html page.

Again, this page is hosted on my [github page](https://pastra98.github.io/rockets/rockets.html) (the page may take a while to load as all pictures are loaded at once with no lazy loading or pagination).

In [5]:
# basic css for the rocket boxes
html = """
<!DOCTYPE html>
<html>
<head>
    <title>Rocket Information</title>
    <style>
        body {
            margin: 20px;
        }
        .rocket-box {
            border: 2px solid black;
            margin-bottom: 20px;
            display: flex;
            height: 400px;
        }
        .left-section {
            width: 25%;
            padding: 10px;
            border-right: 1px solid black;
        }
        .middle-section {
            width: 40%;
            padding: 10px;
            border-right: 1px solid black;
        }
        .right-section {
            width: 35%;
            padding: 10px;
            display: flex;
            justify-content: center;
            align-items: center;
        }
        .description-text {
            height: 320px;
            overflow-y: auto;
        }
        img {
            max-width: 100%;
            max-height: 350px;
            object-fit: contain;
        }
    </style>
</head>
<body>
    <h2>Disclaimer</h2>

    <p>All contents contained on this page have been directly extracted from wikipedia,
    subject to the Creative Commons Attribution-ShareAlike License. This page has been
    generated using DBPedia SparQl queries using code found in
    <a href="https://github.com/pastra98/Rocket_SparQl">this repository</a> for a personal project.
    A link to each individual wikipedia page from which the information was extracted
    is available with each rocket. Images displayed are also from Wikipedia and are
    subject to their individual Creative Commons licenses as specified on their
    respective Wikipedia pages.</p>

    <p>The information presented here is by no means an attempt to present a comprehensive,
    nor necessarily accurate, list of all Rockets that have ever been proposed, designed or launched.
    Instead, presented are the results of executing a SparQl query for entities that have the "dbo:rocket"
    and "wikidata:Q41291" properties in the DBPedia Ontology.</p>

    <h1>Rocket Information</h1>
"""
# Iterate through rockets, generate html, filtering out lists (and family) type articles
for rocket in result_json['results']['bindings']:
    rocket_name_value = rocket['rocket_name']['value']
    if "List" in rocket_name_value or "family" in rocket_name_value.lower():
        continue
    
    rocket_info = rocket['rocket_info_text']['value']
    rocket_image = rocket['rocket_image']['value']
    rocket_link = rocket['rocket_link']['value']
    
    html += '<div class="rocket-box">'

    # Left section containing title and basic stats
    html += '<div class="left-section">'
    html += f'<h2><a href="{rocket_link}">{rocket_name_value}</a></h2>'
    html += '<div>'
    stats = [
        ('Launch Cost', 'launch_cost'),
        ('Mass (tons)', 'mass_tons'),
        ('Diameter (m)', 'diameter_meter'),
        ('Height (m)', 'height_meter'),
        ('Stages', 'stages'),
        ('Launch Failures', 'launch_fails'),
        ('Maiden Flight', 'maiden_flight'),
        ('Final Flight', 'final_flight'),
        ('Service Status', 'service_status'),
        ('Launch Sites', 'launch_sites')
    ]
    for stat_name, stat_key in stats:
        if stat_key in rocket and 'value' in rocket[stat_key]:
            stat_value = rocket[stat_key]['value']
            html += f'<div><strong>{stat_name}:</strong> {stat_value}</div>'
        else:
            html += f'<div><strong>{stat_name}:</strong> No information</div>'
    html += '</div></div>'
    
    # Middle section contianing description
    html += '<div class="middle-section">'
    html += '<h3>Description</h3>'
    html += f'<div class="description-text">{rocket_info}</div></div>'
    
    # Image on the right
    html += '<div class="right-section">'
    if rocket_image:
        html += f'<img src="{rocket_image}" alt="{rocket_name_value}">'
    else:
        html += '<div>No image available</div>'
    html += '</div></div>'
html += """
</body>
</html>
"""

with open("rockets.html", "w", encoding="utf-8") as f:
    f.write(html)

This project was a lot of fun for me!
While the precision of some properties to really identify all relevant articles on wikipedia can be lacking, I imagine LLM-based approaches for extracting and structuring schema data from wikipedia texts will be very effective for extending knowledge graphs even further.