In [1]:
import SPARQLWrapper
import rdflib
import pandas as pd
import tqdm
from mlutil.semantic_web_utils import *
from mlutil import semantic_web_utils

# Querying DBPedia with SPARQL

The code for querying DBPedia can be found in [lambdaofgod/mlutil semantic_web_utils file](https://github.com/lambdaofgod/mlutil/blob/master/mlutil/semantic_web_utils.py)

We find DBPedia entities related to machine learning.

To do this we first find concepts related to ML, and then query triplets
```
?entity ?property ?ml_entity_name 
```
Where we specify `?ml_entity_name` as coming from a prespecified list
and `?property` to capture relationship as being the subject discipline of ML, or being related by Wikipedia links.


# Notes

rdf types are bogus:
- Similarty learning - place
- Machine learning - Thing, Disease, MusicGenre

In [2]:
query = '''
SELECT ?child ?label ?abstract
WHERE {
    {
        ?child skos:broader dbc:Machine_learning . 
        ?child rdfs:label ?label .
    }
UNION {
        ?child dct:subject dbc:Machine_learning .
        ?child rdfs:label ?label .
        ?child dbo:abstract ?abstract .
        FILTER (lang(?abstract) = 'en')
    }
UNION {
        ?child dbo:wikiPageWikiLink dbc:Machine_learning . 
        ?child rdfs:label ?label .
        ?child dbo:abstract ?abstract .
}
}'''

In [3]:
%%time
res = get_sparql_results(query)

CPU times: user 91.2 ms, sys: 28.2 ms, total: 119 ms
Wall time: 2.63 s


In [4]:
len(res['results']['bindings'])

6679

Query for selecting entities that link to machine learning page and have abstracts

In [5]:
print(semantic_web_utils._make_relation_to_selected_subject_where_phrase("dbo:wikiPageWikiLink", "dbc:Machine_learning"))

{
        ?child dbo:wikiPageWikiLink dbc:Machine_learning .
        FILTER (lang(?label) = 'en') .
        ?child rdfs:label ?label .
        ?child dbo:abstract ?abstract  .
        FILTER (lang(?abstract) = 'en') .
        ?child rdf:type ?child_type .
    }


## Machine-learning related entries

In this query we retrieve entities that either:
1) are less broad than Machine learning
2) are subject in machine learning
3) are linked to from machine learning wikipedia page

For entries other than 1) we also filter out ones that do not have an abstract

In [6]:
ml_properties_query = f"""
SELECT DISTINCT ?property ?property_label
WHERE {{
    ?object ?property ?ml  .
    ?ml rdfs:label 'Machine learning'@en .
    ?property rdfs:label ?property_label .
    FILTER (lang(?property_label) = 'en')
}}
"""

In [7]:
len(get_sparql_results(ml_properties_query)['results']['bindings'])

24

In [8]:
ml_related_query = f"""
SELECT DISTINCT ?object ?object_label
WHERE {{
    ?object dbo:wikiPageRedirects  ?ml  .
    ?ml rdfs:label 'Machine learning'@en .
    ?object rdfs:label ?object_label .
    FILTER (lang(?object_label) = 'en')
}}
"""

In [9]:
ml_related_results = get_sparql_results(ml_related_query)['results']['bindings']

In [10]:
ml_related_records = [res['object_label']['value'] for res in ml_related_results]

In [11]:
ml_related_records

['Adaptive machine learning',
 'Machine-learned',
 'Machine-learning',
 'Machine learning algorithm',
 'Machine learning applied',
 'Embedded Machine Learning',
 'Ethics in machine learning',
 'Feature discovery',
 'Self-learning (machine learning)',
 'Self-teaching computer',
 'Automatic learning algorithms',
 'Applications of machine learning',
 'Applying machine learning',
 'Applied machine learning',
 'Ethics of machine learning',
 'Machine Learning',
 'History of machine learning',
 'List of open-source machine learning software',
 'List of machine learning software',
 'Genetic algorithms for machine learning',
 'Computer machine learning',
 'Statistical learning',
 'Ethical issues in machine learning',
 'Strengthening learning algorithms',
 'Learning algorithm',
 'Learning algorithms',
 'Learning machine']

In [12]:
q = f"""
SELECT ?label
WHERE
{{dbc:Machine_learning rdfs:label ?label}}
"""

In [13]:
ml_related_records_names = [n.replace(" ", "_") for n in ml_related_records if not "(" in n]
entity_names = [
    "Machine_learning", "Deep_learning", "Statistical_learning_theory",
    "Natural_language_processing", "Computer_vision", "Time_series",
    "Reinforcement_learning", "Neural_networks", "Feature_engineering",
    "Supervised_learning", "Unsupervised_learning", "Pattern_recognition",
    "Learning_algorithm"
]

entities = [
    pref + name
    for pref in ["dbc:", "dbr:"]
    for name in entity_names
]

results = get_related_concepts_results(entities)

  0%|          | 0/26 [00:00<?, ?it/s]

In [14]:
raw_df = make_dataframe_from_results(results)

In [15]:
len(raw_df)

5717

In [16]:
len(filter_out_people(raw_df))

3560

In [17]:
raw_df.drop_duplicates(subset="label")

Unnamed: 0,child,label,abstract,child_type
0,http://dbpedia.org/resource/Word2vec,Word2vec,Word2vec is a technique for natural language p...,http://dbpedia.org/ontology/Band http://dbpedi...
1,http://dbpedia.org/resource/Rademacher_complexity,Rademacher complexity,In computational learning theory (machine lear...,http://dbpedia.org/class/yago/YagoPermanentlyL...
2,http://dbpedia.org/resource/Random_indexing,Random indexing,Random indexing is a dimensionality reduction ...,http://dbpedia.org/ontology/Software http://db...
3,http://dbpedia.org/resource/Similarity_learning,Similarity learning,Similarity learning is an area of supervised m...,http://dbpedia.org/ontology/Place http://dbped...
4,http://dbpedia.org/resource/Cross-entropy_method,Cross-entropy method,The cross-entropy (CE) method is a Monte Carlo...,http://dbpedia.org/class/yago/WikicatHeuristic...
...,...,...,...,...
5709,http://dbpedia.org/resource/Revoicer,Revoicer,A revoicer provides communication assistance b...,http://dbpedia.org/class/yago/Abstraction10000...
5710,http://dbpedia.org/resource/Currency-counting_...,Currency-counting machine,A currency-counting machine is a machine that ...,http://dbpedia.org/ontology/Software
5712,http://dbpedia.org/resource/Multivariate_optic...,Multivariate optical computing,"Multivariate optical computing, also known as ...",http://dbpedia.org/ontology/ProgrammingLanguage
5713,http://dbpedia.org/resource/Transcomputational...,Transcomputational problem,"In computational complexity theory, a transcom...",http://dbpedia.org/ontology/Disease


In [18]:
%%time
raw_ml_subrecords_results = get_sparql_results(query)

CPU times: user 95.5 ms, sys: 31.8 ms, total: 127 ms
Wall time: 3.32 s


In [19]:
results = raw_ml_subrecords_results['results']['bindings']
len(results)

6679

## DBPedia - two types

DBPedia contains two entities for a broad concept:
- its entity from wikipedia
- entity for wikipedia page with links to more specific concepts

In [20]:
q = f"""SELECT ?child ?label
WHERE {{
    ?child rdfs:label ?label .
    FILTER (?label = 'Semisupervised learning'@en)
}}
"""

In [21]:
print(q)

SELECT ?child ?label
WHERE {
    ?child rdfs:label ?label .
    FILTER (?label = 'Semisupervised learning'@en)
}



In [22]:
get_sparql_results(q)

{'head': {'link': [], 'vars': ['child', 'label']},
 'results': {'distinct': False,
  'ordered': True,
  'bindings': [{'child': {'type': 'uri',
     'value': 'http://dbpedia.org/resource/Category:Semisupervised_learning'},
    'label': {'type': 'literal',
     'xml:lang': 'en',
     'value': 'Semisupervised learning'}},
   {'child': {'type': 'uri',
     'value': 'http://dbpedia.org/resource/Semisupervised_learning'},
    'label': {'type': 'literal',
     'xml:lang': 'en',
     'value': 'Semisupervised learning'}}]}}