# Turning OpenAlex topics hierarchy into into SKOS

See https://www.w3.org/2004/02/skos/intro

Example SKOS fragment in turtle syntax: 

```
@prefix :            <http://ns.nature.com/terms/> .
@prefix bibo:        <http://purl.org/ontology/bibo/> .
@prefix dc:          <http://purl.org/dc/elements/1.1/> .
@prefix dcterms:     <http://purl.org/dc/terms/> .
@prefix foaf:        <http://xmlns.com/foaf/0.1/> .
@prefix npg:         <http://ns.nature.com/terms/> .
@prefix npgd:        <http://ns.nature.com/datasets/> .
@prefix npgg:        <http://ns.nature.com/graphs/> .
@prefix owl:         <http://www.w3.org/2002/07/owl#> .
@prefix prism:       <http://prismstandard.org/namespaces/basic/2.1/> .
@prefix rdf:         <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs:        <http://www.w3.org/2000/01/rdf-schema#> .
@prefix skos:        <http://www.w3.org/2004/02/skos/core#> .
@prefix vann:        <http://purl.org/vocab/vann/> .
@prefix xsd:         <http://www.w3.org/2001/XMLSchema#> .

@prefix article-types: <http://ns.nature.com/article-types/> .

article-types:
    npg:webpage <http://www.nature.com/ontologies/models/article-types/> ;
    dc:date "2015-08-17T06:12:02.390-04:00" ;
    dc:publisher "Macmillan Publishers Limited" ;
    dc:rights "This work is distributed under a Creative Commons Zero 1.0 (CC0 1.0) Public Domain Dedication <http://creativecommons.org/publicdomain/zero/1.0/>."@en ;
    dcterms:license <http://creativecommons.org/publicdomain/zero/1.0/> ;
    vann:preferredNamespacePrefix "article-types" ;
    vann:preferredNamespaceUri "http://ns.nature.com/article-types/" ;
    a owl:Ontology, skos:ConceptScheme ;
    owl:imports <http://ns.nature.com/terms/> ;
    owl:versionInfo "1.9.45" ;
    skos:definition "The ArticleTypes Ontology is a categorization of kinds of publication which are used to index and group content published by Macmillan Science and Education."@en ;
    skos:prefLabel "ArticleTypes Ontology"@en .

article-types:abstracts-collection
    npg:hasRoot article-types:news-and-comment ;
    npg:id "abstracts-collection" ;
    npg:isLeaf true ;
    npg:isRoot false ;
    npg:treeDepth 2 ;
    a npg:ArticleType , skos:Concept ;
    rdfs:isDefinedBy <http://ns.nature.com/article-types/> ;
    skos:broader article-types:research-highlights ;
    skos:definition "The :abstracts-collection article-type denotes a collection of abstracts from a recent conference."@en ;
    skos:inScheme <http://ns.nature.com/article-types/> ;
    skos:prefLabel "Abstracts Collection"@en .

```

## Getting started and loading the data

In [2]:
import pandas as pd
# !pip install ontospy
import ontospy

In [3]:
# load previously saved files
df_merged = pd.read_json("data/openalex_merged_cats_2024_09.json")
df_merged.head()

Unnamed: 0,id,display_name,description,ids,display_name_alternatives,fields,siblings,works_count,cited_by_count,works_api_url,updated_date,created_date,entity_type,domain,subfields,field,topics,keywords,subfield
0,https://openalex.org/domains/3,Physical Sciences,branch of natural science that studies non-liv...,{'wikidata': 'https://www.wikidata.org/wiki/Q1...,[],"[{'id': 'https://openalex.org/fields/15', 'dis...","[{'id': 'https://openalex.org/domains/4', 'dis...",73237366,770120022,https://api.openalex.org/works?filter=topics.d...,2024-04-01T04:50:55.583031,2024-01-23,domains,,,,,,
1,https://openalex.org/domains/2,Social Sciences,branch of science focused on societies and the...,{'wikidata': 'https://www.wikidata.org/wiki/Q3...,[political sciences],"[{'id': 'https://openalex.org/fields/12', 'dis...","[{'id': 'https://openalex.org/domains/4', 'dis...",64995390,273064988,https://api.openalex.org/works?filter=topics.d...,2024-04-01T04:50:50.146788,2024-01-23,domains,,,,,,
2,https://openalex.org/domains/4,Health Sciences,branch of science focused on human health and ...,{'wikidata': 'https://www.wikidata.org/wiki/Q1...,"[medical sciences, biomedical sciences, health...","[{'id': 'https://openalex.org/fields/35', 'dis...","[{'id': 'https://openalex.org/domains/1', 'dis...",43957129,480875956,https://api.openalex.org/works?filter=topics.d...,2024-04-01T04:50:50.416213,2024-01-23,domains,,,,,,
3,https://openalex.org/domains/1,Life Sciences,branch of science that involve the scientific ...,{'wikidata': 'https://www.wikidata.org/wiki/Q8...,"[bioscience, biological science disciplines]","[{'id': 'https://openalex.org/fields/11', 'dis...","[{'id': 'https://openalex.org/domains/4', 'dis...",27536918,445784583,https://api.openalex.org/works?filter=topics.d...,2024-04-01T04:50:55.720915,2024-01-23,domains,,,,,,
4,https://openalex.org/fields/27,Medicine,"field of study for diagnosing, treating and pr...",{'wikidata': 'https://www.wikidata.org/wiki/Q1...,[healthcare sciences],,"[{'id': 'https://openalex.org/fields/35', 'dis...",36272606,433646577,https://api.openalex.org/works?filter=topics.f...,2024-04-01T04:37:08.847205,2024-01-23,fields,"{'id': 'https://openalex.org/domains/4', 'disp...","[{'id': 'https://openalex.org/subfields/2702',...",,,,


## Generating a valid TTL file with topics, fields etc.. from OpenAlex

In [7]:
header = """
@prefix dc:          <http://purl.org/dc/elements/1.1/> .
@prefix dcterms:     <http://purl.org/dc/terms/> .
@prefix owl:         <http://www.w3.org/2002/07/owl#> .
@prefix rdf:         <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs:        <http://www.w3.org/2000/01/rdf-schema#> .
@prefix skos:        <http://www.w3.org/2004/02/skos/core#> .
@prefix xsd:         <http://www.w3.org/2001/XMLSchema#> .
@prefix openalex: <https://lambdamusic.github.io/openalex-hacks/ontology/> .
@prefix oatopics: <https://openalex.org/topics/> .
@prefix oadomains: <https://openalex.org/domains/> .
@prefix oafields: <https://openalex.org/fields/> .
@prefix oasubfields: <https://openalex.org/subfields/> .
@prefix oakeywords: <https://openalex.org/keywords/> .

openalex:
    dc:date "2024-09-24T06:12:02.390-04:00" ;
    dc:publisher "http://www.michelepasin.org" ;
    dc:rights "This work is distributed under a Creative Commons Zero 1.0 (CC0 1.0) Public Domain Dedication <http://creativecommons.org/publicdomain/zero/1.0/>."@en ;
    dcterms:license <http://creativecommons.org/publicdomain/zero/1.0/> ;
    a owl:Ontology, skos:ConceptScheme ;
    owl:versionInfo "0.1" ;
    skos:definition "A non-official SKOS-ification of the OpenAlex scientific topics hierarchy (see https://help.openalex.org/hc/en-us/articles/24736129405719-Topics). Purpose of this exercise is to make it easier to experiment with the model and process it using standard SKOS tools. Export made on 2024-09-24 (for more information, see https://github.com/lambdamusic/openalex-hacks)."@en ;
    skos:prefLabel "OpenAlex Topics Ontology"@en ;
    rdf:seeAlso <https://help.openalex.org/hc/en-us/articles/24736129405719-Topics> ;
    rdf:seeAlso <https://github.com/lambdamusic/openalex-hacks> .

"""



# DOMAINS
#
df = df_merged[df_merged.entity_type == 'domains']

template = """
<{entity_id}>
    a skos:Concept ;
    rdfs:isDefinedBy openalex: ;
    skos:prefLabel "{label}"@en ;
    rdfs:label "{label}"@en ;
    skos:definition "{desc}"@en ;
    skos:inScheme openalex: ;
    openalex:works_count {works_count} ;
    openalex:cited_by_count {cited_by_count} ;
    owl:sameAs <{wikidata_id}> ;
    owl:sameAs <{wikipedia_id}> .
"""
# out = ""
for index, row in df.head(5).iterrows():
    x = template.format(
        entity_id = row['id'],
        desc = row['description'],
        label = row['display_name'],
        works_count = row['works_count'],
        cited_by_count = row['cited_by_count'],
        wikidata_id = row['ids']['wikidata'],
        wikipedia_id = row['ids']['wikipedia'],
    )
    header += x

# FIELDS
#  
df = df_merged[df_merged.entity_type == 'fields']

template = """
<{topic_id}>
    a skos:Concept ;
    rdfs:isDefinedBy openalex: ;
    skos:prefLabel "{label}"@en ;
    rdfs:label "{label}"@en ;
    skos:definition "{desc}"@en ;
    skos:inScheme openalex: ;
    skos:broader <{parent_id}> ;
    openalex:works_count {works_count} ;
    openalex:cited_by_count {cited_by_count} ;
    owl:sameAs <{wikidata_id}> ;
    owl:sameAs <{wikipedia_id}> .
"""
# out = ""
for index, row in df.head(500).iterrows():
    x = template.format(
        topic_id = row['id'],
        parent_id = row['domain']['id'],
        desc = row['description'],
        label = row['display_name'],
        works_count = row['works_count'],
        cited_by_count = row['cited_by_count'],
        wikidata_id = row['ids']['wikidata'],
        wikipedia_id = row['ids']['wikipedia'],
    )
    header += x




# SUBFIELDS
#  
df = df_merged[df_merged.entity_type == 'subfields']

template = """
<{topic_id}>
    a skos:Concept ;
    rdfs:isDefinedBy openalex: ;
    skos:prefLabel "{label}"@en ;
    rdfs:label "{label}"@en ;
    skos:definition "{desc}"@en ;
    skos:inScheme openalex: ;
    skos:broader <{parent_id}> ;
    openalex:works_count {works_count} ;
    openalex:cited_by_count {cited_by_count} ;
    owl:sameAs <{wikidata_id}> ;
    owl:sameAs <{wikipedia_id}> .
"""
# out = ""
for index, row in df.head(5000).iterrows():
    x = template.format(
        topic_id = row['id'],
        parent_id = row['field']['id'],
        desc = row['description'],
        label = row['display_name'],
        works_count = row['works_count'],
        cited_by_count = row['cited_by_count'],
        wikidata_id = row['ids']['wikidata'],
        wikipedia_id = row['ids']['wikipedia'],
    )
    header += x



# TOPICS
#  
df = df_merged[df_merged.entity_type == 'topics']

template = """
<{topic_id}>
    a skos:Concept ;
    rdfs:isDefinedBy openalex: ;
    skos:prefLabel "{label}"@en ;
    rdfs:label "{label}"@en ;
    skos:definition "{desc}"@en ;
    skos:inScheme openalex: ;
    skos:broader <{parent_id}> ;
    openalex:works_count {works_count} ;
    openalex:cited_by_count {cited_by_count} ;
    owl:sameAs <{wikidata_id}> ;
    owl:sameAs <{wikipedia_id}> .
"""
# out = ""
for index, row in df.head(100000).iterrows():

    # some topics miss the IDs column
    wikidata_id = [] if "wikidata" not in row['ids'] else row['ids']['wikidata']
    wikipedia_id = row['ids']['wikipedia'] if row['ids'] else []

    x = template.format(
        topic_id = row['id'],
        parent_id = row['subfield']['id'],
        desc = row['description'],
        label = row['display_name'],
        works_count = row['works_count'],
        cited_by_count = row['cited_by_count'],
        # cheating TODO fix me
        wikidata_id = row['id'] if "wikidata" not in row['ids'] else row['ids']['wikidata'],
        wikipedia_id = row['id'] if "wikipedia" not in row['ids'] else row['ids']['wikipedia'],
    )
    header += x





## finally

with open('data/openalex-topics-rdf.ttl', 'w') as f:
    f.write(header)
print("Turtle generated...")


## test with ontospy
if True:
    o = ontospy.Ontospy()
    o.load_rdf("data/openalex-topics-rdf.ttl")
    o.build_all()

    for x in o.stats():
        print(x)

[32mReading: <data/openalex-topics-rdf.ttl>[0m
.. trying rdf serialization: <turtle>[0m


Turtle generated...


[1m..... success![0m
[37m----------
Loaded 52785 triples.
----------[0m
[32mRDF sources loaded successfully: 1 of 1.[0m
[37m..... 'data/openalex-topics-rdf.ttl'[0m
[37m----------[0m


('Ontologies', 1)
('Triples', 52785)
('Classes', 0)
('Properties', 0)
('Annotation Properties', 0)
('Object Properties', 0)
('Datatype Properties', 0)
('Skos Concepts', 4798)
('Data Shapes', 0)
('Data Sources', 1)


### Extra

The implementation below allows to render also 'keywords' as leaf nodes.

This generates a much larger graph as each topic has 10 keywords (or so). 

In [None]:


# TOPICS+KEYWORDS
#  
df = df_merged[df_merged.entity_type == 'topics']

template = """
<{topic_id}>
    a skos:Concept ;
    rdfs:isDefinedBy openalex: ;
    skos:prefLabel "{label}"@en ;
    rdfs:label "{label}"@en ;
    skos:definition "{desc}"@en ;
    skos:inScheme openalex: ;
    skos:broader <{parent_id}> ;
    openalex:works_count {works_count} ;
    openalex:cited_by_count {cited_by_count} ;
    owl:sameAs <{wikidata_id}> ;
    owl:sameAs <{wikipedia_id}> .
"""

template_keyword = """
<{keyword_id}>
    a skos:Concept ;
    rdfs:isDefinedBy openalex: ;
    skos:prefLabel "{label}"@en ;
    rdfs:label "{label}"@en ;
    skos:inScheme openalex: ;
    skos:broader <{parent_id}> .

"""

keyword_index = 0
for index, row in df.head(100000).iterrows():

    # some topics miss the IDs column
    wikidata_id = [] if "wikidata" not in row['ids'] else row['ids']['wikidata']
    wikipedia_id = row['ids']['wikipedia'] if row['ids'] else []

    for k in row['keywords']:
        keyword_index += 1
        k_id = f"openalex:keyword-{keyword_index}"
    
        x = template_keyword.format(
            keyword_id = k_id,
            parent_id = row['id'],
            label = k,
        )
        header += x
        
    x = template.format(
        topic_id = row['id'],
        parent_id = row['subfield']['id'],
        desc = row['description'],
        label = row['display_name'],
        works_count = row['works_count'],
        cited_by_count = row['cited_by_count'],
        # cheating TODO fix me
        wikidata_id = row['id'] if "wikidata" not in row['ids'] else row['ids']['wikidata'],
        wikipedia_id = row['id'] if "wikipedia" not in row['ids'] else row['ids']['wikipedia'],
    )
    header += x

