# Knowledge Graph construction and query with extracted software metadata

This notebook first generates a knowledge graph from the information extracted about software repositories. It is later queried to assess the good practices followed by the extracted repositories.

In [1]:
import morph_kgc
import pyoxigraph

## KG Construction
The knowledge graph is generated using Morph-KGC, that uses RML mappings to transform the JSON file into RDF. This tool requires some configuration parameters, where we indicate the desired output serialisation and the name and path to the RML mapping file. Then, the kg is generated and stored as a oxigraph store in the variable `graph`, that it is also saved as a `.nq` file.

In [62]:
config = """
             [CONFIGURATION]
             output_format=N-QUADS
             
             [SOMEF-json]
             mappings=../mappings/mapping-somef-star.ttl
         """

In [101]:
graph = morph_kgc.materialize_oxigraph(config)

INFO | 2023-07-03 18:14:26,790 | 145 mapping rules retrieved.
INFO | 2023-07-03 18:14:26,796 | Mappings processed in 0.984 seconds.
INFO | 2023-07-03 18:14:36,412 | Number of triples generated in total: 16524.


In [102]:
graph.add(pyoxigraph.Quad(
    pyoxigraph.NamedNode('https://w3id.org/okn/i/graph/20230628'),
    pyoxigraph.NamedNode('http://purl.org/dc/terms/created'),
    pyoxigraph.Literal('2023-06-28 00:00:00', datatype=pyoxigraph.NamedNode('http://www.w3.org/2001/XMLSchema#dateTime')),
    pyoxigraph.NamedNode('https://w3id.org/okn/i/graph/default')))
graph.add(pyoxigraph.Quad(
    pyoxigraph.NamedNode('https://w3id.org/okn/i/graph/20230628'),
    pyoxigraph.NamedNode('http://www.w3.org/ns/prov#wasAttributedTo'),
    pyoxigraph.Literal('SOftware Metadata Extraction Framework (SOMEF)', datatype=pyoxigraph.NamedNode('http://www.w3.org/2001/XMLSchema#string')),
    pyoxigraph.NamedNode('https://w3id.org/okn/i/graph/default')))

In [103]:
with open('/Users/aiglesias/GitHub/oeg-software-graph/data/somef-kg.nq', 'w') as result:
    result.write(str(graph))

## KG querying - FAIRness assessment

In [73]:
q_res = graph.query("""
            PREFIX sd: <https://w3id.org/okn/o/sd#>
            
            SELECT (COUNT (DISTINCT ?s) AS ?count_software)
            FROM <https://w3id.org/okn/i/graph/20230628>
            WHERE {
                ?s a sd:Software
            }
""")

for solution in q_res:
    print(solution['count_software'])

"270"^^<http://www.w3.org/2001/XMLSchema#integer>


### BP 1: Description is available

In [21]:
bp1_1_debug = """
            PREFIX sd: <https://w3id.org/okn/o/sd#>
            
            SELECT DISTINCT ?software ?desc
            FROM <https://w3id.org/okn/i/graph/20230628>
            WHERE {
                ?software sd:description ?desc 
            }
"""
q_res = graph.query(bp1_1_debug)

for solution in q_res:
    print(solution['software'], solution['desc'])


Counting number of repositories with description (long and short)

In [24]:
bp1_1 = """
            PREFIX sd: <https://w3id.org/okn/o/sd#>
            
            SELECT (COUNT (DISTINCT ?software) AS ?software_count) 
            FROM <https://w3id.org/okn/i/graph/20230628>
            WHERE {
                ?software sd:description ?desc 
            }
"""
q_res = graph.query(bp1_1)

for solution in q_res:
    print(solution['software_count'])


"229"^^<http://www.w3.org/2001/XMLSchema#integer>


Numer of software with descriptions by type: long (README) or short (GitHub API)

In [44]:
q_res = graph.query("""
            PREFIX sd: <https://w3id.org/okn/o/sd#>
            
            SELECT (COUNT (DISTINCT ?software_short) AS ?short_desc_count)
            FROM <https://w3id.org/okn/i/graph/20230628>
            WHERE {
                << ?software_short sd:description ?desc_short >> sd:technique "GitHub_API".
            }
""")

for solution in q_res:
    print(solution['short_desc_count'])
    

"200"^^<http://www.w3.org/2001/XMLSchema#integer>


In [59]:
q_res = graph.query("""
            PREFIX sd: <https://w3id.org/okn/o/sd#>
            PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
            
            SELECT (COUNT (DISTINCT ?software_long) AS ?long_desc_count)
            FROM <https://w3id.org/okn/i/graph/20230628>
            WHERE {
                << ?software_long sd:description ?desc_long >> sd:technique ?long_technique ;
                                                            sd:confidence ?long_conf .
                VALUES ?long_technique {"supervised_classification" "header_analysis"}
                FILTER(xsd:float(?long_conf) > 0.98)
            }
""")

for solution in q_res:
    print(solution['long_desc_count'])
    

"88"^^<http://www.w3.org/2001/XMLSchema#integer>


### BP2: Persistent identifier
Repositories that provide a DOI (not from a publication, but from e.g. Zenodo)

In [60]:
q_res = graph.query("""
            PREFIX sd: <https://w3id.org/okn/o/sd#>
            
            SELECT (COUNT (DISTINCT ?software) AS ?count_software)
            FROM <https://w3id.org/okn/i/graph/20230628>
            WHERE {
                ?software sd:identifier ?id 
            }
""")

for solution in q_res:
    print(solution['count_software'])


"21"^^<http://www.w3.org/2001/XMLSchema#integer>


### BP3: Download URL
Repositories that provide a URL for download from releases

In [66]:
q_res = graph.query("""
            PREFIX sd: <https://w3id.org/okn/o/sd#>
            
            SELECT (COUNT (DISTINCT ?software) AS ?count_software)
            FROM <https://w3id.org/okn/i/graph/20230628>
            WHERE {
                ?software sd:hasVersion ?version 
            }
""")

for solution in q_res:
    print(solution['count_software'])


"81"^^<http://www.w3.org/2001/XMLSchema#integer>


### BP4: A software versioning scheme is followed
If tags follows semantic versioning scheme

In [66]:
## SIN RESOLVER
q_res = graph.query("""
            PREFIX sd: <https://w3id.org/okn/o/sd#>
            
            SELECT (COUNT (DISTINCT ?software) AS ?count_software)
            FROM <https://w3id.org/okn/i/graph/20230628>
            WHERE {
                ?software sd:hasVersion ?version 
            }
""")

for solution in q_res:
    print(solution['count_software'])


"81"^^<http://www.w3.org/2001/XMLSchema#integer>


### BP5: Documentation is available

In [67]:
q_res = graph.query("""
            PREFIX sd: <https://w3id.org/okn/o/sd#>
            
            SELECT (COUNT (DISTINCT ?software) AS ?count_software)
            FROM <https://w3id.org/okn/i/graph/20230628>
            WHERE {
                ?software sd:hasDocumentation ?doc 
            }
""")

for solution in q_res:
    print(solution['count_software'])


"42"^^<http://www.w3.org/2001/XMLSchema#integer>


### BP6: License available

In [82]:
q_res = graph.query("""
            PREFIX sd: <https://w3id.org/okn/o/sd#>
            PREFIX schema: <http://schema.org/>
            
            SELECT (COUNT (DISTINCT ?software) AS ?count_software)
            FROM <https://w3id.org/okn/i/graph/20230628>
            WHERE {
                ?software a sd:Software ;
                          schema:license ?license .
                ?license a schema:CreativeWork ;
                         sd:name ?license_name .
            }
""")

for solution in q_res:
    print(solution['count_software'])


"164"^^<http://www.w3.org/2001/XMLSchema#integer>


### BP7: Explicit citation

In [90]:
q_res = graph.query("""
            PREFIX sd: <https://w3id.org/okn/o/sd#>
            PREFIX schema: <http://schema.org/>
            PREFIX prov: <http://www.w3.org/ns/prov#>
            
            SELECT (COUNT (DISTINCT ?software) AS ?count_software)
            FROM <https://w3id.org/okn/i/graph/20230628>
            WHERE {
                << ?software sd:citation ?cite >> prov:hadPrimarySource ?source
                FILTER(CONTAINS(str(?source),'README'))
            }
""")

for solution in q_res:
    print(solution['count_software'])


"20"^^<http://www.w3.org/2001/XMLSchema#integer>


In [95]:
q_res = graph.query("""
            PREFIX sd: <https://w3id.org/okn/o/sd#>
            PREFIX schema: <http://schema.org/>
            PREFIX prov: <http://www.w3.org/ns/prov#>
            
            SELECT (COUNT (DISTINCT ?software) AS ?count_software)
            FROM <https://w3id.org/okn/i/graph/20230628>
            WHERE {
                << ?software sd:citation ?cite >> prov:hadPrimarySource ?source
                FILTER(CONTAINS(str(?source),'.cff'))
            }
""")

for solution in q_res:
    print(solution['count_software'])


"5"^^<http://www.w3.org/2001/XMLSchema#integer>


### BP8: Available software metadata
Programming language, date created, at least one release and keywords

In [117]:
q_res = graph.query("""
            PREFIX sd: <https://w3id.org/okn/o/sd#>
            PREFIX schema: <http://schema.org/>
            PREFIX prov: <http://www.w3.org/ns/prov#>
            
            SELECT (COUNT (DISTINCT ?software) AS ?count_software)
            FROM <https://w3id.org/okn/i/graph/20230628>
            WHERE {
                ?software sd:hasSourceCode/sd:programmingLanguage ?language .
                ?software sd:dateCreated ?date .
                ?software sd:description ?desc .
                ?software sd:hasVersion ?rel .
                ?software sd:keywords ?keys .
                
            }
""")

for solution in q_res:
    print(solution['count_software'])


"22"^^<http://www.w3.org/2001/XMLSchema#integer>


### BP9: Installation instructions

In [118]:
q_res = graph.query("""
            PREFIX sd: <https://w3id.org/okn/o/sd#>
            PREFIX schema: <http://schema.org/>
            PREFIX prov: <http://www.w3.org/ns/prov#>
            
            SELECT (COUNT (DISTINCT ?software) AS ?count_software)
            FROM <https://w3id.org/okn/i/graph/20230628>
            WHERE {
                ?software sd:hasInstallationInstructions ?inst .
                
            }
""")

for solution in q_res:
    print(solution['count_software'])


"60"^^<http://www.w3.org/2001/XMLSchema#integer>


### BP10: Software requirements

In [121]:
q_res = graph.query("""
            PREFIX sd: <https://w3id.org/okn/o/sd#>
            PREFIX schema: <http://schema.org/>
            PREFIX prov: <http://www.w3.org/ns/prov#>
            
            SELECT (COUNT (DISTINCT ?software) AS ?count_software)
            FROM <https://w3id.org/okn/i/graph/20230628>
            WHERE {
                ?software sd:softwareRequirements ?requirements .
                
            }
""")

for solution in q_res:
    print(solution['software'])


<https://www.w3id.org/okn/i/Software/oeg-upm/easytv-annotator>
<https://www.w3id.org/okn/i/Software/oeg-upm/pytada-hdt-entity>
<https://www.w3id.org/okn/i/Software/oeg-upm/devops-infra>
<https://www.w3id.org/okn/i/Software/oeg-upm/pytada-hdt-entity>
<https://www.w3id.org/okn/i/Software/oeg-upm/gwt-blocks>
<https://www.w3id.org/okn/i/Software/oeg-upm/valkyr-ie-gate>
<https://www.w3id.org/okn/i/Software/oeg-upm/FAIR-Research-Object>
<https://www.w3id.org/okn/i/Software/oeg-upm/LOT-resources>
<https://www.w3id.org/okn/i/Software/oeg-upm/soca>
<https://www.w3id.org/okn/i/Software/oeg-upm/FarolAppsWeb>
<https://www.w3id.org/okn/i/Software/oeg-upm/SendEmailWebApp>
<https://www.w3id.org/okn/i/Software/oeg-upm/pytada-hdt-entity>
<https://www.w3id.org/okn/i/Software/oeg-upm/Massive-ROs-Creator>
<https://www.w3id.org/okn/i/Software/oeg-upm/FAIR-Research-Object>
<https://www.w3id.org/okn/i/Software/oeg-upm/BRATtoBIO>
<https://www.w3id.org/okn/i/Software/oeg-upm/pytada-hdt-entity>
<https://www.w3i