# Sitemap Assay

The start of a simple notebook that could hosted for peeopl to test out their sitemaps (and robots.txt) files with.

References:
* [AdvTools](https://advertools.readthedocs.io/en/master/advertools.sitemaps.html)
* [Sitemap viz](https://www.ayima.com/us/insights/analytics-and-cro/how-to-visualize-an-xml-sitemap-using-python.html)



In [1]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

import advertools as adv
import json
import requests
from pyld import jsonld
from bs4 import BeautifulSoup
import urllib.request
import logging
import traceback
import kglab

In [2]:
%%time 

# smurl = 'https://opencoredata.org/sitemap_0.xml'
# smurl = 'https://www.bco-dmo.org/sitemap.xml'
# smurl = 'https://opentopography.org/sitemap.xml'
# smurl = 'https://www2.earthref.org/MagIC/contributions.sitemap.xml'
# smurl = 'https://oceanscape.org/organisation-sitemap.xml'
smurl = 'https://catalogue.cioosatlantic.ca/sitemap/sitemap-1.xml'
# smurl = 'https://obis.org/sitemap/sitemap_datasets.xml'
# smurl = 'https://infohub.eurocean.net/sitemap/vessels'

iow_sitemap = adv.sitemap_to_df(smurl) # load sitemap to dataframe via advertools
# iow_sitemap.info()
# iow_sitemap.head()

2022-07-24 09:47:28,216 | INFO | sitemaps.py:419 | sitemap_to_df | Getting https://catalogue.cioosatlantic.ca/sitemap/sitemap-1.xml


CPU times: user 12.2 ms, sys: 4.07 ms, total: 16.3 ms
Wall time: 283 ms


## Analyzing the URLs

We can quickly grab the unique URLs from the sitemap column and see how many unique sitemap.xml files we are working with

We can also dive into the URL structure for the resources a bit.

In [3]:
usm = iow_sitemap.sitemap.unique()
uloc = iow_sitemap["loc"].unique()
print("{} unique sitemap XML file(s) pointing to {} unique resource(s).".format(len(usm), len(uloc)))

# Break down all the URL into theor path parts
urldf = adv.url_to_df(list(iow_sitemap['loc']))
urldf.head()

1 unique sitemap XML file(s) pointing to 69 unique resource(s).


Unnamed: 0,url,scheme,netloc,path,query,fragment,dir_1,dir_2
0,https://catalogue.cioosatlantic.ca/dataset/ceo...,https,catalogue.cioosatlantic.ca,/dataset/ceotr-datasets,,,dataset,ceotr-datasets
1,https://catalogue.cioosatlantic.ca/dataset/cma...,https,catalogue.cioosatlantic.ca,/dataset/cmar-datasets,,,dataset,cmar-datasets
2,https://catalogue.cioosatlantic.ca/dataset/coa...,https,catalogue.cioosatlantic.ca,/dataset/coastal-action,,,dataset,coastal-action
3,https://catalogue.cioosatlantic.ca/dataset/col...,https,catalogue.cioosatlantic.ca,/dataset/college-of-the-north-atlantic-cna,,,dataset,college-of-the-north-atlantic-cna
4,https://catalogue.cioosatlantic.ca/dataset/dfo...,https,catalogue.cioosatlantic.ca,/dataset/dfo-datasets,,,dataset,dfo-datasets


## Sample and test sitemap entries

In [4]:
# sample the previously generated url data frame
sample_size = 5
sample_df = urldf.groupby("dir_1").sample(n=sample_size, random_state=1, replace=True)

In [5]:
sample_df.head()

Unnamed: 0,url,scheme,netloc,path,query,fragment,dir_1,dir_2
37,https://catalogue.cioosatlantic.ca/dataset/659...,https,catalogue.cioosatlantic.ca,/dataset/65931687-684a-41b5-9aa5-086a9e72b5c5,,,dataset,65931687-684a-41b5-9aa5-086a9e72b5c5
12,https://catalogue.cioosatlantic.ca/dataset/ca-...,https,catalogue.cioosatlantic.ca,/dataset/ca-cioos_72c91e45-2578-304b-9794-98a6...,,,dataset,ca-cioos_72c91e45-2578-304b-9794-98a65642e24d
9,https://catalogue.cioosatlantic.ca/dataset/ca-...,https,catalogue.cioosatlantic.ca,/dataset/ca-cioos_ee238a31-85f9-3cef-badc-c902...,,,dataset,ca-cioos_ee238a31-85f9-3cef-badc-c90201592a39
5,https://catalogue.cioosatlantic.ca/dataset/for...,https,catalogue.cioosatlantic.ca,/dataset/force-datasets,,,dataset,force-datasets
64,https://catalogue.cioosatlantic.ca/dataset/ca-...,https,catalogue.cioosatlantic.ca,/dataset/ca-cioos_d5faadca-f0b6-46f2-9277-d1e7...,,,dataset,ca-cioos_d5faadca-f0b6-46f2-9277-d1e76a8e313c


### See if the URLs resolve

In [6]:
import urllib.request
import requests

ul = sample_df["url"]

for item in ul:
    # user_agent = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7'
    # headers={'User-Agent':user_agent,}

    headers= {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) '
                           'AppleWebKit/537.11 (KHTML, like Gecko) '
                           'Chrome/23.0.1271.64 Safari/537.11',
             'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
             'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
             'Accept-Encoding': 'none',
             'Accept-Language': 'en-US,en;q=0.8',
             'Connection': 'keep-alive'}

    try:
        # x = requests.get(item)
        # code = x.status_code
        request=urllib.request.Request(url=item, headers=headers) #The assembled request
        with urllib.request.urlopen(request) as response:
            info = response.info()
            dtype = info.get_content_type()    # -> text/html
         # headers = x.headers()
        # print("URL: {} \ninfo : {}\n --".format(item, info))
        print("URL: {} ".format(item))
    except Exception as e:
        # code = x.status_code
        # dtype = info.get_content_type()

        print("Exception on: {} \nerrors : {}\n --".format(item, str(e)))


URL: https://catalogue.cioosatlantic.ca/dataset/65931687-684a-41b5-9aa5-086a9e72b5c5 
URL: https://catalogue.cioosatlantic.ca/dataset/ca-cioos_72c91e45-2578-304b-9794-98a65642e24d 
URL: https://catalogue.cioosatlantic.ca/dataset/ca-cioos_ee238a31-85f9-3cef-badc-c90201592a39 
URL: https://catalogue.cioosatlantic.ca/dataset/force-datasets 
URL: https://catalogue.cioosatlantic.ca/dataset/ca-cioos_d5faadca-f0b6-46f2-9277-d1e76a8e313c 


### See if they have JSON-LD (static check only, no dynamically loaded JSON-LD yet)

In [7]:
ul = sample_df["url"]

headers= {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) '
                        'AppleWebKit/537.11 (KHTML, like Gecko) '
                        'Chrome/23.0.1271.64 Safari/537.11',
          'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
          'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
          'Accept-Encoding': 'none',
          'Accept-Language': 'en-US,en;q=0.8',
          'Connection': 'keep-alive'}

for item in ul:
    request=urllib.request.Request(url=item, headers=headers)
    html = urllib.request.urlopen(request).read()
    soup = BeautifulSoup(html, "html.parser")
    p = soup.find('script', {'type':'application/ld+json'})
    try:
        print("JSON byte size: {} ".format(len(p.contents[0])))
    except Exception as e:
        logging.error(traceback.format_exc())

JSON byte size: 8955 
JSON byte size: 11288 
JSON byte size: 11324 


2022-07-24 09:47:40,206 | ERROR | 916626917.py:20 | <cell line: 12> | Traceback (most recent call last):
  File "/tmp/ipykernel_3096099/916626917.py", line 18, in <cell line: 12>
    print("JSON byte size: {} ".format(len(p.contents[0])))
AttributeError: 'NoneType' object has no attribute 'contents'



JSON byte size: 10777 


### Check JSON-LD structure (static check only, no dynamically loaded JSON-LD yet)

In [8]:
ul = sample_df["url"]

myframe =  {
    "@context":{"@vocab": "https://schema.org/"},
    "@type": "Dataset",
}

context =  { "@vocab": "https://schema.org/" }

headers= {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) '
                        'AppleWebKit/537.11 (KHTML, like Gecko) '
                        'Chrome/23.0.1271.64 Safari/537.11',
          'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
          'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
          'Accept-Encoding': 'none',
          'Accept-Language': 'en-US,en;q=0.8',
          'Connection': 'keep-alive'}

for item in ul:
    request=urllib.request.Request(url=item, headers=headers)
    html = urllib.request.urlopen(request).read()
    soup = BeautifulSoup(html, "html.parser")
    p = soup.find('script', {'type':'application/ld+json'})
    try:
        jld = json.loads(p.contents[0])
        # print(str(jld))
        # compacted = jsonld.compact(str(jld), context)
        # print(len(json.dumps(compacted, indent=2)))
    except Exception as e:
        print("Exception")
        logging.error(traceback.format_exc())

{'@context': {'rdf': 'http://www.w3.org/1999/02/22-rdf-syntax-ns#', 'rdfs': 'http://www.w3.org/2000/01/rdf-schema#', 'schema': 'http://schema.org/', 'xsd': 'http://www.w3.org/2001/XMLSchema#'}, '@graph': [{'@id': 'https://catalogue.cioosatlantic.ca/dataset/65931687-684a-41b5-9aa5-086a9e72b5c5', '@type': 'schema:Dataset', 'http://www.w3.org/ns/dcat#contactPoint': {'@id': '_:N99c0fd643f294cd78f0c060b25c18583'}, 'schema:creator': [{'@id': '_:Nebedefd0ce794984b42695d3b9923a44'}, {'@id': '_:Nd935baa569734380ae3396fc938482ba'}], 'schema:dateModified': '2022-07-05T20:04:34.619446', 'schema:datePublished': '2022-07-05T20:04:34.619436', 'schema:description': "NEGL - Postville, Nunatsiavut (NLQU0001). Tri-leg tower masts are 25 feet tall (7.62 m) and stations are all located at mid-slopes in cleared terrain.  Sites at Cartwright Junction, near Red Bay and at North West River all include fence enclosures which may impact snow accumulation at these sites.  All stations are owned and operated by th

2022-07-24 09:48:00,247 | ERROR | 2754713587.py:31 | <cell line: 19> | Traceback (most recent call last):
  File "/tmp/ipykernel_3096099/2754713587.py", line 25, in <cell line: 19>
    jld = json.loads(p.contents[0])
AttributeError: 'NoneType' object has no attribute 'contents'



Exception
{'@context': {'rdf': 'http://www.w3.org/1999/02/22-rdf-syntax-ns#', 'rdfs': 'http://www.w3.org/2000/01/rdf-schema#', 'schema': 'http://schema.org/', 'xsd': 'http://www.w3.org/2001/XMLSchema#'}, '@graph': [{'@id': 'https://catalogue.cioosatlantic.ca/dataset/ca-cioos_d5faadca-f0b6-46f2-9277-d1e76a8e313c/resource/ca-cioos-resource_25dddad42e209412d443a44a94fe8cdd', '@type': 'schema:DataDownload', 'schema:contentUrl': 'https://cioosatlantic.ca/erddap/tabledap/NL_Climate_Index_all_fields.html', 'schema:description': "ERDDAP's version of the OPeNDAP .html web page for this dataset. Specify a subset of the dataset and download the data via OPeNDAP or in many different file types.", 'schema:encodingFormat': 'ERDDAP', 'schema:name': 'ERDDAP Data Subset Form', 'schema:url': 'https://cioosatlantic.ca/erddap/tabledap/NL_Climate_Index_all_fields.html'}, {'@id': 'https://catalogue.cioosatlantic.ca/organization/ab872f33-5375-42cb-a5df-252570ebf5de', '@type': 'schema:Organization', 'schema:c

## Load to Graph"

Looad a sample set of triples into RDF lib and run a sample SPARQL query on them.

### Note
This is the same loop as above but now we will load into a KG graph

In [9]:
ul = sample_df["url"]

# Test loading into a graph
namespaces = {
    "schema":  "https://schema.org/",
    "schemaold":  "http://schema.org/",
    "shacl":   "http://www.w3.org/ns/shacl#" ,
}

kg = kglab.KnowledgeGraph(
    name = "Schema.org shacl eval datagraph",
    base_uri = "https://example.org/id/",
    namespaces = namespaces,
)

for item in ul:
    html = urllib.request.urlopen(item).read()
    soup = BeautifulSoup(html, "html.parser")
    p = soup.find('script', {'type':'application/ld+json'})
    try:
        print("JSON byte size: {} ".format(len(p.contents[0])))
        kg.load_rdf_text(data=p.contents[0], format="json-ld")
        print(p.contents[0])
    except Exception as e:
        logging.error(traceback.format_exc())

JSON byte size: 8955 

           {
    "@context": {
        "rdf": "http://www.w3.org/1999/02/22-rdf-syntax-ns#",
        "rdfs": "http://www.w3.org/2000/01/rdf-schema#",
        "schema": "http://schema.org/",
        "xsd": "http://www.w3.org/2001/XMLSchema#"
    },
    "@graph": [
        {
            "@id": "https://catalogue.cioosatlantic.ca/dataset/65931687-684a-41b5-9aa5-086a9e72b5c5/resource/ca-cioos-resource_5060f3c24e9a3a739a0e57682d7e8f40",
            "@type": "schema:DataDownload",
            "schema:contentUrl": "https://www.smartatlantic.ca/erddap/tabledap/sma_negl_postville_nlqu0001",
            "schema:description": "ERDDAP's version of the OPeNDAP .html web page for this dataset. Specify a subset of the dataset and download the data via OPeNDAP or in many different file types.",
            "schema:encodingFormat": "ERDDAP",
            "schema:name": "ERDDAP Data Subset Form",
            "schema:url": "https://www.smartatlantic.ca/erddap/tabledap/sma_negl_postv

2022-07-24 09:48:26,135 | ERROR | 2435256237.py:25 | <cell line: 16> | Traceback (most recent call last):
  File "/tmp/ipykernel_3096099/2435256237.py", line 21, in <cell line: 16>
    print("JSON byte size: {} ".format(len(p.contents[0])))
AttributeError: 'NoneType' object has no attribute 'contents'



JSON byte size: 10777 

           {
    "@context": {
        "rdf": "http://www.w3.org/1999/02/22-rdf-syntax-ns#",
        "rdfs": "http://www.w3.org/2000/01/rdf-schema#",
        "schema": "http://schema.org/",
        "xsd": "http://www.w3.org/2001/XMLSchema#"
    },
    "@graph": [
        {
            "@id": "https://catalogue.cioosatlantic.ca/dataset/ca-cioos_d5faadca-f0b6-46f2-9277-d1e76a8e313c/resource/ca-cioos-resource_25dddad42e209412d443a44a94fe8cdd",
            "@type": "schema:DataDownload",
            "schema:contentUrl": "https://cioosatlantic.ca/erddap/tabledap/NL_Climate_Index_all_fields.html",
            "schema:description": "ERDDAP's version of the OPeNDAP .html web page for this dataset. Specify a subset of the dataset and download the data via OPeNDAP or in many different file types.",
            "schema:encodingFormat": "ERDDAP",
            "schema:name": "ERDDAP Data Subset Form",
            "schema:url": "https://cioosatlantic.ca/erddap/tabledap/NL_Clim

In [10]:
sparql = """
PREFIX schema: <https://schema.org/>
SELECT ?s ?name ?description
  WHERE {
    ?s a ?type .    
    
  }
"""

#  schema:Dataset 
# ?s schema:name ?name .
#     ?s schema:description ?description.

df = kg.query_as_df(sparql).to_pandas()

df.head()

Unnamed: 0,s
0,<https://catalogue.cioosatlantic.ca/dataset/65...
1,<https://catalogue.cioosatlantic.ca/dataset/ca...
2,<https://catalogue.cioosatlantic.ca/dataset/ca...
3,<https://catalogue.cioosatlantic.ca/dataset/ca...
4,_:N0e9182e2d6db47fe942bf07753b8589e
