# Sitemap Assay

The start of a simple notebook that could hosted for peeopl to test out their sitemaps (and robots.txt) files with.

References:
* [AdvTools](https://advertools.readthedocs.io/en/master/advertools.sitemaps.html)
* [Sitemap viz](https://www.ayima.com/us/insights/analytics-and-cro/how-to-visualize-an-xml-sitemap-using-python.html)



In [1]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

import advertools as adv
import json
import requests
from pyld import jsonld
from bs4 import BeautifulSoup
import urllib.request
import logging
import traceback
import kglab

In [2]:
%%time 

# smurl = 'https://opencoredata.org/sitemap_0.xml'
# smurl = 'https://www.bco-dmo.org/sitemap.xml'
# smurl = 'https://opentopography.org/sitemap.xml'
# smurl = 'https://www2.earthref.org/MagIC/contributions.sitemap.xml'
# smurl = 'https://oceanscape.org/organisation-sitemap.xml'
# smurl = 'https://catalogue.cioosatlantic.ca/sitemap/sitemap-1.xml'
# smurl = 'https://obis.org/sitemap/sitemap_datasets.xml'
# smurl = 'https://infohub.eurocean.net/sitemap/vessels'
smurl = 'https://raw.githubusercontent.com/iodepo/odis-arch/schema-dev-jm/code/notebooks/Exploration/data-pacificdatahub/sitemap.xml'

iow_sitemap = adv.sitemap_to_df(smurl) # load sitemap to dataframe via advertools
# iow_sitemap.info()
# iow_sitemap.head()

2022-07-25 16:49:16,130 | INFO | sitemaps.py:419 | sitemap_to_df | Getting https://raw.githubusercontent.com/iodepo/odis-arch/schema-dev-jm/code/notebooks/Exploration/data-pacificdatahub/sitemap.xml


CPU times: user 195 ms, sys: 11.2 ms, total: 206 ms
Wall time: 635 ms


## Analyzing the URLs

We can quickly grab the unique URLs from the sitemap column and see how many unique sitemap.xml files we are working with

We can also dive into the URL structure for the resources a bit.

In [3]:
usm = iow_sitemap.sitemap.unique()
uloc = iow_sitemap["loc"].unique()
print("{} unique sitemap XML file(s) pointing to {} unique resource(s).".format(len(usm), len(uloc)))

# Break down all the URL into theor path parts
urldf = adv.url_to_df(list(iow_sitemap['loc']))
urldf.head()

1 unique sitemap XML file(s) pointing to 11567 unique resource(s).


Unnamed: 0,url,scheme,netloc,path,query,fragment,dir_1,dir_2,dir_3,query_profiles
0,https://pacificdata.org/data/dataset/06f61e4a-...,https,pacificdata.org,/data/dataset/06f61e4a-9ca0-47e3-8e9b-8f953274...,profiles=schemaorg,,data,dataset,06f61e4a-9ca0-47e3-8e9b-8f953274d26f.jsonld,schemaorg
1,https://pacificdata.org/data/dataset/07a576ce-...,https,pacificdata.org,/data/dataset/07a576ce-f6b0-4f1f-bf66-670e4f63...,profiles=schemaorg,,data,dataset,07a576ce-f6b0-4f1f-bf66-670e4f63f9b5.jsonld,schemaorg
2,https://pacificdata.org/data/dataset/0c77bfef-...,https,pacificdata.org,/data/dataset/0c77bfef-bbbc-43f9-9583-98bd4b9d...,profiles=schemaorg,,data,dataset,0c77bfef-bbbc-43f9-9583-98bd4b9d5b77.jsonld,schemaorg
3,https://pacificdata.org/data/dataset/100e2658-...,https,pacificdata.org,/data/dataset/100e2658-58f3-49d7-b15e-0da98d2b...,profiles=schemaorg,,data,dataset,100e2658-58f3-49d7-b15e-0da98d2b6bd9.jsonld,schemaorg
4,https://pacificdata.org/data/dataset/10th-aust...,https,pacificdata.org,/data/dataset/10th-australian-conference-on-co...,profiles=schemaorg,,data,dataset,10th-australian-conference-on-coastal-and-ocea...,schemaorg


## Sample and test sitemap entries

In [4]:
# sample the previously generated url data frame
sample_size = 5
sample_df = urldf.groupby("dir_1").sample(n=sample_size, random_state=1, replace=True)

In [5]:
sample_df.head()

Unnamed: 0,url,scheme,netloc,path,query,fragment,dir_1,dir_2,dir_3,query_profiles
235,https://pacificdata.org/data/dataset/adb-borde...,https,pacificdata.org,/data/dataset/adb-border-project677d037e-7a2f-...,profiles=schemaorg,,data,dataset,adb-border-project677d037e-7a2f-49f4-862d-1796...,schemaorg
5192,https://pacificdata.org/data/dataset/oai-www-s...,https,pacificdata.org,/data/dataset/oai-www-spc-int-6dd83cee-fd2c-41...,profiles=schemaorg,,data,dataset,oai-www-spc-int-6dd83cee-fd2c-4183-a8e6-00b7a9...,schemaorg
905,https://pacificdata.org/data/dataset/coastal-p...,https,pacificdata.org,/data/dataset/coastal-population-dataset-wsm.j...,profiles=schemaorg,,data,dataset,coastal-population-dataset-wsm.jsonld,schemaorg
10955,https://pacificdata.org/data/dataset/united-na...,https,pacificdata.org,/data/dataset/united-nations-development-progr...,profiles=schemaorg,,data,dataset,united-nations-development-programme-terminal-...,schemaorg
7813,https://pacificdata.org/data/dataset/proceedin...,https,pacificdata.org,/data/dataset/proceedings-of-the-ninth-session...,profiles=schemaorg,,data,dataset,proceedings-of-the-ninth-session-tarawa-kiriba...,schemaorg


### See if the URLs resolve

In [6]:
import urllib.request
import requests

ul = sample_df["url"]

for item in ul:
    # user_agent = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7'
    # headers={'User-Agent':user_agent,}

    headers= {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) '
                           'AppleWebKit/537.11 (KHTML, like Gecko) '
                           'Chrome/23.0.1271.64 Safari/537.11',
             'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
             'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
             'Accept-Encoding': 'none',
             'Accept-Language': 'en-US,en;q=0.8',
             'Connection': 'keep-alive'}

    try:
        # x = requests.get(item)
        # code = x.status_code
        request=urllib.request.Request(url=item, headers=headers) #The assembled request
        with urllib.request.urlopen(request) as response:
            info = response.info()
            dtype = info.get_content_type()    # -> text/html
         # headers = x.headers()
        # print("URL: {} \ninfo : {}\n --".format(item, info))
        print("URL: {} ".format(item))
    except Exception as e:
        # code = x.status_code
        # dtype = info.get_content_type()

        print("Exception on: {} \nerrors : {}\n --".format(item, str(e)))


URL: https://pacificdata.org/data/dataset/adb-border-project677d037e-7a2f-49f4-862d-17962bd1d469.jsonld?profiles=schemaorg 
URL: https://pacificdata.org/data/dataset/oai-www-spc-int-6dd83cee-fd2c-4183-a8e6-00b7a974ee8c.jsonld?profiles=schemaorg 
URL: https://pacificdata.org/data/dataset/coastal-population-dataset-wsm.jsonld?profiles=schemaorg 
URL: https://pacificdata.org/data/dataset/united-nations-development-programme-terminal-report-south-pacific-disaster-reduction-pr-1621.jsonld?profiles=schemaorg 
URL: https://pacificdata.org/data/dataset/proceedings-of-the-ninth-session-tarawa-kiribati-20-28-october-1980-2769.jsonld?profiles=schemaorg 


### See if they have JSON-LD (static check only, no dynamically loaded JSON-LD yet)

In [13]:
ul = sample_df["url"]

headers= {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) '
                        'AppleWebKit/537.11 (KHTML, like Gecko) '
                        'Chrome/23.0.1271.64 Safari/537.11',
          'Accept': 'application/ld+json,text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
          'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
          'Accept-Encoding': 'none',
          'Accept-Language': 'en-US,en;q=0.8',
          'Connection': 'keep-alive'}

for item in ul:
    request=urllib.request.Request(url=item, headers=headers)
    # p = urllib.request.urlopen(request).read()
    html = urllib.request.urlopen(request).read()
    soup = BeautifulSoup(html, "html.parser")
    p = soup.find('script', {'type':'application/ld+json'})
    try:
        # print("JSON byte size: {} ".format(len(p)))
        print("JSON byte size: {} ".format(len(p.contents[0])))
    except Exception as e:
        logging.error(traceback.format_exc())

JSON byte size: 4689 
JSON byte size: 3931 
JSON byte size: 7732 
JSON byte size: 3869 
JSON byte size: 3576 


### Check JSON-LD structure (static check only, no dynamically loaded JSON-LD yet)

In [19]:
ul = sample_df["url"]

myframe =  {
    "@context":{"@vocab": "https://schema.org/"},
    "@type": "Dataset",
}

context =  { "@vocab": "https://schema.org/" }

headers= {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) '
                        'AppleWebKit/537.11 (KHTML, like Gecko) '
                        'Chrome/23.0.1271.64 Safari/537.11',
          'Accept': 'application/ld+json,text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
          'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
          'Accept-Encoding': 'none',
          'Accept-Language': 'en-US,en;q=0.8',
          'Connection': 'keep-alive'}

for item in ul:
    request=urllib.request.Request(url=item, headers=headers)
    # p = urllib.request.urlopen(request).read()
    html = urllib.request.urlopen(request).read()
    soup = BeautifulSoup(html, "html.parser")
    p = soup.find('script', {'type':'application/ld+json'})
    try:
        jld = json.loads(p.contents[0])
        # jld = json.loads(p)

        # print(str(jld))
        # compacted = jsonld.compact(str(jld), context)
        # print(len(json.dumps(compacted, indent=2)))
    except Exception as e:
        print("Exception")
        logging.error(traceback.format_exc())

{'@context': {'brick': 'https://brickschema.org/schema/Brick#', 'csvw': 'http://www.w3.org/ns/csvw#', 'dc': 'http://purl.org/dc/elements/1.1/', 'dcam': 'http://purl.org/dc/dcam/', 'dcat': 'http://www.w3.org/ns/dcat#', 'dcmitype': 'http://purl.org/dc/dcmitype/', 'dcterms': 'http://purl.org/dc/terms/', 'doap': 'http://usefulinc.com/ns/doap#', 'foaf': 'http://xmlns.com/foaf/0.1/', 'odrl': 'http://www.w3.org/ns/odrl/2/', 'org': 'http://www.w3.org/ns/org#', 'owl': 'http://www.w3.org/2002/07/owl#', 'prof': 'http://www.w3.org/ns/dx/prof/', 'prov': 'http://www.w3.org/ns/prov#', 'qb': 'http://purl.org/linked-data/cube#', 'rdf': 'http://www.w3.org/1999/02/22-rdf-syntax-ns#', 'rdfs': 'http://www.w3.org/2000/01/rdf-schema#', 'schema': 'http://schema.org/', 'sh': 'http://www.w3.org/ns/shacl#', 'skos': 'http://www.w3.org/2004/02/skos/core#', 'sosa': 'http://www.w3.org/ns/sosa/', 'ssn': 'http://www.w3.org/ns/ssn/', 'time': 'http://www.w3.org/2006/time#', 'vann': 'http://purl.org/vocab/vann/', 'void':

## Load to Graph"

Looad a sample set of triples into RDF lib and run a sample SPARQL query on them.

### Note
This is the same loop as above but now we will load into a KG graph

In [15]:
ul = sample_df["url"]

# Test loading into a graph
namespaces = {
    "schema":  "https://schema.org/",
    "schemaold":  "http://schema.org/",
    "shacl":   "http://www.w3.org/ns/shacl#" ,
}

kg = kglab.KnowledgeGraph(
    name = "Schema.org shacl eval datagraph",
    base_uri = "https://example.org/id/",
    namespaces = namespaces,
)

for item in ul:
    html = urllib.request.urlopen(item).read()
    soup = BeautifulSoup(html, "html.parser")
    p = soup.find('script', {'type':'application/ld+json'})
    try:
        print("JSON byte size: {} ".format(len(p.contents[0])))
        kg.load_rdf_text(data=p.contents[0], format="json-ld")
        print(p.contents[0])
    except Exception as e:
        logging.error(traceback.format_exc())

HTTPError: HTTP Error 403: Forbidden

In [10]:
sparql = """
PREFIX schema: <https://schema.org/>
SELECT ?s ?name ?description
  WHERE {
    ?s a ?type .    
    
  }
"""

#  schema:Dataset 
# ?s schema:name ?name .
#     ?s schema:description ?description.

df = kg.query_as_df(sparql).to_pandas()

df.head()

Unnamed: 0,s
0,<https://catalogue.cioosatlantic.ca/dataset/65...
1,<https://catalogue.cioosatlantic.ca/dataset/ca...
2,<https://catalogue.cioosatlantic.ca/dataset/ca...
3,<https://catalogue.cioosatlantic.ca/dataset/ca...
4,_:N0e9182e2d6db47fe942bf07753b8589e
