# Sitemap Assay

The start of a simple notebook that could hosted for peeopl to test out their sitemaps (and robots.txt) files with.

References:
* [AdvTools](https://advertools.readthedocs.io/en/master/advertools.sitemaps.html)
* [Sitemap viz](https://www.ayima.com/us/insights/analytics-and-cro/how-to-visualize-an-xml-sitemap-using-python.html)



In [2]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

import advertools as adv
import json
import requests
from pyld import jsonld
from bs4 import BeautifulSoup
import urllib.request
import logging
import traceback
import kglab

In [53]:
%%time 

# smurl = 'https://opencoredata.org/sitemap_0.xml'
# smurl = 'https://www.bco-dmo.org/sitemap.xml'
smurl = 'https://opentopography.org/sitemap.xml'
# smurl = 'https://www2.earthref.org/MagIC/contributions.sitemap.xml'
# smurl = 'https://oceanscape.org/organisation-sitemap.xml'
# smurl = 'https://catalogue.cioosatlantic.ca/sitemap/sitemap-1.xml'
# smurl = 'https://obis.org/sitemap/sitemap_datasets.xml'
# smurl = 'https://infohub.eurocean.net/sitemap/vessels'

iow_sitemap = adv.sitemap_to_df(smurl) # load sitemap to dataframe via advertools
# iow_sitemap.info()
# iow_sitemap.head()

2022-07-11 15:25:49,504 | INFO | sitemaps.py:419 | sitemap_to_df | Getting https://opentopography.org/sitemap.xml


CPU times: user 34 ms, sys: 4.67 ms, total: 38.7 ms
Wall time: 775 ms


## Analyzing the URLs

We can quickly grab the unique URLs from the sitemap column and see how many unique sitemap.xml files we are working with

We can also dive into the URL structure for the resources a bit.

In [57]:
usm = iow_sitemap.sitemap.unique()
uloc = iow_sitemap["loc"].unique()
print("{} unique sitemap XML file(s) pointing to {} unique resource(s).".format(len(usm), len(uloc)))

# Break down all the URL into theor path parts
urldf = adv.url_to_df(list(iow_sitemap['loc']))
urldf.head()

1 unique sitemap XML file(s) pointing to 709 unique resource(s).


Unnamed: 0,url,scheme,netloc,path,query,fragment,dir_1,dir_2,query_opentopoID
0,https://portal.opentopography.org/lidarDataset...,https,portal.opentopography.org,/lidarDataset,opentopoID=OTLAS.102021.6345.1,,lidarDataset,,OTLAS.102021.6345.1
1,https://portal.opentopography.org/lidarDataset...,https,portal.opentopography.org,/lidarDataset,opentopoID=OTLAS.092021.6339.1,,lidarDataset,,OTLAS.092021.6339.1
2,https://portal.opentopography.org/lidarDataset...,https,portal.opentopography.org,/lidarDataset,opentopoID=OTLAS.092021.2193.1,,lidarDataset,,OTLAS.092021.2193.1
3,https://portal.opentopography.org/lidarDataset...,https,portal.opentopography.org,/lidarDataset,opentopoID=OTLAS.092021.32611.1,,lidarDataset,,OTLAS.092021.32611.1
4,https://portal.opentopography.org/lidarDataset...,https,portal.opentopography.org,/lidarDataset,opentopoID=OTLAS.082021.6339.2,,lidarDataset,,OTLAS.082021.6339.2


## Sample and test sitemap entries

In [46]:
# sample the previously generated url data frame
sample_size = 5
sample_df = urldf.groupby("dir_1").sample(n=sample_size, random_state=1, replace=True)

In [47]:
sample_df.head()

Unnamed: 0,url,scheme,netloc,path,query,fragment,dir_1,dir_2,query_opentopoID
621,https://portal.opentopography.org/dataspace/da...,https,portal.opentopography.org,/dataspace/dataset,opentopoID=OTDS.102020.4326.1,,dataspace,dataset,OTDS.102020.4326.1
691,https://portal.opentopography.org/dataspace/da...,https,portal.opentopography.org,/dataspace/dataset,opentopoID=OTDS.012019.32633.1,,dataspace,dataset,OTDS.012019.32633.1
596,https://portal.opentopography.org/dataspace/da...,https,portal.opentopography.org,/dataspace/dataset,opentopoID=OTDS.052021.26911.1,,dataspace,dataset,OTDS.052021.26911.1
656,https://portal.opentopography.org/dataspace/da...,https,portal.opentopography.org,/dataspace/dataset,opentopoID=OTDS.122019.4326.4,,dataspace,dataset,OTDS.122019.4326.4
593,https://portal.opentopography.org/dataspace/da...,https,portal.opentopography.org,/dataspace/dataset,opentopoID=OTDS.062021.32616.1,,dataspace,dataset,OTDS.062021.32616.1


### See if the URLs resolve

In [59]:
import urllib.request
import requests

ul = sample_df["url"]

for item in ul:
    # user_agent = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7'
    # headers={'User-Agent':user_agent,}

    headers= {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) '
                           'AppleWebKit/537.11 (KHTML, like Gecko) '
                           'Chrome/23.0.1271.64 Safari/537.11',
             'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
             'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
             'Accept-Encoding': 'none',
             'Accept-Language': 'en-US,en;q=0.8',
             'Connection': 'keep-alive'}

    try:
        # x = requests.get(item)
        # code = x.status_code
        request=urllib.request.Request(url=item, headers=headers) #The assembled request
        with urllib.request.urlopen(request) as response:
            info = response.info()
            dtype = info.get_content_type()    # -> text/html
         # headers = x.headers()
        # print("URL: {} \ninfo : {}\n --".format(item, info))
        print("URL: {} ".format(item))
    except Exception as e:
        # code = x.status_code
        # dtype = info.get_content_type()

        print("Exception on: {} \nerrors : {}\n --".format(item, str(e)))


URL: https://portal.opentopography.org/dataspace/dataset?opentopoID=OTDS.102020.4326.1 
URL: https://portal.opentopography.org/dataspace/dataset?opentopoID=OTDS.012019.32633.1 
URL: https://portal.opentopography.org/dataspace/dataset?opentopoID=OTDS.052021.26911.1 
URL: https://portal.opentopography.org/dataspace/dataset?opentopoID=OTDS.122019.4326.4 
URL: https://portal.opentopography.org/dataspace/dataset?opentopoID=OTDS.062021.32616.1 
URL: https://portal.opentopography.org/lidarDataset?opentopoID=OTLAS.052013.26913.1 
URL: https://portal.opentopography.org/lidarDataset?opentopoID=OTLAS.122016.2193.2 
URL: https://portal.opentopography.org/lidarDataset?opentopoID=OTLAS.082011.26910.1 
URL: https://portal.opentopography.org/lidarDataset?opentopoID=OTLAS.102016.26916.1 
URL: https://portal.opentopography.org/lidarDataset?opentopoID=OTLAS.122016.26915.1 
URL: https://portal.opentopography.org/raster?opentopoID=OTSDEM.012012.26911.1 
URL: https://portal.opentopography.org/raster?opentop

### See if they have JSON-LD (static check only, no dynamically loaded JSON-LD yet)

In [61]:
ul = sample_df["url"]

headers= {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) '
                        'AppleWebKit/537.11 (KHTML, like Gecko) '
                        'Chrome/23.0.1271.64 Safari/537.11',
          'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
          'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
          'Accept-Encoding': 'none',
          'Accept-Language': 'en-US,en;q=0.8',
          'Connection': 'keep-alive'}

for item in ul:
    request=urllib.request.Request(url=item, headers=headers)
    html = urllib.request.urlopen(request).read()
    soup = BeautifulSoup(html, "html.parser")
    p = soup.find('script', {'type':'application/ld+json'})
    try:
        print("JSON byte size: {} ".format(len(p.contents[0])))
    except Exception as e:
        logging.error(traceback.format_exc())

JSON byte size: 3385 
JSON byte size: 4172 
JSON byte size: 4478 
JSON byte size: 3258 
JSON byte size: 4215 
JSON byte size: 5137 
JSON byte size: 4915 
JSON byte size: 4804 
JSON byte size: 4052 
JSON byte size: 5557 
JSON byte size: 4169 
JSON byte size: 4915 
JSON byte size: 3913 
JSON byte size: 4100 
JSON byte size: 4116 


### Check JSON-LD structure (static check only, no dynamically loaded JSON-LD yet)

In [64]:
ul = sample_df["url"]

myframe =  {
    "@context":{"@vocab": "https://schema.org/"},
    "@type": "Dataset",
}

context =  { "@vocab": "https://schema.org/" }

headers= {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) '
                        'AppleWebKit/537.11 (KHTML, like Gecko) '
                        'Chrome/23.0.1271.64 Safari/537.11',
          'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
          'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
          'Accept-Encoding': 'none',
          'Accept-Language': 'en-US,en;q=0.8',
          'Connection': 'keep-alive'}

for item in ul:
    request=urllib.request.Request(url=item, headers=headers)
    html = urllib.request.urlopen(request).read()
    soup = BeautifulSoup(html, "html.parser")
    p = soup.find('script', {'type':'application/ld+json'})
    try:
        jld = json.loads(p.contents[0])
        # print(str(jld))
        # compacted = jsonld.compact(str(jld), context)
        # print(len(json.dumps(compacted, indent=2)))
    except Exception as e:
        print("Exception")
        logging.error(traceback.format_exc())

## Load to Graph"

Looad a sample set of triples into RDF lib and run a sample SPARQL query on them.

### Note
This is the same loop as above but now we will load into a KG graph

In [71]:
ul = sample_df["url"]

# Test loading into a graph
namespaces = {
    "schema":  "https://schema.org/",
    "shacl":   "http://www.w3.org/ns/shacl#" ,
}

kg = kglab.KnowledgeGraph(
    name = "Schema.org shacl eval datagraph",
    base_uri = "https://example.org/id/",
    namespaces = namespaces,
)

for item in ul:
    html = urllib.request.urlopen(item).read()
    soup = BeautifulSoup(html, "html.parser")
    p = soup.find('script', {'type':'application/ld+json'})
    try:
        print("JSON byte size: {} ".format(len(p.contents[0])))
        kg.load_rdf_text(data=p.contents[0], format="json-ld")
        # print(p.contents[0])
    except Exception as e:
        logging.error(traceback.format_exc())

JSON byte size: 3385 
JSON byte size: 4172 
JSON byte size: 4478 
JSON byte size: 3258 
JSON byte size: 4215 
JSON byte size: 5137 
JSON byte size: 4915 
JSON byte size: 4804 
JSON byte size: 4052 
JSON byte size: 5557 
JSON byte size: 4169 
JSON byte size: 4915 
JSON byte size: 3913 
JSON byte size: 4100 
JSON byte size: 4116 


In [72]:
sparql = """
PREFIX schema: <https://schema.org/>
SELECT ?s ?name ?description
  WHERE {
    ?s a schema:Dataset .
    ?s schema:name ?name .
    ?s schema:description ?description
  }
"""

df = kg.query_as_df(sparql).to_pandas()

df.head()

Unnamed: 0,s,name,description
0,<https://portal.opentopography.org/dataspace/d...,High Resolution Topography of the Kashihe Faul...,This dataset comprises six high-resolution str...
1,<https://portal.opentopography.org/dataspace/d...,"2016 Norcia Earthquake (Italy), Mt Bove Fault ...",<p>Pre-earthquake terrestrial laser scanning d...
2,<https://portal.opentopography.org/dataspace/d...,"Steep Headwater-Colluvial Channels, Day Creek,...",Dataset includes UAV structure-from-motion der...
3,<https://portal.opentopography.org/dataspace/d...,"Fault scarp near Ili Basin, Shonzhy, Kazakhstan","Target: Fault scarp of a E-W striking, S-dippi..."
4,<https://portal.opentopography.org/dataspace/d...,"Survey of Point Beach, Wisconsin, September 2020",Lake Michigan experiences quasi-decadal fluctu...
