# Sitemap Assay

The start of a simple notebook that could hosted for peeopl to test out their sitemaps (and robots.txt) files with.

References:
* [AdvTools](https://advertools.readthedocs.io/en/master/advertools.sitemaps.html)
* [Sitemap viz](https://www.ayima.com/us/insights/analytics-and-cro/how-to-visualize-an-xml-sitemap-using-python.html)


<a href="https://githubtocolab.com/gleanerio/archetype/blob/master/networks/commons/sources.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.png" alt="Open in Colab"/></a>

[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/gleanerio/archetype/HEAD?labpath=networks/commons/sources.ipynb)



## Notes

has json-ld :  https://researchdata.edu.au/imos-soop-underway-sep-2017/970828
Validation:  https://validator.schema.org/#url=https%3A%2F%2Fresearchdata.edu.au%2Fimos-soop-underway-sep-2017%2F970828  


doesn't:  https://researchdata.edu.au/heupel-michelle/1709766

Validation   https://validator.schema.org/#url=https%3A%2F%2Fresearchdata.edu.au%2Fheupel-michelle%2F1709766


Context as map:  https://www.w3.org/TR/json-ld/#context-definitions





In [None]:
!pip -q install advertools
!pip -q install pyld
!pip -q install kglab
!pip - qinstall requests
!pip -q install json

In [1]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

import advertools as adv
import json
from bs4 import BeautifulSoup
import urllib.request
import logging
import traceback
import kglab
import pandas as pd
# import requests
# from pyld import jsonld


In [2]:
%%time

# smurl = "https://oceanexpert.org/assets/sitemaps/sitemapIndex.xml"
# smurl = "https://www.bco-dmo.org/sitemap.xml"
# smurl = "https://obis.org/sitemap_datasets.xml"
smurl = "https://edmo.seadatanet.org/sitemap.xml"

iow_sitemap = adv.sitemap_to_df(smurl) # load sitemap to dataframe via advertools
iow_sitemap.info()
# iow_sitemap.head()

2023-12-12 07:25:30,491 | INFO | sitemaps.py:536 | sitemap_to_df | Getting https://edmo.seadatanet.org/sitemap.xml


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4878 entries, 0 to 4877
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype              
---  ------           --------------  -----              
 0   loc              4878 non-null   object             
 1   lastmod          4875 non-null   datetime64[ns, UTC]
 2   sitemap          4878 non-null   object             
 3   sitemap_size_mb  4878 non-null   float64            
 4   download_date    4878 non-null   datetime64[ns, UTC]
dtypes: datetime64[ns, UTC](2), float64(1), object(2)
memory usage: 190.7+ KB
CPU times: user 115 ms, sys: 2.99 ms, total: 118 ms
Wall time: 1.78 s


## Analyzing the URLs

We can quickly grab the unique URLs from the sitemap column and see how many unique sitemap.xml files we are working with

We can also dive into the URL structure for the resources a bit.

In [3]:
usm = iow_sitemap.sitemap.unique()
uloc = iow_sitemap["loc"].unique()
print("{} unique sitemap XML file(s) pointing to {} unique resource(s).".format(len(usm), len(uloc)))


1 unique sitemap XML file(s) pointing to 4878 unique resource(s).


In [4]:

from urllib.parse import urlparse

invalid_urls = []

for url in uloc:
    try:
        result = urlparse(url)
        if all([result.scheme, result.netloc]):
            pass
        else:
            invalid_urls.append(url)
    except ValueError:
        print(f"{url} is an exception URL")
        invalid_urls.append(url)


df = pd.DataFrame({'Invalid URLs': invalid_urls})


In [5]:
df.head()

Unnamed: 0,Invalid URLs


In [6]:
from urllib.parse import urlparse

valid_urls = []


for url in uloc:
    try:
        result = urlparse(url)
        if all([result.scheme, result.netloc]):
            valid_urls.append(url)
        else:
            print(f"Invalid URL: {url}")
    except ValueError:
        print(f"Invalid URL: {url}")


In [7]:

# Break down all the URL into their path parts
urldf = adv.url_to_df(list(iow_sitemap['loc']))
# urldf = adv.url_to_df(list(uloc))

urldf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4878 entries, 0 to 4877
Data columns (total 9 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   url       4878 non-null   object 
 1   scheme    4878 non-null   object 
 2   netloc    4878 non-null   object 
 3   path      4878 non-null   object 
 4   query     0 non-null      float64
 5   fragment  0 non-null      float64
 6   dir_1     4877 non-null   object 
 7   dir_2     4875 non-null   object 
 8   last_dir  4877 non-null   object 
dtypes: float64(2), object(7)
memory usage: 343.1+ KB


In [8]:
urldf.head()


Unnamed: 0,url,scheme,netloc,path,query,fragment,dir_1,dir_2,last_dir
0,https://edmo.seadatanet.org/,https,edmo.seadatanet.org,/,,,,,
1,https://edmo.seadatanet.org/search,https,edmo.seadatanet.org,/search,,,search,,search
2,https://edmo.seadatanet.org/sparql,https,edmo.seadatanet.org,/sparql,,,sparql,,sparql
3,https://edmo.seadatanet.org/report/5883,https,edmo.seadatanet.org,/report/5883,,,report,5883.0,5883
4,https://edmo.seadatanet.org/report/5882,https,edmo.seadatanet.org,/report/5882,,,report,5882.0,5882


## Sample and test sitemap entries

In [9]:
# sample the previously generated url data frame
sample_size = 5
# sample_df = urldf.groupby("dir_1").sample(n=sample_size, random_state=1, replace=True)
sample_df = urldf.sample(n=sample_size, random_state=1, replace=True)

In [10]:
sample_df.head(5)

Unnamed: 0,url,scheme,netloc,path,query,fragment,dir_1,dir_2,last_dir
235,https://edmo.seadatanet.org/report/5679,https,edmo.seadatanet.org,/report/5679,,,report,5679,5679
3980,https://edmo.seadatanet.org/report/2688,https,edmo.seadatanet.org,/report/2688,,,report,2688,2688
905,https://edmo.seadatanet.org/report/5274,https,edmo.seadatanet.org,/report/5274,,,report,5274,5274
2763,https://edmo.seadatanet.org/report/1439,https,edmo.seadatanet.org,/report/1439,,,report,1439,1439
2895,https://edmo.seadatanet.org/report/3837,https,edmo.seadatanet.org,/report/3837,,,report,3837,3837


### See if the URLs resolve

In [11]:
import urllib.request
import requests

ul = sample_df["url"]

for item in ul:
    # user_agent = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7'
    # headers={'User-Agent':user_agent,}

    headers= {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) '
                           'AppleWebKit/537.11 (KHTML, like Gecko) '
                           'Chrome/23.0.1271.64 Safari/537.11',
             'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
             'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
             'Accept-Encoding': 'none',
             'Accept-Language': 'en-US,en;q=0.8',
             'Connection': 'keep-alive'}

    try:
        # x = requests.get(item)
        # code = x.status_code
        request=urllib.request.Request(url=item, headers=headers) #The assembled request
        with urllib.request.urlopen(request) as response:
            info = response.info()
            dtype = info.get_content_type()    # -> text/html
         # headers = x.headers()
        # print("URL: {} \ninfo : {}\n --".format(item, info))
        print("URL: {} ".format(item))
    except Exception as e:
        # code = x.status_code
        # dtype = info.get_content_type()

        print("Exception on: {} \nerrors : {}\n --".format(item, str(e)))


URL: https://edmo.seadatanet.org/report/5679 
URL: https://edmo.seadatanet.org/report/2688 
URL: https://edmo.seadatanet.org/report/5274 
URL: https://edmo.seadatanet.org/report/1439 
URL: https://edmo.seadatanet.org/report/3837 


### See if they have JSON-LD (static check only, no dynamically loaded JSON-LD yet)

In [12]:
ul = sample_df["url"]

headers= {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) '
                        'AppleWebKit/537.11 (KHTML, like Gecko) '
                        'Chrome/23.0.1271.64 Safari/537.11',
          'Accept': 'application/ld+json,text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
          'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
          'Accept-Encoding': 'none',
          'Accept-Language': 'en-US,en;q=0.8',
          'Connection': 'keep-alive'}

for item in ul:
    print(item)
    request=urllib.request.Request(url=item, headers=headers)
    # p = urllib.request.urlopen(request).read()

    html = urllib.request.urlopen(request).read()
    soup = BeautifulSoup(html, "html.parser")
    p = soup.find('script', {'type':'application/ld+json'})
    try:
        # print("JSON byte size: {} ".format(len(p)))
        print(p)
        print("JSON byte size: {} ".format(len(p.contents[0])))
    except Exception as e:
        logging.error(traceback.format_exc())

https://edmo.seadatanet.org/report/5679
<script type="application/ld+json">{
    "@context": {
        "@vocab": "https://schema.org/"
    },
    "@type": "Organization",
    "@id": "https://edmo.seadatanet.org/report/5679",
    "name": "Qingdao Hisun Ocean Equipment Corporation Limited",
    "identifier": "5679",
    "alternateName": "QDHISUN",
    "url": "https://edmo.seadatanet.org/report/5679",
    "location": {
        "@type": "Place",
        "latitude": 36.5233271,
        "longitude": 120.45198
    },
    "contactPoint": {
        "@type": "ContactPoint",
        "email": "qdhisun@qdhisun.com"
    },
    "address": {
        "@type": "PostalAddress",
        "addressLocality": "Qingdao City, China",
        "addressRegion": "Shandong Province",
        "postalCode": "266000",
        "streetAddress": "No. 1 Wenhai Road, Aoshanwei Town, Jimo"
    },
    "sameAs": [
        "http://www.qdhisun.com/"
    ],
    "memberOf": {
        "@type": "Program

### Check JSON-LD structure (static check only, no dynamically loaded JSON-LD yet)

In [13]:
ul = sample_df["url"]

myframe =  {
    "@context":{"@vocab": "http://schema.org/"},
    "@type": "Dataset",
}

context =  { "@vocab": "http://schema.org/" }

headers= {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) '
                        'AppleWebKit/537.11 (KHTML, like Gecko) '
                        'Chrome/23.0.1271.64 Safari/537.11',
          'Accept': 'application/ld+json,text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
          'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
          'Accept-Encoding': 'none',
          'Accept-Language': 'en-US,en;q=0.8',
          'Connection': 'keep-alive'}

for item in ul:
    request=urllib.request.Request(url=item, headers=headers)
    # p = urllib.request.urlopen(request).read()
    html = urllib.request.urlopen(request).read()
    soup = BeautifulSoup(html, "html.parser")
    p = soup.find('script', {'type':'application/ld+json'})
    try:
        jld = json.loads(p.contents[0])
        # jld = json.loads(p)

        # print(str(jld))
        # compacted = jsonld.compact(str(jld), context)
        # print(len(json.dumps(compacted, indent=2)))
    except Exception as e:
        print("Exception")
        logging.error(traceback.format_exc())

## Load to Graph"

Looad a sample set of triples into RDF lib and run a sample SPARQL query on them.

### Note
This is the same loop as above but now we will load into a KG graph

In [14]:
ul = sample_df["url"]

# Test loading into a graph
namespaces = {
    "schema":  "http://schema.org/",
    "schemaold":  "http://schema.org/",
    "shacl":   "http://www.w3.org/ns/shacl#" ,
}

kg = kglab.KnowledgeGraph(
    name = "Schema.org shacl eval datagraph",
    base_uri = "https://example.org/id/",
    namespaces = namespaces,
)

for item in ul:
    html = urllib.request.urlopen(item).read()
    soup = BeautifulSoup(html, "html.parser")
    p = soup.find('script', {'type':'application/ld+json'})
    try:
        print("JSON byte size: {} ".format(len(p.contents[0])))
        kg.load_rdf_text(data=p.contents[0], format="json-ld")
        print(p.contents[0])
    except Exception as e:
        logging.error(traceback.format_exc())

JSON byte size: 1023 
{
    "@context": {
        "@vocab": "https://schema.org/"
    },
    "@type": "Organization",
    "@id": "https://edmo.seadatanet.org/report/5679",
    "name": "Qingdao Hisun Ocean Equipment Corporation Limited",
    "identifier": "5679",
    "alternateName": "QDHISUN",
    "url": "https://edmo.seadatanet.org/report/5679",
    "location": {
        "@type": "Place",
        "latitude": 36.5233271,
        "longitude": 120.45198
    },
    "contactPoint": {
        "@type": "ContactPoint",
        "email": "qdhisun@qdhisun.com"
    },
    "address": {
        "@type": "PostalAddress",
        "addressLocality": "Qingdao City, China",
        "addressRegion": "Shandong Province",
        "postalCode": "266000",
        "streetAddress": "No. 1 Wenhai Road, Aoshanwei Town, Jimo"
    },
    "sameAs": [
        "http://www.qdhisun.com/"
    ],
    "memberOf": {
        "@type": "ProgramMembership",
        "programName": "European Direct

In [15]:
sparql = """
PREFIX schema: <https://schema.org/>
SELECT ?s ?name ?description ?type
  WHERE {
    ?s a ?type .
    ?s schema:name ?name .
    ?s schema:description ?description.
  }
"""

#  schema:Dataset
# ?s schema:name ?name .
#     ?s schema:description ?description.

df = kg.query_as_df(sparql)

df.head()

Unnamed: 0,s,name,description,type
0,<https://edmo.seadatanet.org/report/2688>,"JRC, Institute for Environment and Sustainability",A healthy environment is one of the cornerston...,:Organization
1,<https://edmo.seadatanet.org/report/5274>,Marine Solutions Tasmania Pty Ltd,"Based in Hobart, Marine Solutions conducts pro...",:Organization
2,<https://edmo.seadatanet.org/report/1439>,"University of California, Santa Cruz, Ocean Sc...",The Ocean Sciences Department includes faculty...,:Organization
3,<https://edmo.seadatanet.org/report/3837>,University of Southern Mississippi,"The University of Southern Mississippi, known ...",:Organization
