Change to https #873

MarcSkovMadsen · 2022-08-28T06:40:40Z

Fixes #862

Notes

I found that changing to https in SVGs makes them invalid. So I did not do it.
I found some invalid urls and fixed them
There is one invalid url left that I cannot fix, which is Invalid url: http://www.ucrdatatool.gov/Search/Crime/State/StatebyState.cfm #870
I updated the test_links.py file to test all files and urls. But some domain refuse to respond to request requests so I exempted those.
I manually opened and ran all the notebooks to verify they still worked. I discoved some unrelated issues
- KeyError: 'factors' when running Streaming.ipynb notebook holoviews#5412
- alpha option not found for kde plot with plotly #871

MarcSkovMadsen · 2022-08-28T10:57:42Z

The tests fail due to the obsolete url mentioned in #870. I can't find the url.

Let me know if there is more I should do. Thanks.

hoxbro · 2022-08-29T07:24:50Z

I will just add the invalid URL to SKIP_URLS. It is "only" metadata, and I feel it is more correct to have the site on which the data was downloaded even though it does not exist anymore.

Alternatively, you could update the URL to the new site I mentioned in issue #870, but I will at least verify that it is actually possible to get the data on the new site.

So there are two ways forward: one is easy, and one is hard. I will let it be up to you which one to choose.

MarcSkovMadsen · 2022-08-29T10:39:54Z

I would prefer to have a valid url from the new site.

I tried to find it without luck. As I don't have the original context I would not be able to know if I found the right url. I don't even know how Intake works or what the purpose of the meta data is. And for now I don't need to learn.

Can someone else help find the correct url?

philippjfr · 2022-09-05T09:23:41Z

Think this is probably the best link: https://ucr.fbi.gov/crime-in-the-u.s/2018/crime-in-the-u.s.-2019

philippjfr · 2022-09-05T09:25:36Z

Alternatively there's also: https://crime-data-explorer.fr.cloud.gov/pages/explorer/crime/crime-trend

hvplot/tests/test_links.py

hoxbro · 2022-09-12T09:36:07Z

I would change the test_links.py to the following. Even though it is nice with the fixtures, I think we should run all URLs in one go. This can both be good if there are duplicate URLs in different files and avoid overhead when starting up the ThreadPoolExecutor. If a URL fail it should be easy to find the file by searching for the URL.

I changed also changed it to a simpler regex and using requests.head to only get the header of the site. Though anaconda.org give status code 400 when using head and 200 when using get.

"""Urls in docstrings etc. should be valid and secure, i.e.

- exist, i.e. provide a 200 response
- use https:// instead of http:// unless
    - https:// is not supported by the web site
    - https:// cannot be used. For example in SVGs.
"""

import pathlib
import re
from concurrent.futures import ThreadPoolExecutor
from functools import partial

import requests

URL_REGEX = re.compile(r"(https?:\/\/?[\w/\-?=%.]+\.[\w/\-&?=%.]+)")
ROOT = pathlib.Path(__file__).parents[2]
POST_FIXES = ["*.py", "*.ipynb", "*.md", "*.yaml", "*.yml"]
SKIP_URLS = [
    "https://www.ucrdatatool.gov/Search/Crime/State/StatebyState.cfm",  # Not available anymore
    "http://weegee.vision.ucmerced.edu/datasets/landuse.html",  # Not available anymore
    "https://github.com/holoviz/hvplot",  # Alot of issues/PRs
    "http://octopart.com",  # Located in Notebook output, not relevant
    "https://jsperf.com",  # Located in Notebook output, not relevant
    "https://web.archive.org",  # Can be slow and does not work with the regex.
    "https://anaconda.org/anaconda/hvplot",  # Give a status code of 400?
    "https://anaconda.org/conda-forge/hvplot",  # Give a status code of 400?
    "https://anaconda.org/pyviz/hvplot",  # Give a status code of 400?
    "https://bugs.chromium.org/p/chromium/issues",  # Give status code 405 (needs login)
    "https://mybinder.org",  # Give status code 405
]
SKIP_FILES = [
    ".ipynb_checkpoints",  # Can give problems when running test local
]


def _get_files_to_check():
    for post_fix in POST_FIXES:
        for file in ROOT.rglob(post_fix):
            if any(skip in str(file) for skip in SKIP_FILES):
                continue
            yield file


def _find_urls(text):
    urls = set()
    for url in set(re.findall(URL_REGEX, text)):
        if url.endswith("."):
            # Regex will keep a dot at the end
            urls.add(url[:-1])
            continue
        if any(url.startswith(skip) for skip in SKIP_URLS):
            continue
        urls.add(url)

    return urls


def _verify_urls(urls):
    func = partial(requests.head, timeout=5)

    with ThreadPoolExecutor() as executor:
        futures = executor.map(func, urls)

    for url, response in zip(urls, futures):
        if not response.ok:
            raise ValueError(
                f"The url {url} responded with status {response.status_code}."
            )


def test_urls():
    urls = set()
    for file in _get_files_to_check():
        text = file.read_text()
        urls |= _find_urls(text)

    urls = sorted(urls)
    _verify_urls(urls)

maximlt · 2022-11-10T11:50:33Z

I replaced the broken link by an internet archive link. I'd be happy if one day we could replace this crime dataset by another one, I find it a little creepy. Thanks @MarcSkovMadsen !!!

assets.holoviews.org/data/

d4bdd6c

MarcSkovMadsen added the status: ready label Aug 28, 2022

MarcSkovMadsen mentioned this pull request Aug 28, 2022

Test the links in the docs build #869

Open

hoxbro mentioned this pull request Aug 29, 2022

Invalid url: http://www.ucrdatatool.gov/Search/Crime/State/StatebyState.cfm #870

Closed

MarcSkovMadsen requested a review from philippjfr August 30, 2022 13:26

maximlt reviewed Sep 8, 2022

View reviewed changes

hvplot/tests/test_links.py Show resolved Hide resolved

Merge branch 'master' into fix/https

65d8ee4

maximlt removed the status: ready label Oct 27, 2022

maximlt added 2 commits November 3, 2022 14:39

replace invalid link by webarchive

3cdae72

Merge branch 'master' into pr/MarcSkovMadsen/873

e021d48

maximlt merged commit e98dfb1 into holoviz:master Nov 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change to https #873

Change to https #873

MarcSkovMadsen commented Aug 28, 2022 •

edited by hoxbro

Loading

MarcSkovMadsen commented Aug 28, 2022

hoxbro commented Aug 29, 2022

MarcSkovMadsen commented Aug 29, 2022

philippjfr commented Sep 5, 2022

philippjfr commented Sep 5, 2022

hoxbro commented Sep 12, 2022

maximlt commented Nov 10, 2022

Change to https #873

Change to https #873

Conversation

MarcSkovMadsen commented Aug 28, 2022 • edited by hoxbro Loading

MarcSkovMadsen commented Aug 28, 2022

hoxbro commented Aug 29, 2022

MarcSkovMadsen commented Aug 29, 2022

philippjfr commented Sep 5, 2022

philippjfr commented Sep 5, 2022

hoxbro commented Sep 12, 2022

maximlt commented Nov 10, 2022

MarcSkovMadsen commented Aug 28, 2022 •

edited by hoxbro

Loading