Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change to https #873

Merged
merged 4 commits into from
Nov 10, 2022
Merged

Change to https #873

merged 4 commits into from
Nov 10, 2022

Conversation

MarcSkovMadsen
Copy link
Collaborator

@MarcSkovMadsen MarcSkovMadsen commented Aug 28, 2022

Fixes #862

Notes

@MarcSkovMadsen
Copy link
Collaborator Author

The tests fail due to the obsolete url mentioned in #870. I can't find the url.

Let me know if there is more I should do. Thanks.

@hoxbro
Copy link
Member

hoxbro commented Aug 29, 2022

I will just add the invalid URL to SKIP_URLS. It is "only" metadata, and I feel it is more correct to have the site on which the data was downloaded even though it does not exist anymore.

Alternatively, you could update the URL to the new site I mentioned in issue #870, but I will at least verify that it is actually possible to get the data on the new site.

So there are two ways forward: one is easy, and one is hard. I will let it be up to you which one to choose.

@MarcSkovMadsen
Copy link
Collaborator Author

I would prefer to have a valid url from the new site.

I tried to find it without luck. As I don't have the original context I would not be able to know if I found the right url. I don't even know how Intake works or what the purpose of the meta data is. And for now I don't need to learn.

Can someone else help find the correct url?

@philippjfr
Copy link
Member

Think this is probably the best link: https://ucr.fbi.gov/crime-in-the-u.s/2018/crime-in-the-u.s.-2019

@philippjfr
Copy link
Member

@hoxbro
Copy link
Member

hoxbro commented Sep 12, 2022

I would change the test_links.py to the following. Even though it is nice with the fixtures, I think we should run all URLs in one go. This can both be good if there are duplicate URLs in different files and avoid overhead when starting up the ThreadPoolExecutor. If a URL fail it should be easy to find the file by searching for the URL.

I changed also changed it to a simpler regex and using requests.head to only get the header of the site. Though anaconda.org give status code 400 when using head and 200 when using get.

"""Urls in docstrings etc. should be valid and secure, i.e.

- exist, i.e. provide a 200 response
- use https:// instead of http:// unless
    - https:// is not supported by the web site
    - https:// cannot be used. For example in SVGs.
"""

import pathlib
import re
from concurrent.futures import ThreadPoolExecutor
from functools import partial

import requests

URL_REGEX = re.compile(r"(https?:\/\/?[\w/\-?=%.]+\.[\w/\-&?=%.]+)")
ROOT = pathlib.Path(__file__).parents[2]
POST_FIXES = ["*.py", "*.ipynb", "*.md", "*.yaml", "*.yml"]
SKIP_URLS = [
    "https://www.ucrdatatool.gov/Search/Crime/State/StatebyState.cfm",  # Not available anymore
    "http://weegee.vision.ucmerced.edu/datasets/landuse.html",  # Not available anymore
    "https://github.com/holoviz/hvplot",  # Alot of issues/PRs
    "http://octopart.com",  # Located in Notebook output, not relevant
    "https://jsperf.com",  # Located in Notebook output, not relevant
    "https://web.archive.org",  # Can be slow and does not work with the regex.
    "https://anaconda.org/anaconda/hvplot",  # Give a status code of 400?
    "https://anaconda.org/conda-forge/hvplot",  # Give a status code of 400?
    "https://anaconda.org/pyviz/hvplot",  # Give a status code of 400?
    "https://bugs.chromium.org/p/chromium/issues",  # Give status code 405 (needs login)
    "https://mybinder.org",  # Give status code 405
]
SKIP_FILES = [
    ".ipynb_checkpoints",  # Can give problems when running test local
]


def _get_files_to_check():
    for post_fix in POST_FIXES:
        for file in ROOT.rglob(post_fix):
            if any(skip in str(file) for skip in SKIP_FILES):
                continue
            yield file


def _find_urls(text):
    urls = set()
    for url in set(re.findall(URL_REGEX, text)):
        if url.endswith("."):
            # Regex will keep a dot at the end
            urls.add(url[:-1])
            continue
        if any(url.startswith(skip) for skip in SKIP_URLS):
            continue
        urls.add(url)

    return urls


def _verify_urls(urls):
    func = partial(requests.head, timeout=5)

    with ThreadPoolExecutor() as executor:
        futures = executor.map(func, urls)

    for url, response in zip(urls, futures):
        if not response.ok:
            raise ValueError(
                f"The url {url} responded with status {response.status_code}."
            )


def test_urls():
    urls = set()
    for file in _get_files_to_check():
        text = file.read_text()
        urls |= _find_urls(text)

    urls = sorted(urls)
    _verify_urls(urls)

@maximlt
Copy link
Member

maximlt commented Nov 10, 2022

I replaced the broken link by an internet archive link. I'd be happy if one day we could replace this crime dataset by another one, I find it a little creepy. Thanks @MarcSkovMadsen !!!

@maximlt maximlt merged commit e98dfb1 into holoviz:master Nov 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use https instead of http in links
4 participants