Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New access patterns for NOAA@NSIDC data #63

Open
andypbarrett opened this issue Nov 21, 2023 · 1 comment
Open

New access patterns for NOAA@NSIDC data #63

andypbarrett opened this issue Nov 21, 2023 · 1 comment

Comments

@andypbarrett
Copy link
Collaborator

I've been exploring requests and BeautifulSoup to get a list of files on HTTPS. I have code to recursively list files in a directory. I'm in two minds if this should be a tutorial or a how-to. The code "walks" the server directory tree and returns a generator containing the urls for each file. Recursion and generators are hard for many to get their heads around (they are for me at least). But it fills a need.

Ideally, we would have a STAC catalog for these datasets so that we do not need to have these kinds of access patterns. This might be for my next playtime.

import time
from http import HTTPStatus

import requests
from requests.exceptions import HTTPError

from bs4 import BeautifulSoup


retry_codes = [
    HTTPStatus.TOO_MANY_REQUESTS,
    HTTPStatus.INTERNAL_SERVER_ERROR,
    HTTPStatus.BAD_GATEWAY,
    HTTPStatus.SERVICE_UNAVAILABLE,
    HTTPStatus.GATEWAY_TIMEOUT,
]


def get_page(url: str, 
             retries: int = 3) -> requests.Response:
    """Gets resonse from requests

    Parameters
    ----------
    url : url to resource
    retries : number of retries before failing

    Returns
    -------
    requests.Response object
    """
    for n in range(retries):
        try:
            response = requests.get(url)
            response.raise_for_status()

            return response

        except HTTPError as exc:
            code = exc.response.status_code
        
            if code in retry_codes:
                # retry after n seconds
                time.sleep(n)
                continue

            raise    


def get_filelist(url: str, 
                 ext: str = ".nc"):
    """Returns a generator containing files in directory tree
    below url.

    Parameters
    ----------
    url : url to resource
    ext : file extension of files to search for

    Returns
    -------
    Generator containing list files
    """
    
    def is_subdirectory(href):
        return (href.endswith("/") and 
                href not in url and
                not href.startswith("."))

    def is_file(href, ext):
        return href.endswith(ext)
        
    response = get_page(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    for a in soup.find_all('a', href=True):
        if is_subdirectory(a["href"]):
            yield from get_filelist(url+a["href"])
        if is_file(a["href"], ext):
            yield(url + a["href"])
@asteiker
Copy link
Member

asteiker commented Jan 23, 2024

CRYO-195

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants