New access patterns for NOAA@NSIDC data #63

andypbarrett · 2023-11-21T23:45:31Z

I've been exploring requests and BeautifulSoup to get a list of files on HTTPS. I have code to recursively list files in a directory. I'm in two minds if this should be a tutorial or a how-to. The code "walks" the server directory tree and returns a generator containing the urls for each file. Recursion and generators are hard for many to get their heads around (they are for me at least). But it fills a need.

Ideally, we would have a STAC catalog for these datasets so that we do not need to have these kinds of access patterns. This might be for my next playtime.

import time
from http import HTTPStatus

import requests
from requests.exceptions import HTTPError

from bs4 import BeautifulSoup


retry_codes = [
    HTTPStatus.TOO_MANY_REQUESTS,
    HTTPStatus.INTERNAL_SERVER_ERROR,
    HTTPStatus.BAD_GATEWAY,
    HTTPStatus.SERVICE_UNAVAILABLE,
    HTTPStatus.GATEWAY_TIMEOUT,
]


def get_page(url: str, 
             retries: int = 3) -> requests.Response:
    """Gets resonse from requests

    Parameters
    ----------
    url : url to resource
    retries : number of retries before failing

    Returns
    -------
    requests.Response object
    """
    for n in range(retries):
        try:
            response = requests.get(url)
            response.raise_for_status()

            return response

        except HTTPError as exc:
            code = exc.response.status_code
        
            if code in retry_codes:
                # retry after n seconds
                time.sleep(n)
                continue

            raise    


def get_filelist(url: str, 
                 ext: str = ".nc"):
    """Returns a generator containing files in directory tree
    below url.

    Parameters
    ----------
    url : url to resource
    ext : file extension of files to search for

    Returns
    -------
    Generator containing list files
    """
    
    def is_subdirectory(href):
        return (href.endswith("/") and 
                href not in url and
                not href.startswith("."))

    def is_file(href, ext):
        return href.endswith(ext)
        
    response = get_page(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    for a in soup.find_all('a', href=True):
        if is_subdirectory(a["href"]):
            yield from get_filelist(url+a["href"])
        if is_file(a["href"], ext):
            yield(url + a["href"])

The text was updated successfully, but these errors were encountered:

asteiker · 2024-01-23T19:23:50Z

CRYO-195

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New access patterns for NOAA@NSIDC data #63

New access patterns for NOAA@NSIDC data #63

andypbarrett commented Nov 21, 2023

asteiker commented Jan 23, 2024 •

edited by jira bot

Loading

New access patterns for NOAA@NSIDC data #63

New access patterns for NOAA@NSIDC data #63

Comments

andypbarrett commented Nov 21, 2023

asteiker commented Jan 23, 2024 • edited by jira bot Loading

asteiker commented Jan 23, 2024 •

edited by jira bot

Loading