Merge pull request #139 from podaac/cmr_s3_links

add notebook to query cmr and get s3 links for a collections
podaac · Aug 22, 2023 · 686df23 · 686df23
2 parents 70fbdf0 + fa6864d
commit 686df23
Showing 1 changed file with 249 additions and 0 deletions.
diff --git a/notebooks/podaac_cmr_s3_links.ipynb b/notebooks/podaac_cmr_s3_links.ipynb
@@ -0,0 +1,249 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "c1d2baee",
+   "metadata": {},
+   "source": [
+    "# CMR search getting S3 Links\n",
+    "\n",
+    "This is a tutorial to show how to retrive a list of s3 links from a CMR granules search. "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "5cff1d9b",
+   "metadata": {},
+   "source": [
+    "# Prerequisites\n",
+    "\n",
+    "Before proceeding, ensure you have the following:\n",
+    "\n",
+    "Python installed on your system.\n",
+    "The requests library installed. If not, you can install it using \"pip install requests\"\n",
+    "\n",
+    "Define function below to get s3 links for granules based on CMR search criterias."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "52c99c5b",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import requests\n",
+    "import netrc\n",
+    "\n",
+    "def get_s3_links(collection_concept_id, provider, bounding_box=None,\n",
+    "                 time_range=None, cycle=None, wildcard=None, edl_token=None):\n",
+    "    \"\"\"\n",
+    "    Fetch S3 links from CMR API based on search criteria.\n",
+    "\n",
+    "    :param collection_concept_id: The concept ID of the collection to search within.\n",
+    "    :param provider: The data provider for the collection.\n",
+    "    :param bounding_box: A list of coordinates [min_lon, min_lat, max_lon, max_lat] to filter by bounding box.\n",
+    "    :param time_range: A list of two datetime strings [start_time, end_time] to filter by temporal range.\n",
+    "    :param cycle: The cycle value to filter by.\n",
+    "    :param wildcard: A native_id wildcard pattern to filter granules.\n",
+    "    :param edl_token: The EDL token for authentication (optional).\n",
+    "    :return: A list of S3 links from the CMR API.\n",
+    "    \"\"\"\n",
+    "    base_url = 'https://cmr.earthdata.nasa.gov'\n",
+    "    search_endpoint = '/search/granules.umm_json'\n",
+    "\n",
+    "    # Set up query parameters\n",
+    "    params = {\n",
+    "        'collection_concept_id': collection_concept_id,\n",
+    "        'provider': provider,\n",
+    "        'page_size': 2000\n",
+    "    }\n",
+    "\n",
+    "    if bounding_box:\n",
+    "        params['bounding_box'] = ','.join(map(str, bounding_box))\n",
+    "\n",
+    "    if time_range:\n",
+    "        params['temporal'] = ','.join(map(str, time_range))\n",
+    "\n",
+    "    if cycle:\n",
+    "        params['cycle'] = cycle\n",
+    "\n",
+    "    if wildcard:\n",
+    "        params['options[native_id][pattern]'] = 'true'\n",
+    "        params['native_id'] = wildcard\n",
+    "\n",
+    "    s3_links = []\n",
+    "\n",
+    "    headers = {'cmr-search-after': None}\n",
+    "    if edl_token:\n",
+    "        headers[\"Authorization\"] = f\"Bearer {edl_token}\"\n",
+    "\n",
+    "    try:\n",
+    "        while True:\n",
+    "            response = requests.get(base_url + search_endpoint, params=params, headers=headers)\n",
+    "            response.raise_for_status()  # Check for request errors\n",
+    "            response_data = response.json()\n",
+    "            cmr_search_after = response.headers.get(\"cmr-search-after\")\n",
+    "\n",
+    "            if 'items' not in response_data:\n",
+    "                break\n",
+    "\n",
+    "            for item in response_data['items']:\n",
+    "                if 'umm' in item and 'RelatedUrls' in item['umm']:\n",
+    "                    for url_info in item['umm']['RelatedUrls']:\n",
+    "                        if url_info['Type'] == 'GET DATA VIA DIRECT ACCESS' and url_info['URL'] and 's3://' in url_info['URL']:\n",
+    "                            s3_links.append(url_info['URL'])\n",
+    "\n",
+    "            headers['cmr-search-after'] = cmr_search_after\n",
+    "\n",
+    "            if cmr_search_after is None:\n",
+    "                break\n",
+    "\n",
+    "    except requests.exceptions.RequestException as e:\n",
+    "        print(f\"Error: {e}\")\n",
+    "\n",
+    "    return s3_links"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "dd4bcef9",
+   "metadata": {},
+   "source": [
+    "# Provide CMR Credentials\n",
+    "\n",
+    "You may need authentication for the CMR API (e.g., Earthdata Login), the following function will create an EDL token for you, also you will need to create a .netrc file with your EDL credentials. More information on .netrc files https://everything.curl.dev/usingcurl/netrc"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "e0de0474",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# GET TOKEN FROM CMR \n",
+    "def get_token():\n",
+    "    urs_root = 'urs.earthdata.nasa.gov'\n",
+    "    username, _, password = netrc.netrc().authenticators(urs_root)\n",
+    "    token_api = \"https://{}/api/users/tokens\".format(urs_root)\n",
+    "    response = requests.get(token_api, auth=(username, password))\n",
+    "    content = response.json()\n",
+    "    if len(content) > 0:\n",
+    "        return content[0].get('access_token')\n",
+    "    else:\n",
+    "        create_token_api = \"https://{}/api/users/token\".format(urs_root)\n",
+    "        response = requests.post(create_token_api, auth=(username, password))\n",
+    "        content = response.json()\n",
+    "        return content.get('access_token')\n",
+    "\n",
+    "edl_token = get_token()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "44135399",
+   "metadata": {},
+   "source": [
+    "# Search Collection Concept id\n",
+    "\n",
+    "Search for a collection concept id using CMR API, we will use a collections shortname to find the collection concept id."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "9d880c46",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "collection_short_name = \"SWOT_L2_LR_SSH_BASIC_1.0\"\n",
+    "headers = {\"Authorization\": f\"Bearer {edl_token}\"}\n",
+    "cmr_collection_url = f\"https://cmr.earthdata.nasa.gov/search/collections.json?short_name={collection_short_name}\"\n",
+    "response = requests.get(headers=headers, url=cmr_collection_url)\n",
+    "collection_concept_id = response.json().get('feed').get('entry')[0].get('id')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "973aa41f",
+   "metadata": {},
+   "source": [
+    "# Define Search Criteria\n",
+    "\n",
+    "Define the search criteria to filter the granules. This can include the collection concept ID, provider, bounding box, time range, cycle, and wildcard (if needed). Wildcard are used to search for granules name that falls within the wildcard regex."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "81fcd473",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Example usage:\n",
+    "provider = 'POCLOUD'\n",
+    "bounding_box = [-90, -90, 90, 90]\n",
+    "time_range = [\"2023-01-01T00:00:00Z\", \"2023-12-30T23:59:59Z\"]\n",
+    "cycle = \"560\"\n",
+    "wildcard = \"*2023*\""
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f9b0972c",
+   "metadata": {},
+   "source": [
+    "# Fetch S3 Links\n",
+    "\n",
+    "Now, let's call the get_s3_links function with the provided search criteria to fetch the S3 links from the CMR API."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "db8be3ff",
+   "metadata": {
+    "scrolled": true
+   },
+   "outputs": [],
+   "source": [
+    "s3_links = get_s3_links(collection_concept_id, provider,\n",
+    "                        bounding_box=bounding_box, time_range=time_range, \n",
+    "                        wildcard=wildcard, edl_token=edl_token, cycle=cycle)\n",
+    "\n",
+    "print(len(s3_links))\n",
+    "display(s3_links)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "ceae2f42",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.7.10"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}