# Using EsgpullAPI for File Search and Download
This notebook demonstrates how to use the `EsgpullAPI` to search for climate data files and download them programmatically using `esgpull`.

## Setup
First, we import the `EsgpullAPI`. When initializing `EsgpullAPI()`, it will automatically try to locate your `esgpull` configuration file (e.g., `config.toml` typically found in `~/.config/esgpull/`). This file contains settings for data download locations, authentication, search preferences, and other `esgpull` behaviors.
If your configuration file is in a non-standard location, or if you want to use a specific configuration for your script, you can optionally provide the `config_path` argument during initialization (e.g., `EsgpullAPI(config_path='/path/to/your/config.toml')`).

In [1]:
%reload_ext autoreload
%autoreload 2
from esgpull.api import EsgpullAPI
import json # For pretty printing results
import os

# Initialize EsgpullAPI.
# By default, EsgpullAPI() will attempt to find the esgpull configuration file
# in its standard locations (e.g., ~/.config/esgpull/config.toml).
api = None
try:
    print("Attempting to initialize EsgpullAPI using default configuration search...")
    api = EsgpullAPI() # Initialize with default config search
    print("EsgpullAPI initialized successfully using default configuration.")
except Exception as e_default:
    print(f"Error initializing EsgpullAPI with default configuration: {e_default}")
    print("This might mean esgpull is not installed, no configuration file was found in standard locations, or the found config is invalid.")
    print("Please ensure esgpull is correctly set up (e.g., run `esgpull install config` or create `~/.config/esgpull/config.toml`).")
    print("\nAlternatively, if you have a configuration file in a custom location, you can specify its path below:")

    # OPTIONAL: Specify a path to your esgpull configuration file if the default search fails or you want to override.
    # Example: config_path_override = os.path.expanduser("~/custom_esgpull_configs/my_project_config.toml")
    config_path_override = None  # Set to a path string if you want to use a specific config file

    if isinstance(config_path_override, str) and os.path.exists(config_path_override):
        print(f"\nAttempting to initialize EsgpullAPI with specified config: {config_path_override}")
        try:
            api = EsgpullAPI(config_path=config_path_override)
            print(f"EsgpullAPI initialized successfully with {config_path_override}.")
        except Exception as e_override:
            print(f"Error initializing EsgpullAPI with specified config '{config_path_override}': {e_override}")
            print("Please ensure the path is correct and the file is a valid esgpull configuration.")
    elif isinstance(config_path_override, str): # Path was specified but not found
         print(f"\nSpecified configuration file not found: {config_path_override}. Please check the path.")

if not api:
    print("\nEsgpullAPI could not be initialized. Subsequent cells may not work as expected.")
    print("Ensure esgpull is configured correctly or provide a valid config_path if needed.")

# If initialization failed, subsequent cells using `api` will likely raise errors or be skipped by the 'if api:' checks.

Attempting to initialize EsgpullAPI using default configuration search...
EsgpullAPI initialized successfully using default configuration.


## 1. Searching for Files
We can search for files using a dictionary of search criteria. The criteria correspond to facets used in ESGF search (e.g., `project`, `experiment_id`, `variable`, `frequency`). The `api.search()` method directly queries ESGF nodes based on these criteria.

In [2]:
max_files = 100
# Define search criteria
# These are examples, adjust them to your needs and available data on ESGF nodes.
search_criteria = {
    "project": "CMIP6",
    "experiment_id": "historical",
    "variable": "tas",  # Near-surface air temperature
    "frequency": "mon", # Monthly
    # "data_node": "esgf-data.dkrz.de", # Optional: specify a data node to query
    # Add other facets as needed, e.g., "source_id", "variant_label"
    "limit": max_files # Optional: limit the number of results returned by the search query to the node
}

if api:
    print(f"Searching ESGF with criteria: {json.dumps(search_criteria, indent=2)}")
    try:
        search_results = api.search(criteria=search_criteria)
        if search_results:
            print(f"Found {len(search_results)} (of {max_files} allowed) files/datasets matching the criteria from the ESGF node.")
            print(f"Showing details for the first {min(3, len(search_results))} results:")
            for i, result in enumerate(search_results[:3]): # Print first 3 results
                print(f"\nResult {i+1}:")
                print(json.dumps(result, indent=2)) # `result` is a dictionary of file/dataset attributes
        else:
            print("No results found from ESGF for the given criteria.")
    except Exception as e:
        print(f"Error during ESGF search: {e}")
else:
    print("API not initialized. Skipping ESGF search.")

Searching ESGF with criteria: {
  "project": "CMIP6",
  "experiment_id": "historical",
  "variable": "tas",
  "frequency": "mon",
  "limit": 100
}
Found 100 (of 100 allowed) files/datasets matching the criteria from the ESGF node.
Showing details for the first 3 results:

Result 1:
{
  "file_id": "CMIP6.CMIP.AS-RCEC.TaiESM1.historical.r1i1p1f1.Amon.tas.gn.v20200623.tas_Amon_TaiESM1_historical_r1i1p1f1_gn_185001-201412.nc",
  "dataset_id": "CMIP6.CMIP.AS-RCEC.TaiESM1.historical.r1i1p1f1.Amon.tas.gn.v20200623",
  "master_id": "CMIP6.CMIP.AS-RCEC.TaiESM1.historical.r1i1p1f1.Amon.tas.gn.tas_Amon_TaiESM1_historical_r1i1p1f1_gn_185001-201412.nc",
  "url": "https://esgf-data1.llnl.gov/thredds/fileServer/css03_data/CMIP6/CMIP/AS-RCEC/TaiESM1/historical/r1i1p1f1/Amon/tas/gn/v20200623/tas_Amon_TaiESM1_historical_r1i1p1f1_gn_185001-201412.nc",
  "version": "v20200623",
  "filename": "tas_Amon_TaiESM1_historical_r1i1p1f1_gn_185001-201412.nc",
  "local_path": "CMIP6/CMIP/AS-RCEC/TaiESM1/historical/

## 2. Adding a Query to Track and Download
To manage and download files with `esgpull`, we first add a query to its internal database. This query defines the set of files we are interested in based on specified criteria. 
The `api.add()` method is used for this. It requires a dictionary of criteria, which **must include a unique `name` field**. This `name` serves as the `query_id` for all subsequent operations like downloading, updating, or tracking the query.
If a query with the given `name` already exists, `api.add()` will typically update its definition.
When a query is added or updated, `esgpull` may perform a search on ESGF nodes to find matching files and record them in its database, associating them with this query name.

In [3]:
# Define criteria for the query we want to add to esgpull's database.
# This must include a unique 'name' for the query, which will serve as its ID.
query_name = "my_cmip6_tas_historical_demo_query" # Ensure this is unique for your esgpull DB
add_criteria = {
    "name": query_name,
    "project": "CMIP6",
    "experiment_id": "historical",
    "variable": "tas",
    "frequency": "mon",
    "data_node": "esgf-data.dkrz.de", # Optional, but good for consistency if used in search
    # To be more specific and control the files associated with this query,
    # you might add 'source_id', 'variant_label', etc., based on prior search results.
    # For example:
    # "source_id": "ACCESS-CM2",
    # "variant_label": "r1i1p1f1",
    "limit": 10 # `esgpull` uses this to limit files initially fetched for this query from ESGF
}

if api:
    print(f"Adding or updating query in esgpull database with criteria: {json.dumps(add_criteria, indent=2)}")
    try:
        # Add the query. track=False means it won't be auto-updated by a default scheduler.
        api.add(criteria=add_criteria, track=False)
        print(f"Query '{query_name}' added/updated successfully in esgpull database.")
        print(f"This query can now be referred to by its name (query_id): {query_name}")
        print("Note: After adding/updating, esgpull should have identified matching files from ESGF.")
        print("You can run `api.update(query_id=query_name)` to explicitly refresh the file list for this query from ESGF nodes.")
    except Exception as e:
        print(f"Error adding/updating query '{query_name}': {e}")
else:
    print("API not initialized. Skipping add query.")

Adding or updating query in esgpull database with criteria: {
  "name": "my_cmip6_tas_historical_demo_query",
  "project": "CMIP6",
  "experiment_id": "historical",
  "variable": "tas",
  "frequency": "mon",
  "data_node": "esgf-data.dkrz.de",
  "limit": 10
}
Error adding/updating query 'my_cmip6_tas_historical_demo_query': (sqlite3.IntegrityError) UNIQUE constraint failed: tag.sha
[SQL: INSERT INTO tag (name, description, sha) VALUES (?, ?, ?)]
[parameters: ('name:my_cmip6_tas_historical_demo_query', None, '800fc337cbb10f0a14f17696b757100d2f17e678')]
(Background on this error at: https://sqlalche.me/e/20/gkpj)


## 3. Downloading Files for the Query
Once a query is added to `esgpull` (and potentially updated to discover all relevant files from ESGF), we can use its `name` (which serves as `query_id`) to download the associated files.
The `api.download()` method will attempt to download files linked to this query that are in a 'queued' state (i.e., identified by `esgpull` as ready for download and not yet downloaded or failed too many times).

In [4]:
# The query_id is the name we assigned when adding the query.
# Ensure query_name is defined from the previous cell's execution.
query_to_download = query_name if 'query_name' in locals() and query_name else None

if api and query_to_download:
    print(f"Attempting to download files for query_id: '{query_to_download}'")
    try:
        # Optional: It's often a good idea to update the query before downloading.
        # This ensures esgpull has the latest list of files from ESGF for this query.
        # print(f"Updating query '{query_to_download}' to refresh file list from ESGF nodes...")
        # updated_file_infos = api.update(query_id=query_to_download) # Returns list of all files for query
        # print(f"Query update complete. {len(updated_file_infos)} files are now associated with '{query_to_download}'.")
        # print("Files that are new and match the criteria will be in 'queued' status.")

        # Now, attempt to download the queued files for this query.
        download_results = api.download(query_id=query_to_download)
        
        if download_results:
            print(f"Download command processed. {len(download_results)} files were handled (e.g., downloaded, already exist, or failed).")
            print("Details of files processed by the download command:")
            for i, file_info in enumerate(download_results):
                print(f"\nFile {i+1}:")
                print(json.dumps(file_info, indent=2))
                # Common fields: 'filename', 'status' (e.g., 'done', 'error', 'queued', 'skipped'), 'local_path', 'size', 'checksum'
        else:
            print(f"No files were processed by the download command for query '{query_to_download}'.")
            print("This could mean:")
            print("  - No files are currently in 'queued' state for this query (e.g., all downloaded, or none matched).")
            print("  - The query criteria might not match any downloadable files on the ESGF nodes.")
            print("  - Consider running `api.update(query_id=query_name)` if you suspect the file list is stale or files are missing.")
            print("  - Check `esgpull` logs (if configured) for more detailed information if downloads are expected but not happening.")

    except Exception as e:
        print(f"Error during download for query '{query_to_download}': {e}")
elif not api:
    print("API not initialized. Skipping download.")
else: # query_to_download is None
    print("Query name ('query_name') is not defined. Ensure the 'Add Query' step was successful before running download.")

# Downloaded files will be located in the directory specified in your esgpull configuration's `data_path` (or similar setting).

Attempting to download files for query_id: 'my_cmip6_tas_historical_demo_query'
Error during download for query 'my_cmip6_tas_historical_demo_query': 'my_cmip6_tas_historical_demo_query'


## Further Actions
- **Update Query**: Use `api.update(query_id)` to refresh a query. This contacts ESGF nodes, finds all matching files according to the query's criteria, and updates `esgpull`'s database. Files new to `esgpull` will typically be set to 'queued' status.
- **Track Query**: Use `api.track(query_id)` to mark an existing query for automatic tracking by `esgpull`'s scheduler (if a scheduler is configured and running). Tracked queries are periodically updated.
- **List/Get Queries**: While not explicitly shown, a complete API might offer ways to list all queries in the database or get a specific query's details to find its `name`/`query_id` if forgotten.
- **Configuration**: Ensure your `esgpull` `config.toml` is correctly set up, especially paths for data storage (`data_path`), authentication details if required for certain nodes, and any specific search or download preferences.
- **Explore**: Check the `esgpull` documentation and the `EsgpullAPI` source code (`esgpull/api.py`) for more details on available methods, parameters, and the structure of returned data.