# Caching

This notebook illustrates the use of the climetlab cache and highlight some cache configuration settings.

The relevant Climetlab documentation is located at https://climetlab.readthedocs.io/en/latest/guide/caching.html

Relevant CliMetLab settings are:
- cache-directory 
- maximum-cache-disk-usage 
- maximum-cache-size

## How to run this exercise

This exercise is in the form of a [Jupyter notebook](https://jupyter.org/). It can be "run" in a number of free cloud based environments (see two options below). These require no installation. When you click on one of the links below ([`Open in Colab`](https://colab.research.google.com/github/ecmwf-projects/mooc-machine-learning-weather-climate/blob/main/tier_2/data_handling/03-caching-configuration.ipynb) or [`Launch in Deepnote`](https://deepnote.com/launch?url=https://github.com/ecmwf-projects/mooc-machine-learning-weather-climate/blob/main/tier_2/data_handling/03-caching-configuration.ipynb)) you will be prompted to create a free account, after which you will see the same page you see here. You can run each block of code by selecting shift+control repeatedly, or by selecting the "play" icon. 

Advanced users may wish to run this exercise on their own computers by first installing [Python](https://www.python.org/downloads/), [Jupyter](https://jupyter.org/install) and [CliMetLab](https://climetlab.readthedocs.io/en/latest/installing.html).

<style>
td, th {
   border: 1px solid white;
   border-collapse: collapse;
}
</style>
<table align="left">
  <tr>
    <th>Run the tutorial via free cloud platforms: </th>
    <th><a href="https://colab.research.google.com/github/ecmwf-projects/mooc-machine-learning-weather-climate/blob/main/tier_2/data_handling/03-caching-configuration.ipynb">
        <img src = "https://colab.research.google.com/assets/colab-badge.svg" alt = "Colab"></th>
    <th><a href="https://deepnote.com/launch?url=https://github.com/ecmwf-projects/mooc-machine-learning-weather-climate/blob/main/tier_2/data_handling/03-caching-configuration.ipynb">
        <img src = "https://deepnote.com/buttons/launch-in-deepnote-small.svg" alt = "Kaggle"></th>
  </tr>
</table>

## Let's begin the exercise...

In [1]:
# pip install climetlab --quiet 
# --> already done

Note: you may need to restart the kernel to use updated packages.


In [1]:
import climetlab as cml
URL1 = "https://www.ncei.noaa.gov/data/international-best-track-archive-for-climate-stewardship-ibtracs/v04r00/access/csv/ibtracs.SP.list.v04r00.csv"
URL2 = "https://www.ncei.noaa.gov/data/international-best-track-archive-for-climate-stewardship-ibtracs/v04r00/access/csv/ibtracs.NI.list.v04r00.csv"

Using ``cml.load_source("url",...)`` stores the data in the climetlab cache.  

In [2]:
data = cml.load_source("url", URL1)
data.to_pandas()

NotImplementedError: climetlab.readers.text.TextReader.to_pandas() on C:\Users\frei_p\AppData\Local\Temp\climetlab-FREI_P\url-e29d836ce6a5ea6e24cbb4398dd11140a280205e0248a9ad468bde15e1727667.SP.list.v04r00.csv

Next call to the same code does not redownload the data.

In [4]:
data = cml.load_source("url", URL1)
data.to_pandas()

  return pandas.read_csv(self.path, **pandas_read_csv_kwargs)


Unnamed: 0,SID,SEASON,NUMBER,BASIN,SUBBASIN,NAME,ISO_TIME,NATURE,LAT,LON,...,BOM_GUST_PER,REUNION_GUST,REUNION_GUST_PER,USA_SEAHGT,USA_SEARAD_NE,USA_SEARAD_SE,USA_SEARAD_SW,USA_SEARAD_NW,STORM_SPEED,STORM_DIR
0,,Year,,,,,,,degrees_north,degrees_east,...,second,kts,second,ft,nmile,nmile,nmile,nmile,kts,degrees
1,1897005S10135,1897,1,SP,EA,NOT_NAMED,1897-01-04 12:00:00,NR,-10.1000,135.300,...,,,,,,,,,9,246
2,1897005S10135,1897,1,SI,WA,NOT_NAMED,1897-01-04 15:00:00,NR,-10.2755,134.902,...,,,,,,,,,8,246
3,1897005S10135,1897,1,SI,WA,NOT_NAMED,1897-01-04 18:00:00,NR,-10.4406,134.523,...,,,,,,,,,8,246
4,1897005S10135,1897,1,SI,WA,NOT_NAMED,1897-01-04 21:00:00,NR,-10.5853,134.182,...,,,,,,,,,7,247
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
75841,2023039S15153,2023,7,SP,MM,GABRIELLE,2023-02-11 12:00:00,NR,-29.3,168.2,...,,,,,,,,,13,122
75842,2023039S15153,2023,7,SP,MM,GABRIELLE,2023-02-11 15:00:00,NR,-29.4995,168.92,...,,,,,,,,,14,106
75843,2023039S15153,2023,7,SP,MM,GABRIELLE,2023-02-11 18:00:00,NR,-29.7,169.8,...,,,,,,,,,18,111
75844,2023039S15153,2023,7,SP,MM,GABRIELLE,2023-02-11 21:00:00,NR,-30.127,170.811,...,,,,,,,,,21,119


The downloaded data is actually store in a cache directory, managed by CliMetLab, using a small database. Data is also unzipped if needed within the cache directory.

The cache can be observed and manipulated:
- Within python using ``cml.cache``
- With command line interface ``climetlab cache`` and ``climetlab decache``
- Using the web interface GUI (in progress: summer of code project https://github.com/ecmwf-lab/climetlab-script-web)
- NOT by playing directly with the cache files (same logic as a web browser cache).

In [3]:
cml.cache

In [4]:
!climetlab cache

Cache directory:            [34mC:\Users\frei_p\AppData\Local\Temp\climetlab-FREI_P[0m
Cache size:                 [34m189.5 MiB[0m
Number of entries in cache: [34m24[0m
Most recently accessed:     [34m19 minutes ago[0m
Least recently accessed:    [34m3 hours ago[0m
Youngest entry:             [34m19 minutes ago[0m
Oldest entry:               [34m3 hours ago[0m


In [5]:
!climetlab cache --all

[34mC:\Users\frei_p\AppData\Local\Temp\climetlab-FREI_P\grib-index-c348b955561efedae60b79586291f954936a553e160139642f2fa0da3225fe1c.json[0m
  creation_date: [32m2023-02-22 14:19:30.349898[0m
  last_access: [32m2023-02-22 14:19:30.349898[0m
  accesses: [32m1[0m
  type: [32mfile[0m
  size: [32m4[0m
  owner: [32mgrib-index[0m
  args: [32m['test.grib', 1677071098.83462, 1677071249.6379, 1052, 0][0m
  expires: [32mNone[0m
  extra: [32mNone[0m
  flags: [32m0[0m
  owner_data: [32mNone[0m
  parent: [32mNone[0m
  replaced: [32mNone[0m

[34mC:\Users\frei_p\AppData\Local\Temp\climetlab-FREI_P\url-15280dbd4547333ede9ffec63d6959450329b9c003a148969685679b82657cba.grib[0m
  creation_date: [32m2023-02-22 14:20:26.211432[0m
  last_access: [32m2023-02-22 16:50:10.024688[0m
  accesses: [32m3[0m
  type: [32mfile[0m
  size: [32m1052[0m
  owner: [32murl[0m
  args: [32m{'url': 'https://github.com/ecmwf/climetlab/raw/main/docs/examples/test.grib', 'parts': None}[0m


In [6]:
!climetlab cache --newer 1d

[32mEntries newer than '2023-02-21 17:17:45'.[0m
Cache directory:            [34mC:\Users\frei_p\AppData\Local\Temp\climetlab-FREI_P[0m
Cache size:                 [34m189.5 MiB[0m
Number of entries in cache: [34m24[0m
Most recently accessed:     [34m20 minutes ago[0m
Least recently accessed:    [34m3 hours ago[0m
Youngest entry:             [34m20 minutes ago[0m
Oldest entry:               [34m3 hours ago[0m


In [7]:
!climetlab cache --help

usage: cache [-h] [--json] [--all] [--path] [--sort KEY] [--reverse]
             [--match STRING] [--newer DATE] [--older DATE] [--accessed]
             [--larger SIZE] [--smaller SIZE]

Cache command to inspect the CliMetLab cache. The selection arguments are the
same as for the ``climetlab decache`` deletion command. Examples: climetlab
cache --all

optional arguments:
  -h, --help      show this help message and exit
  --json          produce a JSON output
  --all
  --path          print the path of cache directory and exit
  --sort KEY      sort output according to increasing values of KEY.
  --reverse       reverse the order of the sort, from larger to smaller
  --match STRING  TODO
  --newer DATE    TODO
  --older DATE    TODO
  --accessed      use the date of last access instead of the creation date
  --larger SIZE   consider only cache entries that are larger than SIZE bytes
  --smaller SIZE  consider only cache entries that are smaller than SIZE bytes

SIZE can be expressed 

In [10]:
# Delete cached data newer than one day
# !climetlab decache --newer 1d # This is commented out to avoid running this cell by mistake

# Configuring CliMetLab cache settings

In [8]:
!climetlab settings cache-directory 
!climetlab settings maximum-cache-disk-usage 
!climetlab settings maximum-cache-size  

C:\Users\frei_p\AppData\Local\Temp\climetlab-FREI_P
90
None


# Concurrent cache use

If the cache is full, the older data is automatically deleted (with a log message). 
When multiple scripts are using the same cache this may lead to a file being deleted (because the cache is full), even if it is currently in use by another script.
 




In [9]:
import climetlab as cml
cml.settings.set("maximum-cache-size", "50M")

CliMetLab cache: trying to free 139.5 MiB
Deleting entry {
    "path": "C:\\Users\\frei_p\\AppData\\Local\\Temp\\climetlab-FREI_P\\grib-index-c348b955561efedae60b79586291f954936a553e160139642f2fa0da3225fe1c.json",
    "owner": "grib-index",
    "args": [
        "test.grib",
        1677071098.83462,
        1677071249.6379,
        1052,
        0
    ],
    "creation_date": "2023-02-22 14:19:30.349898",
    "flags": 0,
    "owner_data": null,
    "last_access": "2023-02-22 14:19:30.349898",
    "type": "file",
    "parent": null,
    "replaced": null,
    "extra": null,
    "expires": null,
    "accesses": 1,
    "size": 4
}
CliMetLab cache: deleting C:\Users\frei_p\AppData\Local\Temp\climetlab-FREI_P\grib-index-c348b955561efedae60b79586291f954936a553e160139642f2fa0da3225fe1c.json (4)
CliMetLab cache: grib-index ["test.grib", 1677071098.83462, 1677071249.6379, 1052, 0]
Deleting entry {
    "path": "C:\\Users\\frei_p\\AppData\\Local\\Temp\\climetlab-FREI_P\\url-34197a872c74dcb6babfdfb

[WinError 32] Der Prozess kann nicht auf die Datei zugreifen, da sie von einem anderen Prozess verwendet wird: 'C:\\Users\\frei_p\\AppData\\Local\\Temp\\climetlab-FREI_P\\cds-retriever-a057f8b7f2ade3e68be46ddf132c68145e561e13afc27b9cbb67b7f4d09dd5ea.cache'
[WinError 32] Der Prozess kann nicht auf die Datei zugreifen, da sie von einem anderen Prozess verwendet wird: 'C:\\Users\\frei_p\\AppData\\Local\\Temp\\climetlab-FREI_P\\url-15280dbd4547333ede9ffec63d6959450329b9c003a148969685679b82657cba.grib'


# Take home message

. End-Users do not need to manage the data. Data is downloaded on demand, with minimal duplication.

. The climetlab cache is a **cache**: it is managed by climetlab and automatically cleaned up.

. Multiple users should not share the same cache directory.

Let us reset the default climetlab cache configuration, just in case.

In [10]:
cml.settings.reset()