# Pipeline Cache

Now that we have seen the pipeline, let us see how we can use the remote
cache implemented by IQB to avoid re-running queries each time.

The source code for them lives inside `./library/src/iqb/ghcache`.

As a starting point, let's instantiate the pipeline with:

1. a directory where to cache the query results

2. a billing project on BigQuery

3. the optional remote cache instance.

We use the same dataset naming conventions described in the queries chapter, so
the cache paths line up with the query templates and granularities.

The remote cache assumes the existence of a manifest file describing
what remote files are available.

In [25]:
from iqb import IQBPipeline, IQBGitHubRemoteCache

pipeline = IQBPipeline(
    project="measurement-lab",
    data_dir="02-pipeline-cache.dir",
    remote_cache=IQBGitHubRemoteCache(
        data_dir="02-pipeline-cache.dir",
    ),
)

Before moving forward, let us inspect the content of the manifest:

In [26]:
import json

with open("02-pipeline-cache.dir/state/ghremote/manifest.json") as fp:
    print(json.dumps(json.load(fp), indent=4))

{
    "files": {
        "cache/v1/20251001T000000Z/20251101T000000Z/downloads_by_country/data.parquet": {
            "sha256": "82226cc007001bd5545d5b1f036eefe1707c43608581cc5c06e5f055867be376",
            "url": "https://github.com/m-lab/iqb/releases/download/v0.2.0/82226cc00700__cache__v1__20251001T000000Z__20251101T000000Z__downloads_by_country__data.parquet"
        },
        "cache/v1/20251001T000000Z/20251101T000000Z/downloads_by_country/stats.json": {
            "sha256": "975ce9997ec33aad693b4367289b130a0ff0258f94d8c904bd8942debc190c3f",
            "url": "https://github.com/m-lab/iqb/releases/download/v0.2.0/975ce9997ec3__cache__v1__20251001T000000Z__20251101T000000Z__downloads_by_country__stats.json"
        },
        "cache/v1/20251001T000000Z/20251101T000000Z/uploads_by_country/data.parquet": {
            "sha256": "c1f384988a07859d42d34d332806de6d8ce576a26d9d42fce6b4c90628b8be90",
            "url": "https://github.com/m-lab/iqb/releases/download/v0.2.0/c1f384988a0

So, as you can see we map paths that must exist in cache with their SHA256
and remote location (which currently is GitHub).

Now that we understand the content of the manifest, let us create an
entry that exists so we can exercise the remote cache.

As before, we must create a *lazy* entry first:

In [27]:
from iqb import IQBDatasetGranularity, IQBDatasetMLabTable, iqb_dataset_name_for_mlab

entry = pipeline.get_cache_entry(
    dataset_name=iqb_dataset_name_for_mlab(
        granularity=IQBDatasetGranularity.COUNTRY,
        table=IQBDatasetMLabTable.DOWNLOAD,
    ),
    enable_bigquery=False,  # disable so we know we use the remote cache
    start_date="2025-10-01",  # start date is *inclusive*
    end_date="2025-11-01",  # end date is *exclusive*
)

OK, now that we have an entry, we must sync it to get the
data we need from the remote cache.


Here we're using the most defensive possible code pattern
in which we protect against concurrent writes.

In [28]:
with entry.lock():
    if not entry.exists():
        entry.sync()

DEBUG:filelock:Attempting to acquire lock 128876858280096 on 02-pipeline-cache.dir/cache/v1/20251001T000000Z/20251101T000000Z/downloads_by_country/.lock
DEBUG:filelock:Lock 128876858280096 acquired on 02-pipeline-cache.dir/cache/v1/20251001T000000Z/20251101T000000Z/downloads_by_country/.lock
DEBUG:filelock:Attempting to release lock 128876858280096 on 02-pipeline-cache.dir/cache/v1/20251001T000000Z/20251101T000000Z/downloads_by_country/.lock
DEBUG:filelock:Lock 128876858280096 released on 02-pipeline-cache.dir/cache/v1/20251001T000000Z/20251101T000000Z/downloads_by_country/.lock


Once this is completed, we have synced data inside the cache.

Let us print information about the entry and the files it contains:

In [29]:
print(entry.dir_path())
print(entry.data_parquet_file_path())
print(entry.stats_json_file_path())

02-pipeline-cache.dir/cache/v1/20251001T000000Z/20251101T000000Z/downloads_by_country
02-pipeline-cache.dir/cache/v1/20251001T000000Z/20251101T000000Z/downloads_by_country/data.parquet
02-pipeline-cache.dir/cache/v1/20251001T000000Z/20251101T000000Z/downloads_by_country/stats.json


The cache layout matches the local pipeline cache from the previous notebook:
`cache/v1/<start>/<end>/<dataset_name>/...`. Here the dataset name encodes
the granularity (in this example: country).

There is also a stats file. Let's inspect it first:

In [30]:
with open(entry.stats_json_file_path()) as fp:
    print(json.dumps(json.load(fp), indent=4))

{
    "query_start_time": "2025-11-27T10:20:54.036274Z",
    "query_duration_seconds": 1.311,
    "template_hash": "4749dd257891857b70cdab0ea1013e7fbc6e4a06f4e00543a4894c1ba09a5e52",
    "total_bytes_processed": 0,
    "total_bytes_billed": 0
}


So, as you can see we record the start time and duration, plus the template
hash and byte counts. For more detail on parquet reading and column selection,
see the pipeline notebook; here we just confirm the data shape.

In [31]:
from iqb.pipeline import iqb_parquet_read

table = iqb_parquet_read(
    entry.data_parquet_file_path(),
    country_code="US",
)

The return value is a `pandas.DataFrame`.

Let's see what it contains:

In [32]:
table.columns.values

array(['country_code', 'sample_count', 'download_p1', 'download_p5',
       'download_p10', 'download_p25', 'download_p50', 'download_p75',
       'download_p90', 'download_p95', 'download_p99', 'latency_p1',
       'latency_p5', 'latency_p10', 'latency_p25', 'latency_p50',
       'latency_p75', 'latency_p90', 'latency_p95', 'latency_p99',
       'loss_p1', 'loss_p5', 'loss_p10', 'loss_p25', 'loss_p50',
       'loss_p75', 'loss_p90', 'loss_p95', 'loss_p99'], dtype=object)

Let's conclude our overview by just selecting some columns for illustrative purposes:

In [33]:
table[
    [
        "country_code",
        "sample_count",
        "download_p50",
        "latency_p50",
        "loss_p50",
    ]
]

Unnamed: 0,country_code,sample_count,download_p50,latency_p50,loss_p50
0,US,25430008,121.076161,14.967,0.000314
