# Research

I am now showing you how a researcher can interact with
the local or remote cache w/o using BigQuery.

Here we use the higher-level `IQBCache` API, which intentionally has no
BigQuery toggle and focuses on reading cached data.

We are pointing to the same cache directory that the pipeline writes to,
but the cache component itself is read-only.

The source code for `IQBCache` lives inside `./library/src/iqb/cache`.

As a starting point, let's instantiate the cache with:

1. a directory where to cache the query results

2. the optional remote cache instance.

We use the same dataset naming conventions described in the queries chapter, so
the cache paths line up with the query templates and granularities.

The remote cache assumes the existence of a manifest file describing
what remote files are available.

In [1]:
from iqb import IQBCache, IQBGitHubRemoteCache

cache = IQBCache(
    data_dir="03-research.dir",
    remote_cache=IQBGitHubRemoteCache(
        data_dir="03-research.dir",
    ),
)

As before, we are using a manifest, let's print it:

In [2]:
import json

with open("03-research.dir/state/ghremote/manifest.json") as fp:
    print(json.dumps(json.load(fp), indent=4))

{
    "files": {
        "cache/v1/20251001T000000Z/20251101T000000Z/downloads_by_country/data.parquet": {
            "sha256": "82226cc007001bd5545d5b1f036eefe1707c43608581cc5c06e5f055867be376",
            "url": "https://github.com/m-lab/iqb/releases/download/v0.2.0/82226cc00700__cache__v1__20251001T000000Z__20251101T000000Z__downloads_by_country__data.parquet"
        },
        "cache/v1/20251001T000000Z/20251101T000000Z/downloads_by_country/stats.json": {
            "sha256": "975ce9997ec33aad693b4367289b130a0ff0258f94d8c904bd8942debc190c3f",
            "url": "https://github.com/m-lab/iqb/releases/download/v0.2.0/975ce9997ec3__cache__v1__20251001T000000Z__20251101T000000Z__downloads_by_country__stats.json"
        },
        "cache/v1/20251001T000000Z/20251101T000000Z/uploads_by_country/data.parquet": {
            "sha256": "c1f384988a07859d42d34d332806de6d8ce576a26d9d42fce6b4c90628b8be90",
            "url": "https://github.com/m-lab/iqb/releases/download/v0.2.0/c1f384988a0

Now let us create an entry that exists so we can exercise the remote cache.

As before, we must create a *lazy* entry first. We reuse the same
time window as the previous notebooks to keep the example consistent.

The main difference with before is that an entry conceptually bundles
multiple parquet files. However, just as before, the lazy design
allows the researcher to only download what they actually need to use.

In [3]:
from iqb import IQBDatasetGranularity

entry = cache.get_cache_entry(
    granularity=IQBDatasetGranularity.COUNTRY,
    start_date="2025-10-01",  # start date is *inclusive*
    end_date="2025-11-01",  # end date is *exclusive*
)

OK, now that we have an entry, we can select specific
subdata (e.g., for `mlab`) and sync it.

The notebook example (in `./analysis`) shows how you can
obtain data to compute the IQB score.

Here, my intent is to show you, instead, how the low-level
caching mechanism ties with the high-level read-only view
so I am going to just sync a single parquet file.

As in the previous example, we are filtering for the US and
the same filtering knobs are available (it does not make
sense here to filter for city, but we *could* do it). This
call also triggers a sync under the hood, so this is the same
pattern as before, wrapped into a higher-level interface:

In [4]:
table = entry.mlab.read_download_data_frame(country_code="US")

This is the same `pd.DataFrame` we have seen in the previous example:

In [5]:
table.columns.values

array(['country_code', 'sample_count', 'download_p1', 'download_p5',
       'download_p10', 'download_p25', 'download_p50', 'download_p75',
       'download_p90', 'download_p95', 'download_p99', 'latency_p1',
       'latency_p5', 'latency_p10', 'latency_p25', 'latency_p50',
       'latency_p75', 'latency_p90', 'latency_p95', 'latency_p99',
       'loss_p1', 'loss_p5', 'loss_p10', 'loss_p25', 'loss_p50',
       'loss_p75', 'loss_p90', 'loss_p95', 'loss_p99'], dtype=object)