owid · larsyencken · May 17, 2024 · May 7, 2024 · May 10, 2024 · May 17, 2024
diff --git a/lib/catalog/Makefile b/lib/catalog/Makefile
@@ -10,6 +10,11 @@ SRC = owid tests
 # watch:
 # 	poetry run watchmedo shell-command -c 'clear; make unittest' --recursive --drop .
 
+.venv: poetry.toml pyproject.toml poetry.lock
+	@echo '==> Installing packages'
+	poetry install
+	touch .venv
+
 check-typing: .venv
 	@echo '==> Checking types'
 	poetry run pyright $(SRC)

diff --git a/lib/catalog/README.md b/lib/catalog/README.md
@@ -26,195 +26,80 @@ We would love feedback on how we can make this library and overall data catalog
 
 ## Quickstart
 
-Install with `pip install owid-catalog`. Then you can begin exploring the experimental data catalog:
+Install with `pip install owid-catalog`. Then you can get data in two different ways.
 
-```python
-from owid import catalog
-
-# look for Covid-19 data, return a data frame of matches
-catalog.find('covid')
-
-# load Covid-19 data from the Our World in Data namespace as a data frame
-df = catalog.find('covid', namespace='owid').load()
-
-# load data from other than the default `garden` channel
-lung_cancer_tables = catalog.find('lung_cancer_deaths_per_100000_men', channels=['open_numbers'])
-df = lung_cancer_tables.iloc[0].load()
-```
-
-## Development
-
-You need Python 3.8+, `poetry` and `make` installed. Clone the repo, then you can simply run:
-
-```
-# run all unit tests and CI checks
-make test
-
-# watch for changes, then run all checks
-make watch
-```
-
-## Data types
-
-### Catalog
-
-A catalog is an arbitrarily deep folder structure containing datasets inside. It can be local on disk, or remote.
-
-#### Load the remote catalog
-
-```python
-# find the default OWID catalog and fetch the catalog index over HTTPS
-cat = RemoteCatalog()
-
-# get a list of matching tables in different datasets
-matches = cat.find('population')
-
-# fetch a data frame for a specific match over HTTPS
-t = cat.find_one('population', namespace='gapminder')
+### Charts catalog
 
-# load other channels than `garden`
-cat = RemoteCatalog(channels=('garden', 'meadow', 'open_numbers'))
-```
-
-### Datasets
-
-A dataset is a folder of tables containing metadata about the overall collection.
-
-- Metadata about the dataset lives in `index.json`
-- All tables in the folder must share a common format (CSV or Feather)
-
-#### Create a new dataset
+This API attempts to give you exactly the data you in a chart on our site.
 
 ```python
-# make a folder and an empty index.json file
-ds = Dataset.create('/tmp/my_data')
-```
+from owid.catalog import charts
 
-```python
-# choose CSV instead of feather for files
-ds = Dataset.create('/tmp/my_data', format='csv')
+# get the data for one chart by URL
+df = charts.get_data('https://ourworldindata.org/grapher/life-expectancy')
 ```
 
-#### Add a table to a dataset
+Notice that the last part of the URL is the chart's slug, its identifier, in this case `life-expectancy`. Using the slug alone also works.
 
 ```python
-# serialize a table using the table's name and the dataset's default format (feather)
-# (e.g. /tmp/my_data/my_table.feather)
-ds.add(table)
+df = charts.get_data('life-expectancy')
 ```
 
-#### Remove a table from a dataset
+To see what charts are available, you can list them all.
 
 ```python
-ds.remove('table_name')
+>>> slugs = charts.list_charts()
+>>> slugs[:5]
+['above-ground-biomass-in-forest-per-hectare',
+ 'above-or-below-extreme-poverty-line-world-bank',
+ 'abs-change-energy-consumption',
+ 'absolute-change-co2',
+ 'absolute-gains-in-mean-female-height']
 ```
 
-#### Access a table
+### Data science API
 
-```python
-# load a table including metadata into memory
-t = ds['my_table']
-```
+We also curate much more data than is available on our site. To access that in efficient binary (Feather) format, use our data science API.
 
-#### List tables
+This API is designed for use in Jupyter notebooks.
 
 ```python
-# the length is the number of datasets discovered on disk
-assert len(ds) > 0
-```
-
-```python
-# iterate over the tables discovered on disk
-for table in ds:
-    do_something(table)
-```
-
-#### Add metadata
-
-```python
-# you need to manually save your changes
-ds.title = "Very Important Dataset"
-ds.description = "This dataset is a composite of blah blah blah..."
-ds.save()
-```
-
-#### Copy a dataset
-
-```python
-# copying a dataset copies all its files to a new location
-ds_new = ds.copy('/tmp/new_data_path')
-
-# copying a dataset is identical to copying its folder, so this works too
-shutil.copytree('/tmp/old_data', '/tmp/new_data_path')
-ds_new = Dataset('/tmp/new_data_path')
-```
-
-### Tables
-
-Tables are essentially pandas DataFrames but with metadata. All operations on them occur in-memory, except for loading from and saving to disk. On disk, they are represented by tabular file (feather or CSV) and a JSON metadata file.
-
-Columns of `Table` have attribute `VariableMeta`, including their type, description, and unit. Be carful when manipulating them, not all operations are currently supported. Supported are: adding a column, renaming columns. Not supported: direct assignment to `t.columns = ...` or to index names `t.columns.index = ...`.
-
-#### Make a new table
-
-```python
-# same API as DataFrames
-t = Table({
-    'gdp': [1, 2, 3],
-    'country': ['AU', 'SE', 'CH']
-}).set_index('country')
-```
+from owid import catalog
 
-#### Add metadata about the whole table
+# look for Covid-19 data, return a data frame of matches
+catalog.find('covid')
 
-```python
-t.title = 'Very important data'
+# load Covid-19 data from the Our World in Data namespace as a data frame
+df = catalog.find('covid', namespace='owid').load()
 ```
 
-#### Add metadata about a field
+There many be multiple versions of the same dataset in a catalog, each will have a unique path. To easily load the same dataset again, you should record its path and load it this way:
 
 ```python
-t.gdp.description = 'GDP measured in 2011 international $'
-t.sources = [
-    Source(title='World Bank', url='https://www.worldbank.org/en/home')
-]
-```
+from owid import catalog
 
-#### Add metadata about all fields at once
+path = 'garden/ihme_gbd/2023-05-15/gbd_mental_health_prevalence_rate/gbd_mental_health_prevalence_rate'
 
-```python
-# sources and licenses are actually stored a the field level
-t.sources = [
-    Source(title='World Bank', url='https://www.worldbank.org/en/home')
-]
-t.licenses = [
-    License('CC-BY-SA-4.0', url='https://creativecommons.org/licenses/by-nc/4.0/')
-]
+rc = catalog.RemoteCatalog()
+df = rc[path]
 ```
 
-#### Save a table to disk
+## Development
 
-```python
-# save to /tmp/my_table.feather + /tmp/my_table.meta.json
-t.to_feather('/tmp/my_table.feather')
+You need Python 3.8+, `poetry` and `make` installed. Clone the repo, then you can simply run:
 
-# save to /tmp/my_table.csv + /tmp/my_table.meta.json
-t.to_csv('/tmp/my_table.csv')
 ```
+# run all unit tests and CI checks
+make test
 
-#### Load a table from disk
-
-These work like normal pandas DataFrames, but if there is also a `my_table.meta.json` file, then metadata will also get read. Otherwise it will be assumed that the data has no metadata:
-
-```python
-t = Table.read_feather('/tmp/my_table.feather')
-
-t = Table.read_csv('/tmp/my_table.csv')
+# watch for changes, then run all checks
+make watch
 ```
 
 ## Changelog
 
 - `dev`
+  - Add experimental chart data API in `owid.catalog.charts`
 - `v0.3.8`
   - Switch from isort & black & fake8 to ruff
 - `v0.3.8`

diff --git a/lib/catalog/owid/catalog/charts.py b/lib/catalog/owid/catalog/charts.py
@@ -0,0 +1,79 @@
+#
+#  owid.catalog.charts
+#
+#
+#  Access to data in OWID charts.
+#
+
+from dataclasses import dataclass
+from typing import List, Optional
+
+import pandas as pd
+
+from .internal import (
+    ChartNotFoundError,  # noqa
+    LicenseError,  # noqa
+    _fetch_bundle,
+    _GrapherBundle,
+    _list_charts,
+)
+
+
+@dataclass
+class Chart:
+    """
+    A chart published on Our World in Data, for example:
+
+    https://ourworldindata.org/grapher/life-expectancy
+    """
+
+    slug: str
+
+    _bundle: Optional[_GrapherBundle] = None
+
+    @property
+    def bundle(self) -> _GrapherBundle:
+        # LARS: give a nice error if the chart does not exist
+        if self._bundle is None:
+            self._bundle = _fetch_bundle(self.slug)
+
+        return self._bundle
+
+    @property
+    def config(self) -> dict:
+        return self.bundle.config  # type: ignore
+
+    def get_data(self) -> pd.DataFrame:
+        return self.bundle.to_frame()
+
+    def __lt__(self, other):
+        return self.slug < other.slug
+
+    def __eq__(self, value: object) -> bool:
+        return isinstance(value, Chart) and value.slug == self.slug
+
+
+def list_charts() -> List[str]:
+    """
+    List all available charts published on Our World in Data, representing each via
+    a short slug that you can use with `get_data()`.
+    """
+    return sorted(_list_charts())
+
+
+def get_data(slug_or_url: str) -> pd.DataFrame:
+    """
+    Fetch the data for a chart by its slug or by the URL of the chart.
+
+    Additional metadata about the chart is available in the DataFrame's `attrs` attribute.
+    """
+    if slug_or_url.startswith("https://ourworldindata.org/grapher/"):
+        slug = slug_or_url.split("/")[-1]
+
+    elif slug_or_url.startswith("https://"):
+        raise ValueError("URL must be a Grapher URL, e.g. https://ourworldindata.org/grapher/life-expectancy")
+
+    else:
+        slug = slug_or_url
+
+    return Chart(slug).get_data()