Skip to content
This repository has been archived by the owner on Nov 1, 2023. It is now read-only.


Browse files Browse the repository at this point in the history
  • Loading branch information
larsyencken committed Nov 1, 2023
1 parent 659b82c commit 28e466b
Showing 1 changed file with 1 addition and 275 deletions.
276 changes: 1 addition & 275 deletions
Original file line number Diff line number Diff line change
@@ -1,275 +1 @@
![build status](
[![PyPI version](](

# owid-catalog

_A Pythonic API for working with OWID's data catalog._

Status: experimental, APIs likely to change

## Overview

Our World In Data is building a new data catalog, with the goal of our datasets being reproducible and transparent to the general public. That project is our [etl](, which going forward will contain the recipes for all the datasets we republish.

This library allows you to query our data catalog programmatically, and get back data in the form of Pandas data frames, perfect for data pipelines or Jupyter notebook explorations.

graph TB
etl -->|reads| walden[upstream datasets]
etl -->|generates| s3[data catalog]
catalog[owid-catalog-py] -->|queries| s3

We would love feedback on how we can make this library and overall data catalog better. Feel free to send us an email at, or start a [discussion]( on Github.

## Quickstart

Install with `pip install owid-catalog`. Then you can begin exploring the experimental data catalog:

from owid import catalog

# look for Covid-19 data, return a data frame of matches

# load Covid-19 data from the Our World In Data namespace as a data frame
df = catalog.find('covid', namespace='owid').load()

# load data from other than the default `garden` channel
lung_cancer_tables = catalog.find('lung_cancer_deaths_per_100000_men', channels=['open_numbers'])
df = lung_cancer_tables.iloc[0].load()

## Development

You need Python 3.8+, `poetry` and `make` installed. Clone the repo, then you can simply run:

# run all unit tests and CI checks
make test
# watch for changes, then run all checks
make watch

## Data types

### Catalog

A catalog is an arbitrarily deep folder structure containing datasets inside. It can be local on disk, or remote.

#### Load the remote catalog

# find the default OWID catalog and fetch the catalog index over HTTPS
cat = RemoteCatalog()

# get a list of matching tables in different datasets
matches = cat.find('population')

# fetch a data frame for a specific match over HTTPS
t = cat.find_one('population', namespace='gapminder')

# load other channels than `garden`
cat = RemoteCatalog(channels=('garden', 'meadow', 'open_numbers'))

### Datasets

A dataset is a folder of tables containing metadata about the overall collection.

- Metadata about the dataset lives in `index.json`
- All tables in the folder must share a common format (CSV or Feather)

#### Create a new dataset

# make a folder and an empty index.json file
ds = Dataset.create('/tmp/my_data')

# choose CSV instead of feather for files
ds = Dataset.create('/tmp/my_data', format='csv')

#### Add a table to a dataset

# serialize a table using the table's name and the dataset's default format (feather)
# (e.g. /tmp/my_data/my_table.feather)

#### Remove a table from a dataset


#### Access a table

# load a table including metadata into memory
t = ds['my_table']

#### List tables

# the length is the number of datasets discovered on disk
assert len(ds) > 0

# iterate over the tables discovered on disk
for table in ds:

#### Add metadata

# you need to manually save your changes
ds.title = "Very Important Dataset"
ds.description = "This dataset is a composite of blah blah blah..."

#### Copy a dataset

# copying a dataset copies all its files to a new location
ds_new = ds.copy('/tmp/new_data_path')

# copying a dataset is identical to copying its folder, so this works too
shutil.copytree('/tmp/old_data', '/tmp/new_data_path')
ds_new = Dataset('/tmp/new_data_path')

### Tables

Tables are essentially pandas DataFrames but with metadata. All operations on them occur in-memory, except for loading from and saving to disk. On disk, they are represented by tabular file (feather or CSV) and a JSON metadata file.

Columns of `Table` have attribute `VariableMeta`, including their type, description, and unit. Be carful when manipulating them, not all operations are currently supported. Supported are: adding a column, renaming columns. Not supported: direct assignment to `t.columns = ...` or to index names `t.columns.index = ...`.

#### Make a new table

# same API as DataFrames
t = Table({
'gdp': [1, 2, 3],
'country': ['AU', 'SE', 'CH']

#### Add metadata about the whole table

t.title = 'Very important data'

#### Add metadata about a field

t.gdp.description = 'GDP measured in 2011 international $'
t.sources = [
Source(title='World Bank', url='')

#### Add metadata about all fields at once

# sources and licenses are actually stored a the field level
t.sources = [
Source(title='World Bank', url='')
t.licenses = [
License('CC-BY-SA-4.0', url='')

#### Save a table to disk

# save to /tmp/my_table.feather + /tmp/my_table.meta.json

# save to /tmp/my_table.csv + /tmp/my_table.meta.json

#### Load a table from disk

These work like normal pandas DataFrames, but if there is also a `my_table.meta.json` file, then metadata will also get read. Otherwise it will be assumed that the data has no metadata:

t = Table.read_feather('/tmp/my_table.feather')

t = Table.read_csv('/tmp/my_table.csv')

## Changelog

- `dev`
- Remove `catalog.frames`; use `owid-repack` package instead
- Relax dependency constraints
- Add optional `channel` argument to `DatasetMeta`
- Stop supporting metadata in Parquet format, load JSON sidecar instead
- Fix errors when creating new Table columns
- `v0.3.4`
- Bump `pyarrow` dependency to enable Python 3.11 support
- `v0.3.3`
- Add more arguments to `Table.__init__` that are often used in ETL
- Add `Dataset.update_metadata` function for updating metadata from YAML file
- Python 3.11 support via update of `pyarrow` dependency
- `v0.3.2`
- Fix a bug in `Catalog.__getitem__()`
- Replace `mypy` type checker by `pyright`
- `v0.3.1`
- Sort imports with `isort`
- Change black line length to 120
- Add `grapher` channel
- Support path-based indexing into catalogs
- `v0.3.0`
- Support multiple formats per table
- Support reading and writing `parquet` files with embedded metadata
- Optional `repack` argument when adding tables to dataset
- Underscore `|`
- Get `version` field from `DatasetMeta` init
- Resolve collisions of `underscore_table` function
- Convert `version` to `str` and load json `dimensions`
- `v0.2.9`
- Allow multiple channels in `catalog.find` function
- `v0.2.8`
- `v0.2.7`
- Split datasets into channels (`garden`, `meadow`, `open_numbers`, ...) and make garden default one
- Add `.find_latest` method to Catalog
- `v0.2.6`
- Add flag `is_public` for public/private datasets
- Enforce snake_case for table, dataset and variable short names
- Add fields `published_by` and `published_at` to Source
- Added a list of supported and unsupported operations on columns
- Updated `pyarrow`
- `v0.2.5`
- Fix ability to load remote CSV tables
- `v0.2.4`
- Update the default catalog URL to use a CDN
- `v0.2.3`
- Fix methods for finding and loading data from a `LocalCatalog`
- `v0.2.2`
- Repack frames to compact dtypes on `Table.to_feather()`
- `v0.2.1`
- Fix key typo used in version check
- `v0.2.0`
- Copy dataset metadata into tables, to make tables more traceable
- Add API versioning, and a requirement to update if your version of this library is too old
- `v0.1.1`
- Add support for Python 3.8
- `v0.1.0`
- Initial release, including searching and fetching data from a remote catalog
This library has now been folded into OWID's ETL monorepo. Find it here:

0 comments on commit 28e466b

Please sign in to comment.