Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🎉 Add a new charts-based data API #2629

Merged
merged 3 commits into from
May 17, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions lib/catalog/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,11 @@ SRC = owid tests
# watch:
# poetry run watchmedo shell-command -c 'clear; make unittest' --recursive --drop .

.venv: poetry.toml pyproject.toml poetry.lock
@echo '==> Installing packages'
poetry install
touch .venv

check-typing: .venv
@echo '==> Checking types'
poetry run pyright $(SRC)
Expand Down
187 changes: 36 additions & 151 deletions lib/catalog/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,195 +26,80 @@ We would love feedback on how we can make this library and overall data catalog

## Quickstart

Install with `pip install owid-catalog`. Then you can begin exploring the experimental data catalog:
Install with `pip install owid-catalog`. Then you can get data in two different ways.

```python
from owid import catalog

# look for Covid-19 data, return a data frame of matches
catalog.find('covid')

# load Covid-19 data from the Our World in Data namespace as a data frame
df = catalog.find('covid', namespace='owid').load()

# load data from other than the default `garden` channel
lung_cancer_tables = catalog.find('lung_cancer_deaths_per_100000_men', channels=['open_numbers'])
df = lung_cancer_tables.iloc[0].load()
```

## Development

You need Python 3.8+, `poetry` and `make` installed. Clone the repo, then you can simply run:

```
# run all unit tests and CI checks
make test

# watch for changes, then run all checks
make watch
```

## Data types
larsyencken marked this conversation as resolved.
Show resolved Hide resolved

### Catalog

A catalog is an arbitrarily deep folder structure containing datasets inside. It can be local on disk, or remote.

#### Load the remote catalog

```python
# find the default OWID catalog and fetch the catalog index over HTTPS
cat = RemoteCatalog()

# get a list of matching tables in different datasets
matches = cat.find('population')

# fetch a data frame for a specific match over HTTPS
t = cat.find_one('population', namespace='gapminder')
### Charts catalog

# load other channels than `garden`
cat = RemoteCatalog(channels=('garden', 'meadow', 'open_numbers'))
```

### Datasets

A dataset is a folder of tables containing metadata about the overall collection.

- Metadata about the dataset lives in `index.json`
- All tables in the folder must share a common format (CSV or Feather)

#### Create a new dataset
This API attempts to give you exactly the data you in a chart on our site.

```python
# make a folder and an empty index.json file
ds = Dataset.create('/tmp/my_data')
```
from owid.catalog import charts
larsyencken marked this conversation as resolved.
Show resolved Hide resolved

```python
# choose CSV instead of feather for files
ds = Dataset.create('/tmp/my_data', format='csv')
# get the data for one chart by URL
df = charts.get_data('https://ourworldindata.org/grapher/life-expectancy')
```

#### Add a table to a dataset
Notice that the last part of the URL is the chart's slug, its identifier, in this case `life-expectancy`. Using the slug alone also works.

```python
# serialize a table using the table's name and the dataset's default format (feather)
# (e.g. /tmp/my_data/my_table.feather)
ds.add(table)
df = charts.get_data('life-expectancy')
```

#### Remove a table from a dataset
To see what charts are available, you can list them all.

```python
ds.remove('table_name')
>>> slugs = charts.list_charts()
>>> slugs[:5]
['above-ground-biomass-in-forest-per-hectare',
'above-or-below-extreme-poverty-line-world-bank',
'abs-change-energy-consumption',
'absolute-change-co2',
'absolute-gains-in-mean-female-height']
```

#### Access a table
### Data science API

```python
# load a table including metadata into memory
t = ds['my_table']
```
We also curate much more data than is available on our site. To access that in efficient binary (Feather) format, use our data science API.

#### List tables
This API is designed for use in Jupyter notebooks.

```python
# the length is the number of datasets discovered on disk
assert len(ds) > 0
```

```python
# iterate over the tables discovered on disk
for table in ds:
do_something(table)
```

#### Add metadata

```python
# you need to manually save your changes
ds.title = "Very Important Dataset"
ds.description = "This dataset is a composite of blah blah blah..."
ds.save()
```

#### Copy a dataset

```python
# copying a dataset copies all its files to a new location
ds_new = ds.copy('/tmp/new_data_path')

# copying a dataset is identical to copying its folder, so this works too
shutil.copytree('/tmp/old_data', '/tmp/new_data_path')
ds_new = Dataset('/tmp/new_data_path')
```

### Tables

Tables are essentially pandas DataFrames but with metadata. All operations on them occur in-memory, except for loading from and saving to disk. On disk, they are represented by tabular file (feather or CSV) and a JSON metadata file.

Columns of `Table` have attribute `VariableMeta`, including their type, description, and unit. Be carful when manipulating them, not all operations are currently supported. Supported are: adding a column, renaming columns. Not supported: direct assignment to `t.columns = ...` or to index names `t.columns.index = ...`.

#### Make a new table

```python
# same API as DataFrames
t = Table({
'gdp': [1, 2, 3],
'country': ['AU', 'SE', 'CH']
}).set_index('country')
```
from owid import catalog

#### Add metadata about the whole table
# look for Covid-19 data, return a data frame of matches
catalog.find('covid')

```python
t.title = 'Very important data'
# load Covid-19 data from the Our World in Data namespace as a data frame
df = catalog.find('covid', namespace='owid').load()
```

#### Add metadata about a field
There many be multiple versions of the same dataset in a catalog, each will have a unique path. To easily load the same dataset again, you should record its path and load it this way:

```python
t.gdp.description = 'GDP measured in 2011 international $'
t.sources = [
Source(title='World Bank', url='https://www.worldbank.org/en/home')
]
```
from owid import catalog

#### Add metadata about all fields at once
path = 'garden/ihme_gbd/2023-05-15/gbd_mental_health_prevalence_rate/gbd_mental_health_prevalence_rate'

```python
# sources and licenses are actually stored a the field level
t.sources = [
Source(title='World Bank', url='https://www.worldbank.org/en/home')
]
t.licenses = [
License('CC-BY-SA-4.0', url='https://creativecommons.org/licenses/by-nc/4.0/')
]
rc = catalog.RemoteCatalog()
df = rc[path]
```

#### Save a table to disk
## Development

```python
# save to /tmp/my_table.feather + /tmp/my_table.meta.json
t.to_feather('/tmp/my_table.feather')
You need Python 3.8+, `poetry` and `make` installed. Clone the repo, then you can simply run:

# save to /tmp/my_table.csv + /tmp/my_table.meta.json
t.to_csv('/tmp/my_table.csv')
```
# run all unit tests and CI checks
make test

#### Load a table from disk

These work like normal pandas DataFrames, but if there is also a `my_table.meta.json` file, then metadata will also get read. Otherwise it will be assumed that the data has no metadata:

```python
t = Table.read_feather('/tmp/my_table.feather')

t = Table.read_csv('/tmp/my_table.csv')
# watch for changes, then run all checks
make watch
```

## Changelog

- `dev`
- Add experimental chart data API in `owid.catalog.charts`
- `v0.3.8`
- Switch from isort & black & fake8 to ruff
- `v0.3.8`
Expand Down
79 changes: 79 additions & 0 deletions lib/catalog/owid/catalog/charts.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
#
# owid.catalog.charts
#
#
# Access to data in OWID charts.
#

from dataclasses import dataclass
from typing import List, Optional

import pandas as pd

from .internal import (
ChartNotFoundError, # noqa
LicenseError, # noqa
_fetch_bundle,
_GrapherBundle,
_list_charts,
)


@dataclass
class Chart:
"""
A chart published on Our World in Data, for example:

https://ourworldindata.org/grapher/life-expectancy
"""

slug: str

_bundle: Optional[_GrapherBundle] = None

@property
def bundle(self) -> _GrapherBundle:
# LARS: give a nice error if the chart does not exist
if self._bundle is None:
self._bundle = _fetch_bundle(self.slug)

return self._bundle

@property
def config(self) -> dict:
return self.bundle.config # type: ignore

def get_data(self) -> pd.DataFrame:
return self.bundle.to_frame()

def __lt__(self, other):
return self.slug < other.slug

def __eq__(self, value: object) -> bool:
return isinstance(value, Chart) and value.slug == self.slug


def list_charts() -> List[str]:
"""
List all available charts published on Our World in Data, representing each via
a short slug that you can use with `get_data()`.
"""
return sorted(_list_charts())


def get_data(slug_or_url: str) -> pd.DataFrame:
"""
Fetch the data for a chart by its slug or by the URL of the chart.

Additional metadata about the chart is available in the DataFrame's `attrs` attribute.
"""
if slug_or_url.startswith("https://ourworldindata.org/grapher/"):
slug = slug_or_url.split("/")[-1]

elif slug_or_url.startswith("https://"):
raise ValueError("URL must be a Grapher URL, e.g. https://ourworldindata.org/grapher/life-expectancy")

else:
slug = slug_or_url

return Chart(slug).get_data()
Loading
Loading