Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intake 2.0 #737

Merged
merged 120 commits into from Jan 31, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
120 commits
Select commit Hold shift + click to select a range
81c5136
start refactor
martindurant May 17, 2023
b32ce36
Add minimal tests
martindurant May 18, 2023
03f855a
Prototype for magic bytes and glipse
martindurant May 18, 2023
66e9eb8
Add more things
martindurant May 18, 2023
ff09a71
Add duck
martindurant May 18, 2023
c3a5fd4
More stuff
martindurant May 21, 2023
54b9073
fix test
martindurant May 21, 2023
1e54f29
Merge branch 'master' into reader
martindurant May 23, 2023
d2330ee
Add a little Ray
martindurant May 23, 2023
bba5c81
converters POC
martindurant May 23, 2023
c653af3
A litlte text
martindurant May 23, 2023
53f8b49
Add generalised data description
martindurant May 26, 2023
d05901a
doc
martindurant May 26, 2023
68d5b98
Remove arg for py38/39
martindurant May 26, 2023
f83138e
stop
martindurant May 29, 2023
a67f530
gut
martindurant May 31, 2023
98cc1ef
Make transformations
martindurant May 31, 2023
2ab1e29
converter types
martindurant May 31, 2023
1a1d0e9
Example backward compatability for CSV
martindurant Jun 2, 2023
994aef3
guess files
martindurant Jun 5, 2023
d6ade03
Merge branch 'master' into reader
martindurant Jun 6, 2023
ac596e6
Prototypes
martindurant Jun 7, 2023
fbba222
Separate out readers bytype
martindurant Jun 8, 2023
c3a6f47
corrections
martindurant Jun 8, 2023
5722992
it's all a pipeline
martindurant Jun 13, 2023
95c223d
stop point
martindurant Jun 15, 2023
33d6f2a
Fix linkages
martindurant Jun 15, 2023
e4c6ae2
Add attribute and getitem shortcuts
martindurant Jun 16, 2023
03e8341
Add rendering
martindurant Jun 16, 2023
61e628a
Nix dataclass
martindurant Jun 17, 2023
22e9644
Towards catalogs
martindurant Jun 26, 2023
c5ebe05
First up test
martindurant Jun 27, 2023
c4adfa0
Re-enable env expansin in user_parameters
martindurant Jun 29, 2023
28d973c
Cimilar patters and start to fill out Catalog
martindurant Jun 30, 2023
83fa822
References to other data
martindurant Jul 4, 2023
e4f69a8
Exrtact parameter API POC
martindurant Jul 10, 2023
c52b789
Make user-parameters serialisable; pass UPs around; make started name…
martindurant Jul 13, 2023
518300d
remove breakpoint
martindurant Jul 13, 2023
7e3e73a
kwarg manipulation helpers
martindurant Jul 14, 2023
2b47ffc
stop point
martindurant Jul 15, 2023
4d78f7a
YAML for Catalog
martindurant Jul 17, 2023
b3ae557
ser to/from file
martindurant Jul 20, 2023
5eba282
README
martindurant Jul 20, 2023
d984268
Merge branch 'master' into reader
martindurant Jul 20, 2023
432e462
ups and ser redux
martindurant Aug 8, 2023
32af3a8
ready
martindurant Aug 10, 2023
00ad8a1
lazy
martindurant Aug 15, 2023
3041a59
easy lazy
martindurant Aug 18, 2023
478e54a
A little SQL love
martindurant Aug 23, 2023
35d935f
maybe a little green
martindurant Aug 24, 2023
b44aad9
again
martindurant Aug 24, 2023
3e10e55
xr hv plot
martindurant Aug 24, 2023
aae92fd
text
martindurant Aug 24, 2023
621358e
Add some data classes
martindurant Aug 29, 2023
1e6f9c2
more types
martindurant Aug 30, 2023
9790012
STAC catalog
martindurant Aug 31, 2023
ed7339d
merge stac bands
martindurant Aug 31, 2023
500eca8
tidy
martindurant Sep 1, 2023
0969647
thredds merge source
martindurant Sep 2, 2023
6efac99
Shonky retry
martindurant Sep 7, 2023
b64491a
Make workflow tests
martindurant Sep 8, 2023
0904016
Another workflow test
martindurant Sep 8, 2023
bbf5b77
smalls
martindurant Sep 10, 2023
c5642b2
Tighten up data types, add geopandas IO
martindurant Sep 11, 2023
e2cc3f5
Refactor converters
martindurant Sep 13, 2023
6753f55
First search and graph play
martindurant Sep 15, 2023
4e771ad
Add search test
martindurant Sep 18, 2023
47cca48
more args
martindurant Sep 19, 2023
ec905d8
Add some IO types
martindurant Sep 19, 2023
78259e7
imprtove guessing
martindurant Sep 20, 2023
6ad4576
Remove .lower() for now
martindurant Sep 25, 2023
a1b32fd
small changes for demo
martindurant Sep 27, 2023
8e34476
remove remote cat
martindurant Sep 27, 2023
1588f97
Merge branch 'master' into reader
martindurant Sep 27, 2023
6247256
Merge branch 'master' into reader
martindurant Sep 27, 2023
9d7d2d2
up supported python version
martindurant Sep 27, 2023
001334c
type annotations
martindurant Sep 27, 2023
9529e77
version-indep type annot
martindurant Sep 27, 2023
ea17ae6
Merge Readers and Converters
martindurant Sep 29, 2023
66954d1
caching exmple
martindurant Sep 29, 2023
c26a165
small changes
martindurant Oct 3, 2023
7ce9c43
more types
martindurant Oct 3, 2023
2174fe7
Add hugging
martindurant Oct 3, 2023
20e5c00
Add hug, sklearn and torch sets
martindurant Oct 5, 2023
5690e58
Some backport and TF example cat
martindurant Oct 7, 2023
f457ca1
Readd config and entrypoints; add MS buildings example
martindurant Oct 11, 2023
66d06b1
Compat and tab helper; some more formats
martindurant Oct 24, 2023
e9f95b5
Add ML data types
martindurant Oct 27, 2023
797859d
Search framework
martindurant Nov 1, 2023
72198c9
Add environment consistency check and metadata field descriptions
martindurant Nov 2, 2023
3a5f870
Format and move to pyproject
martindurant Nov 6, 2023
eba7f2b
remove some old stuff
martindurant Nov 9, 2023
252ad54
add earthaccess catalog/reader
martindurant Nov 10, 2023
2b6d894
DOI to comcept-id and doc
martindurant Nov 10, 2023
513169d
Catalog doc strings
martindurant Nov 13, 2023
2c9b7a3
Lots of changes, compat and docs
martindurant Nov 20, 2023
93fdfb9
Mode doc, docstrings, convenience and fixes
martindurant Nov 23, 2023
7bd9a6a
GUI compat and attribute lookup safety
martindurant Nov 28, 2023
3781b0c
revert pickle to fix test
martindurant Nov 28, 2023
91a5730
Make GUI work, add V2 docs and allow real names in category entries.
martindurant Dec 1, 2023
d4dd0cf
More docs
martindurant Dec 4, 2023
745ebd4
a2
martindurant Dec 4, 2023
42b9c75
Remove plugin build
martindurant Dec 4, 2023
82eb680
later py
martindurant Dec 4, 2023
4c09108
Small corrections + example notebook
martindurant Dec 6, 2023
49c9d3b
Add HDL reader and walkthrough from PyData
martindurant Dec 8, 2023
ec949fd
Add polars and delta
martindurant Dec 21, 2023
1b2d09b
fits
martindurant Jan 2, 2024
74a934b
Add asdf
martindurant Jan 2, 2024
15db444
Add tiledb
martindurant Jan 4, 2024
84fba28
Consistency fixes
martindurant Jan 4, 2024
14e56b7
Add prometheus reader
martindurant Jan 15, 2024
1d2c435
Add much documentation
martindurant Jan 25, 2024
86cd70f
more docs
martindurant Jan 27, 2024
ba1cfec
Reduce deps
martindurant Jan 28, 2024
f35ad9a
update xarray reader args
martindurant Jan 30, 2024
fd76056
Merge branch 'master' into reader
martindurant Jan 30, 2024
68aeeec
lint
martindurant Jan 30, 2024
7c87ed5
allow thredds test to fail
martindurant Jan 31, 2024
7b2131a
arg name
martindurant Jan 31, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
4 changes: 2 additions & 2 deletions .github/workflows/main.yaml
Expand Up @@ -13,7 +13,7 @@ jobs:
strategy:
fail-fast: false
matrix:
CONDA_ENV: [py38, py39, py310, py311, pip]
CONDA_ENV: [py39, py310, py311, pip]
steps:
- name: Checkout
uses: actions/checkout@v4
Expand All @@ -31,4 +31,4 @@ jobs:
- name: Run Tests
shell: bash -l {0}
run: |
pytest -v
pytest -v intake/readers
1 change: 1 addition & 0 deletions .gitignore
@@ -1,3 +1,4 @@
.DS_Store
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
Expand Down
4 changes: 0 additions & 4 deletions .pre-commit-config.yaml
Expand Up @@ -24,10 +24,6 @@ repos:
- id: ruff # See 'setup.cfg' for args
args: [intake]
files: intake/
- repo: https://github.com/pycqa/isort
rev: 5.12.0
hooks:
- id: isort
- repo: https://github.com/hoxbro/clean_notebook
rev: 0.1.5
hooks:
Expand Down
20 changes: 0 additions & 20 deletions MANIFEST.in

This file was deleted.

32 changes: 14 additions & 18 deletions README.md
@@ -1,24 +1,26 @@
# Intake: A general interface for loading data
# Intake: Take 2

**A general python package for describing, loading and processing data**

![Logo](https://github.com/intake/intake/raw/master/logo-small.png)

[![Build Status](https://github.com/intake/intake/workflows/CI/badge.svg)](https://github.com/intake/intake/actions)
[![Documentation Status](https://readthedocs.org/projects/intake/badge/?version=latest)](http://intake.readthedocs.io/en/latest/?badge=latest)
[![Join the chat at https://gitter.im/ContinuumIO/intake](https://badges.gitter.im/ContinuumIO/intake.svg)](https://gitter.im/ContinuumIO/intake?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge&utm_content=badge)


Intake is a lightweight set of tools for loading and sharing data in data science projects.
Intake helps you:
*Taking the pain out of data access and distribution*

Intake is an open-source package to:

* Load data from a variety of formats (see the [current list of known plugins](http://intake.readthedocs.io/en/latest/plugin-directory.html)) into containers you already know, like Pandas dataframes, Python lists, NumPy arrays, and more.
* Convert boilerplate data loading code into reusable Intake plugins
* Describe data sets in catalog files for easy reuse and sharing between projects and with others.
* Share catalog information (and data sets) over the network with the Intake server
- describe your data declaratively
- gather data sets into catalogs
- search catalogs and services to find the right data you need
- load, transform and output data in many formats
- work with third party remote storage and compute platforms

Documentation is available at [Read the Docs](http://intake.readthedocs.io/en/latest).

Weekly news about this repo and other related projects can be found on the
[wiki](https://github.com/intake/intake/wiki/Community-News)
Please report issues at https://github.com/intake/intake/issues

Install
-------
Expand All @@ -35,21 +37,15 @@ dependencies you install, with the simplest having least requirements
pip install intake
```

and additional sections `[server]`, `[plot]` and `[dataframe]`, or to include everything:

```bash
pip install intake[complete]
```

Note that you may well need specific drivers and other plugins, which usually have additional
dependencies of their own.

Development
-----------
* Create development Python environment with the required dependencies, ideally with `conda`.
The requirements can be found in the yml files in the `scripts/ci/` directory of this repo.
* e.g. `conda env create -f scripts/ci/environment-py38.yml` and then `conda activate test_env`
* Install intake using `pip install -e .[complete]`
* e.g. `conda env create -f scripts/ci/environment-py311.yml` and then `conda activate test_env`
* Install intake using `pip install -e .`
* Use `pytest` to run tests.
* Create a fork on github to be able to submit PRs.
* We respect, but do not enforce, pep8 standards; all new code should be covered by tests.
134 changes: 134 additions & 0 deletions README_refactor.md
@@ -0,0 +1,134 @@
## Intake Take2

Intake has been extensively rewritten to produce Intake Take2,
https://github.com/intake/intake/pull/737 .
This will now become the version of the ``main`` branch and be released as v2.0.0. The
main documentation will move to describing V2, and V1 will not be further developed.
Existing users of the legacy version ("v1") may find their code breaks and will need
a version pin, although we aim to support most legacy workflows via backward compatibility.

To install, you would do the following

```shell
> pip install intake
or
> conda install intake
```

To get v1:

```shell
> pip install "intake<2"
or
> conda install "intake<2"
```

This README is being kept to describe why the rewrite was done and considerations that
went into it.

### Motivation for the rewrite.

The main way to get the most out of Intake v1 has been by editing YAML files. This is
how the documentation is structured. Yes, you could use intake.open_* to seed them, but then
you will find a strong discontinuity between the documentation of the driver and the third
party library that actually does the reading.

This made is very unlikely to convert a novice data-oriented python user into someone
that can create even the simplest catalogs. They will certainly never use more advanced
features like parametrisation or derived datasets. The new model eases users in and lends
itself to being overlaid with graphical/wizard interfaces (i.e., in jupyterlab or in
preparation for use with
[anaconda.cloud](https://docs.anaconda.com/free/anaconda-notebooks/notebook-data-catalog/)).

### Main changes

This is a total rewrite. Backward compatibility is desired and some v1 sources have been
rewritten to use the v2 readers.

#### Simplification

We are dropping features that added complexity but were only rarely used.

- the server; the Intake server was never production-ready, and most
use-cases can be provided by [tiled](https://blueskyproject.io/tiled/)
- the caching/persist stuff; files can be persisted by fsspec, and we maintain the ability to
write to various formats
- explicit dependence on dask; dask is just one of many possible compute engines and
an we should not be tied to one
- less added functionality in the readers (like file pattern stuff)
- explicit dependence on hvplot (but you can still choose to use it)
- the CLI


#### New structure

Many new classes have appeared. From an intake-savy point of view, the biggest change is
the splitting of "drivers" into "data" and "reader". I view them as the objective description
of what the dataset is (e.g., "this is CSV at this URL") versus how you might load it
("call pandas with these arguments"). This strongly implies that you might want to read the
same data in different ways. Crucially, it makes the readers much easier to write.

Here is the Awkward reader for parquet files. Particularly for files, often all you need to do
is specify which function will do the read and what keyword accepts the URL.
```python
class AwkwardParquet(Awkward):
implements = {datatypes.Parquet}
imports = {"awkward", "pyarrow"}
func = "awkward:from_parquet"
url_arg = "path"
```

The imports are declared and deferred until needed, so there is no need to make all those intake-*
repos with their own dependencies. (Of course, you might still want to declare packages
and requirements; considering whether catalogs should have requirements, but this is better
suited for something like conda-project). The arguments accepted are the same as for the
target function, and the method `.doc()` will show this.


### New features

- recommendation system to try to guess the right data type from a URL or existing function call,
and readers that can use that type (and for each, tells you the instance it makes and provides docs).
Can be extended to "I have this type but I want this other type, what
set of steps get me there"
- embracing any compute engines as first-class (e.g., duck, dask, ray, spark) or none
- no constraints on the types of data that can/should be returned
- pipeline building tools, including explicit conversion, types operations, generalised getattr and
getitem (like dask delayed) and apply. Most of these available as "transform" attributes, including
new namespaces like "reader.np.max(..)" will call numpy on whatever the reader makes, but lazily.
- output functions, as a special type of "conversion", returning a new data description for further
manipulation. This is effectively caching (would like to add conditions to the pipeline, only load and
convert if converted version doesn't already exist).
- generalised derived datasets, including functions of multiple intake inputs. A data or any reader
output might be the input of any other reader, forming a graph. Picking a specific output from those
possible gives you the pipeline, ready for execution. Any such pipelines could be encoded in a catalog.
- user parameters are similar to before, but are also plugable; a few types are provided.
Some helper methods have been made
to walk data/reader kwargs and extract default values as parameters, replacing their original value
with a reference to the parameter. The parameters are hierarchical catalog->data->reader

Some examples of each of these exist in the current state of the code. There are many many more to
write, but the functions themselves are really simple. This is aiming for composition and easy crowd
sourcing, high bus factor.

### Work to follow

- thorough search capability, which will need some thoughts in this context
- compatibility with remaining existing intake plugins
- the catalog serialisation currently uses custom YAML tags, but this should not be necessary
- add those magic methods that make pipelines work on descriptions on catalogs, not just
materialised readers.
- metadata conventions, to persist basic dataset properties (e.g., based on frictionlessdata spec)
and validation as a pipeline operation you can do to any data entry using any available reader that
can produce the info
- probably much more - I will need help!

### Unanswered questions

- actual functions and classes are now embedded into any YAML serialised catalog as strings. These
are imported/instantiated when the reader is instantiated from its description. So arbitrary
code execution is possible, but not at catalog parse time. We only have a loose permissions config
story around this
- this implementation maintains the distinction between "descriptions" (which have templated values
and user parameters) and readers (which only have concrete values and real instances). Is this a
major confusion we somehow want to eliminate in V2?
2 changes: 1 addition & 1 deletion docs/environment.yml
Expand Up @@ -4,7 +4,7 @@ channels:

dependencies:
- appdirs
- python=3.8
- python=3.10
- dask
- numpy
- pandas
Expand Down
104 changes: 104 additions & 0 deletions docs/make_api.py
@@ -0,0 +1,104 @@
import os
import sys
import intake


def run(path):
fn = os.path.join(path, "source", "api2.rst")
with open(fn, "w") as f:
print(
f"""
API Reference
=============

User Functions
--------------

.. autosummary::
intake.config.Config
intake.readers.datatypes.recommend
intake.readers.convert.auto_pipeline
intake.readers.entry.Catalog
intake.readers.entry.DataDescription
intake.readers.entry.ReaderDescription
intake.readers.readers.recommend
intake.readers.readers.reader_from_call

.. autoclass:: intake.config.Config
:members:

.. autofunction:: intake.readers.datatypes.recommend

.. autofunction:: intake.readers.convert.auto_pipeline

.. autoclass:: intake.readers.entry.Catalog
:members:

.. autoclass:: intake.readers.entry.DataDescription
:members:

.. autoclass:: intake.readers.entry.ReaderDescription
:members:

.. autofunction:: intake.readers.readers.recommend

.. autofunction:: intake.readers.readers.reader_from_call

Base Classes
------------

These may be subclassed by developers

.. autosummary::""",
file=f,
)
bases = (
"intake.readers.datatypes.BaseData",
"intake.readers.readers.BaseReader",
"intake.readers.convert.BaseConverter",
"intake.readers.namespaces.Namespace",
"intake.readers.search.SearchBase",
"intake.readers.user_parameters.BaseUserParameter",
)
for base in bases:
print(" ", base, file=f)
print(file=f)
for base in bases:
print(
f""".. autoclass:: {base}
:members:
""",
file=f,
)

print(
"""

Data Classes
------------

.. autosummary::""",
file=f,
)
for cls in sorted(intake.readers.subclasses(intake.BaseData), key=lambda c: c.qname()):
print(" ", cls.qname().replace(":", "."), file=f)
print(
"""

Reader Classes
--------------

Includes readers, transformers, converters and output classes.

.. autosummary::""",
file=f,
)
for cls in sorted(intake.readers.subclasses(intake.BaseReader), key=lambda c: c.qname()):
print(" ", cls.qname().replace(":", "."), file=f)


if __name__ == "__main__":
here = os.path.abspath(os.path.dirname(sys.argv[0]))
run(here)
else:
here = os.path.abspath(os.path.dirname(__file__))