intake · martindurant · Jan 31, 2024 · May 17, 2023 · May 18, 2023 · May 18, 2023
diff --git a/.github/workflows/main.yaml b/.github/workflows/main.yaml
@@ -13,7 +13,7 @@ jobs:
     strategy:
       fail-fast: false
       matrix:
-        CONDA_ENV: [py38, py39, py310, py311, pip]
+        CONDA_ENV: [py39, py310, py311, pip]
     steps:
       - name: Checkout
         uses: actions/checkout@v4
@@ -31,4 +31,4 @@ jobs:
       - name: Run Tests
         shell: bash -l {0}
         run: |
-          pytest -v
+          pytest -v intake/readers
diff --git a/.gitignore b/.gitignore
@@ -1,3 +1,4 @@
+.DS_Store
 # Byte-compiled / optimized / DLL files
 __pycache__/
 *.py[cod]

diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -24,10 +24,6 @@ repos:
     -   id: ruff  # See 'setup.cfg' for args
         args: [intake]
         files: intake/
--   repo: https://github.com/pycqa/isort
-    rev: 5.12.0
-    hooks:
-      - id: isort
 -   repo: https://github.com/hoxbro/clean_notebook
     rev: 0.1.5
     hooks:

diff --git a/MANIFEST.in b/MANIFEST.in
diff --git a/README.md b/README.md
@@ -1,24 +1,26 @@
-# Intake: A general interface for loading data
+# Intake: Take 2
+
+**A general python package for describing, loading and processing data**
 
 ![Logo](https://github.com/intake/intake/raw/master/logo-small.png)
 
 [![Build Status](https://github.com/intake/intake/workflows/CI/badge.svg)](https://github.com/intake/intake/actions)
 [![Documentation Status](https://readthedocs.org/projects/intake/badge/?version=latest)](http://intake.readthedocs.io/en/latest/?badge=latest)
-[![Join the chat at https://gitter.im/ContinuumIO/intake](https://badges.gitter.im/ContinuumIO/intake.svg)](https://gitter.im/ContinuumIO/intake?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge&utm_content=badge)
 
 
-Intake is a lightweight set of tools for loading and sharing data in data science projects.
-Intake helps you:
+*Taking the pain out of data access and distribution*
+
+Intake is an open-source package to:
 
-* Load data from a variety of formats (see the [current list of known plugins](http://intake.readthedocs.io/en/latest/plugin-directory.html)) into containers you already know, like Pandas dataframes, Python lists, NumPy arrays, and more.
-* Convert boilerplate data loading code into reusable Intake plugins
-* Describe data sets in catalog files for easy reuse and sharing between projects and with others.
-* Share catalog information (and data sets) over the network with the Intake server
+- describe your data declaratively
+- gather data sets into catalogs
+- search catalogs and services to find the right data you need
+- load, transform and output data in many formats
+- work with third party remote storage and compute platforms
 
 Documentation is available at [Read the Docs](http://intake.readthedocs.io/en/latest).
 
-Weekly news about this repo and other related projects can be found on the
-[wiki](https://github.com/intake/intake/wiki/Community-News)
+Please report issues at https://github.com/intake/intake/issues
 
 Install
 -------
@@ -35,21 +37,15 @@ dependencies you install, with the simplest having least requirements
 pip install intake
 ```
 
-and additional sections `[server]`, `[plot]` and `[dataframe]`, or to include everything:
-
-```bash
-pip install intake[complete]
-```
-
 Note that you may well need specific drivers and other plugins, which usually have additional
 dependencies of their own.
 
 Development
 -----------
  * Create development Python environment with the required dependencies, ideally with `conda`.
    The requirements can be found in the yml files in the `scripts/ci/` directory of this repo.
-   * e.g. `conda env create -f scripts/ci/environment-py38.yml` and then `conda activate test_env`
- * Install intake using `pip install -e .[complete]`
+   * e.g. `conda env create -f scripts/ci/environment-py311.yml` and then `conda activate test_env`
+ * Install intake using `pip install -e .`
  * Use `pytest` to run tests.
  * Create a fork on github to be able to submit PRs.
  * We respect, but do not enforce, pep8 standards; all new code should be covered by tests.
diff --git a/README_refactor.md b/README_refactor.md
@@ -0,0 +1,134 @@
+## Intake Take2
+
+Intake has been extensively rewritten to produce Intake Take2,
+https://github.com/intake/intake/pull/737 .
+This will now become the version of the ``main`` branch and be released as v2.0.0. The
+main documentation will move to describing V2, and V1 will not be further developed.
+Existing users of the legacy version ("v1") may find their code breaks and will need
+a version pin, although we aim to support most legacy workflows via backward compatibility.
+
+To install, you would do the following
+
+```shell
+> pip install intake
+or
+> conda install intake
+```
+
+To get v1:
+
+```shell
+> pip install "intake<2"
+or
+> conda install "intake<2"
+```
+
+This README is being kept to describe why the rewrite was done and considerations that
+went into it.
+
+### Motivation for the rewrite.
+
+The main way to get the most out of Intake v1 has been by editing YAML files. This is
+how the documentation is structured. Yes, you could use intake.open_* to seed them, but then
+you will find a strong discontinuity between the documentation of the driver and the third
+party library that actually does the reading.
+
+This made is very unlikely to convert a novice data-oriented python user into someone
+that can create even the simplest catalogs. They will certainly never use more advanced
+features like parametrisation or derived datasets. The new model eases users in and lends
+itself to being overlaid with graphical/wizard interfaces (i.e., in jupyterlab or in
+preparation for use with
+[anaconda.cloud](https://docs.anaconda.com/free/anaconda-notebooks/notebook-data-catalog/)).
+
+### Main changes
+
+This is a total rewrite. Backward compatibility is desired and some v1 sources have been
+rewritten to use the v2 readers.
+
+#### Simplification
+
+We are dropping features that added complexity but were only rarely used.
+
+- the server; the Intake server was never production-ready, and most
+ use-cases can be provided by [tiled](https://blueskyproject.io/tiled/)
+- the caching/persist stuff; files can be persisted by fsspec, and we maintain the ability to
+ write to various formats
+- explicit dependence on dask; dask is just one of many possible compute engines and
+ an we should not be tied to one
+- less added functionality in the readers (like file pattern stuff)
+- explicit dependence on hvplot (but you can still choose to use it)
+- the CLI
+
+
+#### New structure
+
+Many new classes have appeared. From an intake-savy point of view, the biggest change is
+the splitting of "drivers" into "data" and "reader". I view them as the objective description
+of what the dataset is (e.g., "this is CSV at this URL") versus how you might load it
+("call pandas with these arguments"). This strongly implies that you might want to read the
+same data in different ways. Crucially, it makes the readers much easier to write.
+
+Here is the Awkward reader for parquet files. Particularly for files, often all you need to do
+is specify which function will do the read and what keyword accepts the URL.
+```python
+class AwkwardParquet(Awkward):
+    implements = {datatypes.Parquet}
+    imports = {"awkward", "pyarrow"}
+    func = "awkward:from_parquet"
+    url_arg = "path"
+```
+
+The imports are declared and deferred until needed, so there is no need to make all those intake-*
+repos with their own dependencies. (Of course, you might still want to declare packages
+and requirements; considering whether catalogs should have requirements, but this is better
+suited for something like conda-project). The arguments accepted are the same as for the
+target function, and the method `.doc()` will show this.
+
+
+### New features
+
+- recommendation system to try to guess the right data type from a URL or existing function call,
+ and readers that can use that type (and for each, tells you the instance it makes and provides docs).
+ Can be extended to "I have this type but I want this other type, what
+ set of steps get me there"
+- embracing any compute engines as first-class (e.g., duck, dask, ray, spark) or none
+- no constraints on the types of data that can/should be returned
+- pipeline building tools, including explicit conversion, types operations, generalised getattr and
+ getitem (like dask delayed) and apply. Most of these available as "transform" attributes, including
+ new namespaces like "reader.np.max(..)" will call numpy on whatever the reader makes, but lazily.
+- output functions, as a special type of "conversion", returning a new data description for further
+ manipulation. This is effectively caching (would like to add conditions to the pipeline, only load and
+ convert if converted version doesn't already exist).
+- generalised derived datasets, including functions of multiple intake inputs. A data or any reader
+ output might be the input of any other reader, forming a graph. Picking a specific output from those
+ possible gives you the pipeline, ready for execution. Any such pipelines could be encoded in a catalog.
+- user parameters are similar to before, but are also plugable; a few types are provided.
+ Some helper methods have been made
+ to walk data/reader kwargs and extract default values as parameters, replacing their original value
+ with a reference to the parameter. The parameters are hierarchical catalog->data->reader
+
+Some examples of each of these exist in the current state of the code. There are many many more to
+write, but the functions themselves are really simple. This is aiming for composition and easy crowd
+sourcing, high bus factor.
+
+### Work to follow
+
+- thorough search capability, which will need some thoughts in this context
+- compatibility with remaining existing intake plugins
+- the catalog serialisation currently uses custom YAML tags, but this should not be necessary
+- add those magic methods that make pipelines work on descriptions on catalogs, not just
+ materialised readers.
+- metadata conventions, to persist basic dataset properties (e.g., based on frictionlessdata spec)
+ and validation as a pipeline operation you can do to any data entry using any available reader that
+ can produce the info
+- probably much more - I will need help!
+
+### Unanswered questions
+
+- actual functions and classes are now embedded into any YAML serialised catalog as strings. These
+ are imported/instantiated when the reader is instantiated from its description. So arbitrary
+ code execution is possible, but not at catalog parse time. We only have a loose permissions config
+ story around this
+- this implementation maintains the distinction between "descriptions" (which have templated values
+ and user parameters) and readers (which only have concrete values and real instances). Is this a
+ major confusion we somehow want to eliminate in V2?
diff --git a/docs/environment.yml b/docs/environment.yml
@@ -4,7 +4,7 @@ channels:
 
 dependencies:
   - appdirs
-  - python=3.8
+  - python=3.10
   - dask
   - numpy
   - pandas

diff --git a/docs/make_api.py b/docs/make_api.py
@@ -0,0 +1,104 @@
+import os
+import sys
+import intake
+
+
+def run(path):
+    fn = os.path.join(path, "source", "api2.rst")
+    with open(fn, "w") as f:
+        print(
+            f"""
+API Reference
+=============
+
+User Functions
+--------------
+
+.. autosummary::
+    intake.config.Config
+    intake.readers.datatypes.recommend
+    intake.readers.convert.auto_pipeline
+    intake.readers.entry.Catalog
+    intake.readers.entry.DataDescription
+    intake.readers.entry.ReaderDescription
+    intake.readers.readers.recommend
+    intake.readers.readers.reader_from_call
+
+.. autoclass:: intake.config.Config
+    :members:
+
+.. autofunction:: intake.readers.datatypes.recommend
+
+.. autofunction:: intake.readers.convert.auto_pipeline
+
+.. autoclass:: intake.readers.entry.Catalog
+    :members:
+
+.. autoclass:: intake.readers.entry.DataDescription
+    :members:
+
+.. autoclass:: intake.readers.entry.ReaderDescription
+    :members:
+
+.. autofunction:: intake.readers.readers.recommend
+
+.. autofunction:: intake.readers.readers.reader_from_call
+
+Base Classes
+------------
+
+These may be subclassed by developers
+
+.. autosummary::""",
+            file=f,
+        )
+        bases = (
+            "intake.readers.datatypes.BaseData",
+            "intake.readers.readers.BaseReader",
+            "intake.readers.convert.BaseConverter",
+            "intake.readers.namespaces.Namespace",
+            "intake.readers.search.SearchBase",
+            "intake.readers.user_parameters.BaseUserParameter",
+        )
+        for base in bases:
+            print("  ", base, file=f)
+        print(file=f)
+        for base in bases:
+            print(
+                f""".. autoclass:: {base}
+   :members:
+""",
+                file=f,
+            )
+
+        print(
+            """
+
+Data Classes
+------------
+
+.. autosummary::""",
+            file=f,
+        )
+        for cls in sorted(intake.readers.subclasses(intake.BaseData), key=lambda c: c.qname()):
+            print("  ", cls.qname().replace(":", "."), file=f)
+        print(
+            """
+
+Reader Classes
+--------------
+
+Includes readers, transformers, converters and output classes.
+
+.. autosummary::""",
+            file=f,
+        )
+        for cls in sorted(intake.readers.subclasses(intake.BaseReader), key=lambda c: c.qname()):
+            print("  ", cls.qname().replace(":", "."), file=f)
+
+
+if __name__ == "__main__":
+    here = os.path.abspath(os.path.dirname(sys.argv[0]))
+    run(here)
+else:
+    here = os.path.abspath(os.path.dirname(__file__))