Simplified data registry and dataset interface #149

3coins · 2022-03-23T04:55:17Z

Resolves #145, #146, and #147.

Rekindling the data registry project with a simplified DataRegistry and Dataset interfaces. The package has been re-written from scratch using the latest cookiecutter template with all the setup for unit and integration tests along with GitHub actions to aid automated build check, changelog, and release workflows.

Install

# Clone the repo to your local environment
git clone -b 3coins-data-registry https://github.com/3coins/jupyterlab-data-explorer

# Change directory to the jupyterlab_dataregistry directory
cd jupyterlab-data-explorer

# Install package in development mode
pip install -e .

# Link your development version of the extension with JupyterLab
jupyter labextension develop . --overwrite

# Server extension must be manually installed in develop mode
jupyter server extension enable jupyterlab_dataregistry

# Build extension
jlpm build

Data Registry provides 3 main components:

A typed JSON based dataset interface to represent any kind of dataset
A data registry API to register, retrieve and manage datasets
A command registry API to allow extension authors to register user actions with specific dataset types

Data Registry provides both a TypeScript and a Python API to register new datasets. The TypeScript interface allows data providers to register new datasets via plugins; this also allows extension writers to create commands and associate these commands with a specific dataset type. The Python API provides an additional way to register datasets inside notebooks; this also enables registering dataset definition stored in files with “dataset” extension. Dataset providers can share dataset files or notebooks containing dataset definitions with JupyterLab users.

Dataset Interface

A typed JSON based extensible interface that can be used to represent any kind of dataset. Dataset interface expects 2 types, one for defining the value and second for the metadata. Having these as typed values allows creating any kind of dataset. In addition, the attributes abstract data type, storage type and serialization type are declared as string types which allows flexibility for data providers to define datasets that can span a vast range of mime types, with extensibility to support any future mime types.

export interface Dataset<T extends JSONValue, U extends JSONValue> {
  /**
   * Unique identifier for the dataset, for in-memory
   * datasets, a unique uuid might be provided.
   * This id should be unique across a jupyter server instance.
   */
  id: string;
  /**
   * Abstract data type for the dataset, e.g.,
   * tabular, image, text, tabular collection
   */
  abstractDataType: string;
  /**
   * Serialization type for the dataset e.g.,
   * csv, jpg
   */
  serializationType: string;
  /**
   * Storage type for the dataset e.g.,
   * inmemory, file, s3
   */
  storageType: string;
  /**
   * Output value for the dataset
   */
  value: T;
  /**
   * Additional properties for the dataset
   * that help serialize or query data
   */
  metadata: U;
  title: string;
  description: string;
  tags?: Set<string>;
  version?: string;
}

Id

A string that represents the unique identifier for the dataset, is expected to be unique across a JupyterLab instance.

Abstract Data Type

A string that captures the abstract data type for the dataset, and is largely defined by the dataset provider. This property represents a very high level abstraction of the data type which might not conform to a specific mime type but rather provide a more general view of the dataset. Most datasets with features might fall under “tabular” because they will have a set of labels/columns with multiple rows of values. Some other examples are “image” to represent any single image, “image-collection” to represent a set of images, and “text” to represent free-form or structured text data.

Serialization Type

A string that captures information about the serialization format of the data. This property represents the specific subtype which can be used to serialize or visualize the data. For example, tabular datasets can be represented in “csv” or “tsv”, images might be “jpeg”, “png” etc. Some other examples are “text”, “json”, “svg”, “sql”.

Storage Type

A string that defines how the data is stored, e.g., S3, database, in memory etc.

Value

Nullable value which defines the type for the actual value of the dataset, e.g., for an in memory comma separated file, this might represent the actual string value of the data.

Metadata

Defines the type for capturing any metadata associated with the dataset that might help extension writers or JupyterLab users download/serialize the data. e.g., for a tabular dataset, this might capture the delimiter and line delimiter; for a dataset stored in S3, this might capture the credentials or folder and object information.

Title

A string that is used largely for display purposes to identify the dataset.

Description

A string that captures some more detail about the dataset, so extension developers and users have more context of what kind of data is stored in the dataset.

Version

A string that defines the version of the dataset.

Here are a few examples of real world datasets expressed using the dataset interface:

In memory dataset in CSV format

{
    id: '47e0c8a6-b49f-46ec-88e7-bbb7be2cbb52',
    abstractDataType: 'tabular',
    serializationType: 'csv',
    storageType: 'inmemory',
    value: 'lovingly photographed in the manner of a golden book sprung to life stuart little 2 manages sweetness largely without stickiness,pos\nconsistently clever and suspenseful,pos\nred dragon never cuts corners,pos',
    metadata: {
      delimiter: ',',
      lineDelimiter: '\n',
    },
    title: 'Rotten Tomatoes Dataset',
    description: 'Movie Review Dataset. This is a dataset containing positive and negative processed sentences from Rotten Tomatoes movie reviews',
    version: '1.0',
}

Directory of images stored in S3

{
    id: 's3://covid19-dataset',
    abstractDataType: 'image-collection',
    serializationType: 'png',
    storageType: 's3',
    value: null,
    metadata: {
    bucket: 'covid19-dataset',
        folders: {
            'train': ['train/Covid', 'train/Normal', 'train/Viral Phenomena'],
            'test': ['test/Covid', 'test/Normal', 'test/Viral Phenomena']
        }
    },
    title: 'Covid-19 Image Dataset',
    description: 'This dataset will help deep learning and AI enthusiasts to contribute to improving COVID-19 detection using Chest X-rays. Data was collected from publicly released GitHub account by the University of Montreal professors. The Pneumonia data has been taken from the RSNA website.',
    version: '2.1',
}

Dataset stored in a SQL database

{
    id: 'f6712a3c-c902-412b-9a0f-e5e581e21739',
    abstractDataType: 'tabular',
    serializationType: 'mysql',
    storageType: 'database',
    value: null,
    metadata: {
        database: {
            host: 'localhost',
            port: '3306',
            name: 'jobs-data'
        },
        table: 'usa-ds-jobs'
    },
    title: 'Data Scientist Job Market in the U.S.',
    description: 'This dataset provides job data for open data scientist jobs in U.S. posted on Indeed.com. The information collected includes company name, position, location, description, and number of reviews. The questions that this dataset helps answer are 1) What skills, tools, majors are most sought 2) Which location has most opportunities 3) Whats the difference between data scientist, data engineer and data analyst?',
    version: '1.0',
}

Note: The above example specifies a “dataset” vs a “datasource”. However, the dataset interface could be easily applied to datasources as well. For example, for a dataset definition to represent a collection of tables might be represented by an abstractDataType of “sql-tabular” with relevant details to connect/serialize in the metadata property.

Data Registry API

Data registry aims to catalog datasets in a JupyterLab environment, and provides APIs to register, update and retrieve datasets. These are the core set of APIs which will be shared with other extensions to help them manage datasets. Data registry provides these APIs:

export interface IDataRegistry {
  /** 
   * Registers a dataset. Use {*@link *Registry#updateDataset}
   * to update a registered dataset. 
   * 
   * *@param *dataset The dataset to register 
   * *@throws *Throws an error if dataset with 
   * same id and version already exists. 
   */
  registerDataset<T extends JSONValue, U extends JSONValue>(
    dataset: Dataset<T, U>
  ): void;
  
  /** 
   * Updates a registered dataset, bumps up the version. 
   * 
   * *@param *dataset The dataset to update 
   * *@throws *Throws an exception if any of abstractDataType, 
   * serializationType or storageType are different from 
   * registered values. 
   */
  updateDataset<T extends JSONValue, U extends JSONValue>(
    dataset: Dataset<T, U>
  ): void;
  
  /** 
   * Returns last registered version of dataset if no version is passed. 
   * 
   * *@param *id unique id dataset was registered with 
   * *@param *version optional, specific dataset version 
   * *@throws  *Will throw an error if no matching dataset found 
   */
  getDataset<T extends JSONValue, U extends JSONValue>(
    id: string,
    version?: string
  ): Dataset<T, U>;
  
  /** 
   * Returns dataset signal for subscribing to changes in dataset 
   * See {*@link *https://jupyterlab.github.io/lumino/signaling/classes/signal.html|Signal} 
   * to learn more about use of signals to subscribe to changes in dataset. 
   * 
   * *@param *id unique id used to register the dataset 
   * *@returns *{*@link *https://jupyterlab.github.io/lumino/signaling/classes/signal.html|Signal} signal instance associated with dataset 
   * *@throws *{Error} Will throw an error if 
   */
  getDatasetSignal<T extends JSONValue, U extends JSONValue>(
    id: string
  ): Signal<any, Dataset<T, U>>;
  
  /** 
   * Returns true if dataset exists, false otherwise 
   *  
   * *@param *id unique id that was used to register the dataset 
   * *@param *version version that dataset was registered with 
   * *@returns *{boolean} true if matching dataset exists 
   */
  hasDataset<T extends JSONValue, U extends JSONValue>(
    id: string,
    version?: string
  ): boolean;
  
  /** 
   * Returns list of datasets that match the passed abstract data type, 
   * serialization type, and storage type. 
   * 
   * *@param *abstractDataType abstract data type to match 
   * *@param *serializationType serialization type to match 
   * *@param *storageType storage type to match 
   * *@returns *{Dataset[]|[]} list of matching datasets 
   */
  queryDataset(
    abstractDataType?: string,
    serializationType?: string,
    storageType?: string
  ): Dataset<any, any>[] | [];
  
  /** 
   * Registers a command for datasets having a set of abstract data type,  
   * serialization type, and storage type. This is useful for extension  
   * writers to associate specific dataset types with commands/actions 
   * that their extensions support. 
   *  
   * *@param *commandId unique id of the command registered with the command registry 
   * *@param *abstractDataType abstract data type 
   * *@param *serializationType serialization type 
   * *@param *storageType storage type 
   */
  registerCommand(
    commandId: string,
    abstractDataType: string,
    serializationType: string,
    storageType: string
  ): void;
  
  /** 
   * Get list of commands registered with a specific set of  
   * abstract data type, serialization type, and storage type. 
   * This is useful for extension writers to obtain only those commands 
   * that have been previously registered with the dataset type.   
   *  
   * *@param *abstractDataType abstract data type 
   * *@param *serializationType serialization type 
   * *@param *storageType storage type 
   * *@returns *{Set[string]|[]} set of registered commands 
   */
  getCommands(
    abstractDataType: string,
    serializationType: string,
    storageType: string
  ): Set<string> | [];
  
   /** 
    * This signal provides subscription to 
    * event when any datasets are registered. 
    */
  readonly datasetAdded: Signal<any, Dataset<any, any>>;
  
  /** 
   * This signal provides subscription to 
   * event when any updates are made to datasets. 
   */
  readonly datasetUpdated: Signal<any, Dataset<any, any>>;
  
  /** 
   * This signal provides subscription to 
   * event when any command is registered. 
   */
  readonly commandAdded: Signal<any, String>;
}

Registering new datasets

Use the Typescript API

Plugins/extensions can access the data registry object by adding a dependency on the IDataRegistry interface. Plugins can add new datasets by using the “register” API from Data Registry. Here is an example of a plugin that registers a dataset.

const datasets: JupyterFrontEndPlugin<void> = {
  id: '@jupyterlab/datasets:plugin',
  autoStart: true,
  requires: [IDataRegistry],
  activate: (
    app: JupyterFrontEnd,
    registry: IDataRegistry
  ) => {
    interface ICSVMetadata extends JSONObject {
      delimiter: string;
      lineDelimiter: string;
    }
    interface IS3CSVMetadata extends ICSVMetadata {
      bucket: string;
      filename: string;
    }
    registry.registerDataset<JSONValue, IS3CSVMetadata>({
      id: 's3://bucket/filename',
      abstractDataType: 'tabular',
      serializationType: 'csv',
      storageType: 's3',
      value: null,
      metadata: {
        delimiter: ',',
        lineDelimiter: '\n',
        bucket: 'bucket',
        filename: 'filename',
      },
      title: 'CSV S3 Dataset',
      description: 'CSV in S3 dataset',
      version: '1.0',
    });
  }
};

Use Notebook (Python API)

In addition to the Typescript API, a Python Dataset class is provided to allow lab users to register datasets within the notebook. Creating a new instance of the Dataset class within a notebook cell and executing the cell will invoke the“register” API. Here is an example of using the Python API to register a new dataset.

from jupyterlab_dataregistry import dataset
ds = dataset.Dataset(
    id="s3://airport-data",
    abstract_data_type="tabular",
    serialization_type="tsv",
    storage_type="s3",
    title="Most Crowded Airports",
    description="The dataset contains 250 different airports and each airport has the following attributes: rank, name, location, country, code, passengers, year",
    value=None,
    metadata={
        "delimiter": "\t",
        "lineDemiliter": "\n"
    }
)

Use a dataset file

Another way for users to register datasets would be to create a file with “dataset” extension with definition of dataset in JSON. This feature allows an object with a single dataset definition or an array to register multiple datasets. Opening the file will register all datasets defined inside the file. Here is an example of datasets defined in a dataset file.

[
    {
        "id": "100",
        "abstractDataType": "tabular",
        "storageType": "inmemory",
        "value" : {
            "value": "header1,header2\nvalue1,value2"
        },
        "metadata": {
            "delimiter": ",",
            "lineDelimiter": "\n"
        },
        "title": "CSV in memory dataset 1",
        "description": "CSV in memory dataset 1",
        "version": "1.0"
    },
    {
        "id": "200",
        "abstractDataType": "tabular",
        "storageType": "inmemory",
        "value" : {
            "value": "header1,header2\nvalue3,value4"
        },
        "metadata": {
            "delimiter": ",",
            "lineDelimiter": "\n"
        },
        "title": "CSV in memory dataset 2",
        "description": "CSV in memory dataset 2",
        "version": "1.0"
    }
]

Attach actions to datasets (Command Registry API)

Data registry provides APIs to register commands to specific dataset types, and retrieve a set of commands registered to those dataset types. This is useful for populating a variety of action-based widgets that perform an action associated with loading, visualizing or managing specific datasets.

Data providers can register commands with a specific dataset by using the “registerCommand” API.

registry.registerCommand(
    'render-csv', 'tabular', 'csv', 'inmemory'
);

Extension writers can use the “getCommands” API to get all commands registered for a specific data, serialization, and storage types. They can use these list of commands to bind actions to specific datasets. Specific datasets can be queried by using the “queryDataset” API. Here is an example of an extension that adds a panel in the lab launcher for each dataset registered with a command “render-csv”.

const commandId = 'render-csv';
const command = registry.getCommands(
    'tabular', 'csv', 'inmemory'
).find(c => c === commandId);
if(command) {
    const datasets = registry.queryDataset(
        'tabular', 'csv', 'inmemory'     
    );
    
    datasets.forEach((dataset) => {
        app.commands.addCommand(`${dataset.id}`, {
            label: `${dataset.id}`,
            execute: (args) => {
                const {value} = args.dataset;
                alert(value);
            }
        });
        
        // Adds a new panel in launcher which will 
        // present an alert with dataset value
        launcher.add({
            command: `${dataset.id}`,
            category: 'Datasets',
            args: {
                dataset
            }
        })
    });
}

My Datasets UI

Data registry extension adds a dataset explorer panel to JupyterLab UI that will allow JupyterLab users to view and interact with all registered datasets within a single JupyterLab instance. This extension tracks dataset additions and updates via data registry within a JupyterLab session.

This widget also provides a context menu that allows execution of registered commands/actions associated with a dataset.

“Add Data” button allows users to register a new dataset by auto-populating a dataset creation template in a notebook cell; this can be edited and customized by the user to register a new dataset. This feature uses the python API for registering a new dataset. Executing the notebook cell will register the dataset defined inside the cell.

Open Questions

Types for abstract data, Serialization and storage

The properties Abstract Data Type, Serialization Type and Storage Type in the Data Registry API are all “string” types at the moment, so there is no schema that could be enforced on these values. There are several options to control these values. Here are some proposed solutions:

Record in documentation

Document and maintain all types in a repository as part of documentation.

PROS

Easy to maintain, new additions just need documentation update, no code change required.

CONS

It is easy for API users to misspell values in the code which might end up creating datasets that have diverged from the documentation.
Since there is no enforcement, dataset providers might not confirm to documented values which could end up cluttering lab environment with duplicated values.

Use Enums

Codify types in the data registry code as typescript enum values so these values are strongly typed.

PROS

Ensures that API users don’t diverge from the allowed values.
Easy for API users to follow, as the API will enforce these values and code completion hints will help see what values are available.

CONS

Addition of new values need code change and release cycle. This might slow down API users and adds some friction to adoption of the API.

Build an API

Provide an API to allow data providers and extension writers to add new type values.

PROS

Provides flexibility and a programmatic way to add new values.
Will avoid duplication of values
Can help with normalization of values e.g., “tabular” and “TABULAR” should not be 2 different types.

CONS

As this provides only programmatic access, it is not easy to see existing values.
As there is no human supervision, similar values could still be added. For example, “csv” and “comma-sep” might signify the same serialization types.

Types (Generics) for value and metadata in Dataset

Dataset interface allows JSON typed values for the “value” and “metadata” properties. This is currently open to any interface/types, which introduces a chance that there might be duplication and redundant types. There is some benefit in standardizing these values for common dataset types. Here are a few approaches to tackle this.

Record interfaces in documentation

Add documentation for these interfaces for different dataset types.

PROS

Easy to maintain, new additions just need documentation update, no code change required.

CONS

It is easy for API users to misspell values in the code which might end up creating interfaces that have diverged from the documentation.
Since there is no enforcement, dataset creators might not confirm to documented values which could end up cluttering lab environment with duplicated values.

Define typescript interfaces

Add interfaces, and allow only these interfaces to be used in data registry.

PROS

Ensures that API users don’t diverge from defined interfaces.
Easy for API users to follow, as the API will enforce these values and code completion hints will help see what properties are allowed.

CONS

Addition of new interface or updates to existing need code change and release cycle. This might slow down API users and adds some friction to adoption of the API.

Hybrid approach

Define interfaces for certain dataset types, but also allow others to be documented and added within the registry api.

PROS

Allows flexibility of adding new dataset types and documenting them for common use.
Provides structure and adherence to commonly used dataset types, so that users don’t have to re-invent these interfaces.
As new interfaces become mature, these could be moved inside the code to improve adherence to single definition of these interfaces.

CONS

For 100% adherence, code change required to add new interfaces to registry api.
Needs governance around which new interfaces could be moved to the code.

3coins · 2022-03-23T06:28:02Z

@vidartf @telamonian
I am looking for some initial feedback on the updated data registry project. This version has 2 main components, a Dataset interface, and a DataRegistry API to manage datasets in a lab environment. It also includes a "My Datasets" UI, which will show all registered datasets. The binder link should provide these components, and let you create datasets via a file, in a front-end extension or using the dataset class inside a cell. I wanted a quick way for users to try this, so have bundled this into one extension.

I am also contemplating these additional tasks:

Add data registry server side API, so the dataset metadata can be persisted
Enable user to register dataset, when they create a new dataframe (pandas, dask etc.)
Contextual "My Datasets", so that user only see datasets in the current notebook

I have tried to capture the original use cases along with these ideas in user stories added in the next section.

cc @ellisonbg

Data Registry User Stories

As an end user in JupyterLab working with datasets.
- When, I am working with datasets in a notebook
  - I would like JupyterLab to let me add them to “My Datasets”, so I can use them later
    - When, I am working with in memory datasets in a notebook (pandas, numpy, etc.)
      - I would like JupyterLab to be aware of those datasets, and let me add them to “My Datasets” using a UI, so I can use them later and see other things I might do with them.
      - I would also like to be able to quickly add them to “My Datasets” from Python explicitly.
    - When, I am working with remote data sources/sets in a notebook
      - I would to be able to quickly add them to “My Datasets” from Python.
    - When, I am working with in memory or remote datasets in a notebook
      - I would like to optionally see what else I could do with the data in other extensions
- When, I am working with a dataset in another JupyterLab extension
  - I would like to see what else I could do with the dataset in other extensions and take action on that
  - I would like to be able to add the dataset to “My Datasets”
- When, I am done using a dataset in “My Datasets”
  - I would like the dataset to be archived/deleted, so that it doesn’t appear in the list of “My Datasets“
- When, I am working with a dataset in one JupyterLab extension and I want to use another JupyterLab extension to work with the dataset
  - I would like the extension to be able to access and work seamlessly with the dataset
  - I would like the extension to be able to carry the dataset changes from previous extension interactions
- When I want to use an archived/deleted dataset again
  - I would like to see this dataset in list of “My Datasets”
- When I want to find a dataset that I have already worked with
  - I would like to be able to search the dataset by name and other attributes/tags
- When I am working with “My Datasets” in the data explorer
  - I would like to be notified when the dataset is updated (version changes)
As a developer of a JupyterLab extension that inputs and/or outputs datasets (Notebooks)
- When, I have an output dataset
  - I would like to show my user what else they could do with the output dataset in other JupyterLab extensions or notebooks and select one of those options in a simple UI, so I don’t have to build explicit integration with all of those extensions.
  - I would like to allow my user to choose to add the output dataset to “My Datasets”.
- When I am building support for different types of input data
  - I would like to tell other extensions that I can do something with those types of data, so they can pass those types of data to me without my intervention and without me building explicit integration with all of those extensions.
- When I am integrating with other extensions
  - I would like to interface with a central dataset API rather than all of the other extensions, so I don’t have to repeat the work of integration.
- When a user imports a new dataset
  - I would like my extension to get notified, so that I can do something meaningful with the dataset, for example show in a list of “My Datasets”.
- When another JupyterLab extension adds a new action for a dataset
  - I would like my extension get notified so that I can add a new option for the dataset so my end user can use this option perform the new action
- When my user is done using the dataset
  - I would like my extension to be able to remove this dataset from “My Datasets”
- When my end user wants to reuse an archived/deleted dataset
  - I would like my extension to be able to add this dataset to “My Datasets”
- When a dataset is updated
  - I would like my extension to be notified, so that my extension can update the dataset info and user options related to this dataset
As a developer of a data import experience in JupyterLab:
- When a user picks a dataset they are interested in working with
  - I would like to show my user what else they could do with the output dataset in other JupyterLab extensions or notebooks and select one of those options in a simple UI, so I don’t have to build explicit integration with all of those extensions.
  - I would like to allow my user to choose to add the output dataset to “My Datasets”.

aiqc · 2022-03-23T16:15:03Z

I can spend some time playing around with this next week (family vacation this week).

At a glance:

Record a path/ location?
Are there any plans to record metadata about column names and dtypes (pandas, numpy, parquet) and table/array dimensions?
Pickle and dill as supported persistent formats (is that what inmemory is)?
Persist a default query to fetch the dataset?
Dependencies required for interacting w data?

permanently splitting datasets into train/ test ahead of time is bad practice

3coins · 2022-03-24T01:06:59Z

@aiqc
Thanks for looking into this PR.

Record a path/ location?
Are there any plans to record metadata about column names and dtypes (pandas, numpy, parquet) and table/array dimensions?
Persist a default query to fetch the dataset?
Dependencies required for interacting w data?

For path/location, one option is to use "id" property. The interface so far is flexible enough to allow any JSON based schema in the "metadata" and "value" properties. It is expected that the implementors will use these to come up with a variety of schemas and we can formalize some of them.

For example, we should define a strict schema for a dataset with tab-separated file stored in S3, with column names, dtypes etc. I would define this schema like this.

interface S3 {
    bucket: string
    object: string
}

interface DataColumn {
    name: string
    dtype: string
}

interface ITabSeparated {
    lineDelimiter: string,
    colDelimiter: string,
    columns: DataColumn[]
}

interface IS3TsvMetadata {
    storage: S3,
    serialization: ITabSeparated
} 

// Use the above schema to register a dataset
registry.register<JSONValue, IS3TsvMetadata>({
    id: "s3://datasets/covid19-dataset",
    abstractDataType: "tabular",
    storageType: "s3",
    serializationType: "tsv",
    value: null,
    metadata: {
        storage: {
            bucket: "datasets",
            object: "covid19-dataset"    
        },
        serialization: {
            lineDelimiter: "\n",
            colDelimiter: "\t",
            columns: [
                {
                    "name": "country",
                    "dtype": "string" 
                },
                {
                    "name": "state",
                    "dtype": "string" 
                },
                {
                    "name": "cases",
                    "dtype": "number"
                },
                {
                    "name": "reported",
                    "dtype": "datetime"
                }   
            ]
        }
    }
});

Pickle and dill as supported persistent formats (is that what inmemory is)?

"inmemory" here just signified a non-remote dataset, dataframes created from pandas for example could be declared "inmemory". This definition here doesn't inherently do anything to support specific formats/libraries, this is merely a way to specify what the storage type is. It is totally open to the implementor to handle specific storage formats. However, I am looking into how to allow users to register datasets directly from the cell when they create a pandas dataframe, we can discuss more if pickle and dill should be supported in this context.

Persist a default query to fetch the dataset?

Can you elaborate on this.

permanently splitting datasets into train/ test ahead of time is bad practice

Agree, the example I had was just a representation of a dataset with multiple folders/files.

…terlab-data-explorer into 3coins-data-registry

romeokienzler · 2022-05-12T09:24:59Z

@3coins as discussed, done some edits, >>>>highlighted<<<<

Data Registry User Stories

* As an end user in JupyterLab working with datasets.
  
  * When, I am working with datasets in a notebook
    
    * I would like JupyterLab to let me add them to “My Datasets”, so I can use them later
      
      * When, I am working with in memory datasets in a notebook (pandas, numpy, etc.)
        
        * I would like JupyterLab to be aware of those datasets, and let me add them to “My Datasets” using a UI, so I can use them later and see other things I might do with them.
        * I would also like to be able to quickly add them to “My Datasets” from Python explicitly.
        * >>>>I'd like to have a view of all in-memory data sets in all my kernels (like RStudio)<<<<
      * When, I am working with remote data sources/sets in a notebook
        
        * I would to be able to quickly add them to “My Datasets” from Python.
      * When, I am working with in memory or remote datasets in a notebook
        
        * I would like to optionally see what else I could do with the data in other extensions
             * >>>>for example launching a SQL query editor and execute queries from there<<<<
             * >>>>trigger a ML wizards which allows for selecting feature columns, target columns and trigger ML/DL training from there (using the same wizard, pushing e.g. for a pipeline)<<<<
  * When, I am working with a dataset in another JupyterLab extension
    
    * I would like to see what else I could do with the dataset in other extensions and take action on that
    * I would like to be able to add the dataset to “My Datasets”
  * When, I am done using a dataset in “My Datasets”
    
    * I would like the dataset to be archived/deleted, so that it doesn’t appear in the list of “My Datasets“
  * When, I am working with a dataset in one JupyterLab extension and I want to use another JupyterLab extension to work with the dataset
    
    * I would like the extension to be able to access and work seamlessly with the dataset
    * I would like the extension to be able to carry the dataset changes from previous extension interactions
  * When I want to use an archived/deleted dataset again
    
    * I would like to see this dataset in list of “My Datasets”
  * When I want to find a dataset that I have already worked with
    
    * I would like to be able to search the dataset by name and other attributes/tags
  * When I am working with “My Datasets” in the data explorer
    
    * I would like to be notified when the dataset is updated (version changes)

* As a developer of a JupyterLab extension that inputs and/or outputs datasets (Notebooks)
  
  * When, I have an output dataset
    
    * I would like to show my user what else they could do with the output dataset in other JupyterLab extensions or notebooks and select one of those options in a simple UI, so I don’t have to build explicit integration with all of those extensions.
    * I would like to allow my user to choose to add the output dataset to “My Datasets”.
  * When I am building support for different types of input data
    * >>>> I'd like to be able to share my data set with other via ml-exchange.org, feast, or any other feature store<<<<
    
    * I would like to tell other extensions that I can do something with those types of data, so they can pass those types of data to me without my intervention and without me building explicit integration with all of those extensions.
  * When I am integrating with other extensions
    
    * I would like to interface with a central dataset API rather than all of the other extensions, so I don’t have to repeat the work of integration.
  * When a user imports a new dataset
    
    * I would like my extension to get notified, so that I can do something meaningful with the dataset, for example show in a list of “My Datasets”.
  * When another JupyterLab extension adds a new action for a dataset
    
    * I would like my extension get notified so that I can add a new option for the dataset so my end user can use this option perform the new action
  * When my user is done using the dataset
    
    * I would like my extension to be able to remove this dataset from “My Datasets”
  * When my end user wants to reuse an archived/deleted dataset
    
    * I would like my extension to be able to add this dataset to “My Datasets”
  * When a dataset is updated
    
    * I would like my extension to be notified, so that my extension can update the dataset info and user options related to this dataset

* As a developer of a data import experience in JupyterLab:
  
  * When a user picks a dataset they are interested in working with
    
    * I would like to show my user what else they could do with the output dataset in other JupyterLab extensions or notebooks and select one of those options in a simple UI, so I don’t have to build explicit integration with all of those extensions.
    * I would like to allow my user to choose to add the output dataset to “My Datasets”.

ckadner · 2022-05-19T18:08:30Z

@3coins -- are there any existing data set providers/registries that you think should implement/support this new interface?

In MLX we have datasets in the catalog. We are considering if we should implement this new dataset interface.

3coins · 2022-05-20T17:03:28Z

@ckadner

@3coins -- are there any existing data set providers/registries that you think should implement/support this new interface?

In MLX we have datasets in the catalog. We are considering if we should implement this new dataset interface.

This project is still a work in progress. That said, I think we can explore how this applies to the datasets you mentioned here, and identify if there are any gaps. One key point to consider here is that DataRegistry is not intended to serve as a catalog, but rather act as a central artifact to manage "My Datasets" (Datasets I am working with) within a JupyterLab instance.

Piyush Jain and others added 23 commits August 27, 2021 14:27

Simplified data registry interface, added unit tests

4beec8b

Added updated dataset api to registry

3ce702c

Added unit tests for update dataset api

272b348

Added lumino commands api as dependency

b1a7506

Added new apis for commands, signals

4cb16a2

Updated dataset version to string, added query api

126f2e2

Updated package structure

7132539

Removed existing extension directory

fd3ca4e

Updated query api

3a7663b

Added dataregistry as prebuilt extension

e05a6d6

Added actions for dataset list, populates notebook

1096f89

Added signals for dataset add/update

f0867df

Added model for state, updated styling

36eab23

Refactored, added mime extension

d3b393c

Dataset class to aid dataset registry in python

3f38275

Updated build target

fc86dcd

Added filter box, add dataset features

27e8c42

Git ignore file

f6b6375

Addd a csv viewer widget and menu option

0ff7f24

Updated doc comments, added dataset examples

816391b

Updated README to match current API

cb751c3

Addes user stories

dbf2df5

Moved all code to prebuilt extension, simplifies install

00d059b

3coins mentioned this pull request Mar 23, 2022

Simplified data registry interface, added unit tests #148

Closed

5 tasks

3coins added 3 commits March 22, 2022 22:23

Corrected the add data template, so dataset is exported from cell

9c7eaee

Added a sample dataset file

e02479d

Added a sample notebook with a dataset

519d15e

Moved workflow files to correct location

8b243b2

3coins mentioned this pull request Mar 23, 2022

New repository for Data Registry jupyterlab/frontends-team-compass#141

Open

Rename users-stories.md to user-stories.md

b9f1a50

fcollonval mentioned this pull request Mar 24, 2022

Weekly Team Meetings: Jan-Jun 2022 jupyterlab/frontends-team-compass#135

Closed

Piyush Jain added 15 commits April 1, 2022 00:10

Added a dataset file based store

e557201

Fixture for dataregistry tests

d11a8c8

Added utililty functions

69f9641

Updated dataset to use dataclasses

cf2b10a

Linting updates

e053685

Added pytest options to avoid warnings during test

7274ed6

Added dataregistry class

500850d

Added dataregistry rest apis, unit tests

d2b238e

Merge branch '3coins-data-registry' of https://github.com/3coins/jupy…

9e4e3e3

…terlab-data-explorer into 3coins-data-registry

Formatting updates

3a7af62

Added .vscode dir to gitignore

079c83c

Fixed tests for command store

4cef92a

Working dataregistry API and integration with FE

0252a49

Added persistence store, endpoints for registry

ebaba53

Added intellij files to gitignore

ec351ae

Added a sql dataset

b146bb0

Updated sql command to use context ready

8121e64

fcollonval closed this Aug 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simplified data registry and dataset interface #149

Simplified data registry and dataset interface #149

3coins commented Mar 23, 2022 •

edited

3coins commented Mar 23, 2022 •

edited

aiqc commented Mar 23, 2022 •

edited

3coins commented Mar 24, 2022

romeokienzler commented May 12, 2022

ckadner commented May 19, 2022

3coins commented May 20, 2022

Simplified data registry and dataset interface #149

Simplified data registry and dataset interface #149

Conversation

3coins commented Mar 23, 2022 • edited

Install

Dataset Interface

Id

Abstract Data Type

Serialization Type

Storage Type

Value

Metadata

Title

Description

Tags

Version

In memory dataset in CSV format

Directory of images stored in S3

Dataset stored in a SQL database

Data Registry API

Registering new datasets

Attach actions to datasets (Command Registry API)

My Datasets UI

Open Questions

Types for abstract data, Serialization and storage

Record in documentation

Use Enums

Build an API

Types (Generics) for value and metadata in Dataset

Record interfaces in documentation

Define typescript interfaces

Hybrid approach

3coins commented Mar 23, 2022 • edited

Data Registry User Stories

aiqc commented Mar 23, 2022 • edited

3coins commented Mar 24, 2022

romeokienzler commented May 12, 2022

Data Registry User Stories

ckadner commented May 19, 2022

3coins commented May 20, 2022

3coins commented Mar 23, 2022 •

edited

3coins commented Mar 23, 2022 •

edited

aiqc commented Mar 23, 2022 •

edited