Skip to content
This repository has been archived by the owner on Aug 8, 2023. It is now read-only.

Simplified data registry and dataset interface #149

Closed
wants to merge 45 commits into from

Conversation

3coins
Copy link

@3coins 3coins commented Mar 23, 2022

Resolves #145, #146, and #147.

Rekindling the data registry project with a simplified DataRegistry and Dataset interfaces. The package has been re-written from scratch using the latest cookiecutter template with all the setup for unit and integration tests along with GitHub actions to aid automated build check, changelog, and release workflows.

Binder

Install

# Clone the repo to your local environment
git clone -b 3coins-data-registry https://github.com/3coins/jupyterlab-data-explorer

# Change directory to the jupyterlab_dataregistry directory
cd jupyterlab-data-explorer

# Install package in development mode
pip install -e .

# Link your development version of the extension with JupyterLab
jupyter labextension develop . --overwrite

# Server extension must be manually installed in develop mode
jupyter server extension enable jupyterlab_dataregistry

# Build extension
jlpm build

Data Registry provides 3 main components:

  1. A typed JSON based dataset interface to represent any kind of dataset
  2. A data registry API to register, retrieve and manage datasets
  3. A command registry API to allow extension authors to register user actions with specific dataset types

Data Registry provides both a TypeScript and a Python API to register new datasets. The TypeScript interface allows data providers to register new datasets via plugins; this also allows extension writers to create commands and associate these commands with a specific dataset type. The Python API provides an additional way to register datasets inside notebooks; this also enables registering dataset definition stored in files with “dataset” extension. Dataset providers can share dataset files or notebooks containing dataset definitions with JupyterLab users.

Dataset Interface

A typed JSON based extensible interface that can be used to represent any kind of dataset. Dataset interface expects 2 types, one for defining the value and second for the metadata. Having these as typed values allows creating any kind of dataset. In addition, the attributes abstract data type, storage type and serialization type are declared as string types which allows flexibility for data providers to define datasets that can span a vast range of mime types, with extensibility to support any future mime types.

export interface Dataset<T extends JSONValue, U extends JSONValue> {
  /**
   * Unique identifier for the dataset, for in-memory
   * datasets, a unique uuid might be provided.
   * This id should be unique across a jupyter server instance.
   */
  id: string;
  /**
   * Abstract data type for the dataset, e.g.,
   * tabular, image, text, tabular collection
   */
  abstractDataType: string;
  /**
   * Serialization type for the dataset e.g.,
   * csv, jpg
   */
  serializationType: string;
  /**
   * Storage type for the dataset e.g.,
   * inmemory, file, s3
   */
  storageType: string;
  /**
   * Output value for the dataset
   */
  value: T;
  /**
   * Additional properties for the dataset
   * that help serialize or query data
   */
  metadata: U;
  title: string;
  description: string;
  tags?: Set<string>;
  version?: string;
}

Id

A string that represents the unique identifier for the dataset, is expected to be unique across a JupyterLab instance.

Abstract Data Type

A string that captures the abstract data type for the dataset, and is largely defined by the dataset provider. This property represents a very high level abstraction of the data type which might not conform to a specific mime type but rather provide a more general view of the dataset. Most datasets with features might fall under “tabular” because they will have a set of labels/columns with multiple rows of values. Some other examples are “image” to represent any single image, “image-collection” to represent a set of images, and “text” to represent free-form or structured text data.

Serialization Type

A string that captures information about the serialization format of the data. This property represents the specific subtype which can be used to serialize or visualize the data. For example, tabular datasets can be represented in “csv” or “tsv”, images might be “jpeg”, “png” etc. Some other examples are “text”, “json”, “svg”, “sql”.

Storage Type

A string that defines how the data is stored, e.g., S3, database, in memory etc.

Value

Nullable value which defines the type for the actual value of the dataset, e.g., for an in memory comma separated file, this might represent the actual string value of the data.

Metadata

Defines the type for capturing any metadata associated with the dataset that might help extension writers or JupyterLab users download/serialize the data. e.g., for a tabular dataset, this might capture the delimiter and line delimiter; for a dataset stored in S3, this might capture the credentials or folder and object information.

Title

A string that is used largely for display purposes to identify the dataset.

Description

A string that captures some more detail about the dataset, so extension developers and users have more context of what kind of data is stored in the dataset.

Tags

List of arbitrary strings that could be attached to the dataset which might aid searching and identification of similar datasets.

Version

A string that defines the version of the dataset.

Here are a few examples of real world datasets expressed using the dataset interface:

In memory dataset in CSV format

{
    id: '47e0c8a6-b49f-46ec-88e7-bbb7be2cbb52',
    abstractDataType: 'tabular',
    serializationType: 'csv',
    storageType: 'inmemory',
    value: 'lovingly photographed in the manner of a golden book sprung to life stuart little 2 manages sweetness largely without stickiness,pos\nconsistently clever and suspenseful,pos\nred dragon never cuts corners,pos',
    metadata: {
      delimiter: ',',
      lineDelimiter: '\n',
    },
    title: 'Rotten Tomatoes Dataset',
    description: 'Movie Review Dataset. This is a dataset containing positive and negative processed sentences from Rotten Tomatoes movie reviews',
    version: '1.0',
}

Directory of images stored in S3

{
    id: 's3://covid19-dataset',
    abstractDataType: 'image-collection',
    serializationType: 'png',
    storageType: 's3',
    value: null,
    metadata: {
    bucket: 'covid19-dataset',
        folders: {
            'train': ['train/Covid', 'train/Normal', 'train/Viral Phenomena'],
            'test': ['test/Covid', 'test/Normal', 'test/Viral Phenomena']
        }
    },
    title: 'Covid-19 Image Dataset',
    description: 'This dataset will help deep learning and AI enthusiasts to contribute to improving COVID-19 detection using Chest X-rays. Data was collected from publicly released GitHub account by the University of Montreal professors. The Pneumonia data has been taken from the RSNA website.',
    version: '2.1',
}

Dataset stored in a SQL database

{
    id: 'f6712a3c-c902-412b-9a0f-e5e581e21739',
    abstractDataType: 'tabular',
    serializationType: 'mysql',
    storageType: 'database',
    value: null,
    metadata: {
        database: {
            host: 'localhost',
            port: '3306',
            name: 'jobs-data'
        },
        table: 'usa-ds-jobs'
    },
    title: 'Data Scientist Job Market in the U.S.',
    description: 'This dataset provides job data for open data scientist jobs in U.S. posted on Indeed.com. The information collected includes company name, position, location, description, and number of reviews. The questions that this dataset helps answer are 1) What skills, tools, majors are most sought 2) Which location has most opportunities 3) Whats the difference between data scientist, data engineer and data analyst?',
    version: '1.0',
}

Note: The above example specifies a “dataset” vs a “datasource”. However, the dataset interface could be easily applied to datasources as well. For example, for a dataset definition to represent a collection of tables might be represented by an abstractDataType of “sql-tabular” with relevant details to connect/serialize in the metadata property.

Data Registry API

Data registry aims to catalog datasets in a JupyterLab environment, and provides APIs to register, update and retrieve datasets. These are the core set of APIs which will be shared with other extensions to help them manage datasets. Data registry provides these APIs:

export interface IDataRegistry {
  /** 
   * Registers a dataset. Use {*@link *Registry#updateDataset}
   * to update a registered dataset. 
   * 
   * *@param *dataset The dataset to register 
   * *@throws *Throws an error if dataset with 
   * same id and version already exists. 
   */
  registerDataset<T extends JSONValue, U extends JSONValue>(
    dataset: Dataset<T, U>
  ): void;
  
  /** 
   * Updates a registered dataset, bumps up the version. 
   * 
   * *@param *dataset The dataset to update 
   * *@throws *Throws an exception if any of abstractDataType, 
   * serializationType or storageType are different from 
   * registered values. 
   */
  updateDataset<T extends JSONValue, U extends JSONValue>(
    dataset: Dataset<T, U>
  ): void;
  
  /** 
   * Returns last registered version of dataset if no version is passed. 
   * 
   * *@param *id unique id dataset was registered with 
   * *@param *version optional, specific dataset version 
   * *@throws  *Will throw an error if no matching dataset found 
   */
  getDataset<T extends JSONValue, U extends JSONValue>(
    id: string,
    version?: string
  ): Dataset<T, U>;
  
  /** 
   * Returns dataset signal for subscribing to changes in dataset 
   * See {*@link *https://jupyterlab.github.io/lumino/signaling/classes/signal.html|Signal} 
   * to learn more about use of signals to subscribe to changes in dataset. 
   * 
   * *@param *id unique id used to register the dataset 
   * *@returns *{*@link *https://jupyterlab.github.io/lumino/signaling/classes/signal.html|Signal} signal instance associated with dataset 
   * *@throws *{Error} Will throw an error if 
   */
  getDatasetSignal<T extends JSONValue, U extends JSONValue>(
    id: string
  ): Signal<any, Dataset<T, U>>;
  
  /** 
   * Returns true if dataset exists, false otherwise 
   *  
   * *@param *id unique id that was used to register the dataset 
   * *@param *version version that dataset was registered with 
   * *@returns *{boolean} true if matching dataset exists 
   */
  hasDataset<T extends JSONValue, U extends JSONValue>(
    id: string,
    version?: string
  ): boolean;
  
  /** 
   * Returns list of datasets that match the passed abstract data type, 
   * serialization type, and storage type. 
   * 
   * *@param *abstractDataType abstract data type to match 
   * *@param *serializationType serialization type to match 
   * *@param *storageType storage type to match 
   * *@returns *{Dataset[]|[]} list of matching datasets 
   */
  queryDataset(
    abstractDataType?: string,
    serializationType?: string,
    storageType?: string
  ): Dataset<any, any>[] | [];
  
  /** 
   * Registers a command for datasets having a set of abstract data type,  
   * serialization type, and storage type. This is useful for extension  
   * writers to associate specific dataset types with commands/actions 
   * that their extensions support. 
   *  
   * *@param *commandId unique id of the command registered with the command registry 
   * *@param *abstractDataType abstract data type 
   * *@param *serializationType serialization type 
   * *@param *storageType storage type 
   */
  registerCommand(
    commandId: string,
    abstractDataType: string,
    serializationType: string,
    storageType: string
  ): void;
  
  /** 
   * Get list of commands registered with a specific set of  
   * abstract data type, serialization type, and storage type. 
   * This is useful for extension writers to obtain only those commands 
   * that have been previously registered with the dataset type.   
   *  
   * *@param *abstractDataType abstract data type 
   * *@param *serializationType serialization type 
   * *@param *storageType storage type 
   * *@returns *{Set[string]|[]} set of registered commands 
   */
  getCommands(
    abstractDataType: string,
    serializationType: string,
    storageType: string
  ): Set<string> | [];
  
   /** 
    * This signal provides subscription to 
    * event when any datasets are registered. 
    */
  readonly datasetAdded: Signal<any, Dataset<any, any>>;
  
  /** 
   * This signal provides subscription to 
   * event when any updates are made to datasets. 
   */
  readonly datasetUpdated: Signal<any, Dataset<any, any>>;
  
  /** 
   * This signal provides subscription to 
   * event when any command is registered. 
   */
  readonly commandAdded: Signal<any, String>;
}

Registering new datasets

Use the Typescript API

Plugins/extensions can access the data registry object by adding a dependency on the IDataRegistry interface. Plugins can add new datasets by using the “register” API from Data Registry. Here is an example of a plugin that registers a dataset.

const datasets: JupyterFrontEndPlugin<void> = {
  id: '@jupyterlab/datasets:plugin',
  autoStart: true,
  requires: [IDataRegistry],
  activate: (
    app: JupyterFrontEnd,
    registry: IDataRegistry
  ) => {
    interface ICSVMetadata extends JSONObject {
      delimiter: string;
      lineDelimiter: string;
    }
    interface IS3CSVMetadata extends ICSVMetadata {
      bucket: string;
      filename: string;
    }
    registry.registerDataset<JSONValue, IS3CSVMetadata>({
      id: 's3://bucket/filename',
      abstractDataType: 'tabular',
      serializationType: 'csv',
      storageType: 's3',
      value: null,
      metadata: {
        delimiter: ',',
        lineDelimiter: '\n',
        bucket: 'bucket',
        filename: 'filename',
      },
      title: 'CSV S3 Dataset',
      description: 'CSV in S3 dataset',
      version: '1.0',
    });
  }
};

Use Notebook (Python API)

In addition to the Typescript API, a Python Dataset class is provided to allow lab users to register datasets within the notebook. Creating a new instance of the Dataset class within a notebook cell and executing the cell will invoke the“register” API. Here is an example of using the Python API to register a new dataset.

from jupyterlab_dataregistry import dataset
ds = dataset.Dataset(
    id="s3://airport-data",
    abstract_data_type="tabular",
    serialization_type="tsv",
    storage_type="s3",
    title="Most Crowded Airports",
    description="The dataset contains 250 different airports and each airport has the following attributes: rank, name, location, country, code, passengers, year",
    value=None,
    metadata={
        "delimiter": "\t",
        "lineDemiliter": "\n"
    }
)

Use a dataset file

Another way for users to register datasets would be to create a file with “dataset” extension with definition of dataset in JSON. This feature allows an object with a single dataset definition or an array to register multiple datasets. Opening the file will register all datasets defined inside the file. Here is an example of datasets defined in a dataset file.

[
    {
        "id": "100",
        "abstractDataType": "tabular",
        "storageType": "inmemory",
        "value" : {
            "value": "header1,header2\nvalue1,value2"
        },
        "metadata": {
            "delimiter": ",",
            "lineDelimiter": "\n"
        },
        "title": "CSV in memory dataset 1",
        "description": "CSV in memory dataset 1",
        "version": "1.0"
    },
    {
        "id": "200",
        "abstractDataType": "tabular",
        "storageType": "inmemory",
        "value" : {
            "value": "header1,header2\nvalue3,value4"
        },
        "metadata": {
            "delimiter": ",",
            "lineDelimiter": "\n"
        },
        "title": "CSV in memory dataset 2",
        "description": "CSV in memory dataset 2",
        "version": "1.0"
    }
]

Attach actions to datasets (Command Registry API)

Data registry provides APIs to register commands to specific dataset types, and retrieve a set of commands registered to those dataset types. This is useful for populating a variety of action-based widgets that perform an action associated with loading, visualizing or managing specific datasets.

Data providers can register commands with a specific dataset by using the “registerCommand” API.

registry.registerCommand(
    'render-csv', 'tabular', 'csv', 'inmemory'
);

Extension writers can use the “getCommands” API to get all commands registered for a specific data, serialization, and storage types. They can use these list of commands to bind actions to specific datasets. Specific datasets can be queried by using the “queryDataset” API. Here is an example of an extension that adds a panel in the lab launcher for each dataset registered with a command “render-csv”.

const commandId = 'render-csv';
const command = registry.getCommands(
    'tabular', 'csv', 'inmemory'
).find(c => c === commandId);
if(command) {
    const datasets = registry.queryDataset(
        'tabular', 'csv', 'inmemory'     
    );
    
    datasets.forEach((dataset) => {
        app.commands.addCommand(`${dataset.id}`, {
            label: `${dataset.id}`,
            execute: (args) => {
                const {value} = args.dataset;
                alert(value);
            }
        });
        
        // Adds a new panel in launcher which will 
        // present an alert with dataset value
        launcher.add({
            command: `${dataset.id}`,
            category: 'Datasets',
            args: {
                dataset
            }
        })
    });
}

My Datasets UI

Data registry extension adds a dataset explorer panel to JupyterLab UI that will allow JupyterLab users to view and interact with all registered datasets within a single JupyterLab instance. This extension tracks dataset additions and updates via data registry within a JupyterLab session.

my-datasets-ui

This widget also provides a context menu that allows execution of registered commands/actions associated with a dataset.

my-datasets-context-menu

“Add Data” button allows users to register a new dataset by auto-populating a dataset creation template in a notebook cell; this can be edited and customized by the user to register a new dataset. This feature uses the python API for registering a new dataset. Executing the notebook cell will register the dataset defined inside the cell.

my-datasets-add-data

Open Questions

Types for abstract data, Serialization and storage

The properties Abstract Data Type, Serialization Type and Storage Type in the Data Registry API are all “string” types at the moment, so there is no schema that could be enforced on these values. There are several options to control these values. Here are some proposed solutions:

Record in documentation

Document and maintain all types in a repository as part of documentation.

PROS

  • Easy to maintain, new additions just need documentation update, no code change required.

CONS

  • It is easy for API users to misspell values in the code which might end up creating datasets that have diverged from the documentation.
  • Since there is no enforcement, dataset providers might not confirm to documented values which could end up cluttering lab environment with duplicated values.

Use Enums

Codify types in the data registry code as typescript enum values so these values are strongly typed.

PROS

  • Ensures that API users don’t diverge from the allowed values.
  • Easy for API users to follow, as the API will enforce these values and code completion hints will help see what values are available.

CONS

  • Addition of new values need code change and release cycle. This might slow down API users and adds some friction to adoption of the API.

Build an API

Provide an API to allow data providers and extension writers to add new type values.

PROS

  • Provides flexibility and a programmatic way to add new values.
  • Will avoid duplication of values
  • Can help with normalization of values e.g., “tabular” and “TABULAR” should not be 2 different types.

CONS

  • As this provides only programmatic access, it is not easy to see existing values.
  • As there is no human supervision, similar values could still be added. For example, “csv” and “comma-sep” might signify the same serialization types.

Types (Generics) for value and metadata in Dataset

Dataset interface allows JSON typed values for the “value” and “metadata” properties. This is currently open to any interface/types, which introduces a chance that there might be duplication and redundant types. There is some benefit in standardizing these values for common dataset types. Here are a few approaches to tackle this.

Record interfaces in documentation

Add documentation for these interfaces for different dataset types.

PROS

  • Easy to maintain, new additions just need documentation update, no code change required.

CONS

  • It is easy for API users to misspell values in the code which might end up creating interfaces that have diverged from the documentation.
  • Since there is no enforcement, dataset creators might not confirm to documented values which could end up cluttering lab environment with duplicated values.

Define typescript interfaces

Add interfaces, and allow only these interfaces to be used in data registry.

PROS

  • Ensures that API users don’t diverge from defined interfaces.
  • Easy for API users to follow, as the API will enforce these values and code completion hints will help see what properties are allowed.

CONS

  • Addition of new interface or updates to existing need code change and release cycle. This might slow down API users and adds some friction to adoption of the API.

Hybrid approach

Define interfaces for certain dataset types, but also allow others to be documented and added within the registry api.

PROS

  • Allows flexibility of adding new dataset types and documenting them for common use.
  • Provides structure and adherence to commonly used dataset types, so that users don’t have to re-invent these interfaces.
  • As new interfaces become mature, these could be moved inside the code to improve adherence to single definition of these interfaces.

CONS

  • For 100% adherence, code change required to add new interfaces to registry api.
  • Needs governance around which new interfaces could be moved to the code.

@3coins
Copy link
Author

3coins commented Mar 23, 2022

@vidartf @telamonian
I am looking for some initial feedback on the updated data registry project. This version has 2 main components, a Dataset interface, and a DataRegistry API to manage datasets in a lab environment. It also includes a "My Datasets" UI, which will show all registered datasets. The binder link should provide these components, and let you create datasets via a file, in a front-end extension or using the dataset class inside a cell. I wanted a quick way for users to try this, so have bundled this into one extension.

I am also contemplating these additional tasks:

  • Add data registry server side API, so the dataset metadata can be persisted
  • Enable user to register dataset, when they create a new dataframe (pandas, dask etc.)
  • Contextual "My Datasets", so that user only see datasets in the current notebook

I have tried to capture the original use cases along with these ideas in user stories added in the next section.

cc @ellisonbg


Data Registry User Stories

  • As an end user in JupyterLab working with datasets.
    • When, I am working with datasets in a notebook
      • I would like JupyterLab to let me add them to “My Datasets”, so I can use them later
        • When, I am working with in memory datasets in a notebook (pandas, numpy, etc.)
          • I would like JupyterLab to be aware of those datasets, and let me add them to “My Datasets” using a UI, so I can use them later and see other things I might do with them.
          • I would also like to be able to quickly add them to “My Datasets” from Python explicitly.
        • When, I am working with remote data sources/sets in a notebook
          • I would to be able to quickly add them to “My Datasets” from Python.
        • When, I am working with in memory or remote datasets in a notebook
          • I would like to optionally see what else I could do with the data in other extensions
    • When, I am working with a dataset in another JupyterLab extension
      • I would like to see what else I could do with the dataset in other extensions and take action on that
      • I would like to be able to add the dataset to “My Datasets”
    • When, I am done using a dataset in “My Datasets”
      • I would like the dataset to be archived/deleted, so that it doesn’t appear in the list of “My Datasets“
    • When, I am working with a dataset in one JupyterLab extension and I want to use another JupyterLab extension to work with the dataset
      • I would like the extension to be able to access and work seamlessly with the dataset
      • I would like the extension to be able to carry the dataset changes from previous extension interactions
    • When I want to use an archived/deleted dataset again
      • I would like to see this dataset in list of “My Datasets”
    • When I want to find a dataset that I have already worked with
      • I would like to be able to search the dataset by name and other attributes/tags
    • When I am working with “My Datasets” in the data explorer
      • I would like to be notified when the dataset is updated (version changes)
  • As a developer of a JupyterLab extension that inputs and/or outputs datasets (Notebooks)
    • When, I have an output dataset
      • I would like to show my user what else they could do with the output dataset in other JupyterLab extensions or notebooks and select one of those options in a simple UI, so I don’t have to build explicit integration with all of those extensions.
      • I would like to allow my user to choose to add the output dataset to “My Datasets”.
    • When I am building support for different types of input data
      • I would like to tell other extensions that I can do something with those types of data, so they can pass those types of data to me without my intervention and without me building explicit integration with all of those extensions.
    • When I am integrating with other extensions
      • I would like to interface with a central dataset API rather than all of the other extensions, so I don’t have to repeat the work of integration.
    • When a user imports a new dataset
      • I would like my extension to get notified, so that I can do something meaningful with the dataset, for example show in a list of “My Datasets”.
    • When another JupyterLab extension adds a new action for a dataset
      • I would like my extension get notified so that I can add a new option for the dataset so my end user can use this option perform the new action
    • When my user is done using the dataset
      • I would like my extension to be able to remove this dataset from “My Datasets”
    • When my end user wants to reuse an archived/deleted dataset
      • I would like my extension to be able to add this dataset to “My Datasets”
    • When a dataset is updated
      • I would like my extension to be notified, so that my extension can update the dataset info and user options related to this dataset
  • As a developer of a data import experience in JupyterLab:
    • When a user picks a dataset they are interested in working with
      • I would like to show my user what else they could do with the output dataset in other JupyterLab extensions or notebooks and select one of those options in a simple UI, so I don’t have to build explicit integration with all of those extensions.
      • I would like to allow my user to choose to add the output dataset to “My Datasets”.

@aiqc
Copy link

aiqc commented Mar 23, 2022

I can spend some time playing around with this next week (family vacation this week).

At a glance:

  • Record a path/ location?
  • Are there any plans to record metadata about column names and dtypes (pandas, numpy, parquet) and table/array dimensions?
  • Pickle and dill as supported persistent formats (is that what inmemory is)?
  • Persist a default query to fetch the dataset?
  • Dependencies required for interacting w data?

permanently splitting datasets into train/ test ahead of time is bad practice

@3coins
Copy link
Author

3coins commented Mar 24, 2022

@aiqc
Thanks for looking into this PR.

Record a path/ location?
Are there any plans to record metadata about column names and dtypes (pandas, numpy, parquet) and table/array dimensions?
Persist a default query to fetch the dataset?
Dependencies required for interacting w data?

For path/location, one option is to use "id" property. The interface so far is flexible enough to allow any JSON based schema in the "metadata" and "value" properties. It is expected that the implementors will use these to come up with a variety of schemas and we can formalize some of them.

For example, we should define a strict schema for a dataset with tab-separated file stored in S3, with column names, dtypes etc. I would define this schema like this.

interface S3 {
    bucket: string
    object: string
}

interface DataColumn {
    name: string
    dtype: string
}

interface ITabSeparated {
    lineDelimiter: string,
    colDelimiter: string,
    columns: DataColumn[]
}

interface IS3TsvMetadata {
    storage: S3,
    serialization: ITabSeparated
} 

// Use the above schema to register a dataset
registry.register<JSONValue, IS3TsvMetadata>({
    id: "s3://datasets/covid19-dataset",
    abstractDataType: "tabular",
    storageType: "s3",
    serializationType: "tsv",
    value: null,
    metadata: {
        storage: {
            bucket: "datasets",
            object: "covid19-dataset"    
        },
        serialization: {
            lineDelimiter: "\n",
            colDelimiter: "\t",
            columns: [
                {
                    "name": "country",
                    "dtype": "string" 
                },
                {
                    "name": "state",
                    "dtype": "string" 
                },
                {
                    "name": "cases",
                    "dtype": "number"
                },
                {
                    "name": "reported",
                    "dtype": "datetime"
                }   
            ]
        }
    }
});

Pickle and dill as supported persistent formats (is that what inmemory is)?

"inmemory" here just signified a non-remote dataset, dataframes created from pandas for example could be declared "inmemory". This definition here doesn't inherently do anything to support specific formats/libraries, this is merely a way to specify what the storage type is. It is totally open to the implementor to handle specific storage formats. However, I am looking into how to allow users to register datasets directly from the cell when they create a pandas dataframe, we can discuss more if pickle and dill should be supported in this context.

Persist a default query to fetch the dataset?

Can you elaborate on this.

permanently splitting datasets into train/ test ahead of time is bad practice

Agree, the example I had was just a representation of a dataset with multiple folders/files.

@romeokienzler
Copy link

@3coins as discussed, done some edits, >>>>highlighted<<<<

Data Registry User Stories

* As an end user in JupyterLab working with datasets.
  
  * When, I am working with datasets in a notebook
    
    * I would like JupyterLab to let me add them to “My Datasets”, so I can use them later
      
      * When, I am working with in memory datasets in a notebook (pandas, numpy, etc.)
        
        * I would like JupyterLab to be aware of those datasets, and let me add them to “My Datasets” using a UI, so I can use them later and see other things I might do with them.
        * I would also like to be able to quickly add them to “My Datasets” from Python explicitly.
        * >>>>I'd like to have a view of all in-memory data sets in all my kernels (like RStudio)<<<<
      * When, I am working with remote data sources/sets in a notebook
        
        * I would to be able to quickly add them to “My Datasets” from Python.
      * When, I am working with in memory or remote datasets in a notebook
        
        * I would like to optionally see what else I could do with the data in other extensions
             * >>>>for example launching a SQL query editor and execute queries from there<<<<
             * >>>>trigger a ML wizards which allows for selecting feature columns, target columns and trigger ML/DL training from there (using the same wizard, pushing e.g. for a pipeline)<<<<
  * When, I am working with a dataset in another JupyterLab extension
    
    * I would like to see what else I could do with the dataset in other extensions and take action on that
    * I would like to be able to add the dataset to “My Datasets”
  * When, I am done using a dataset in “My Datasets”
    
    * I would like the dataset to be archived/deleted, so that it doesn’t appear in the list of “My Datasets“
  * When, I am working with a dataset in one JupyterLab extension and I want to use another JupyterLab extension to work with the dataset
    
    * I would like the extension to be able to access and work seamlessly with the dataset
    * I would like the extension to be able to carry the dataset changes from previous extension interactions
  * When I want to use an archived/deleted dataset again
    
    * I would like to see this dataset in list of “My Datasets”
  * When I want to find a dataset that I have already worked with
    
    * I would like to be able to search the dataset by name and other attributes/tags
  * When I am working with “My Datasets” in the data explorer
    
    * I would like to be notified when the dataset is updated (version changes)

* As a developer of a JupyterLab extension that inputs and/or outputs datasets (Notebooks)
  
  * When, I have an output dataset
    
    * I would like to show my user what else they could do with the output dataset in other JupyterLab extensions or notebooks and select one of those options in a simple UI, so I don’t have to build explicit integration with all of those extensions.
    * I would like to allow my user to choose to add the output dataset to “My Datasets”.
  * When I am building support for different types of input data
    * >>>> I'd like to be able to share my data set with other via ml-exchange.org, feast, or any other feature store<<<<
    
    * I would like to tell other extensions that I can do something with those types of data, so they can pass those types of data to me without my intervention and without me building explicit integration with all of those extensions.
  * When I am integrating with other extensions
    
    * I would like to interface with a central dataset API rather than all of the other extensions, so I don’t have to repeat the work of integration.
  * When a user imports a new dataset
    
    * I would like my extension to get notified, so that I can do something meaningful with the dataset, for example show in a list of “My Datasets”.
  * When another JupyterLab extension adds a new action for a dataset
    
    * I would like my extension get notified so that I can add a new option for the dataset so my end user can use this option perform the new action
  * When my user is done using the dataset
    
    * I would like my extension to be able to remove this dataset from “My Datasets”
  * When my end user wants to reuse an archived/deleted dataset
    
    * I would like my extension to be able to add this dataset to “My Datasets”
  * When a dataset is updated
    
    * I would like my extension to be notified, so that my extension can update the dataset info and user options related to this dataset

* As a developer of a data import experience in JupyterLab:
  
  * When a user picks a dataset they are interested in working with
    
    * I would like to show my user what else they could do with the output dataset in other JupyterLab extensions or notebooks and select one of those options in a simple UI, so I don’t have to build explicit integration with all of those extensions.
    * I would like to allow my user to choose to add the output dataset to “My Datasets”.

@ckadner
Copy link

ckadner commented May 19, 2022

@3coins -- are there any existing data set providers/registries that you think should implement/support this new interface?

In MLX we have datasets in the catalog. We are considering if we should implement this new dataset interface.

@3coins
Copy link
Author

3coins commented May 20, 2022

@ckadner

@3coins -- are there any existing data set providers/registries that you think should implement/support this new interface?

In MLX we have datasets in the catalog. We are considering if we should implement this new dataset interface.

This project is still a work in progress. That said, I think we can explore how this applies to the datasets you mentioned here, and identify if there are any gaps. One key point to consider here is that DataRegistry is not intended to serve as a catalog, but rather act as a central artifact to manage "My Datasets" (Datasets I am working with) within a JupyterLab instance.

@fcollonval fcollonval closed this Aug 8, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Simplify data model by removing nested dataset capability
5 participants