Skip to content

Commit

Permalink
[Artifacts] Visualize data/artifacts with DataItem.show() and return …
Browse files Browse the repository at this point in the history
…DataItems in run.artifact() (#1092)

(cherry picked from commit 2d01ce8)
  • Loading branch information
yaronha authored and Hedingber committed Jul 12, 2021
1 parent b7f3f43 commit f279457
Show file tree
Hide file tree
Showing 10 changed files with 172 additions and 41 deletions.
6 changes: 3 additions & 3 deletions docs/runtimes/functions.md
Expand Up @@ -24,8 +24,8 @@ use in your projects.
**Functions** (function objects) can be created by using any of the following methods:

- **{py:func}`~mlrun.run.new_function`** - creates a function for local run or from container, from code repository/archive, from function spec.
- **{py:func}`~mlrun.code_to_function`** - creates a function from local or remote source code (single file) or from a notebook (code file will be embedded in the function object).
- **{py:func}`~mlrun.import_function`** - imports a function from a local or remote YAML function-configuration file or
- **{py:func}`~mlrun.run.code_to_function`** - creates a function from local or remote source code (single file) or from a notebook (code file will be embedded in the function object).
- **{py:func}`~mlrun.run.import_function`** - imports a function from a local or remote YAML function-configuration file or
from a function object in the MLRun database (using a DB address of the format `db://<project>/<name>[:<tag>]`)
or from the function marketplace (e.g. `hub://describe`).

Expand Down Expand Up @@ -142,7 +142,7 @@ Run object has the following methods/properties:
- `outputs` &mdash; returns a dictionary of the run results and artifact paths.
- `logs(watch=True)` &mdash; returns the latest logs.
Use `Watch=False` to disable the interactive mode in running jobs.
- `artifact(key)` &mdash; returns full artifact details for the provided key.
- `artifact(key)` &mdash; returns artifact for the provided key (as {py:class}`~mlrun.datastore.DataItem` object).
- `output(key)` &mdash; returns a specific result or an artifact path for the provided key.
- `wait_for_completion()` &mdash; wait for async run to complete
- `refresh()` &mdash; refresh run state from the db/service
Expand Down
10 changes: 6 additions & 4 deletions docs/store/artifacts.md
Expand Up @@ -90,7 +90,7 @@ they host common and object specific metadata such as:
* type specific attributes

Artifacts can be obtained via the SDK through type specific APIs or using generic artifact APIs such as:
* {py:func}`~mlrun.run.get_data_item` - get the {py:class}`~mlrun.datastore.DataItem` object for reading/downloading the artifact content
* {py:func}`~mlrun.run.get_dataitem` - get the {py:class}`~mlrun.datastore.DataItem` object for reading/downloading the artifact content
* {py:func}`~mlrun.datastore.get_store_resource` - get the artifact object

example artifact URLs:
Expand Down Expand Up @@ -162,13 +162,15 @@ get_data_run = run_local(name='get_data',
artifact_path=artifact_path)
```

The dataset location is returned in the `outputs` field, therefore you can get the location by calling `get_data_run.outputs['iris_dataset']` and use the `get_dataitem` function to get the dataset itself.
The dataset location is returned in the `outputs` field, therefore you can get the location by calling `get_data_run.artifact('iris_dataset')` to get the dataset itself.


``` python
# Read your data set
from mlrun.run import get_dataitem
dataset = get_dataitem(get_data_run.outputs['iris_dataset'])
get_data_run.artifact('iris_dataset').as_df()

# Visualize an artifact in Jupyter (image, html, df, ..)
get_data_run.artifact('confusion-matrix').show()
```

Call `dataset.meta.stats` to obtain the data statistics. You can also get the data as a Pandas Dataframe by calling the `dataset.as_df()`.
Expand Down
25 changes: 17 additions & 8 deletions docs/store/datastore.md
Expand Up @@ -56,19 +56,28 @@ Note that in order to call our function with an `input` we used the `inputs` dic
a simple parameter we used the `params` dictionary attribute. the input value is the specific item uri
(per data store schema) as explained above.

Reading the data results from our run:
we can easily get a run output artifact as a `DataItem` (allowing us to view/use the artifact) using:

```python
# read the data locally as a Dataframe
prep_data_run.artifact('cleaned_data').as_df()
```

The {py:class}`~mlrun.datastore.DataItem` support multiple convenience methods such as:
* **get**, **put** - to read/write data
* **download**, **upload** - to download/upload files
* **as_df** - to convert the data to a DataFrame object
* **get()**, **put()** - to read/write data
* **download()**, **upload()** - to download/upload files
* **as_df()** - to convert the data to a DataFrame object
* **local** - to get a local file link to the data (will be downloaded locally if needed)
* **listdir**, **stat** - file system like methods
* **listdir()**, **stat** - file system like methods
* **meta** - access to the artifact metadata (in case of an artifact uri)
* **show()** - will visualize the data in Jupyter (as image, html, etc.)

Check the **{py:class}`~mlrun.datastore.DataItem`** class documentation for details

In order to get a DataItem object from a url use {py:func}`~mlrun.run.get_data_item` or
{py:func}`~mlrun.run.get_data_object` (returns the `DataItem.get()`), for example:
In order to get a DataItem object from a url use {py:func}`~mlrun.run.get_dataitem` or
{py:func}`~mlrun.run.get_object` (returns the `DataItem.get()`), for example:

df = mlrun.get_data_item('s3://demo-data/mydata.csv').as_df()
print(mlrun.get_data_object('https://my-site/data.json'))
df = mlrun.get_dataitem('s3://demo-data/mydata.csv').as_df()
print(mlrun.get_object('https://my-site/data.json'))

17 changes: 8 additions & 9 deletions docs/tutorial/02-model-training.ipynb
Expand Up @@ -672,8 +672,7 @@
],
"source": [
"# Display HTML output artifacts\n",
"from IPython.display import display, HTML\n",
"display(HTML(filename=train_run.outputs['confusion-matrix']))"
"train_run.artifact('confusion-matrix').show()"
]
},
{
Expand All @@ -696,7 +695,7 @@
}
],
"source": [
"display(HTML(filename=train_run.outputs['roc-multiclass']))"
"train_run.artifact('roc-multiclass').show()"
]
},
{
Expand All @@ -722,7 +721,7 @@
"outputs": [],
"source": [
"# Read your data set\n",
"df = mlrun.run.get_dataitem(train_run.outputs['test_set']).as_df()"
"df = train_run.artifact('test_set').as_df()"
]
},
{
Expand Down Expand Up @@ -1416,7 +1415,7 @@
],
"source": [
"# Display the `histograms` artifact\n",
"display(HTML(describe_run.outputs['histograms']))"
"describe_run.artifact('histograms').show()"
]
},
{
Expand All @@ -1440,7 +1439,7 @@
],
"source": [
"# Display the `imbalance` artifact\n",
"display(HTML(filename=describe_run.outputs['imbalance']))"
"describe_run.artifact('imbalance').show()"
]
},
{
Expand All @@ -1464,7 +1463,7 @@
],
"source": [
"# Display the `correlation` artifact\n",
"display(HTML(filename=describe_run.outputs['correlation']))"
"describe_run.artifact('correlation').show()"
]
},
{
Expand Down Expand Up @@ -1821,8 +1820,8 @@
"print(f'Test Accuracy: {test_run.outputs[\"accuracy\"]}')\n",
"\n",
"# Display HTML output artifacts\n",
"display(HTML(filename=test_run.outputs['confusion-matrix']))\n",
"display(HTML(filename=test_run.outputs['roc-multiclass']))"
"test_run.outputs('confusion-matrix').show()\n",
"test_run.outputs('roc-multiclass').show()"
]
},
{
Expand Down
89 changes: 81 additions & 8 deletions mlrun/datastore/base.py
Expand Up @@ -17,13 +17,14 @@
from tempfile import mktemp

import fsspec
import orjson
import pandas as pd
import pyarrow.parquet as pq
import requests
import urllib3

import mlrun.errors
from mlrun.utils import logger
from mlrun.utils import is_ipython, logger

verify_ssl = False
if not verify_ssl:
Expand Down Expand Up @@ -221,7 +222,29 @@ def rm(self, path, recursive=False, maxdepth=None):


class DataItem:
"""Data input/output class abstracting access to various local/remote data sources"""
"""Data input/output class abstracting access to various local/remote data sources
DataItem objects are passed into functions and can be used inside the function, when a function run completes
users can access the run data via the run.artifact(key) which returns a DataItem object.
users can also convert a data url (e.g. s3://bucket/key.csv) to a DataItem using `mlrun.get_dataitem(url)`.
Example::
# using data item inside a function
def my_func(context, data: DataItem):
df = data.as_df()
# reading run results using DataItem (run.artifact())
train_run = train_iris_func.run(inputs={'dataset': dataset},
params={'label_column': 'label'})
train_run.artifact('confusion-matrix').show()
test_set = train_run.artifact('test_set').as_df()
# create and use DataItem from uri
data = mlrun.get_dataitem('http://xyz/data.json').get()
"""

def __init__(
self,
Expand Down Expand Up @@ -276,20 +299,38 @@ def url(self):
"""DataItem url e.g. /dir/path, s3://bucket/path"""
return self._url

def get(self, size=None, offset=0):
"""read all or a range and return the content"""
return self._store.get(self._path, size=size, offset=offset)
def get(self, size=None, offset=0, encoding=None):
"""read all or a byte range and return the content
:param size: number of bytes to get
:param offset: fetch from offset (in bytes)
:param encoding: encoding (e.g. "utf-8") for converting bytes to str
"""
body = self._store.get(self._path, size=size, offset=offset)
if encoding and isinstance(body, bytes):
body = body.decode(encoding)
return body

def download(self, target_path):
"""download to the target dir/path"""
"""download to the target dir/path
:param target_path: local target path for the downloaded item
"""
self._store.download(self._path, target_path)

def put(self, data, append=False):
"""write/upload the data, append is only supported by some datastores"""
"""write/upload the data, append is only supported by some datastores
:param data: data (bytes/str) to write
:param append: append data to the end of the object, NOT SUPPORTED BY SOME OBJECT STORES!
"""
self._store.put(self._path, data, append=append)

def upload(self, src_path):
"""upload the source file (src_path) """
"""upload the source file (src_path)
:param src_path: source file path to read from and upload
"""
self._store.upload(self._path, src_path)

def stat(self):
Expand Down Expand Up @@ -339,6 +380,38 @@ def as_df(
**kwargs,
)

def show(self, format=None):
"""show the data object content in Jupyter
:param format: format to use (when there is no/wrong suffix), e.g. 'png'
"""
if not is_ipython:
logger.warning(
"Jupyter/IPython was not detected, .show() will only display inside Jupyter"
)
return

from IPython import display

suffix = self.suffix.lower()
if format:
suffix = "." + format

if suffix in [".jpg", ".png", ".gif"]:
display.display(display.Image(self.get(), format=suffix[1:]))
elif suffix in [".htm", ".html"]:
display.display(display.HTML(self.get(encoding="utf-8")))
elif suffix in [".csv", ".pq", ".parquet"]:
display.display(self.as_df())
elif suffix in [".yaml", ".txt", ".py"]:
display.display(display.Pretty(self.get(encoding="utf-8")))
elif suffix == ".json":
display.display(display.JSON(orjson.loads(self.get())))
elif suffix == ".md":
display.display(display.Markdown(self.get(encoding="utf-8")))
else:
logger.error(f"unsupported show() format {suffix} for {self.url}")

def __str__(self):
return self.url

Expand Down
4 changes: 2 additions & 2 deletions mlrun/db/httpdb.py
Expand Up @@ -398,7 +398,7 @@ def list_runs(
start_time_to: datetime = None,
last_update_time_from: datetime = None,
last_update_time_to: datetime = None,
):
) -> RunList:
""" Retrieve a list of runs, filtered by various options.
Example::
Expand Down Expand Up @@ -523,7 +523,7 @@ def list_artifacts(
until=None,
iter: int = None,
best_iteration: bool = False,
):
) -> ArtifactList:
""" List artifacts filtered by various parameters.
Examples::
Expand Down
26 changes: 24 additions & 2 deletions mlrun/lists.py
Expand Up @@ -11,17 +11,21 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

from typing import List

import pandas as pd

import mlrun

from .artifacts import Artifact, dict_to_artifact
from .config import config
from .render import artifacts_to_html, runs_to_html
from .utils import flatten, get_in
from .utils import flatten, get_artifact_target, get_in


class RunList(list):
def to_rows(self):
"""return the run list as flattened rows"""
rows = []
head = [
"project",
Expand Down Expand Up @@ -57,6 +61,7 @@ def to_rows(self):
return [head] + rows

def to_df(self, flat=False):
"""convert the run list to a dataframe"""
rows = self.to_rows()
df = pd.DataFrame(rows[1:], columns=rows[0]) # .set_index('iter')
df["start"] = pd.to_datetime(df["start"])
Expand All @@ -69,6 +74,7 @@ def to_df(self, flat=False):
return df

def show(self, display=True, classes=None, short=False):
"""show the run list as a table in Jupyter"""
html = runs_to_html(self.to_df(), display, classes=classes, short=short)
if not display:
return html
Expand All @@ -80,6 +86,7 @@ def __init__(self, *args):
self.tag = ""

def to_rows(self):
"""return the artifact list as flattened rows"""
rows = []
head = {
"tree": "",
Expand All @@ -102,6 +109,7 @@ def to_rows(self):
return [head.keys()] + rows

def to_df(self, flat=False):
"""convert the artifact list to a dataframe"""
rows = self.to_rows()
df = pd.DataFrame(rows[1:], columns=rows[0])
df["updated"] = pd.to_datetime(df["updated"])
Expand All @@ -113,13 +121,27 @@ def to_df(self, flat=False):
return df

def show(self, display=True, classes=None):
"""show the artifact list as a table in Jupyter"""
df = self.to_df()
if self.tag != "*":
df.drop("tree", axis=1, inplace=True)
html = artifacts_to_html(df, display, classes=classes)
if not display:
return html

def objects(self) -> List[Artifact]:
"""return as a list of artifact objects"""
return [dict_to_artifact(artifact) for artifact in self]

def dataitems(self) -> List["mlrun.DataItem"]:
"""return as a list of DataItem objects"""
dataitems = []
for item in self:
artifact = get_artifact_target(item)
if artifact:
dataitems.append(mlrun.get_dataitem(artifact))
return dataitems


class FunctionList(list):
def __init__(self):
Expand Down

0 comments on commit f279457

Please sign in to comment.