Skip to content

Commit

Permalink
[docs] Adding Dataset usage examples (#469)
Browse files Browse the repository at this point in the history
  • Loading branch information
gilad-shaham committed Oct 25, 2020
1 parent 7ba57be commit 7d2bbc8
Show file tree
Hide file tree
Showing 4 changed files with 92 additions and 14 deletions.
90 changes: 84 additions & 6 deletions docs/data-management-and-versioning.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@

- [Overview](#overview)
- [Datasets](#datasets)
- [Logging a Dataset From a Job](#logging-a-dataset-from-a-job)
- [Models](#models)
- [Plots](#plots)

Expand Down Expand Up @@ -37,11 +38,69 @@ Where `key` is the the name of the artifact and `df` is the DataFrame. By defaul

MLRun will also calculate statistics on the DataFrame on all numeric fields. You can enable statistics regardless to the DataFrame size by setting the `stats` parameter to `True`.

### Logging a Dataset From a Job

The following example shows how to work with datasets from a [job](job-submission-and-tracking.html):

``` python
from os import path
from mlrun.execution import MLClientCtx
from mlrun.datastore import DataItem

# Ingest a data set into the platform
def get_data(context: MLClientCtx, source_url: DataItem, format: str = 'csv'):

iris_dataset = source_url.as_df()

target_path = path.join(context.artifact_path, 'data')
# Optionally print data to your logger
context.logger.info('Saving Iris data set to {} ...'.format(target_path))

# Store the data set in your artifacts database
context.log_dataset('iris_dataset', df=iris_dataset, format=format,
index=False, artifact_path=target_path)
```

We can run this function locally or as a job. For example if we run it locally:

``` python
from os import path
from mlrun import new_project, run_local, mlconf

project_name = 'my-project'
project_path = path.abspath('conf')
project = new_project(project_name, project_path, init_git=True)

# Target location for storing pipeline artifacts
artifact_path = path.abspath('jobs')
# MLRun DB path or API service URL
mlconf.dbpath = mlconf.dbpath or 'http://mlrun-api:8080'

source_url = 'https://s3.wasabisys.com/iguazio/data/iris/iris_dataset.csv'
# Run get-data function locally
get_data_run = run_local(name='get_data',
handler=get_data,
inputs={'source_url': source_url},
project=project_name,
artifact_path=artifact_path)
```

The dataset location is returned in the `outputs` field, therefore you can get the location by calling `get_data_run.outputs['iris_dataset']` and use the `get_dataitem` function to get the dataset itself.


``` python
# Read your data set
from mlrun.run import get_dataitem
dataset = get_dataitem(get_data_run.outputs['iris_dataset'])
```

Call `dataset.meta.stats` to obtain the data statistics. You can also get the data as a Pandas Dataframe by calling the `dataset.as_df()`.

## Models

An essential piece of artifact management and versioning is storing a model version. This allows the users to experiment with different models and compare their performance, without having to worry about losing their previous results.

The simplest way to store a model named `model` is with the following code:
The simplest way to store a model named `my_model` is with the following code:

``` python
from pickle import dumps
Expand Down Expand Up @@ -100,8 +159,8 @@ gen_func = code_to_function(name=train_iris,
train_iris_func = project.set_function(gen_func).apply(auto_mount())

train_iris = train_iris_func.run(name=train_iris,
handler=train_iris,
artifact_path=artifact_path)
handler=train_iris,
artifact_path=artifact_path)
```

You can now use `get_model` to read the model and run it. This function will get the model file, metadata, and extra data. The input can be either the path of the model, or the directory where the model resides. If you provide a directory, the function will search for the model file (by default it searches for .pkl files)
Expand Down Expand Up @@ -150,7 +209,7 @@ run = func.run(name=test_model,
handler=test_model,
params={'label_column': 'label'},
inputs={'models_path': train_iris.outputs['model'],
'test_set': 'http://iguazio-sample-data.s3.amazonaws.com/iris_dataset.csv'}),
'test_set': 'https://s3.wasabisys.com/iguazio/data/iris/iris_dataset.csv'}),
artifact_path=artifact_path)
```

Expand All @@ -161,8 +220,27 @@ Storing plots is useful to visualize the data and to show any information regard
For example, the following code creates a confusion matrix plot using [sklearn.metrics.plot_confusion_matrix](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.plot_confusion_matrix.html#sklearn.metrics.plot_confusion_matrix) and stores the plot in the artifact repository:

``` python
cmd = metrics.plot_confusion_matrix(model, xtest, ytest, normalize='all', values_format='.2g', cmap=plt.cm.Blues)
context.log_artifact(PlotArtifact('confusion-matrix', body=cmd.figure_), local_path='plots/confusion_matrix.html')
from mlrun.artifacts import PlotArtifact
from mlrun.mlutils import gcf_clear

gcf_clear(plt)
confusion_matrix = metrics.plot_confusion_matrix(model,
xtest,
ytest,
normalize='all',
values_format = '.2g',
cmap=plt.cm.Blues)
confusion_matrix = context.log_artifact(PlotArtifact('confusion-matrix', body=confusion_matrix.figure_),
local_path='plots/confusion_matrix.html')
```

You can use the `update_dataset_meta` function to associate the plot with the dataset by assigning the value of the `extra_data` parameter:

``` python
from mlrun.artifacts import update_dataset_meta

extra_data = {'confusion_matrix': confusion_matrix}
update_dataset_meta(dataset, extra_data=extra_data)
```

[Back to top](#top)
10 changes: 5 additions & 5 deletions docs/job-submission-and-tracking.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,11 +20,11 @@
- [Using the MLRun CLI to Run an MLRun Service](#using-the-mlrun-cli-to-run-an-mlrun-service)

## Experiment Tracking
Experiment tracking enables you to store every action and result in your project. It is a convenient way to go back to previous results and compare different artifacts. You will find 3 main sections within the your project:
1. [**Artifacts**](#artifact): Any data stored is considered an artifact. Artifacts are versioned and enable you to compare different outputs of the executed Jobs
2. [**Functions**](#functions): The code in your project is stored in functions that are versioned. Functions can the functions you wrote, or externally loaded functions, such as functions that originate from the [MLRun Functions Marketplace](https://github.com/mlrun/functions)
3. [**Jobs**](#jobs): Allows you to review anything you executed, and review the execution outcome
4. [**Pipelines**](#pipelines): Reusable end-to-end ML workflows
Experiment tracking enables you to store every action and result in your project. It is a convenient way to go back to previous results and compare different artifacts. You will find the following sections within your project:
1. [**Artifacts**](#artifact): Any data stored is considered an artifact. Artifacts are versioned and enable you to compare different outputs of the executed Jobs.
2. [**Functions**](#functions): The code in your project is stored in functions that are versioned. Functions can the functions you wrote, or externally loaded functions, such as functions that originate from the [MLRun Functions Marketplace](https://github.com/mlrun/functions).
3. [**Jobs**](#jobs): Allows you to review anything you executed, and review the execution outcome.
4. [**Pipelines**](#pipelines): Reusable end-to-end ML workflows.

You can compare different experiments and review these results. When using experiment tracking you don't have to worry about saving your work as you try out different models and various configurations, you can always compare your different results and choose the best strategy based on your current and past experiments.

Expand Down
2 changes: 1 addition & 1 deletion docs/load-from-marketplace.md
Original file line number Diff line number Diff line change
Expand Up @@ -123,7 +123,7 @@ When working with functions pay attention to the following:
In this example we run the describe function. this function analyze a dataset (in our case it's a csv file) and generate html files (e.g. correlation, histogram) and save them under the artifact path

```python
DATA_URL = 'https://iguazio-sample-data.s3.amazonaws.com/datasets/iris_dataset.csv'
DATA_URL = 'https://s3.wasabisys.com/iguazio/data/iris/iris_dataset.csv'

my_describe.run(name='describe',
inputs={'table': DATA_URL},
Expand Down
4 changes: 2 additions & 2 deletions docs/quick-start.md
Original file line number Diff line number Diff line change
Expand Up @@ -139,7 +139,7 @@ As input, we will provide a CSV file from S3:

```python
# Set the source-data URL
source_url = 'http://iguazio-sample-data.s3.amazonaws.com/iris_dataset.csv'
source_url = 'https://s3.wasabisys.com/iguazio/data/iris/iris_dataset.csv'
```

Next call this function locally, using the `run_local` method. This is a wrapper that will store the execution results in the MLRun database.
Expand Down Expand Up @@ -412,7 +412,7 @@ def init_functions(functions: dict, project=None, secrets=None):
name="Quick-start",
description="This is simple workflow"
)
def kfpipeline(source_url='http://iguazio-sample-data.s3.amazonaws.com/iris_dataset.csv'):
def kfpipeline(source_url='https://s3.wasabisys.com/iguazio/data/iris/iris_dataset.csv'):

# Ingest the data set
ingest = funcs['get-data'].as_step(
Expand Down

0 comments on commit 7d2bbc8

Please sign in to comment.