# Machine Learning Pipeline with Proactive Jupyter Kernel and Tensorboard
The ActiveEon Jupyter Kernel adds a kernel backend to Jupyter.

This kernel interfaces directly with the ProActive scheduler and constructs tasks and workflows to execute them on the fly.

With this interface, users can run their code locally and test it using a native python kernel, and by a simple switch to ProActive kernel, run it on remote public or private infrastructures without having to modify the code.

See https://github.com/ow2-proactive/proactive-jupyter-kernel for more information.

As a quick start, we recommend the user to run the `#%help()` pragma using the following script:

In [None]:
#%help()

## Connection

If you are trying ProActive for the first time, sign up on the [try platform](https://try.activeeon.com/signup.html).

Once you receive your login and password, connect to the trial platform using the `#%connect()` pragma.

For more information, type: `#%help(pragma=connect)`

In [None]:
#%connect(url=https://try.activeeon.com:8443)

## Runtime environment definition

The `#%runtime_env()` pragma enable user to define the runtime environment for pipeline execution.

The user can select the container type (docker, podman, singularity), the container image, and mount local directories inside container.

For more information, type: `#%help(pragma=runtime_env)`

In [None]:
#%runtime_env(type=docker,image=activeeon/dlm3,mount_host_path=/shared,mount_container_path=/shared,debug=false,verbose=false,force=off)

## Importing libraries
The main difference between the ProActive and 'native language' kernels resides in the way the memory is accessed
during blocks execution. In a common native language kernel, the whole script code (all the notebook blocks) is
locally executed in the same shared memory space; whereas the ProActive kernel will execute each created task in an
independent process. In order to facilitate the transition from native language to ProActive kernels, we included the
pragma `#%import()`. This pragma gives the user the ability to add libraries that are common to all created tasks, and
thus relative distributed processes, that are implemented in the same native script language.

The import pragma is used as follows:

`#%import([language=SCRIPT_LANGUAGE])`.

Example:

```python
#%import(language=Python)
import os
import pandas
```

NOTE: If the language is not specified, Python is considered as default language.

In [None]:
#%import()
import json
import numpy as np
import pandas as pd
import pickle
import bz2
import random

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import RandomizedSearchCV, cross_val_score, KFold
from scipy.stats import uniform

from tensorboardX import SummaryWriter

## Creating tasks

### Creating the _import_data_ task

In [None]:
#%task(name=import_data,export=[dataframe_json])
dataset_url = "https://activeeon-public.s3.eu-west-2.amazonaws.com/datasets/vehicle_silhouette_weka_dataset.csv"
dataframe = pd.read_csv(dataset_url)

dataframe_json = dataframe.to_json(orient='split').encode()
compressed_data = bz2.compress(dataframe_json)
dataframe.head()

### Creating the _cross_validation_ task 

In [None]:
#%task(name=cross_validation,dep=[import_data],import=[dataframe_json],export=[nested_scores_json])
dataframe = pd.read_json(dataframe_json, orient='split')

label_column = "vehicle_class"
dataframe_train = dataframe.drop(label_column, axis=1, inplace=False)
dataframe_label = dataframe[label_column]

# Set up possible values of parameters to optimize over
distributions = dict(C=uniform(loc=0, scale=4), penalty=['l2', 'l1'])

# We will use a Logistic Classifier with "rbf" kernel
logistic = LogisticRegression(solver='saga', tol=1e-2, max_iter=200, random_state=0)

# Choose cross-validation techniques for the inner and outer loops,
# independently of the dataset.
# E.g "GroupKFold", "LeaveOneOut", "LeaveOneGroupOut", etc.
inner_cv = KFold(n_splits=10, shuffle=True, random_state=random.randint(0,9))
outer_cv = KFold(n_splits=10, shuffle=True, random_state=random.randint(0,9))

# Non_nested parameter search and scoring
clf = RandomizedSearchCV(estimator=logistic, param_distributions=distributions, cv=inner_cv)
clf.fit(dataframe_train, dataframe_label)

# Nested CV with parameter optimization
nested_scores = cross_val_score(clf, X=dataframe_train, y=dataframe_label, cv=outer_cv)

# Print scores
print("nested cross-validation scores:\n", nested_scores)
print("average of {:6f} with std. dev. of {:6f}."
      .format(nested_scores.mean(), nested_scores.std()))

# Save scores on Tensorboard
# writer = SummaryWriter("./logs")
PA_JOB_ID = variables.get("PA_JOB_ID")
TENSORBOARD_LOG_PATH = "/shared/tensorboard/job_id_" + str(PA_JOB_ID)
os.makedirs(TENSORBOARD_LOG_PATH)
writer = SummaryWriter(TENSORBOARD_LOG_PATH)
for idx, nested_score in enumerate(nested_scores):
    writer.add_scalar('Logistic_Regression_Scores', nested_score, idx)
writer.close()

# save the model to disk
#filename = '/shared/logistic_regression_model.sav'
#pickle.dump(model, open(filename, 'wb'))

nested_scores_json = json.dumps(nested_scores.tolist())
result = nested_scores_json

### Visualizing the job pipeline

In [None]:
#%draw_job()

### Submitting the job to the scheduler

To submit the job to the ProActive Scheduler, the user has to use the `#%submit_job()` pragma:

```python
#%submit_job()
```

If the job is not created, or is not up-to-date, the `#%submit_job()` creates a new job named as the old one.
To provide a new name, use the same pragma and provide a name as parameter:

```python
#%submit_job([name=JOB_NAME])
```

If the job's name is not set, the ProActive kernel uses the current notebook name, if possible, or gives a random one.

In [None]:
#%submit_job(name=ML_Pipeline_Tensorboard_Example)

### Getting results and outputs

After the execution of a ProActive workflow, two outputs can be obtained,
* results: values that have been saved in the 
[task result variable](https://doc.activeeon.com/latest/user/ProActiveUserGuide.html#_task_result),
* console outputs: classic outputs that have been displayed/printed 

To get task results, please use the `#%get_task_result()` pragma by providing the task name, and either the job ID or
the job name:

```python
#%get_task_result([job_id=JOB_ID], [job_name=JOB_NAME], task_name=TASK_NAME)
```

The result(s) of all the tasks of a job can be obtained with the `#%get_job_result()` pragma, by providing the job name
or the job ID:

```python
#%get_job_result([job_id=JOB_ID], [job_name=JOB_NAME])
```

To get and display console outputs of a task, you can use the `#%print_task_output()` pragma in the following
way:

```python
#%print_task_output([job_id=JOB_ID], [job_name=JOB_NAME], task_name=TASK_NAME)
```

Finally, the  `#%print_job_output()` pragma allows to print all job outputs, by providing the job name or the job ID:

```python
#%print_job_output([job_id=JOB_ID], [job_name=JOB_NAME])
```

NOTE: If neither `job_name` nor the `job_id` are provided, the last submitted job is selected by default. 

In [None]:
#%print_task_output(task_name=cross_validation)