# Delivery

## Overview

In this part, we will deliver the featurization job to a remote server and execute it there. This actually can be done with just few lines of code. But we will show a lot of the process "under the hood" to make you familiar with it, and to explain why do we have this setup.

Delivery is the most fundamental purpose of training grounds. It is extremely easy to write _some_ data science code, that is executable on your local machine. It is not so easy though to then deliver this code to a remote server (be it server for training or a web-server that exposes model to the world) so that everything continues to work.

Delivery in Training Grounds is built upon the following principles.

### Deliverables are pickled objects

We do not deliver chunks of code or notebooks. Instead, we deliver the objects that incapsulate this code.

The most simple way of doing it is write a class that contains all the required functionality in `run` method and deliver it. In the previous presentations you saw that the `FeaturizationJob` class is more complicated. We didn't have the functionality written in the run method; instead, this functionality was defined as a composition of smaller objects, according to SOLID principles. This is *not* a requirement of delivery subsystem, the delivery will work perfectly fine without any SOLID. 

When prototyping, we would recommend to stick to the simplest way, which is implementing everything in the `run` method. When the solution is developed enough, you may need to consider it's decomposition to the subclasses in order to provide testability and reusability.

### The source code is delivered alongside the objects

In many frameworks there is a backstage idea that the framework has a comprehensive set of bug-free basic objects, and any imaginable functionality we need can be composed from these. So the users would never need to write Python code ever again, instead they would write declarative descriptions of the functionality they need. In this mindset, the delivery of the source code can be performed with `pip install`.

This approach is not the one TG follows due to the various reasons:
* Frameworks seldom actually get to this stage of development
* Versioning is painful
* This mindset creates a complexity gap: to do something new, with no basic objects available, is a lot harder than using the constructor. In this regard, it is extremely important for us that the user can implement this prototyping functionality in the `run` method without using any complex architecture.

Therefore, the source code is changing rapidly. Publishing it via PiPy or `git` would create a very complicated setup, when delivery requires a lot of intermediate stages, such as commiting, pushing, tagging or publishing. 

The simpler solution is to package the current source code into a Python package, placing the pickled objects as resource inside this package. No external actions are required in this case: the object will be unseparable from the source code, thus preventing versioning issues.

### Multiple versions

We wanted different versions of a model to be able to run at the same time. But how can we do that, if the models are represented as packages? In Python, we cannot have two modules with the same name installed at the same time. Thus, they have to have different name. This is why Training Grounds itself is not a Python package, but a folder inside your project. 

Consider the file structure, recommended by TG:
```
/myproject/tg/
/myproject/tg/common/
/myproject/tg/mylibrary/
/myproject/some_other_code_of_the_project
```

When building a package, these files will be transfomed into something like:
```
/package_name/UID/
/package_name/UID/tg/
/package_name/UID/tg/common/
/package_name/UID/tg/mylibrary/
```

Note that everything outside of original `/myproject/tg/` folder is ignored. So outside of `tg` folder you can have data caches, temporal files, sensitive information (as long as it's not pushed in the repository) and so on. It will never be delivered anywhere. The corollary is that all the classes and functions you use in your object must be defined inside `/tg/` folder. Otherwise, they will not be delivered.

The name of the TG is actually `UID.tg`, with different UID in each package. Hence, several versions of TG can be used at the same time! But that brings another limitation that must be observed inside `tg` folder: all the references inside TG must be relative. They cannot refer to `tg`, because `tg` will become `{UID}.tg` in the runtime on the remote server.


### Hot Module Replacement

Now, the question arises, how to use this package. We cannot write something like:

```
from UID.tg import *
```

because the name `UID` is formed dynamically. 

The solution is to install the module during runtime. During this process, the name becomes known, and then we can dynamically import from the module. Of course, importing classes or methods would not be handy, but remember that deliverables are objects, and these objects are pickled as the module resources. So all we need to do is to unpickle these objects, and all the classes and methods will be loaded dynamically by unpickler. 

This work is done by `EntryPoint` class.

#### Note for advanced users

When package is created, we pickle the objects under the local version of TG, thus, the classes are unavoidably pickled as `tg.SomeClass`, but we want to unpickle them as `UID.tg.SomeClass`. How is this achived? Fortunately, pickling allows you to do some manipulations while pickling/unpickling, and so we just replace all `tg.` prefixes to `UID.tg.` while building a package (UID is already known at this time).

It is also possible to do same trick when unpickling: if you want to transfer the previously packaged object into the current `tg` version, this is possible. Of course, it's on your responsibility to ensure that current TG is compatible with an older version. Later we will discuss a use case for that.

## Packaging

Consider the following job we want to deliver to the remote server and execute there.

In [1]:
from tg.common.datasets.featurization import FeaturizationJob, DataframeFeaturizer, InMemoryJobDestination
from tg.common.datasets.selectors import Selector
from tg.common.datasets.access import MockDfDataSource
import pandas as pd

destination = InMemoryJobDestination()

job = FeaturizationJob(
    name = 'job',
    version = 'v1',
    source = MockDfDataSource(pd.read_csv('titanic.csv')),
    featurizers = {
        'passengers': DataframeFeaturizer(row_selector = Selector.identity)
    },
    destination = destination,
    status_report_frequency=100
)

job.run()
destination.buffer['passengers'][0].head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


For details about mentioned classes, we refer you to the previous demos. Essentialy, the code above just passes the `titanik.csv` file through the TG machinery, decomposes and reconstructs it again, without changing anything.

Let's build the package with this job:

In [2]:
from tg.common.delivery.packaging import PackagingTask, make_package

packaging_task = PackagingTask(
    name = 'titanic_featurization',
    version = '1',
    payload = dict(job = job),
)

info = make_package(packaging_task)
info.__dict__

{'task': <tg.common.delivery.packaging.packaging_dto.PackagingTask at 0x7f5060162050>,
 'module_name': 'titanic_featurization__1',
 'path': PosixPath('/home/yura/Desktop/repos/boilerplate-service-ml/temp/release/package/titanic_featurization__1-1.tar.gz')}

Here `PackagingTask` defines all the properties of the package, and `make_package` creates the package.

*NOTE*: `name` and `version` here are the name and version in the sense of Python. 

If you create and install another package with the name `titanic_featurization` and higher version, the version 1 will be removed from the system - because Python does not allow you to have different versions of the same library at the same time. This is the way to go if you actually want to update the model.

If you want several models to be used at the same time, you need to incorporate the version inside name, e.g. `name=titanic_featurization_1`

Let us now install the created package. `make_package` stored a file in the local system, and now we will install it. In the code, it results in the `EntryPoint` object.

In [3]:
from tg.common.delivery.packaging import install_package_and_get_loader

entry_point = install_package_and_get_loader(info.path)
entry_point.__dict__

{'module_name': 'titanic_featurization',
 'module_version': '1',
 'tg_module_name': 'titanic_featurization__1.tg',
 'python_module_name': 'titanic_featurization__1',
 'original_tg_module_name': 'tg',
 'resources_location': '/home/yura/anaconda3/envs/bo/lib/python3.7/site-packages/titanic_featurization__1/resources'}

Now we will load the job from the package. Note that the classes are indeed located in different modules.

In [4]:
loaded_job = entry_point.load_resource('job')
job, loaded_job

(<tg.common.datasets.featurization.featurization_job.FeaturizationJob at 0x7f5035ea1950>,
 <titanic_featurization__1.tg.common.datasets.featurization.featurization_job.FeaturizationJob at 0x7f5063a42c10>)

## Containering

Although we could just run the package at the remote server via ssh, the more suitable way is to use Docker. Training Grounds defines methods to build the docker container out of the package

In [5]:
from tg.common.delivery.packaging import ContaineringTask, make_container

ENTRY_FILE_TEMPLATE = '''
import {module}.{tg_name}.common.delivery.jobs.ssh_docker_job_execution as feat
from {module} import Entry
import logging

logger = logging.getLogger()

logger.info("Hello, docker!")
job = Entry.load_resource('job')
job.run()
logger.info(job.destination.buffer['passengers'][0])

'''

DOCKERFILE_TEMPLATE  = '''FROM python:3.7

{install_libraries}

COPY . /featurization

WORKDIR /featurization

COPY {package_filename} package.tar.gz

RUN pip install package.tar.gz

CMD ["python3","/featurization/run.py"]
'''

task = ContaineringTask(
    packaging_task = packaging_task,
    entry_file_name = 'run.py',
    entry_file_template=ENTRY_FILE_TEMPLATE,
    dockerfile_template=DOCKERFILE_TEMPLATE,
    image_name='titanic-featurization',
    image_tag='test'
)

make_container(task)


Now, we can run this container locally:

In [6]:
!docker run titanic-featurization:test

2021-04-08 12:06:22,651 INFO: Hello, docker!
2021-04-08 12:06:22,668 INFO: Featurization Job job at version v1 has started
2021-04-08 12:06:22,668 INFO: Fetching data
2021-04-08 12:06:22,695 INFO: 100 data objects are processed


2021-04-08 12:06:22,731 INFO: 200 data objects are processed


2021-04-08 12:06:22,767 INFO: 300 data objects are processed
2021-04-08 12:06:22,798 INFO: 400 data objects are processed


2021-04-08 12:06:22,825 INFO: 500 data objects are processed
2021-04-08 12:06:22,847 INFO: 600 data objects are processed


2021-04-08 12:06:22,863 INFO: 700 data objects are processed
2021-04-08 12:06:22,878 INFO: 800 data objects are processed
2021-04-08 12:06:22,896 INFO: Data fetched, finalizing


2021-04-08 12:06:22,910 INFO: Uploading data
2021-04-08 12:06:22,910 INFO: Featurization job completed


2021-04-08 12:06:22,911 INFO:      PassengerId  Survived  Pclass  ...     Fare Cabin  Embarked
0              1         0       3  ...   7.2500   NaN         S
1              2         1       1  ...  71.2833   C85         C
2              3         1       3  ...   7.9250   NaN         S
3              4         1       1  ...  53.1000  C123         S
4              5         0       3  ...   8.0500   NaN         S
..           ...       ...     ...  ...      ...   ...       ...
886          887         0       2  ...  13.0000   NaN         S
887          888         1       1  ...  30.0000   B42         S
888          889         0       3  ...  23.4500   NaN         S
889          890         1       1  ...  30.0000  C148         C
890          891         0       3  ...   7.7500   NaN         Q

[891 rows x 12 columns]


This `make_container` function is not "standard" or "universal": it just allows building the containers that are suitable for Sagemaker tasks and featurization jobs. So if you need some more sophisticated containering, please check the source code of this function to understand how to create an analog for it. Most of the complicated job is done by packaging, so `make_container` really just fills templates with values and executes some shell commands.

## SSH/Docker routine

Fortunately, you don't really need to do packaging or containering yourself, because we have a higher level level interfaces to do that, which is `Routine` classes. For instance, `SSHDockerJobRoutine` allows you to execute your jobs in the docker at the remote server to which you have ssh access.

In [7]:
from tg.common.delivery.jobs import SSHDockerJobRoutine, DockerOptions

routine = SSHDockerJobRoutine(
    job = job,
    repository=None,
    remote_host_address=None,
    remote_host_user=None,
    options = DockerOptions(propagate_environmental_variables=[])
)

Most of the fields are specified to None, because we are not going to actually start the remote job with this notebook. `SSHDockerJobRoutine` allows less intrusive methods of running your code for debugging.

Using the `.attached` accesor, we can run job in the same Python process that your current code is executed. This is, of course, the fastest way to do that, and therefore it's preferrable to use this to debug for typos, wrong logic, etc.

In [8]:
import logging
logging.basicConfig(format='%(asctime)s %(levelname)s: %(message)s', level=logging.INFO)

routine.attached.execute()

2021-04-08 14:06:23,736 INFO: Featurization Job job at version v1 has started


2021-04-08 14:06:23,740 INFO: Fetching data


2021-04-08 14:06:23,766 INFO: 100 data objects are processed


2021-04-08 14:06:23,785 INFO: 200 data objects are processed


2021-04-08 14:06:23,797 INFO: 300 data objects are processed


2021-04-08 14:06:23,813 INFO: 400 data objects are processed


2021-04-08 14:06:23,828 INFO: 500 data objects are processed


2021-04-08 14:06:23,849 INFO: 600 data objects are processed


2021-04-08 14:06:23,859 INFO: 700 data objects are processed


2021-04-08 14:06:23,872 INFO: 800 data objects are processed


2021-04-08 14:06:23,885 INFO: Data fetched, finalizing


2021-04-08 14:06:23,899 INFO: Uploading data


2021-04-08 14:06:23,901 INFO: Featurization job completed


The `.local` accessor builds package and container, then executes the container locally. This step allows debugging the following things:

* If your job is serializable. This is usually achievable by not using `lambda` syntax.
* If all the code the job uses is located inside the TG folder, and if all the references are relative. If something is wrong, you will see the import error.
* If the environmental variables are carried to docker correctly. 
* If you have sufficient permissions to start docker
* etc.

This step allows you to check the deliverability of your work. 

Unfortunately, Jupyther notebook does not allow to view the output of `subprocess.call`, so the next cell will not produce an output. When running from command line, you'll be able to see the output of packaging, containering and then the output of the running container.

The execution may take a while since there are many packages TG requires. You can check the progress in the console from which `jupyter notebook` was started.

In [9]:
routine.local.execute()

But you can retrieve logs from the container with the following useful method. Note that logs printed via `logging` are placed in stderr instead of strdout.

In [10]:
output, errors = routine.local.get_logs()
print(errors)

2021-04-08 12:06:37,555 INFO: Welcome to Training Grounds. This is Job execution via Docker/SSH
2021-04-08 12:06:37,568 INFO: Executing job job version v1
2021-04-08 12:06:37,568 INFO: Featurization Job job at version v1 has started
2021-04-08 12:06:37,569 INFO: Fetching data
2021-04-08 12:06:37,591 INFO: 100 data objects are processed
2021-04-08 12:06:37,623 INFO: 200 data objects are processed
2021-04-08 12:06:37,646 INFO: 300 data objects are processed
2021-04-08 12:06:37,673 INFO: 400 data objects are processed
2021-04-08 12:06:37,692 INFO: 500 data objects are processed
2021-04-08 12:06:37,715 INFO: 600 data objects are processed
2021-04-08 12:06:37,757 INFO: 700 data objects are processed
2021-04-08 12:06:37,799 INFO: 800 data objects are processed
2021-04-08 12:06:37,831 INFO: Data fetched, finalizing
2021-04-08 12:06:37,842 INFO: Uploading data
2021-04-08 12:06:37,842 INFO: Featurization job completed
2021-04-08 12:06:37,842 INFO: Job completed



`routine.remote` has the same interface as `routine.local`, and will run the container at the remote machine. The only problems you should have at these stage are permissions:
* to push to your docker registry
* to connect to the remote machine via SSH
* to execute `docker run` at the remote machine

## Summary 

In this demo, we delivered the job to the remote server and executed it there. That concludes the featurization-related part of the Training Grounds.

Note that the packaging and containering techniques are not specific for the featurization, and can process any code. In the subsequent demos, the same techniques will be applied to run the training on the remote server as well.