# MLRepo - A Quick Introduction
In this notebook we give a quick introduction working with the repository and explain the basic priniples.

In [None]:
#import all things you need to get startes
import pandas as pd
import logging as logging

# Here start the repository specific imports
import pailab.repo as repo
import pailab.memory_handler as memory_handler
from pailab.repo_objects import RepoInfoKey, MeasureConfiguration
from job_runner.job_runner import SimpleJobRunner, JobState, SQLiteJobRunner

#You may set the loglevel and log-format here. 
#Note that the repository uses the logging module.
#FORMAT = "[%(filename)s:%(lineno)s - %(funcName)20s() ] %(message)s"
#logging.basicConfig(format=FORMAT, level=logging.DEBUG)
logging.basicConfig(level=logging.ERROR)

# Read the data
As an example machine learning task to ilustrate the way of working with the repository we use the Boston housing data from the  [UCI Machine Learning Repository](http://archive.ics.uci.edu/ml/index.php) where we have applied some preprocessing. The data consists of house prices together with the house features `'RM'`, `'LSTAT'`, and `'PTRATIO'`:
- `'RM'` is the average number of rooms among homes in the neighborhood.
- `'LSTAT'` is the percentage of homeowners in the neighborhood considered "lower class" (working poor).
- `'PTRATIO'` is the ratio of students to teachers in primary and secondary schools in the neighborhood.
We just read the csv-file containing the data (also in the repository) into a pandas dataframe.

In [None]:
data = pd.read_csv('housing.csv')
#data.shape

# Create a new repository
We first create a new repository for our task. The repository is the central key around all functionality is built. Similar to a repository used for source control in classical software development it contains all data and algorithms needed for the machine learning task. The repository needs storages for 
- scripts containing the machine learning algorithms and interfaces,
- numerical objects such as arrays and matrices representing data, e.g. input data, data from the valuation of the models,
- json documents representing parameters, e.g. training parameter, model parameter.

To keep things simple, we just start using in memory storages. Note that the used memory interfaces are except for testing and playing around not be the first choice, since when ending the session, everything will be lost...

In addition to the storages the repository needs a reference to a JobRunner which the platform can use to execute machine learning jobs. For this example we use the most simple one, executing everything sequential in the same thread, the repository runs in.

In [None]:
# setting up the repository
if True:
    # for the scripts as well as for parameters we used the same simple in memory handler
    handler = memory_handler.RepoObjectMemoryStorage() 
    # for the numerical objects (numpy) we use the simple NumpyMemoryStorage 
    numpy_handler = memory_handler.NumpyMemoryStorage()
    # and for the sake of being simple, we use a SimpleJobRunner
    job_runner = SimpleJobRunner(None)
else:
    from pailab.disk_handler import RepoObjectDiskStorage
    from pailab.numpy_handler_hdf import NumpyHDFStorage
    handler = RepoObjectDiskStorage('c:/temp/boston_housing_repo')
    numpy_handler = NumpyHDFStorage('c:/temp/boston_housing_repo') 
    job_runner = SQLiteJobRunner('c:/temp/job_runner.sqlite', None)
ml_repo = repo.MLRepo('test_user', handler, numpy_handler, handler, job_runner)
job_runner.set_repo(ml_repo)
ml_repo._job_runner = job_runner

## Adding data
The data in the repository is handled by two different data objects:
- RawData is the object containing real data.
- DataSet is the object conaining the logical data, i.e. a reference to a RawData object together with a specification, which data from the RawData will be used. Here, one can specify a fixed version of the underlying RawData object (then changes to the RawData will not affect the derived DataSet) or a fixed or floating subset of the RawData by defininga start and endindex cutting the derived data just out of the original data.

Normally one will add RawData and then define DataSets which are used to train or test a model which is exactly the way shown in the following.

In [None]:
# Add RawData. A convenient way to add RawData is simply to use the method add_data.
# This method just takes a pandas dataframe and the specification, which columns belong to the input 
#and which to the targets.
ml_repo.add_data('boston_housing', data, input_variables=['RM', 'LSTAT', 'PTRATIO'], target_variables = ['MEDV'])
# create DataSet objects for training and test data
training_data = repo.DataSet('boston_housing', 0, 300, 
                            repo_info = {RepoInfoKey.NAME.value: 'training_data', RepoInfoKey.CATEGORY.value: repo.MLObjectType.TRAINING_DATA})
test_data = repo.DataSet('boston_housing', 301, None, 
                            repo_info = {RepoInfoKey.NAME.value: 'test_data',  RepoInfoKey.CATEGORY.value: repo.MLObjectType.TEST_DATA})
# add the objects to the repository. The method returns a dictionary of object names to version numbers of the added objects.
version_list = ml_repo.add([training_data, test_data], message = 'add training and test data')

When creating the DataSet we have to set two important informations for the repository, given as a dictionary:
- The object name. Each object in the repository needs to have a unique name in the repository.
- The object type which gives. In our example here we say that we specify that the DataSet are training and test data. Note that on can have only one training data object pre repository while the repository can obtain many different test data sets.

Some may wonder what is now stored in *version_list*.
** Adding an object (independent if it is a data object or some other object such as a parameter), the object gets a version number and no object will be removed, adding just adds a new version.** The add method returns a dictionary of the object names together with their version number.

In [None]:
print(version_list)

## Adding a model
The next step to do machine learning would be to define a model which will be used in the repository. A model consists of the following pieces
- a skript where the code for the model valuation is defined together with the function name of the evaluation method
- a skript where the code for the model training is defined together with th function nam of the training method
- a model parameter object defining the model parameter and which must have implemented the correct interface so that it can be used within the repository (see the documentation on integrating new objects, normally there is not more to do then just simply add *@repo_object_init()* to the line above your *__init__* method)
- a training parameter object defining training parameters (such as number of optimization steps etc.), if necessary for your algorithms (this oen is optional)

** SKLearn models as an example**

We do not have to define the pieces defined above, if we use the sklearn module. Instead we can use the externals.sklearn module interfacing 
the sklearn package so that this can be used within the repository. This interface provides a simple method (add_model) to add an arbitrary sklearn model as a model which can be handled by the repository. This method adds a bunch of repo objects to the repository (according to the pieces described above):
- An object defining the function to be called to evaluate the model
- An object defining the function to be called to train the model
- An object defining the model
- An object defining the model parameter
For the following we just use a DecisionTree as our model.

In [None]:
import externals.sklearn_interface as sklearn_interface
from sklearn.tree import DecisionTreeRegressor
sklearn_interface.add_model(ml_repo, DecisionTreeRegressor(), model_param={'max_depth': 5})

## Train the model
Now, model taining is very simple, since you have defined training and testing data as well as  methods to value and fit your model and the model parameter.
So, you can just call *run_training* on the repository, and the training is perfomred automatically.
The training job is executed via the JobRunner you specified setting up the repository. All method of the repository involving jobs return the job id when adding the job to the JobRunner so that you can control the status of the task and see if it sucessfully finished.

In [None]:
job_id = ml_repo.run_training()  
job_info = job_runner.get_info(job_id[0], job_id[1])
#print(job_info.trace_back)

## Run evaluation
To measure errors and to provide plots the model must be evaluated on all test and training datasets.

In [None]:
job_id = ml_repo.run_evaluation()
#info =job_runner.get_info(job_id[1]) 
#print(str(info.trace_back))

## Add and compute measures

In [None]:
ml_repo.add_measure(MeasureConfiguration.MAX)
ml_repo.add_measure(MeasureConfiguration.R2)

In [None]:
job_ids = ml_repo.run_measures()

In [None]:
max_measure = ml_repo._get('DecisionTreeRegressor/measure/training_data/max')
print(str(max_measure.value))
max_measure = ml_repo._get('DecisionTreeRegressor/measure/test_data/max')
print(str(max_measure.value))

# Working with the repository

In [None]:
for k in repo.MLObjectType:
    names = ml_repo.get_names(k.value)
    for n in names: 
        print(n + '\t  ' + k.value)

In [None]:
for k in ml_repo.get_commits():
    print(str(k))

In [None]:
for k, v in job_runner._job_info.items():
    print(str(k) + ':  ' + str(v))
#job_runner.get_info('34484a2c-c225-11e8-9693-fc084a6691eb')

## Change model parameter, check consistency and train

In [None]:
param = ml_repo._get('DecisionTreeRegressor/model_param')
param.sklearn_params['max_depth'] = 2
version = ml_repo.add(param)

In [None]:
import pailab.tools as tools
#depp = ml_repo._get('DecisionTreeRegressor/model_param')
results = tools.check_model(ml_repo, 'DecisionTreeRegressor')
print(results)

In [None]:
ml_repo.run_training()

In [None]:
results = tools.check_model(ml_repo, 'DecisionTreeRegressor')
print(results)

In [None]:
ml_repo.run_evaluation()
ml_repo.run_measures()

In [None]:
measure = ml_repo._get('DecisionTreeRegressor/measure/test_data/r2',version = (0,100))
for x in measure:
    print(str(x.repo_info))
    break

In [None]:
ml_repo.get_names(repo.MLObjectType.MEASURE_CONFIGURATION)

In [None]:
m = ml_repo._get('DecisionTreeRegressor/measure/training_data/r2')

In [None]:
str(m.repo_info)

In [None]:
print(results)

In [None]:
data = ml_repo._get('boston_housing')

In [None]:
print(str(data.repo_info))

## Append RawData

In [None]:
train_data = ml_repo.get_training_data(full_object = False)
print(train_data.repo_info[RepoInfoKey.NAME] +': ' +str(train_data))
test_data = ml_repo.get_names(repo.MLObjectType.TEST_DATA)
for k in test_data:
    t = ml_repo._get(k)
    print(str(t)+ ' Version: ' + str(t.repo_info[RepoInfoKey.VERSION]))

In [None]:
from numpy import array
ml_repo.append_raw_data('boston_housing', x_data = array([[ 6.575, 4.98, 15.3]]), y_data =array([[504000.0]]))

In [None]:
print(train_data.repo_info[RepoInfoKey.NAME] +': ' +str(train_data))
for k in test_data:
    t = ml_repo._get(k)
    print(str(t) + ' Version: ' + str(t.repo_info[RepoInfoKey.VERSION]))

In [None]:
results = tools.check_model(ml_repo, 'DecisionTreeRegressor')
print(results)

# Repo-Analysis

In [None]:
import pailab.plot as plot

In [None]:
for j in range(2):
    training_data = ml_repo._get('training_data')
    training_data.end_index += 50
    ml_repo.add(training_data, message='add 50 datapoints to end_index')
    for i in range(6,12):
        #print(i)
        param = ml_repo._get('DecisionTreeRegressor/model_param')
        param.sklearn_params['max_depth'] = i
        version = ml_repo.add(param)
        ml_repo.add(param)
        ml_repo.run_training()
        ml_repo.run_evaluation()
        ml_repo.run_measures()
    

## Plotting
### Plot measures by parameter

In [None]:
import pailab.plot_helper as plt_helper
import pailab.plot as plot
#if False:
#print(pd.DataFrame(
#plt_helper.get_measure_by_model_parameter(ml_repo, 'DecisionTreeRegressor/measure/test_data/r2', 'max_depth')
#))
plot.measure_by_model_parameter(ml_repo, 'DecisionTreeRegressor/measure/test_data/r2', 'max_depth')


### Plot histograms

In [None]:
plot.histogram(ml_repo, 'test_data', x_coordinate = 'PTRATIO') #, y_coordinate='MEDV')

In [None]:
depp = ml_repo._get( 'training_data', version = (0,100))
#print(str(depp))
for x in depp:
    print(str(x)+ ', version: ' + str(x.repo_info[RepoInfoKey.VERSION]))

In [None]:
print(str(depp))