[WIP] Introduce Dockerfile #135

p16i · 2019-02-24T18:31:27Z

This is a first Dockerfile that aims to make the system more portable and easier to be run, addressing #133.

The Docker file is structured such that

The image contains pre-installed libraries: currently I start with Shogun.
Datasets are excluded from the image, and the dataset directory has to be mounted whilte running.

This image is built using a modified config.yaml. In particular, Shogun's KMEANS and DTC sections are:

library: shogun
methods:
   KMEANS:
        run: ['metric']
        iteration: 3
        script: methods/shogun/kmeans.py
        format: [arff, csv, txt]
        datasets:
            - files: [ ['datasets/waveform.csv', 'datasets/waveform_centroids.csv'] ]
              options:
                clusters: 2

            - files: [ ['datasets/wine.csv', 'datasets/wine_centroids.csv'],
                       ['datasets/iris.csv', 'datasets/iris_centroids.csv'] ]
              options:
                clusters: 3
    DTC:
        run: ['timing', 'metric']
        script: methods/shogun/decision_tree.py
        format: [csv, txt, arff]
        datasets:
            - files: [ ['datasets/iris_train.csv', 'datasets/iris_test.csv', 'datasets/iris_labels.csv'] ]

Results from executions.

Suppose relevant datasets have been downloaded already from make datasets. The image is built using the following command docker build -t benchmark.

> docker run -v `pwd`/datasets:/usr/src/benchmarks/datasets benchmark  BLOCK=shogun METHODBLOCK=KMEANS
[INFO ] CPU Model:  Intel(R) Core(TM) i7-4770HQ CPU @ 2.20GHz
[INFO ] Distribution:
[INFO ] Platform: x86_64
[INFO ] Memory: 3.8544921875 GB
[INFO ] CPU Cores: 2
[INFO ] Method: KMEANS
[INFO ] Options: {'clusters': 2}
[INFO ] Library: shogun
[INFO ] Dataset: waveform

           mlpack  matlab  scikit  mlpy    shogun  weka  elki  milk  dlibml
waveform        -       -       -     -  0.022962     -     -     -       -

[INFO ] Options: {'clusters': 3}
[INFO ] Library: shogun
[INFO ] Dataset: wine
[INFO ] Dataset: iris

       mlpack  matlab  scikit  mlpy    shogun  weka  elki  milk  dlibml
wine        -       -       -     -  0.000771     -     -     -       -
iris        -       -       -     -  0.000620     -     -     -       -

[INFO ] Options: {'clusters': 5}
[INFO ] Options: {'clusters': 6}
[INFO ] Options: {'clusters': 7}
[INFO ] Options: {'clusters': 26}
[INFO ] Options: {'clusters': 10}
[INFO ] Options: {'clusters': 75}
[INFO ] Options: {'centroids': 75}

> benchmarks $ docker run -v `pwd`/datasets:/usr/src/benchmarks/datasets benchmark  BLOCK=shogun METHODBLOCK=DTC
[INFO ] CPU Model:  Intel(R) Core(TM) i7-4770HQ CPU @ 2.20GHz
[INFO ] Distribution:
[INFO ] Platform: x86_64
[INFO ] Memory: 3.8544921875 GB
[INFO ] CPU Cores: 2
[INFO ] Method: DTC
[INFO ] Options: None
[INFO ] Library: shogun
[INFO ] Dataset: iris

       mlpack  matlab  scikit    shogun  weka  milk  R
iris        -       -       -  0.000817     -     -  -

Could you please give me feedback or comments? Meanwhile, I will add more libraries to the image.

Update :

2019/02/25: The image can be found on Docker Hub, heytitle/mlpack-benchmarks.

zoq · 2019-02-27T11:09:19Z

This looks good to me, I'm wondering if it's possible to somehow generate a list of each package/version. Also I think the reason for excluding the datasets is to reduce the size? What about we tar the datasets folder?

p16i · 2019-02-27T15:34:30Z

hi, thanks for the comment.

I'm wondering if it's possible to somehow generate a list of each package/version.

Do you mean having a Docker image for each library?

Also I think the reason for excluding the datasets is to reduce the size? What about we tar the datasets folder?

Yes, the dataset directory is 2GB, which is too big to store in the container. If we keep it as a archive (.tar.gz), the size is around 700MB. I'm not sure whether it's worth.

What do you think?

zoq · 2019-02-27T19:12:29Z

Do you mean having a Docker image for each library?

Yeah, not sure that is something we should do since, in this case we would have to update not only the docker file that contains all libs but the single lib as well.

Yes, the dataset directory is 2GB, which is too big to store in the container. If we keep it as a archive (.tar.gz), the size is around 700MB. I'm not sure whether it's worth.

What do you think?

I see, I'd like to keep it as simple as possible, sharing the dataset folder might not be the easiest solution. Can you think of anything else, what we could do? Perhaps 700MB isn't that bad?

p16i · 2019-02-27T20:25:53Z

Before investigating further, may I ask how is your plan to run this container?
Why do you think sharing the dataset directory isn't the easiest approach?

zoq · 2019-02-28T21:30:51Z

The easiest for me would be to have something that runs out of the box docker run is all I need. What about we provide a docker that includes the datasets and another one without. What do you think?

introduce Dockerfile

3b02751

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Introduce Dockerfile #135

[WIP] Introduce Dockerfile #135

p16i commented Feb 24, 2019 •

edited

zoq commented Feb 27, 2019

p16i commented Feb 27, 2019

zoq commented Feb 27, 2019

p16i commented Feb 27, 2019

zoq commented Feb 28, 2019

[WIP] Introduce Dockerfile #135

Are you sure you want to change the base?

[WIP] Introduce Dockerfile #135

Conversation

p16i commented Feb 24, 2019 • edited

Results from executions.

zoq commented Feb 27, 2019

p16i commented Feb 27, 2019

zoq commented Feb 27, 2019

p16i commented Feb 27, 2019

zoq commented Feb 28, 2019

p16i commented Feb 24, 2019 •

edited