Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Introduce Dockerfile #135

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

[WIP] Introduce Dockerfile #135

wants to merge 1 commit into from

Conversation

p16i
Copy link
Contributor

@p16i p16i commented Feb 24, 2019

This is a first Dockerfile that aims to make the system more portable and easier to be run, addressing #133.

The Docker file is structured such that

  • The image contains pre-installed libraries: currently I start with Shogun.
  • Datasets are excluded from the image, and the dataset directory has to be mounted whilte running.

This image is built using a modified config.yaml. In particular, Shogun's KMEANS and DTC sections are:

library: shogun
methods:
   KMEANS:
        run: ['metric']
        iteration: 3
        script: methods/shogun/kmeans.py
        format: [arff, csv, txt]
        datasets:
            - files: [ ['datasets/waveform.csv', 'datasets/waveform_centroids.csv'] ]
              options:
                clusters: 2

            - files: [ ['datasets/wine.csv', 'datasets/wine_centroids.csv'],
                       ['datasets/iris.csv', 'datasets/iris_centroids.csv'] ]
              options:
                clusters: 3
    DTC:
        run: ['timing', 'metric']
        script: methods/shogun/decision_tree.py
        format: [csv, txt, arff]
        datasets:
            - files: [ ['datasets/iris_train.csv', 'datasets/iris_test.csv', 'datasets/iris_labels.csv'] ]

Results from executions.

Suppose relevant datasets have been downloaded already from make datasets. The image is built using the following command docker build -t benchmark.

> docker run -v `pwd`/datasets:/usr/src/benchmarks/datasets benchmark  BLOCK=shogun METHODBLOCK=KMEANS
[INFO ] CPU Model:  Intel(R) Core(TM) i7-4770HQ CPU @ 2.20GHz
[INFO ] Distribution:
[INFO ] Platform: x86_64
[INFO ] Memory: 3.8544921875 GB
[INFO ] CPU Cores: 2
[INFO ] Method: KMEANS
[INFO ] Options: {'clusters': 2}
[INFO ] Library: shogun
[INFO ] Dataset: waveform

           mlpack  matlab  scikit  mlpy    shogun  weka  elki  milk  dlibml
waveform        -       -       -     -  0.022962     -     -     -       -

[INFO ] Options: {'clusters': 3}
[INFO ] Library: shogun
[INFO ] Dataset: wine
[INFO ] Dataset: iris

       mlpack  matlab  scikit  mlpy    shogun  weka  elki  milk  dlibml
wine        -       -       -     -  0.000771     -     -     -       -
iris        -       -       -     -  0.000620     -     -     -       -

[INFO ] Options: {'clusters': 5}
[INFO ] Options: {'clusters': 6}
[INFO ] Options: {'clusters': 7}
[INFO ] Options: {'clusters': 26}
[INFO ] Options: {'clusters': 10}
[INFO ] Options: {'clusters': 75}
[INFO ] Options: {'centroids': 75}
> benchmarks $ docker run -v `pwd`/datasets:/usr/src/benchmarks/datasets benchmark  BLOCK=shogun METHODBLOCK=DTC
[INFO ] CPU Model:  Intel(R) Core(TM) i7-4770HQ CPU @ 2.20GHz
[INFO ] Distribution:
[INFO ] Platform: x86_64
[INFO ] Memory: 3.8544921875 GB
[INFO ] CPU Cores: 2
[INFO ] Method: DTC
[INFO ] Options: None
[INFO ] Library: shogun
[INFO ] Dataset: iris

       mlpack  matlab  scikit    shogun  weka  milk  R
iris        -       -       -  0.000817     -     -  -

Could you please give me feedback or comments? Meanwhile, I will add more libraries to the image.

Update :

  • 2019/02/25: The image can be found on Docker Hub, heytitle/mlpack-benchmarks.

@zoq
Copy link
Member

zoq commented Feb 27, 2019

This looks good to me, I'm wondering if it's possible to somehow generate a list of each package/version. Also I think the reason for excluding the datasets is to reduce the size? What about we tar the datasets folder?

@p16i
Copy link
Contributor Author

p16i commented Feb 27, 2019

hi, thanks for the comment.

I'm wondering if it's possible to somehow generate a list of each package/version.

Do you mean having a Docker image for each library?

Also I think the reason for excluding the datasets is to reduce the size? What about we tar the datasets folder?

Yes, the dataset directory is 2GB, which is too big to store in the container. If we keep it as a archive (.tar.gz), the size is around 700MB. I'm not sure whether it's worth.

What do you think?

@zoq
Copy link
Member

zoq commented Feb 27, 2019

Do you mean having a Docker image for each library?

Yeah, not sure that is something we should do since, in this case we would have to update not only the docker file that contains all libs but the single lib as well.

Yes, the dataset directory is 2GB, which is too big to store in the container. If we keep it as a archive (.tar.gz), the size is around 700MB. I'm not sure whether it's worth.

What do you think?

I see, I'd like to keep it as simple as possible, sharing the dataset folder might not be the easiest solution. Can you think of anything else, what we could do? Perhaps 700MB isn't that bad?

@p16i
Copy link
Contributor Author

p16i commented Feb 27, 2019

Before investigating further, may I ask how is your plan to run this container?
Why do you think sharing the dataset directory isn't the easiest approach?

@zoq
Copy link
Member

zoq commented Feb 28, 2019

The easiest for me would be to have something that runs out of the box docker run is all I need. What about we provide a docker that includes the datasets and another one without. What do you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants