Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add MLCube integration #1

Merged
merged 10 commits into from
Nov 10, 2022

Conversation

davidjurado
Copy link
Contributor

@davidjurado davidjurado commented Mar 4, 2022

DataPerf Speech Example - MLCube integration

Project setup

# Create Python environment and install MLCube Docker runner 
virtualenv -p python3 ./env && source ./env/bin/activate && pip install mlcube-docker

# Fetch the implementation from GitHub
git clone https://github.com/harvard-edge/dataperf-speech-example && cd ./dataperf-speech-example
git fetch origin pull/1/head:feature/MLCube-integration && git checkout feature/MLCube-integration

Project structure

Diagram

Tasks execution

# Run download task
mlcube run --task=download -Pdocker.build_strategy=always

# Run select task
mlcube run --task=select -Pdocker.build_strategy=always

# Run evaluate task
mlcube run --task=evaluate -Pdocker.build_strategy=always

Execute complete pipeline

# Run all steps
mlcube run --task=download,select,evaluate -Pdocker.build_strategy=always

davidjurado pushed a commit to davidjurado/dataperf-speech-example that referenced this pull request Jun 1, 2022
* test commit

* Delete unncessary files. Add utils and constants for supporting functions in eval

* Add core supporting functions for model trainig and scoring

* Add main functionality to eval, and supporting utils functions. Update requirements

* Add folder structure. Add random training file and its results for testing setup. Minor fix to constant and setup file

* Add gitignore to ignore everything except test file. Delete selection folder since it is not necessary

* Add gitignore file to ignore all files in train sets except random_500.csv

* Simplify output readout to avoid bug

* Updated file and methods to match with previous design pattern

* Add data file as input in main function and yaml file so all paths in yaml are relative

* Add docker-compose file, and modify dockerfile, requirements and main accordingly

* Fixed type hinting as suggested in PR review
@colbybanbury
Copy link
Collaborator

@davidjurado How would someone specify the workspace/ directory to MLCube?

Also is there a way to point to a file outside of workspace/? For example config_files/?

@colbybanbury colbybanbury merged commit 8f9bb5c into harvard-edge:main Nov 10, 2022
@davidjurado
Copy link
Contributor Author

Hello @colbybanbury,

I'm sorry for the late reply, I didn't get a notification of your comment.

To specify a different workspace folder you can use --workspace and then provide the path, for example:

mlcube run --task=select --workspace=path/to/new_folder 

To point a file outside the workspace folder you need to have a parameter for the task you want to run, this is defined in the mlcube.yamlfile, for example, in the task select you have the following parameters:

select:
    # Run selection algorithm
    parameters:
      inputs:
        {
          allowed_training_set: { type: file, default: data/preliminary_evaluation_dataset/allowed_training_set.yaml },
          train_embeddings_dir: data/preliminary_evaluation_dataset/train_embeddings/,
        }
      outputs: { outdir: select_output/ }

and let's say we want to define a different allowed_training_set, we need to specify the name of the parameters to override and provide the absolute path of the new file we want to use:

mlcube run --task=select allowed_training_set=/Users/me/allowed_training_set.yaml

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants