Skip to content
This repository has been archived by the owner on Jun 6, 2024. It is now read-only.

Job submission v2 #2260

Merged
merged 36 commits into from
Apr 15, 2019
Merged

Job submission v2 #2260

merged 36 commits into from
Apr 15, 2019

Conversation

abuccts
Copy link
Member

@abuccts abuccts commented Mar 4, 2019

Back-end part for job submission v2 #2232.

Work Items

Exit Criteria

  • BERT end to end demo.
  • Distributed end to end example.

@abuccts abuccts added this to the Marketplace Demo milestone Mar 4, 2019
taskRoles:
<name>: String, required # Should be in taskRoles
preCommands:
- String, required # execute before $$commands$$
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line 59, 61, 211 are still using old reference way $$. Please fix them.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

Remove previous `$$...$$` references
<param2>: value2Type # Can be referenced by `<% $parameters.param1 %>`, `<% $parameters.param2 %>`

jobRetryCount: Integer, optional # Default is 0
taskRoles:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please explain what taskRoles is. e.g., taskRoles defines the type of task (i.e., container). In OpenPAI protocol, one job can has one or multiple types of task, each in turn may has one or multiple tasks.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

<name>: String, required # Name of the taskRole
protocol_version: String, optional # Protocol version, default is 2
instances: Integer, optional # Default is 1, instances of a taskRole
completion:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please explain what is "completion". e.g., Completion specifies how a job state changes in case of a task completion.
Please provide a link to our document for details.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

dockerImage: String, required # Should reference to a dockerimage defined in prerequisites
data: String, optional # Should reference to data defined in prerequisites, the whole data object in this taskRole can be referenced as `$data`
output: String, optional # Should reference to output defined in prerequisites, the whole output object in this taskRole can be referenced as `$output`
script: String, optional # Should reference to script defined in prerequisites, the whole script object in this taskRole can be referenced as `$script`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scope of the reference: the reference is only valid inside this taskRole. User cannot reference data/output/script from another taskRole

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

# to handle that a component may interact with different component differently, user is encouraged to place the codes handling such difference in the "deployments" field.
# e.g., a job may get input data through wget, hdfc -dfs cp, copy, or just directly read from remote storage. This logic can be placed here.
# in summary, the deployments field is responsible to make sure the job to run properly in a deployment specific runtime environment.
# one could have many deployments, but only the first deployment can be activated at runtime. User can choose the deployment at job submission time.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"but only the first deployment can be activated at runtime." but only one deployment can be activated at runtime (specified in "default")

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

* Change protocolVersion type into number.
* Reserve protocolVersion field for job and prerequisite types.
* Merge attachments field into defaults.
Refine docs according to comments.
@coveralls
Copy link

Coverage Status

Coverage increased (+0.7%) to 53.287% when pulling 2b47b07 on feature/submission-v2 into 50ebc74 on master.

@coveralls
Copy link

coveralls commented Apr 1, 2019

Coverage Status

Coverage increased (+0.7%) to 53.314% when pulling 65fa657 on feature/submission-v2 into 50ebc74 on master.

}
let entrypoint = '';
commands = commands.map((command) => command.trim()).join(' && ');
const tokens = mustache.parse(commands, ['<%', '%>']);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a bug in the version of mustache we are using is that the cache key of the parsed template is only the template source, w/o the delimiters, so if we are using different delimiters in the same cache, there may be parsing mistake occurs when parsing the same text with different delimiters.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

memoryMB: 8192
gpu: 1
commands:
- python resnet50_trainer.py --train_data null

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we changed job description file to v2 format here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It has been changed in #2292, hasn't been merged yet.
Added in a separate directory protocol-marketplace and keep the original one.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed to marketplace-v2

Create new mustache writer instance to avoid conflict.

router.use('/template', templateRouter);
router.use('/jobs', jobV2Router);
router.use('/jobs', jobRouter);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Synced with Yifan, when we decide to switch to v2 api, we need to redirect the all remaining APIs to v1 before that time point so that the all components will set "api/v2" as the base URI.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do in future PRs

sterowang
sterowang previously approved these changes Apr 1, 2019
Copy link

@sterowang sterowang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

Update.
script: tensorflow_cnnbenchmarks
extraContainerOptions:
shmMB: 64
resourcePerInstance:
Copy link

@sterowang sterowang Apr 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resourcePerInstance [](start = 4, length = 19)

resorucesPerInstance? #WontFix

Copy link
Member Author

@abuccts abuccts Apr 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original one is resourcePerInstance, should we change to resourcesPerInstance? #WontFix

Copy link

@sterowang sterowang Apr 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As it contains multiple children, maybe resourcesPerInstance is more grammer correct? @fanyangCS can how do you think about it?


In reply to: 270776814 [](ancestors = 270776814)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to change, Yarn also call it "resource".


In reply to: 270779984 [](ancestors = 270779984,270776814)

if ('prerequisites' in protocolObj) {
for (let item of protocolObj.prerequisites) {
prerequisites[item.type][item.name] = item;
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will you check the name duplication for the same type of prerequisites?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, the prerequisite with same type and same type will overwrite the previous one.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think is it useful to prompt users name duplicated? Maybe users sometimes forget to change the name when they create a new prerequisites by copying and modify the existing ones?


In reply to: 270776412 [](ancestors = 270776412)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated, will throw bad request error if duplicate prerequisites or deployments exist.

@sterowang sterowang dismissed their stale review April 1, 2019 09:04

revoking review

memoryMB: Integer, required
gpu: Integer, required
ports:
<portLabel>: Integer, required, default is 0 # Only for host network, portLabel string in ^[A-Za-z0-9\-._~]+$ format.
Copy link

@sterowang sterowang Apr 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

required [](start = 30, length = 8)

Optional? #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

cpu: Integer, required
memoryMB: Integer, required
gpu: Integer, required
ports:
Copy link

@sterowang sterowang Apr 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Units? #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

dockerImage: String, required # Should reference to a dockerimage defined in prerequisites.
# Scope of the reference `$data`, '$output', `$script`: the reference is only valid inside this task role.
# User cannot reference them from another task role. Reference for `$parameters` is global and shared among task roles.
data: String, optional # Should reference to data defined in prerequisites, the whole data object in this task role can be referenced as `$data`.
Copy link

@sterowang sterowang Apr 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should reference to data defined in prerequisites, the whole data object in this task role can be referenced as $data. [](start = 29, length = 120)

What about describe as "Select data defined in prerequisites, target can be referenced as $data in this task role? #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

data: String, optional # Should reference to data defined in prerequisites, the whole data object in this task role can be referenced as `$data`.
output: String, optional # Should reference to output defined in prerequisites, the whole output object in this task role can be referenced as `$output`.
script: String, optional # Should reference to script defined in prerequisites, the whole script object in this task role can be referenced as `$script`.
extraContainerOptions:
Copy link

@sterowang sterowang Apr 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And description for output and script? @fanyangCS what your opinion? #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

Copy link

@sterowang sterowang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🕐

Update protocol docs.
Check duplicate prerequisites and deployments in protocol.
Copy link

@sterowang sterowang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

abuccts and others added 5 commits April 1, 2019 21:34
* Upgrade yaml to protocol in marketplace

Upgrade yaml to protocol in marketplace.

* Add descriptions for example yamls in marketplace

Add descriptions for example yamls in marketplace.

* Remove unit test using outdated yaml

Remove unit test using outdated yaml.

* Restore former marketplace and add a new one

Restore former marketplace and add a protocol marketplace.

* Update

Update.

* Rename protocol marketplace to marketplace v2

Rename protocol marketplace to marketplace v2.
Replace webhdfs callback with async functions.
Revert changes in previous marketplace, conflicts in #2292.
Update hdfs promises.
* Add protocol submission plugin

Add protocol submission plugin:
* import protocol yaml file
* view/edit protocol yaml file with monaco editor

Known issues:
* monaco editor cannot be bundled by webpack for plugin
* editor worker has the wrong base url
* language highlight and some editor styles do not work in plugin

* Update webpack config file

Update webpack config file.

* Update protocol form

Update protocol form.

* Add webpack public path

Add webpack public path.

* Support clone from job list

Support clone from job list.

* fix webpack public url

* fix cors web worker issue

* Update according to comments

Update according to comments.

* Use immutable update to clone state

Use immutable update to clone state.

* Add componentDidMount

Add `componentDidMount`.

* Update componentDidMount

Update `componentDidMount`.

* Add bootstrap css

Add bootstrap css.

* Fix monaco editor line number style

Pick monaco-hack style and fix line number margin.

* Use react-bootstrap

Use react-bootstrap.

* Add loading spinner

Add loading spinner.

* Fix tslint issue

Fix tslint issue.

* Use css modules in bootstrap styles

Use css modules in bootstrap styles.

* Remove string literal in styles

Remove string literal in styles.

* Add parameters textfield

Add parameters textfield.

* Fix tslint errors

Fix tslint errors.

* Update

Update.

* Update

Co-Authored-By: abuccts <xiongyf@yandex.com>

* Update

Update.

* Add features in monaco plugin

Add one feature in monaco plugin to reduce bundle size,
due to a bug in
https://github.com/Microsoft/monaco-editor-webpack-plugin/blob/v1.7.0/index.js#L38.

* Update

Update.
@abuccts
Copy link
Member Author

abuccts commented Apr 15, 2019

🚢, close #2232.

@abuccts abuccts merged commit abe33b2 into master Apr 15, 2019
sunqinzheng pushed a commit that referenced this pull request Apr 16, 2019
- [Docs] Finalize protocol spec: merged pai-proto branch.
- [Docs] Add new yaml files that follow new protocol: #2292.
- [Rest Server] Protocol yaml parsing, validation and rendering: #2274.
- [Rest Server] Convert protocol to framework launcher description file: #2291.
- [Rest Server] Refactor and update rest API v2: #2329.
- [Webportal] Add submission protocol plugin: #2461.
@abuccts abuccts deleted the feature/submission-v2 branch August 19, 2019 05:54
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants