-
Notifications
You must be signed in to change notification settings - Fork 544
Conversation
* Update PAI protocol yaml Update PAI protocol template and example according to the discussion.
Update protocol spec.
Add pai-protocol for job submission v2.
docs/pai-protocol.yaml
Outdated
taskRoles: | ||
<name>: String, required # Should be in taskRoles | ||
preCommands: | ||
- String, required # execute before $$commands$$ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line 59, 61, 211 are still using old reference way $$. Please fix them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
Remove previous `$$...$$` references
docs/pai-protocol.yaml
Outdated
<param2>: value2Type # Can be referenced by `<% $parameters.param1 %>`, `<% $parameters.param2 %>` | ||
|
||
jobRetryCount: Integer, optional # Default is 0 | ||
taskRoles: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please explain what taskRoles is. e.g., taskRoles defines the type of task (i.e., container). In OpenPAI protocol, one job can has one or multiple types of task, each in turn may has one or multiple tasks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
docs/pai-protocol.yaml
Outdated
<name>: String, required # Name of the taskRole | ||
protocol_version: String, optional # Protocol version, default is 2 | ||
instances: Integer, optional # Default is 1, instances of a taskRole | ||
completion: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please explain what is "completion". e.g., Completion specifies how a job state changes in case of a task completion.
Please provide a link to our document for details.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
docs/pai-protocol.yaml
Outdated
dockerImage: String, required # Should reference to a dockerimage defined in prerequisites | ||
data: String, optional # Should reference to data defined in prerequisites, the whole data object in this taskRole can be referenced as `$data` | ||
output: String, optional # Should reference to output defined in prerequisites, the whole output object in this taskRole can be referenced as `$output` | ||
script: String, optional # Should reference to script defined in prerequisites, the whole script object in this taskRole can be referenced as `$script` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Scope of the reference: the reference is only valid inside this taskRole. User cannot reference data/output/script from another taskRole
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
docs/pai-protocol.yaml
Outdated
# to handle that a component may interact with different component differently, user is encouraged to place the codes handling such difference in the "deployments" field. | ||
# e.g., a job may get input data through wget, hdfc -dfs cp, copy, or just directly read from remote storage. This logic can be placed here. | ||
# in summary, the deployments field is responsible to make sure the job to run properly in a deployment specific runtime environment. | ||
# one could have many deployments, but only the first deployment can be activated at runtime. User can choose the deployment at job submission time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"but only the first deployment can be activated at runtime." but only one deployment can be activated at runtime (specified in "default")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
* Change protocolVersion type into number. * Reserve protocolVersion field for job and prerequisite types. * Merge attachments field into defaults.
Refine docs according to comments.
} | ||
let entrypoint = ''; | ||
commands = commands.map((command) => command.trim()).join(' && '); | ||
const tokens = mustache.parse(commands, ['<%', '%>']); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a bug in the version of mustache we are using is that the cache key of the parsed template is only the template source, w/o the delimiters, so if we are using different delimiters in the same cache, there may be parsing mistake occurs when parsing the same text with different delimiters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
marketplace/job/caffe2_resnet50.yaml
Outdated
memoryMB: 8192 | ||
gpu: 1 | ||
commands: | ||
- python resnet50_trainer.py --train_data null |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we changed job description file to v2 format here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It has been changed in #2292, hasn't been merged yet.
Added in a separate directory protocol-marketplace
and keep the original one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
renamed to marketplace-v2
Create new mustache writer instance to avoid conflict.
|
||
router.use('/template', templateRouter); | ||
router.use('/jobs', jobV2Router); | ||
router.use('/jobs', jobRouter); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Synced with Yifan, when we decide to switch to v2 api, we need to redirect the all remaining APIs to v1 before that time point so that the all components will set "api/v2" as the base URI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will do in future PRs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
script: tensorflow_cnnbenchmarks | ||
extraContainerOptions: | ||
shmMB: 64 | ||
resourcePerInstance: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
resourcePerInstance [](start = 4, length = 19)
resorucesPerInstance? #WontFix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original one is resourcePerInstance
, should we change to resourcesPerInstance
? #WontFix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As it contains multiple children, maybe resourcesPerInstance is more grammer correct? @fanyangCS can how do you think about it?
In reply to: 270776814 [](ancestors = 270776814)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to change, Yarn also call it "resource".
In reply to: 270779984 [](ancestors = 270779984,270776814)
if ('prerequisites' in protocolObj) { | ||
for (let item of protocolObj.prerequisites) { | ||
prerequisites[item.type][item.name] = item; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will you check the name duplication for the same type of prerequisites?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope, the prerequisite with same type and same type will overwrite the previous one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think is it useful to prompt users name duplicated? Maybe users sometimes forget to change the name when they create a new prerequisites by copying and modify the existing ones?
In reply to: 270776412 [](ancestors = 270776412)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated, will throw bad request error if duplicate prerequisites or deployments exist.
docs/pai-job-protocol.yaml
Outdated
memoryMB: Integer, required | ||
gpu: Integer, required | ||
ports: | ||
<portLabel>: Integer, required, default is 0 # Only for host network, portLabel string in ^[A-Za-z0-9\-._~]+$ format. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
required [](start = 30, length = 8)
Optional? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
docs/pai-job-protocol.yaml
Outdated
cpu: Integer, required | ||
memoryMB: Integer, required | ||
gpu: Integer, required | ||
ports: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Units? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
docs/pai-job-protocol.yaml
Outdated
dockerImage: String, required # Should reference to a dockerimage defined in prerequisites. | ||
# Scope of the reference `$data`, '$output', `$script`: the reference is only valid inside this task role. | ||
# User cannot reference them from another task role. Reference for `$parameters` is global and shared among task roles. | ||
data: String, optional # Should reference to data defined in prerequisites, the whole data object in this task role can be referenced as `$data`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should reference to data defined in prerequisites, the whole data object in this task role can be referenced as
$data
. [](start = 29, length = 120)
What about describe as "Select data defined in prerequisites, target can be referenced as $data
in this task role? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
data: String, optional # Should reference to data defined in prerequisites, the whole data object in this task role can be referenced as `$data`. | ||
output: String, optional # Should reference to output defined in prerequisites, the whole output object in this task role can be referenced as `$output`. | ||
script: String, optional # Should reference to script defined in prerequisites, the whole script object in this task role can be referenced as `$script`. | ||
extraContainerOptions: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And description for output and script? @fanyangCS what your opinion? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🕐
Update protocol docs.
Check duplicate prerequisites and deployments in protocol.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* Upgrade yaml to protocol in marketplace Upgrade yaml to protocol in marketplace. * Add descriptions for example yamls in marketplace Add descriptions for example yamls in marketplace. * Remove unit test using outdated yaml Remove unit test using outdated yaml. * Restore former marketplace and add a new one Restore former marketplace and add a protocol marketplace. * Update Update. * Rename protocol marketplace to marketplace v2 Rename protocol marketplace to marketplace v2.
Replace webhdfs callback with async functions.
Revert changes in previous marketplace, conflicts in #2292.
Update hdfs promises.
* Add protocol submission plugin Add protocol submission plugin: * import protocol yaml file * view/edit protocol yaml file with monaco editor Known issues: * monaco editor cannot be bundled by webpack for plugin * editor worker has the wrong base url * language highlight and some editor styles do not work in plugin * Update webpack config file Update webpack config file. * Update protocol form Update protocol form. * Add webpack public path Add webpack public path. * Support clone from job list Support clone from job list. * fix webpack public url * fix cors web worker issue * Update according to comments Update according to comments. * Use immutable update to clone state Use immutable update to clone state. * Add componentDidMount Add `componentDidMount`. * Update componentDidMount Update `componentDidMount`. * Add bootstrap css Add bootstrap css. * Fix monaco editor line number style Pick monaco-hack style and fix line number margin. * Use react-bootstrap Use react-bootstrap. * Add loading spinner Add loading spinner. * Fix tslint issue Fix tslint issue. * Use css modules in bootstrap styles Use css modules in bootstrap styles. * Remove string literal in styles Remove string literal in styles. * Add parameters textfield Add parameters textfield. * Fix tslint errors Fix tslint errors. * Update Update. * Update Co-Authored-By: abuccts <xiongyf@yandex.com> * Update Update. * Add features in monaco plugin Add one feature in monaco plugin to reduce bundle size, due to a bug in https://github.com/Microsoft/monaco-editor-webpack-plugin/blob/v1.7.0/index.js#L38. * Update Update.
🚢, close #2232. |
- [Docs] Finalize protocol spec: merged pai-proto branch. - [Docs] Add new yaml files that follow new protocol: #2292. - [Rest Server] Protocol yaml parsing, validation and rendering: #2274. - [Rest Server] Convert protocol to framework launcher description file: #2291. - [Rest Server] Refactor and update rest API v2: #2329. - [Webportal] Add submission protocol plugin: #2461.
Back-end part for job submission v2 #2232.
Work Items
Exit Criteria