Find file
Fetching contributors…
Cannot retrieve contributors at this time
83 lines (72 sloc) 3.82 KB
* add workers node failure management
(reschedule jobs not yet done, FIFO, when all commands were sent once, remove
commands from the TODO list only once we received their output)
Each command should have a list of identifiers.
If the command is resent, a new identifier should be added to the list.
Reception of any of the correct identifiers should complete the corresponding
Results with unknown identifiers are logged.
* add a -z option to enable compression of commands/results
* add a -c option to encrypt commands/results
* add --load-probe option to pause/resume jobs depending on load
give a signal number as parameter, so that people can say their software
to checkpoint
* start processing results before workers are finished
do we really need locks in worker threads?
* use the same logger than in instead of print
use a shorter log format for time also
* main became a pretty big function, cut it into sub-functions so we
can read it on one screen
* rewrite using UDP once worker failure is supported?
- rewrite using simple sockets and no more Pyro?
(server would call poll then)
* add requesting groups of jobs instead of one at a time
* support being client of another PAR server and accepting clients
* add a distributed test and also another but parallel and distributed
I just did one by hand, it is not so difficult using ssh and commands
that just sleep some time then echo something else in order to mimic
they are CPU (but not data) intensive
* think about the security of the client-server model:
- a client shouldn't accept to execute commands from an untrusted server
(commands could be any Unix command, including rm)
- server shouldn't accept to give commands to untrusted clients
(this would deplete the commands list, probably without doing the
corresponding jobs, and this would expose the commands)
- a client for which command execution result in errors should not be able
to deplete the commands list (at least, we should detect they were not
performed and reschedule them later on. At most, we should not send him
commands too fast)
Here is an idea for a safe communication protocol using asymmetric
1) The server compress | encrypt | sign commands prior to sending them
to workers.
Compression could be a real time one like LZOP. This would decrease
commands size and increase encryption quality.
2) The workers verify signature | decrypt | decompress commands.
If any of the above fails, it is logged and corresponding command is
3) Workers can encrypt their results like the server does for commands:
(compress | encrypt | sign).
Also, look at some MPI document you saw with a simple security model.
* profile with the python profiler
* add a code coverage test, python is not compiled and pychecker is too
light at checking things
* generate tests to verify each option have an effect when used on the CLI
* The client could be at the same time a server for other clients in order to
scale by using hierarchical layers of servers instead of having only one
server managing too many clients (Russian doll/fractal architecture).
This would be only useful in a multi-cluster environment, and such bridge
between cluster servers would be easily a bottleneck for data transfer.
This kind of "special client" should request several jobs at a time, and
send several results at a time (because he is the proxy for maybe an entire
cluster, not just a "standard client").
* a little bit related with previous idea, if nodes have same speed
and jobs require same amount of computations, nodes could request
several jobs at the same time and do data prefetching for the next job
while computing another