Permalink
Browse files

adding my crossvalidation scripts to give people idea about how i use…

… these
1 parent 18c5777 commit 7513852b11375f4df39cf3ac8f59fab5b79d8be2 @karpathy committed Nov 20, 2015
Showing with 475 additions and 0 deletions.
  1. +34 −0 cv/README.md
  2. +64 −0 cv/driver.py
  3. +359 −0 cv/inspect_cv.ipynb
  4. +2 −0 cv/killall.sh
  5. +5 −0 cv/runworker.sh
  6. +11 −0 cv/spawn.sh
View
@@ -0,0 +1,34 @@
+
+## Cross-validation utilities
+
+### Starting workers on different GPUs
+
+I thought I should do a small code dump of my cross-validation utilities. My workflow is to run on a single machine with multiple GPUs. Each worker runs on one GPU, and I spawn workers with the `spawn.sh` script, e.g.:
+
+```bash
+$ ./spawn 0 1 2 3 4 5 6
+```
+
+spawns 7 workers using GPUs 0-6 (inclusive), all running in screen sessions `ak0`...`ak6`. E.g. to attach to one of these it would be `screen -r ak0`. And `CTRL+a, d` to detach from a screen session and `CTRL+a, k, y` to kill a worker. Also `./killall.sh` to kill all workers.
+
+You can see that `spawn.sh` calls `runworker.sh` in a screen session. The runworker script can modify the paths (since LD_LIBRARY_PATH does not trasfer to inside screen sessions), and calls `driver.py`.
+
+Finally, `driver.py` runs an infinite loop of actually calling the training script, and this is where I set up all the cross-validation ranges. Also note, very importantly, how the `train.lua` script is called, with
+
+```python
+cmd = 'CUDA_VISIBLE_DEVICES=%d th train.lua ' % (gpuid, )
+```
+
+this is because otherwise Torch allocates a lot of memory on all GPUs on a single machine because it wants to support multigpu setups, but if you're only training on a single GPU you really want to use this flag to *hide* the other GPUs from each worker.
+
+Also note that I'm using the field `opt.id` to assign a unique identifier to each worker, based on the GPU it's running on and some random number, and current time, to distinguish each run.
+
+Have a look through my `driver.py` to get a sense of what it's doing. In my workflow I keep modifying this script and killing workers whenever I want to tune some of the cross-validation ranges.
+
+### Playing with checkpoints that get written to files
+
+Finally, the IPython Notebook `inspect_cv.ipynb` gives you an idea about how I analyze the checkpoints that get written out by the workers. The notebook is *super-hacky* and not intended for plug and play use; I'm only putting it up in case this is useful to anyone to build on, and to get a sense for the kinds of analysis you might want to play with.
+
+### Conclusion
+
+Overall, this system works quite well for me. My cluster machines run workers in screen sessions, these write checkpoints to a shared file system, and then I use notebooks to look at what hyperparameter ranges work well. Whatever works well I encode into `driver.py`, and then I restart the workers and iterate until things work well :) Hope some of this is useful & Good luck!
View
@@ -0,0 +1,64 @@
+import os
+from random import uniform, randrange, choice
+import math
+import time
+import sys
+import json
+
+def encodev(v):
+ if isinstance(v, float):
+ return '%.3g' % v
+ else:
+ return str(v)
+
+assert len(sys.argv) > 1, 'specify gpu/rnn_size/num_layers!'
+gpuid = int(sys.argv[1])
+
+cmd = 'CUDA_VISIBLE_DEVICES=%d th train.lua ' % (gpuid, )
+while True:
+ time.sleep(1.1+uniform(0,1))
+
+ opt = {}
+ opt['id'] = '%d-%0d-%d' % (gpuid, randrange(1000), int(time.time()))
+ opt['gpuid'] = 0
+ opt['seed'] = 123
+ opt['val_images_use'] = 3200
+ opt['save_checkpoint_every'] = 2500
+
+ opt['max_iters'] = -1 # run forever
+ opt['batch_size'] = 16
+
+ #opt['checkpoint_path'] = 'checkpoints'
+
+ opt['language_eval'] = 1 # do eval
+
+ opt['optim'] = 'adam'
+ opt['optim_alpha'] = 0.8
+ opt['optim_beta'] = choice([0.995, 0.999])
+ opt['optim_epsilon'] = 1e-8
+ opt['learning_rate'] = 10**uniform(-5.5,-4.5)
+
+ opt['finetune_cnn_after'] = -1 # dont finetune
+ opt['cnn_optim'] = 'adam'
+ opt['cnn_optim_alpha'] = 0.8
+ opt['cnn_optim_beta'] = 0.995
+ opt['cnn_learning_rate'] = 10**uniform(-5.5,-4.25)
+
+ opt['drop_prob_lm'] = 0.5
+
+ opt['rnn_size'] = 512
+ opt['input_encoding_size'] = 512
+
+ opt['learning_rate_decay_start'] = -1 # dont decay
+ opt['learning_rate_decay_every'] = 50000
+
+ opt['input_json'] = '/scr/r6/karpathy/cocotalk.json'
+ opt['input_h5'] = '/scr/r6/karpathy/cocotalk.h5'
+
+ #opt['start_from'] = '/scr/r6/karpathy/neuraltalk2_checkpoints/good6/model_id0-565-1447975213.t7'
+
+ optscmd = ''.join([' -' + k + ' ' + encodev(v) for k,v in opt.iteritems()])
+ exe = cmd + optscmd + ' | tee /scr/r6/karpathy/neuraltalk2_checkpoints/out' + opt['id'] + '.txt'
+ print exe
+ os.system(exe)
+
View
Oops, something went wrong.
View
@@ -0,0 +1,2 @@
+screen -ls | grep ak | cut -d. -f1 | awk '{print $1}' | xargs kill
+
View
@@ -0,0 +1,5 @@
+#!/bin/bash
+
+echo "worker $1 is starting. Exporting LD_LIBRARY_PATH then running driver.py"
+export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$TORCHPATH/lib:/usr/local/lib:/usr/local/cuda/lib64:/home/stanford/cudnn_r3:/home/stanford/cudnn_r3/lib64
+python driver.py $1
View
@@ -0,0 +1,11 @@
+# will spawn workers on the given GPU ids, in screen sessions prefixed with "ak"
+for i in "$@"
+do
+
+ echo "spawning worker on GPU $i..."
+ screen -S ak$i -d -m ./runworker.sh $i
+
+ sleep 2
+done
+
+

0 comments on commit 7513852

Please sign in to comment.