Lead developer and maintainer: Simo Goshev
Developers: Jason Bowman
Group: BC Research Services
Although a fairly large number of commands in Stata are internally parallelized,
the speed of execution of specific algorithms such as bootstrapping, jackknifing and imputation
could be accelerated by utilizing a computing cluster. The aim of package parallelize
is to help researchers
with parallelizing their analyses and submitting jobs directly from their local
copy of Stata to the Linux computing cluster at Boston College (and potentially any
cluster running Torque(PBS)).
To load package parallelize
, include the following line in your do file:
do "https://raw.githubusercontent.com/goshevs/parallelize/master/ado/parallelize.ado"
Over the past several months, we reached a couple of important milestones:
-
Pulling data directly from Box, thus eliminating a series of intermediate steps. We are currently developing the Stata interface to python and also aim to provide seemless uploading functionality.
-
Developed and tested successfully the job submission, monitoring and output collection functionality (currently streamlining query and collection).
-
Added support for
pchained
and user-written routines via script importation. -
Added a post-parallelize program for submitting job-related requests to the cluster. Currently supports requests for progress checking and output retrieval.
Development continues!
parallelize
is used to define a connection, decribe the specifics of the job and
submit jobs to the computing cluster
parallelize, CONspecs(string) [JOBspecs(string) DATAspecs(string)
imports(string) EXECspecs(string)]: command
parallelize
takes the following arguments:
Required
argument | description |
---|---|
CONspecs | connection specification; two flavors, see below for syntax |
command | command to be parallelized on the cluster |
Optional and conditionally required arguments:
argument | description |
---|---|
JOBspecs | the specification of a parallel job; see below for syntax |
DATAspecs | specification of the data to be used; see below for syntax |
imports | location of work, aggregation and monitoring files; see below for syntax |
EXECspecs | execution specifications; see below for syntax |
Syntax for CONspecs
CONspecs
can be specified in two ways:
con(configFile="" profile="")
, whereconfigFile
is the path and file name of the configuration file to be used byssh
to connection to the clusterprofile
is the name of the profile in the configuration file to be used, or
con(sshHost="")
, where:sshHost
is the name of the host in the sshconfig
file located in.ssh/
to be used to connect to the cluster
The configuration file should be specified in this format.
Syntax for JOBspecs
JOBspecs
defines the resource requirements for a parallel job. It has the following syntax:
job(nodes="" ppn="" pmem="" walltime="" jobname="")
where:
nodes
is the number of nodes requestedppn
is the number of virtual processors per nodepmem
is the RAM per processorwalltime
is the length of time allocated to the job, or job's runtimejobname
is the name that will be applied to all parallel jobs
Syntax for DATAspecs
DATAspecs
defines the data file and its location. It is specified in the following way:
data(inFile="" loc="" argPass="")
where:
inFile
should include the path and name of the data fileloc
takes the values oflocal
,cluster
, orbox
to indicate where the data file is housed.argPass
takes a string with information that the user wishes to pass to their do files. This is an optional argument.
Syntax for imports
imports
defines the location of work, aggregation and monitoring files. It is specified in the following way:
imports(work="" coll="" mon="")
where:
work
should include the path and name of the do file to be executed by each worker on the clustercoll
should include the path and name of the do file that instructs Stata how to combine the output provided by the workersmon
should include the path and name of the do file that instructs Stata how to monitor the submission and re-submit jobs if necessary
There are special rules for writing these import files. More details to come.
Syntax for EXECspecs
EXECspecs
defines execution parameters. It has the following syntax:
exec(nrep="" pURL="" cbfreq="" email="" )
where:
nrep
is the number of parallel jobs neededpURL
is the URL of ado
orado
file which has to be imported prior to runningcommand
. This is an optional argumentcbfreq
is the callback frequency of the monitoring process (could be defined in seconds, minutes, hours and days)email
instructs Torque to send an email to the specified email address once all jobs are completed.
callCluster
is used to send requests to the cluster. Currently the program
supports requests for checking progress and retrieving output.
callCluster, Request(string asis) [CONspecs(string asis) JOBspecs(string asis)
OUTloc(string asis) KEEPremote]
callCluster
takes the following arguments:
Required
argument | description |
---|---|
Request | type of request; currently checkProgress and pullData are supported |
Optional and conditionally required arguments:
argument | description |
---|---|
CONspecs | connection specification; syntax identical to the one used in parallelize |
JOBspecs | takes jobname is an agrument and its syntax is job(jobname="") ; jobname is the name of the job of interest |
OUTloc | the directory on the user's machine where output should be copied to; default is ~/Desktop |
KEEPremote | instructs Stata to keep the job directorty and related files on the cluster; by default Stata will erase all related files |
Both CONspecs
and JOBspecs
are required arguments if callCluster
is not
run immediately after parallelize
or the sreturn
has been cleared.
*** Define basepath
local pathBasename "~/Desktop/gitProjects/parallelize"
*** Load the ado's
do "`pathBasename'/ado/parallelize.ado" // we should pull this from gitHub
************************************
*** GENERIC BOOTSTRAP
sysuse auto
save "`pathBasename'/examples/data/myboot"
*** Define locations
local locConf "`pathBasename'/config/config1"
local locData "c:/Users/goshev/Desktop/gitProjects/parallelize/examples/data/myboot.dta"
local locWork "`pathBasename'/imports/mybootWork.do"
local locColl "c:/Users/goshev/Desktop/gitProjects/parallelize/imports/mybootCollect.do"
local locMon "c:/Users/goshev/Desktop/gitProjects/parallelize/imports/genericMonitor.do"
local eMailAddress ""
*** Execute custom command in parallel
parallelize, ///
con(sshHost="sirius") /// con(configFile = "`locConf'" profile="sirius") ///
job(nodes="1" ppn="1" pmem="1gb" walltime="00:05:00" jobname="myBoot") ///
data(path= "`locData'" loc="local") ///
imports(work="`locWork'" coll="`locColl'" mon="`locMon'") ///
exec(nrep="5" cbfreq="30s" email="`eMailAddress'"): ///
regress price mpg trunk headroom i.foreign, robust
*** Check progress
callCluster, r(checkProgress)
*** Retrieve data
callCluster, r(pullData)