Skip to content
Brian Quistorff edited this page Sep 3, 2021 · 9 revisions

Here are some examples of ways of parallelizing code. Also see the examples in the help file (html version).

  1. Reproducible results with RNGs
  2. Example using pll_instance macros
  3. Generalized Append

Example using pll_instance macros

A lot of users sometimes need a more tailored solution using parallel. A good way of achieving such is by using pll_instance and PLL_CLUSTERS macros that are generated within each instance. Here is a trivial but perhaps useful example doing such

clear all
set more off
set trace off

parallel setclusters 4
cap drop 

// Generating a variable called code that goes from 1/4
sysuse auto
set seed 112321
gen code = floor(runiform()*4) + 1
tab code
/*
       code |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |         20       27.03       27.03
          2 |         11       14.86       41.89
          3 |         21       28.38       70.27
          4 |         22       29.73      100.00
------------+-----------------------------------
      Total |         74      100.00

*/

// Storing
save mytempdata, replace
clear

// Program that stores a dataset for
program myprogram
	use if code == $pll_instance using mytempdata.dta, clear
	collapse (mean) price rep78 (max) code
	save dataset_$pll_instance.dta, replace
end

// Processing the data and taking a look at the datasets
parallel, prog(myprogram) nodata: myprogram price
ls dataset_*.dta
/*
-rw-rw-r-- 1 george george 907 Sep 21 09:28 dataset_1.dta
-rw-rw-r-- 1 george george 907 Sep 21 09:28 dataset_2.dta
-rw-rw-r-- 1 george george 907 Sep 21 09:28 dataset_3.dta
-rw-rw-r-- 1 george george 907 Sep 21 09:28 dataset_4.dta
*/

// Now appending (using parallel append)
parallel append, do(di) e("dataset_%g.dta, 1/4")
list
/*
     +------------------------------------------+
     |   price     rep78   code      dta_source |
     |------------------------------------------|
  1. | 6,292.5       3.3      1   dataset_1.dta |
  2. |   4,489       3.5      2   dataset_2.dta |
  3. | 6,532.1      3.35      3   dataset_3.dta |
  4. | 6,537.5   3.52632      4   dataset_4.dta |
     +------------------------------------------+
*/

// Removing files using shell
!rm dataset_*.dta  mytempdata.dta

Another example can be found in the help file of parallel (Example 6).

Generalized Append

If your data named in an easy manner then parallel append can help. A more general solution is sketched out here. First, we review a simple application of parallel append.

//files 2008_01/income.dta, ..., 2012_12/income.dta
program def myprogram
	gen female = (gender == "female")
	collapse (mean) income, by(female) fast
end
parallel append, do(myprogram) prog(myprogram) e("%g_%02.0f/income.dta, 2008/2012, 1/12")

Here is a more general solution. It requires the user to be able to load the names of the files into the current data

//Load files names into the variable filenames (this will depend on the use-case).
program def myprogram2
	local N = _n
	tempfile accumulated
	forval i=1/`N'{
		preserve
		local fn = filename[`i']
		use "`fn'", clear

		//do the real work
		gen female = (gender == "female")
		collapse (mean) income, by(female) fast

		cap append `accumulated'
		save `accumulated', replace
		restore
	}
	use `accumulated', clear
end
parallel prog(myprogram2) : myprogram2
Clone this wiki locally