Skip to content


Subversion checkout URL

You can clone with
Download ZIP
Fetching contributors…
Cannot retrieve contributors at this time
334 lines (238 sloc) 22 KB
\usepackage[pdftitle={Efficient and highly parallel computation}, pdfauthor={Jeffrey Finkelstein}]{hyperref}
\newenvironment{idea}{\begin{proof}[Proof idea]}{\end{proof}}
\author{Jef{}frey Finkelstein}
\title{Efficient and highly parallel computation}
\section{Models of computation}
\subsection{Parallel random access machines}
There are many variations of the basic parallel random access machine (PRAM), which includes an arbitrary number of processors sharing an arbitrary amount of random access memory.
An exclusive-read, exclusive-write PRAM (EREW PRAM) does not allow any concurrent memory accesses (so all concurrent memory accesses must be explicitly serialized in the program).
An EREW PRAM is useful because any algorithm running on such a machine experiences no contention due to concurrent memory accesses.
A concurrent-read, concurrent-write PRAM (CRCW PRAM), on the other hand, has a unit cost for any number of processors concurrently accessing any single memory location.
In this model, contention due to concurrent memory accesses is not necessarily included in the running time.
Can an EREW PRAM simulate a CRCW PRAM (with the intention of reducing or eliminating time spent queuing memory concurrent accesses in physical hardware implementations)?
Due to a refinement of a simulation originally by Vishkin \cite{vishkin83}, a CRCW PRAM algorithm running in time $t(n)$ on $p(n)$ processors can be simulated by an EREW PRAM algorithm running in time $O(t(n)\cdot\lg n)$ on $p(n)$ processors.
\subsection{Queue-read, queue-write parallel random access machines}
Gibbons, Matias, and Ramachandran \cite{gmr98a} define the queue-read, queue-write PRAM (QRQW PRAM).
Whereas the EREW PRAM does not allow concurrent memory accesses and the CRCW PRAM has a unit cost for each concurrent memory access, the QRQW PRAM has a time penalty at each step directly proportional to the number of concurrent memory accesses.
This is the cost due to memory contention (or the ``queuing delay''), intended to model the cost of servicing memory access requests sequentially at the memory module of the machine.
The QRQW PRAM is ``more powerful'' than the EREW PRAM, because we may be able to collapse multiple exclusive memory accesses into a single concurrent one.
(Even so, there are some algorithms in which such a simulation does not necessarily result in a decreased running time.)
The QRQW PRAM is ``less powerful'' than the CRCW PRAM, because the CRCW PRAM does not account for the cost due to contention during memory accesses.
\cite{gmr98a} briefly provides the simple simulation of a QRQW PRAM by a CRCW PRAM, but I could not seem to find a work which provides a simulation of a QRQW PRAM by a EREW PRAM (though we examine such a simulation in the next section).
The same work also provides some theorems about ``time separations'' among different variants of PRAMs, but they don't seem to define what that means; it has something to do with time lower bounds.
\subsection{Variants of the queue-read, queue-write parallel random access machine}
In \cite{gmr98b}, the authors define single instruction, multiple data (SIMD) QRQW PRAMs, and multiple instruction, multiple data (MIMD) QRQW PRAMs.
In \cite{gmr98a}, they define QRQW asynchronous PRAMs, in which the computation at each processor proceeds asynchronously, but concurrent memory accesses to the same memory location truly form a queue which is processed serially; the head of the queue, if it is not empty, is dequeued and processed at each time step.
\subsection{Queuing shared memory machines}
The queuing shared memory (QSM) machine \cite{gmr99} is a essentially a parameterized QRQW PRAM in which the parameter $g$ scales the cost (of each step in each processor) due to memory accesses.
It is considered a ``bulk-synchrony'' model because it allows each processor to operate asynchronously between barrier synchronizations.
(This is a compromise between a completely synchronous model and a completely asynchronous model.)
\subsection{Bulk synchronous parallel random access machine}
The bulk syncronous parallel (BSP) \cite{valiant89} model is really a \emph{distributed} computation model, in that it has an arbitrary number of processors, each with local memory, connected via a (stateless) network with limited bandwidth.
Get a copy of \cite{valiant89}.
The bulk synchronous PRAM (BSPRAM) \cite{tiskin98} is a generalization of the BSP model which has an arbitrary number of PRAMs connected to a bandwidth-limited shared memory (in place of a bandwidth-limited network).
If we choose the PRAMs in the BSPRAM to be QRQW PRAMs, this essentially yields a generalization of a QSM machine with an extra parameter, $l$, which measures latency at shared memory.
A QSM machine is essentially a QRQW BSPRAM with $l=1$ (since the QSM machine does not penalize latency at shared memory).
\subsection{Memory access complexity}
In \cite{zstc98, zhang00}, Zhang et al. suggest that as memory access becomes a bottleneck, it pays to examine computational complexity not only in terms of time, space, and number of processors, but also in terms of \emph{memory access complexity}.
Memory access complexity measures the cost due to temporal and spatial locality of memory accesses in an algorithm (as well as ``contiguous and non-contiguous accesses, short message[s] and long message[s], etc.'' \cite{zcsm07}).
% see also
In \cite{av88}, the authors suggest the \emph{input/output model}, also known as the \emph{disk access model}, in which there are two layers of memory: a (fast) cache with a finite number of blocks and a (slow) disk.
Some analysis of the amount of memory transfers required for common algorithms appears in lecture notes from Erik Demaine's advanced data structures course \cite{demaine07}.
\subsection{Parameterized parallel random access machine}
The Parameterized PRAM (PPRAM) defined in \cite{hc} is a general family of PRAM models which assign a different cost to each unique type of operation.
Several other PRAM-like models are special cases of this generalization.
The PPRAM is parameterized by a four-tuple of positive reals, $(\alpha, \beta, \gamma, \delta)$, which are the costs for EREW memory access, CRCW memory access, multiprefix operation (for example, a ``prefix-sum'' operation), and global synchronization, respectively.
(Local computation has a cost of one.)
The benefit of this parameterized model is that one can analyze the tradeoffs in resource usage due to differences in distinct algorithms for a single computational problem.
Augment this model with two more parameters: $\epsilon$ for cost due to round trips to memory and $\zeta$ for cost due to queuing delay.
\subsection{Other models of parallel computation}
There are many, many other models of parallel computing machines; see \cite{zcsm07} for details and a listing of modern parallel computation models.
In that work, all the PRAMs as well as the QSM discussed in the previous sections are considered ``first generation'' models because they use a shared memory model.
The second generation models use distributed memory and the third hierarchical memory.
It seems that most models can be simulated on most other models with small costs.
Which of the models mentioned in \cite{zcsm07} is closest to our current hardware?
Which will be closest to the next-generation of hardware?
\subsection{Lower bounds for parallel models of computation}
\cite{mr98} provides lower bounds for certain problems on the QSM and BSP models, by using a generalized shared memory (GSM) machine.
This is a generalization of both QSM and BSP.
\subsection{The explicit multi-threading machine}
Vishkin, Caragea and Lee \cite{vcl06} describe complexity measures for the Explicit Multi-Threading (XMT) machine (introduced in \cite{vdbn98}).
When computing the overall complexity of an algorithm written for the XMT machine, they include the computation depth, the number of round-trips to memory, and the queuing delay.
They also include an additional cost due to thread and processor management when there are more threads than processors (which could be called ``processor contention'').
The XMT model combines some of the features of several different models of parallel computation, including a shared memory model, queuing delays, hierarchical memory access, memory segregated into modules, etc.
\section{Tradeoffs in resource usage}
One of our goals is to examine the tradeoff among time, space, number of processors, number of round-trips to memory, queuing delay, etc.
\subsection{Simulating a QRQW PRAM on an EREW PRAM}
Our first thought was to investigate eliminating queuing delay at the expense of computation depth.
To do this, we consider Vishkin's EREW PRAM simulation of a CRCW PRAM, and try to adapt it to simulate a QRQW PRAM.
\subsubsection{Simulating a CRCW PRAM on an EREW PRAM}
Vishkin showed that simulating a PRIORITY CRCW PRAM on a EREW PRAM can be done with a $O(\lg^2 n)$ increase in time.
A PRIORITY CRCW PRAM is a CRCW PRAM in which write conflicts are resolved by choosing the processor with the smallest ID.
This model is more powerful than both the COMMON CRCW PRAM (in which concurrent writes to the same address are only allowed when the processors write the same data) and the ARBITRARY CRCW PRAM (in which an arbitrary processor succeeds on concurrent writes), so simulating the PRIORITY CRCW PRAM model suffices to simulate these other two.
An algorithm which runs on an input of size $n$ on a CRCW PRAM with $p(n)$ processors in $t(n)$ time and $s(n)$ space can be simulated on an EREW PRAM with $p(n)$ processors in $O(t(n)\cdot \lg^2 n)$ time and $O(p(n) \cdot s(n))$ space.
The PRIORITY CRCW PRAM and EREW PRAM models are synchronous, so it suffices to show separately how to simulate a read step and a write step.
To simulate a write step, each processor writes the address, the processor's ID number, and the data to write to a memory address.
Perform Batcher sort, a $O(\lg^2 n)$ EREW PRAM (stable) sorting algorithm, on the addresses (then on processor ID number).
Each processor reads one of the addresses and reads the one before it (in the sorted sequence).
If the two are different, this processor has the lowest address of all processors attempting to write to this address, so it writes its value.
All other processors do nothing.
To simulate a read step, sort the sequence of addresses and IDs in $O(\lg^2 n)$ time as in the write step.
Each processor reads an address and the one before it, as above.
Create a bitmap with a $1$ where a processor sees that it has a new address and a $0$ where a processor does not see a new address.
Each processor corresponding to a $1$ reads from the corresponding address.
Perform a ``segmented prefix copy-right'', using the bitmap as the segment delimiters, to disseminate the data read in $O(\lg n)$ time.
``Unsort'' the adresses and IDs along with the data read.
A later $O(\lg n)$ EREW PRAM sorting algorithm due to Cole improves the above theorem.
An algorithm which runs on an input of size $n$ on a CRCW PRAM with $p(n)$ processors in $t(n)$ time and $s(n)$ space can be simulated on an EREW PRAM with $p(n)$ processors in $O(t(n)\cdot \lg n)$ time and $O(p(n) \cdot s(n))$ space.
See \cite{smith93} for more information on the Batcher sorting algorithm, the Cole sorting algorithm, and this simulation in general.
\cite{savage98} also has the statement and proof of the earlier version of this theorem.
\subsubsection{Simulating a SIMD-QRQW PRAM on an EREW PRAM}\label{sec:simdqrqwerew}
A SIMD-QRQW PRAM is just a CRCW PRAM with an added time penalty due to concurrent reads and writes.
Therefore, the same proof technique for simulating a CRCW PRAM on an EREW PRAM applies for simulating a SIMD-QRQW PRAM on an EREW PRAM.
At each step of a SIMD-QRQW PRAM with $p$ processors, the time cost due to contention is at most $p$ (if all processors are accessing the same memory location).
However, the simulated step on the EREW PRAM with $p$ processors still takes $O(\lg p)$ time.
Suppose the SIMD-QRQW PRAM step has contention $\kappa$.
If $\kappa\in O(\lg p)$, then the EREW PRAM simulation runs more slowly than the original algorithm.
If $\kappa\in\Omega(\lg p)$, then the EREW PRAM simulation runs \emph{more quickly} than the original algorithm.
This last fact is a bit unsettling; we expect a more restricted model of computation to take longer to decide a computational problem.
We take this fact as evidence that the EREW PRAM is not refined enough to explore tradeoffs between time, space, number of processors, number of memory accesses, and queuing delays.
\subsubsection{Simulating a MIMD-QRQW PRAM on an EREW PRAM}
We first note the simulation of a MIMD-QRQW PRAM on a SIMD-QRQW PRAM:
\begin{theorem}[\cite{gmr98a}, Observation 2.1]\label{thm:mimdsimdqrqw}
An algorithm which runs on an input of size $n$ on a MIMD-QRQW PRAM with $p(n)$ processors in $t(n)$ time and $s(n)$ space with a maximum time cost for a single step $\tau(n)$ can be simulated on a SIMD-QRQW PRAM with $p(n)\cdot \tau(n)$ processors in $O(t(n))$ time and $s(n) + \tau(n) \cdot p(n)$ space.
(The increase in space comes from the proof; it is not explicitly stated in the original theorem.)
Combining \autoref{thm:mimdsimdqrqw} with the results stated in \autoref{sec:simdqrqwerew}, we find that simulating a step of a MIMD-QRQW PRAM on an EREW PRAM also is subject to the same unusual behavior, as described below.
A step on the MIMD-QRQW PRAM with contention $\kappa$ becomes multiple steps on the SIMD-QRQW PRAM.
Some of these steps are subject to contention $\kappa$ as well.
The same analysis as above applies.
\subsection{Commonly studied problems for PRAMs}
Tradeoffs between amount of communication and number of processors in a partitioned PRAM (in which each of a finite set of processors receives only a subset of the input and there is a finite amount of shared memory) for the following problems are studied in \cite{abn99}.
\item $k$-selection
\item minimum
\item prefix-sum
\item integer sorting
\item $k\mhyphen k$ routing
\item permutation routing
\item list reversal
Two algorithms for majority are given in \cite{hc}.
Many algorithms for minimum weight spanning tree also exist because of the many applications, for example, in networking.
There are also a bunch of PRAM approximation algorithms in \cite{dsst97}.
\cite{vcl06} gives algorithms and analyses for
\item array compaction
\item $k$-ary tree summation
\item breadth-first search
\item sparse matrix, dense vector multiplication
Examine these algorithms with respect to queuing delay and number of round-trips to memory, etc., as performed in \cite{vcl06}.
\section{Inherently sequential and highly parallel computational problems}
\subsection{Highly parallel computational problems}\label{sec:definenc}
\textsf{NC}{} is the class of all languages which can be decided by a (logarithmic space) uniform family of Boolean circuits with polynomial size, polylogarithmic depth, and fan-in two.
If we view each of the gates in the circuit as a processor and each layer of the circuit as a single time step, then \textsf{NC} represents the class of languages decidable by a CREW PRAM with a polynomial number of processors running in polylogarithmic parallel time.
Languages in \textsf{NC} are considered highly parallel.
We know $\mathsf{L}\subseteq\mathsf{NL}\subseteq\mathsf{NC}^2\subseteq\mathsf{NC}\subseteq\mathsf{P}$.
The first inclusion is obvious, the second is due to \cite{borodin77}, the third follows from the definition, and the last is because \textsf{NC} is logarithmic space uniform (so the circuit can be generated in polynomial time) and because circuit evaluation can be performed in polynomial time.
\subsection{Inherently sequential computational problems}
Languages which are complete for \textsf{P} under $\leq_m^{NC}$ reductions are considered inherently sequential, because a highly parallel solution (that is, an \textsf{NC} algorithm) for any one of them would imply a highly parallel solution for all problems in \textsf{P}.
What is the relationship between logarithmic space many-one reductions ($\leq_m^L$) and $\mathsf{NC}$ many-one reductions ($\leq_m^{NC}$)?
Since $\mathsf{L}\subseteq\mathsf{NC}$, the former reduction implies the latter.
Such problems would be more accurately described as ``inherently sequential in the worst case'', since the complexity classes \textsf{P} and \textsf{NC} are defined by worst-case time bounds (and worst-case number of processors in the latter class).
The canonical natural \textsf{P}-complete problem is the Circuit Value Problem (CVP): given the encoding of both a Boolean circuit and its input, decide if the output of the circuit is 1.
\subsection{Highly parallel appromixations}
An inherently sequential (that is, \textsf{P}-complete) optimization problem may still admit a highly parallel approximation.
Indeed, so do some \textsf{NP}- and \textsf{PSPACE}-complete optimization problems (for example, see \cite{hmrrrs98} for \textsf{NC} approximations for \textsf{NP}- and \textsf{PSPACE}-hard problems).
Analogous to the chain of inclusions
\mathsf{P} \subseteq \mathsf{PO} \subseteq \mathsf{FPTAS} \subseteq \mathsf{PTAS} \subseteq \mathsf{ApxPO} \subseteq \mathsf{poly\mhyphen ApxPO} \subseteq \mathsf{exp\mhyphen ApxPO} \subseteq \mathsf{NPO}
for approximating \textsf{NP} optimization problems, we have
\mathsf{NC} \subseteq \mathsf{NCO} \subseteq \mathsf{FNCAS} \subseteq \mathsf{NCAS} \subseteq \mathsf{ApxNCO} \subseteq \mathsf{poly\mhyphen ApxNCO} \subseteq \mathsf{exp\mhyphen ApxNCO}
for approximating \textsf{NC} optimization problems (see working definitions in \cite{finkelstein13}) and
\mathsf{L} \subseteq \mathsf{LO} \subseteq \mathsf{FLAS} \subseteq \mathsf{LAS} \subseteq \mathsf{ApxLO} \subseteq \mathsf{poly\mhyphen ApxLO} \subseteq \mathsf{exp\mhyphen ApxLO} \subseteq \mathsf{NLO}
for approximating \textsf{NL} optimization problems (classes defined in \cite{tantau07}).
\subsection{Fixed parameter parallelizable problems}
As an analog to the class of fixed parameter tractable problems, \textsf{FPT}, we might also consider the class of fixed parameter \emph{parallelizable} problems, \textsf{FPP}, introduced (in a sketchy manner) in \cite{cd98}.
Survey and create a listing of known problems in \textsl{\textsf{FPP}}.
\subsection{Fixed parameter parallelizable approximation}
In \cite{marx07}, the author presents a cohesive view of fixed parameter tractable approximation algorithms.
See also \cite{ch10}.
No work seems to exist which studys fixed parameter parallelizable approximations.
\subsection{Average-case complexity}
\textsf{P}, \textsf{NP}, and \textsf{NP}-completeness are all defined using worst-case time complexity.
Average-case complexity classes are defined using expected running times over a probability distribution on inputs of each size.
There is a wealth of complexity classes related to the notion of average-case complexity (\textsf{AvgP}, \textsf{DistP}, \textsf{HeurP}, \textsf{NearlyP}, etc.).
For a survey of the average-case complexity of \textsf{NP} problems, see \cite{bt06}.
There are some definitions for smaller average-case complexity classes, but they are a bit harder to find.
See \cite[Section~3.5.1]{yamakami97} for general definitions of average-case complexity classes and \cite[Section~7]{bcgl89} for average-case logarithmic space.
The latter paper very briefly (that is, in a single sentence) states that their definitions can be used to define average-case logarithmic space-uniform \textsf{NC}.
This is a bit indirect because the average-case hardness comes from the unifom Turing machine which generates the circuit, and not the circuit itself.
\subsection{Average-case highly parallel problems}
One possible set of requirements for a problem to be considered ``highly parallel on average'' is
\item the expected depth of the circuit deciding it is polylogarithmic, and
\item the expected size of the circuit is polynomial,
but average-case highly parallel computation does not seem to be explicitly defined anywhere.
What is the appropriate definition for average-case \textsl{\textsf{NC}}?
Is it the definition given by \cite{bcgl89}, or would it make more sense to define it using the CREW PRAM characterization of \textsl{\textsf{NC}}{} (see \autoref{sec:definenc})?
There do seem to be a few examples of average-case highly parallel PRAM algorithms.
\item Sublist Selection Problem \cite{pz91}
\item $k$-Colorability of Permutation Graphs \cite{an99}
\item Single-Source Shortest-Paths \cite{meyer02}
\item various other graph problems \cite{rs92}
There has been some work on average-case complexity of \textsl{\textsf{P}}-complete problems in \cite{sx99}, though I have not examined that yet.
Also, there doesn't seem to be much follow up to that work.
There is also plenty of work done on expected depth of random circuits, which may be of use.
\section{About this work}
Copyright 2012 Jef{}frey Finkelstein.
This work is licensed under the Creative Commons Attribution-ShareAlike License 3.0.
Visit \mbox{\url{}} to view a copy of this license.
The \LaTeX{} markup which generated this document is available on the World Wide Web at \mbox{\url{}}.
It is also licensed under the Creative Commons Attribution-ShareAlike License.
The author can be contacted via email at \email{}.
Jump to Line
Something went wrong with that request. Please try again.