# A Distributed Matrix Data Structure and Its Statistical Applications on PyTorch

**The Programming Workshop at the Inaugural Kenneth Lange Symposium, Feb 21-22, 2020**

_Seyoon Ko and Joong-Ho (Johann) Won_

## Synopsis

We developed a distributed matrix operation package suitable for distributed matrix-vector operations and distributed tall-and-thin (or wide-and-short) matrices.
The code runs on both multi-node machines and multi-GPU machines using PyTorch.
We have applied this package for four statistical applications, namely, nonnegative matrix factorization (NMF), multidimensional scaling (MDS), positron emission tomography (PET), and $\ell_1$-regularized Cox regression.
In particular, $\ell_1$-regularized Cox regression with the UK Biobank dataset was the biggest joint multivariate survival analysis to our knowledge. 
In this workshop, we provide small examples that run on a single node, and demonstrate multi-GPU usage on our machine.

## Contents

* Brief introduction to PyTorch operations
* `torch.distributed` package
* Distributed matrix data structure in package `dist_stat`
* Applications: Nonnegative matrix factorization and $\ell_1$-penalized Cox regression
* Demonstration on multi-GPU machine

## Introduction to PyTorch

PyTorch is an optimized tensor library for deep learning using GPUs and CPUs. It has two goals of development:
* A replacement for NumPy to use the power of GPUs $\rightarrow$ optimization of numerical operations
* A deep learning research platform that provides maximum flexibility and speed $\rightarrow$ optimization of automatic gradient computation e.g. backpropagation

We are trying to exploit the former in a distributed environment.

## Basic PyTorch Operations
We introduce simple operations on PyTorch. Note that Python uses 0-based, row-major ordering, like C and C++ (cf. R and Julia have 1-based, column-major ordering). First we import the PyTorch
library. This is similar to `library()` in R and equivalent to `import ...` in Julia.

In [1]:
import torch

In [2]:
torch.__version__

'1.4.0'

### Tensor Creation

One may create an uninitialized tensor. This creates a 3 × 4 tensor (matrix).

In [3]:
torch.empty(3, 4) # uninitialized tensor. Julia equivalent: Array{Float32}(undef, 3, 4)

tensor([[2.1604e-35, 3.0899e-41, 2.2561e-43, 0.0000e+00],
        [2.5622e-29, 4.5761e-41, 2.1553e-35, 3.0899e-41],
        [0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00]])

The following is equivalent to `set.seed()` in R.

In [4]:
torch.manual_seed(100) # Julia equivalent: Random.seed!(100)

<torch._C.Generator at 0x7f8ff0072b10>

This generates a tensor initialized with random values from (0, 1).

In [5]:
y = torch.rand(3, 4) # from Unif(0, 1). 
y

tensor([[0.1117, 0.8158, 0.2626, 0.4839],
        [0.6765, 0.7539, 0.2627, 0.0428],
        [0.2080, 0.1180, 0.1217, 0.7356]])

We can also generate a tensor filled with zeros or ones.

In [6]:
z = torch.ones(3, 4) # torch.zeros(3, 4)
z

tensor([[1., 1., 1., 1.],
        [1., 1., 1., 1.],
        [1., 1., 1., 1.]])

A tensor can be created from standard Python data.

In [7]:
w = torch.tensor([3, 4, 5, 6])
w

tensor([3, 4, 5, 6])

A tensor can be created in certain datatype (default: float32) and on certain device (default: CPU) of choice 

In [8]:
# double precision
w = torch.tensor([3, 4, 5, 6], dtype=torch.float64)
w

tensor([3., 4., 5., 6.], dtype=torch.float64)

In [9]:
# # on GPU number zero. will not run if CUDA GPU is not present.
# w = torch.tensor([3, 4, 5, 6], device='cuda:0')
# w

Shape of a tensor can be accessed by appending `.shape` to the tensor name.

In [10]:
z.shape

torch.Size([3, 4])

### Casting

A tensor can have datatype and location changed by the method `.to()`. The arguments are similar to choosing datatype and device of the new tensor.

In [11]:
w = w.to(device = "cpu", dtype=torch.int32)
w

tensor([3, 4, 5, 6], dtype=torch.int32)

### Indexing

The following are standard method of indexing tensors.

In [12]:
y[2, 3] # indexing: zero-based, returns a 0-dimensional tensor

tensor(0.7356)

The indexing always returns a (sub)tensor, even for scalars (treated as zero-dimensional tensors).
A standard Python number can be returned by using .item().

In [13]:
y[2, 3].item() # A standard Python floating-point number

0.7355988621711731

To get a column from a tensor, we use the indexing as below. The syntax is similar but slightly
different from R.

In [14]:
y[:, 3] # 3rd column. The leftmost column is 0th. cf. y[, 4] in R

tensor([0.4839, 0.0428, 0.7356])

The following is for taking a row.

In [15]:
y[2, :] # 2nd row. The top row is 0th. cf. y[3, ] in R

tensor([0.2080, 0.1180, 0.1217, 0.7356])

### Simple operations

Here we provide an example of simple operations on PyTorch. Addition using the operator ‘+’ acts
just like anyone can expect:

In [16]:
x = y + z # a simple addition.
x

tensor([[1.1117, 1.8158, 1.2626, 1.4839],
        [1.6765, 1.7539, 1.2627, 1.0428],
        [1.2080, 1.1180, 1.1217, 1.7356]])

Here is another form of addition.

In [17]:
x = torch.add(y, z) # another syntax for addition

The operators ending with an underscore (`_`) changes the value of the tensor in-place. Otherwise, the argument never changes. Unlike methods ending with `!` in Julia, this rule is strictly enforced in PyTorch. (The underscore determines usage of the keyword `const` in C++-level.)

In [18]:
y.add_(z) # in-place addition

tensor([[1.1117, 1.8158, 1.2626, 1.4839],
        [1.6765, 1.7539, 1.2627, 1.0428],
        [1.2080, 1.1180, 1.1217, 1.7356]])

### Concatenation

We can concatenate the tensors using the function `cat()`, which resembles `c()`, `cbind()`, and
`rbind()` in R, `cat()`, `vcat()`, `hcat()` in Julia. The second argument indicates the dimension that the tesors are concatenated
along: zero means by concatenation by rows, and one means by columns.

In [19]:
torch.cat((y, z), 0) # along the rows. cf. vcat

tensor([[1.1117, 1.8158, 1.2626, 1.4839],
        [1.6765, 1.7539, 1.2627, 1.0428],
        [1.2080, 1.1180, 1.1217, 1.7356],
        [1.0000, 1.0000, 1.0000, 1.0000],
        [1.0000, 1.0000, 1.0000, 1.0000],
        [1.0000, 1.0000, 1.0000, 1.0000]])

In [20]:
torch.cat((y, z), 1) # along the columns. cf. hcat

tensor([[1.1117, 1.8158, 1.2626, 1.4839, 1.0000, 1.0000, 1.0000, 1.0000],
        [1.6765, 1.7539, 1.2627, 1.0428, 1.0000, 1.0000, 1.0000, 1.0000],
        [1.2080, 1.1180, 1.1217, 1.7356, 1.0000, 1.0000, 1.0000, 1.0000]])

### Reshaping

One can reshape a tensor, like changing the attribute `dim` in R.

In [21]:
y.view(12) # 1-dimensional array

tensor([1.1117, 1.8158, 1.2626, 1.4839, 1.6765, 1.7539, 1.2627, 1.0428, 1.2080,
        1.1180, 1.1217, 1.7356])

Up to one of the arguments of `view()` can be −1. The size of the reshaped tensor is inferred
from the other dimensions.

In [22]:
# reshape into (6)-by-2 tensor;
# (6) is inferred from the other dimension
y.view(-1, 2)

tensor([[1.1117, 1.8158],
        [1.2626, 1.4839],
        [1.6765, 1.7539],
        [1.2627, 1.0428],
        [1.2080, 1.1180],
        [1.1217, 1.7356]])

### Basic statistics and Reductions

Calling `.sum()`, `.mean()`, `.std()` methods of a tensor do the obvious. Optional argument determines the dimension of reduction.

In [23]:
y

tensor([[1.1117, 1.8158, 1.2626, 1.4839],
        [1.6765, 1.7539, 1.2627, 1.0428],
        [1.2080, 1.1180, 1.1217, 1.7356]])

In [24]:
y.sum()

tensor(16.5933)

In [25]:
y.sum(0) # reduces rows, columnwise sum

tensor([3.9962, 4.6878, 3.6469, 4.2623])

In [26]:
y.sum(1) # reduces columns, rowwise sum

tensor([5.6739, 5.7359, 5.1834])

In [27]:
y.sum((0, 1)) # reduces rows and columns -> a single number.

tensor(16.5933)

In [28]:
y.mean()

tensor(1.3828)

In [29]:
y.mean(0) # columnwise mean

tensor([1.3321, 1.5626, 1.2156, 1.4208])

In [30]:
y.std(1) # rowwise standard deviation (division by (n-1))

tensor([0.3058, 0.3384, 0.2961])

### Linear Algebra

Matrix transpose is performed by appending `.t()` to a tensor. Matrix multiplication is carried out by the method `torch.mm()`.

In [31]:
torch.mm(y, z.t()) # Note: y is 3 x 4, z is 3 x 4. 

tensor([[5.6739, 5.6739, 5.6739],
        [5.7359, 5.7359, 5.7359],
        [5.1834, 5.1834, 5.1834]])

## `torch.distributed`: Distributed subpackage for PyTorch

`torch.distributed` is the subpackage for distributed operations on PyTorch. The interface is mostly inspired by the message passing interface (MPI). The available backends are:

* Gloo, a collective communication library developed by Facebook, included in PyTorch. Full support for CPU, partial collective communication only for GPU.
* MPI, a good-old communication standard. The most flexible, but PyTorch needs to be compiled from its source to use it as a backend. Full support for GPU if the MPI installation is "CUDA-aware".
* NCCL, Nvidia Collective Communications Library, collective communication only for multiple GPUs on the same machine.

For this workshop, we use Gloo for its full functionalities on CPU and runnability on Jupyter Notebook. The experiments in our paper use MPI for running multi-node setting and multi-GPU setting with basically the same code. The interface below is specific for Gloo backend. For MPI backend, please consult with a section from [distributed package tutorial](https://pytorch.org/tutorials/intermediate/dist_tuto.html#communication-backends) or [our example code](https://github.com/kose-y/dist_stat/tree/master/examples).

In [32]:
import os
import torch
import torch.distributed as dist
from torch.multiprocessing import Process

def init_process(rank, size, fn, backend='gloo'):
    """ Initialize the distributed environment. """
    os.environ['MASTER_ADDR'] = '127.0.0.1'
    os.environ['MASTER_PORT'] = '29500'
    dist.init_process_group(backend, rank=rank, world_size=size)
    fn(rank, size)

def run_process(size, fn):
    processes = []
    for rank in range(size):
        p = Process(target=init_process, args=(rank, size, fn))
        p.start()
        processes.append(p)
        
    for p in processes:
        p.join()

Each distributed function from now on will have `rank` and `size` as the two arguments. When `run_process` is called with the communicator size `size` and the function name `fn`, `size` process will be launched, and each of them will call `fn` with `rank` of each process and the `size`.  

### Point-to-point communication

![](https://pytorch.org/tutorials/_images/send_recv.png)

Figure courtesy of: https://pytorch.org/tutorials/_images/send_recv.png

In [33]:
"""Blocking point-to-point communication."""

def point_to_point(rank, size):
    tensor = torch.zeros(1)
    if rank == 0:
        tensor += 1
        # Send the tensor to process 1
        dist.send(tensor=tensor, dst=1)
    elif rank == 1:
        # Receive tensor from process 0
        dist.recv(tensor=tensor, src=0)
    dist.barrier()
    print('Rank ', rank, ' has data ', tensor[0])

In [34]:
run_process(4, point_to_point)

Rank  0  has data  tensor(1.)
Rank  2  has data  tensor(0.)
Rank  3  has data  tensor(0.)
Rank  1  has data  tensor(1.)


### Collective communication

| | | 
|:---|:---|
| ![](https://pytorch.org/tutorials/_images/scatter.png) | ![](https://pytorch.org/tutorials/_images/gather.png) |
| Scatter | Gather |
| ![](https://pytorch.org/tutorials/_images/reduce.png) | ![](https://pytorch.org/tutorials/_images/all_reduce.png) |
| Reduce | All-reduce |
| ![](https://pytorch.org/tutorials/_images/broadcast.png) | ![](https://pytorch.org/tutorials/_images/all_gather.png) |
| Broadcast | All-gather |

Table courtesy of: https://pytorch.org/tutorials/intermediate/dist_tuto.html


In [35]:
def broadcast(rank, size):
    tensor = torch.zeros(1)
    if rank == 0:
        tensor[0] = 7
    dist.broadcast(tensor, src=0)
    print('Rank ', rank, ' has data ', tensor[0])

In [36]:
run_process(4, broadcast)

Rank  0  has data  tensor(7.)
Rank  2  has data  tensor(7.)
Rank  3  has data  tensor(7.)
Rank  1  has data  tensor(7.)


In [37]:
def reduce(rank, size):
    tensor = torch.ones(1)
    dist.reduce(tensor, 3)
    print('Rank ', rank, ' has data ', tensor[0])

In [38]:
run_process(4, reduce)

Rank  0  has data  tensor(4.)
Rank  3  has data  tensor(4.)
Rank  1  has data  tensor(3.)
Rank  2  has data  tensor(2.)


In [39]:
def all_reduce(rank, size):
    tensor = torch.ones(1)
    dist.all_reduce(tensor, op=dist.ReduceOp.SUM)
    print('Rank ', rank, ' has data ', tensor[0])

In [40]:
run_process(4, all_reduce)

Rank  0  has data  tensor(4.)
Rank  2  has data  tensor(4.)
Rank  3  has data  tensor(4.)
Rank  1  has data  tensor(4.)


In [41]:
def all_gather(rank, size):
    tensors = [torch.zeros(1) for i in range(size)]
    dat = torch.zeros(1)
    dat[0] += rank
    dist.all_gather(tensors, dat)
    print('Rank ', rank, ' has data ', tensors)

In [42]:
run_process(4, all_gather)

Rank  0  has data  [tensor([0.]), tensor([1.]), tensor([2.]), tensor([3.])]
Rank  3  has data  [tensor([0.]), tensor([1.]), tensor([2.]), tensor([3.])]
Rank  1  has data  [tensor([0.]), tensor([1.]), tensor([2.]), tensor([3.])]
Rank  2  has data  [tensor([0.]), tensor([1.]), tensor([2.]), tensor([3.])]


The below is the code for simple Monte Carlo $\pi$ estimation. 100,000 $x_i$ and $y_i$ are sampled from $Unif(0,1)$ for each process, and we measure the proportion of $(x_i, y_i)$s inside the unit quartercircle. 

In [43]:
def mc_pi(n, size):
    # this code is executed on each process.
    x = torch.rand((n), dtype=torch.float64)
    y = torch.rand((n), dtype=torch.float64)
    # compute local estimate of pi
    r = torch.mean((x**2 + y**2 < 1).to(dtype=torch.float64))*4
    dist.all_reduce(r) # sum of 'r's in each device is stored in 'r'
    return r / size

def run_mc_pi(rank, size):
    n = 100000
    torch.manual_seed(100 + rank)
    r = mc_pi(n, size)
    if rank == 0:
        print(r.item())

In [44]:
run_process(4, run_mc_pi)

3.14356


## `distmat`: Distributed Matrices on PyTorch

Using the tensor operations and communication package, we created a data structure for a distributed matrix. 
* Each process (enumerated by rank) holds a contiguous block of the full data matrix by rows or columns.
* The data may be a sparse matrix. 
* If GPUs are involved, each process controls a GPU whose index matches the process rank. 
* From now on: [100] × 100 matrix split over four processes means...
    * Rank 0 process keeps rows [0:25). of the matrix, in row-major ordering.
    * Rank 3 process keeps rows [75:100).
* Limitation: length of distributed dimension should be divisible by number of processes.

In [45]:
import dist_stat.distmat as distmat

### Creation

- `distgen_ones()`: Creates a distributed matrix filled with ones
- `distgen_zeros()`: Creates a distributed matrix filled with zeros
- `distgen_uniform()`: Creates a distributed matrix from uniform distribution
- `distgen_normal()`: Creates a distributed matrix from stndard normal distribution 

In [46]:
def create_unif(rank, size):
    A = distmat.distgen_uniform(8, 8, TType=torch.DoubleTensor)
    print('Matrix A:', 'Rank ', rank, ' has data ', A.chunk)   

In [47]:
run_process(4, create_unif)

Matrix A: Rank  3  has data  tensor([[0.6214, 0.7096, 0.7099, 0.7977, 0.8779, 0.1236, 0.5650, 0.0655],
        [0.1210, 0.0454, 0.5070, 0.1860, 0.5035, 0.1454, 0.8090, 0.4991]],
       dtype=torch.float64)Matrix A: Rank  0  has data  tensor([[0.6941, 0.3464, 0.9751, 0.7911, 0.4274, 0.4460, 0.5522, 0.9559],
        [0.9405, 0.2215, 0.3271, 0.1352, 0.6283, 0.3030, 0.1302, 0.1811]],
       dtype=torch.float64)

Matrix A: Rank  2  has data  tensor([[0.8381, 0.4943, 0.5984, 0.6167, 0.6128, 0.8593, 0.1344, 0.5146],
        [0.1479, 0.4238, 0.5144, 0.7051, 0.8133, 0.1795, 0.2721, 0.4631]],
       dtype=torch.float64)
Matrix A: Rank  1  has data  tensor([[0.5459, 0.3573, 0.2406, 0.5600, 0.6501, 0.9692, 0.5168, 0.2422],
        [0.4312, 0.5917, 0.3425, 0.2202, 0.7030, 0.5629, 0.9259, 0.7612]],
       dtype=torch.float64)


A distributed matrix can also be created from local chunks.

In [48]:
def from_chunks(rank, size):
    torch.manual_seed(100 + rank)
    chunk = torch.randn(2, 4)
    print("rank ", rank, "has chunk", chunk)
    A = distmat.THDistMat.from_chunks(chunk)
    print('Matrix A:', 'Rank ', rank, ' has data ', A.chunk)    
    if rank == 0:
        print(A.shape)

In [49]:
run_process(4, from_chunks)

rank  0 has chunk tensor([[ 0.3607, -0.2859, -0.3938,  0.2429],
        [-1.3833, -2.3134, -0.3172, -0.8660]])
rank  1 has chunk tensor([[-1.3905, -0.8152, -0.3204,  0.7377],
        [-1.7534,  0.6033, -0.2520, -0.4373]])
rank  3 has chunk tensor([[ 1.7286, -0.4007,  2.5587,  1.6848],
        [-1.6571, -0.2811,  0.7743, -0.9554]])
rank  2 has chunk tensor([[ 0.9907,  0.3349,  1.1497, -0.5498],
        [-0.1046,  2.0104, -0.7886, -0.1246]])
Matrix A: Rank  3  has data  tensor([[ 1.7286, -0.4007,  2.5587,  1.6848],
        [-1.6571, -0.2811,  0.7743, -0.9554]])Matrix A: Rank  1  has data  tensor([[-1.3905, -0.8152, -0.3204,  0.7377],
        [-1.7534,  0.6033, -0.2520, -0.4373]])Matrix A: Rank  0  has data  tensor([[ 0.3607, -0.2859, -0.3938,  0.2429],
        [-1.3833, -2.3134, -0.3172, -0.8660]])Matrix A: Rank  2  has data  tensor([[ 0.9907,  0.3349,  1.1497, -0.5498],
        [-0.1046,  2.0104, -0.7886, -0.1246]])



[8, 4]


Data can be distributed from the master process.

In [50]:
def dist_data(rank, size):
    if rank == 0:
        data = torch.rand(4, 2)
        print("master data: ", data)
    else:
        data = None
    
    data_dist = distmat.dist_data(data, src=0, TType=torch.DoubleTensor)
    print('data_dist: ', 'Rank ', rank, ' has data ', data_dist.chunk)
    dist.barrier()
    print(data_dist.shape) # shape of the distributed matrix
    dist.barrier()
    print(data_dist.sizes) # sizes along the distributed dimension

In [51]:
run_process(4, dist_data)

master data:  tensor([[0.7118, 0.7876],
        [0.4183, 0.9014],
        [0.9969, 0.7565],
        [0.2239, 0.3023]])
data_dist:  Rank  0  has data  tensor([[0.7118, 0.7876]], dtype=torch.float64)
data_dist:  Rank  3  has data  tensor([[0.2239, 0.3023]], dtype=torch.float64)
data_dist:  Rank  2  has data  tensor([[0.9969, 0.7565]], dtype=torch.float64)
data_dist:  Rank  1  has data  tensor([[0.4183, 0.9014]], dtype=torch.float64)
[4, 2]
[4, 2]
[4, 2]
[4, 2]
[1, 1, 1, 1]
[1, 1, 1, 1]
[1, 1, 1, 1]
[1, 1, 1, 1]


Remark: The default is to create a row-major matrix. They can easily be changed to a column-major matrix by transposition.

### Elementwise operations

Some of the basic functions work naturally. 

In [52]:
def elemwise_1(rank, size):
    A = distmat.distgen_uniform(4, 2)
    print("A: rank ", rank, "has chunk", A.chunk)
    B = distmat.distgen_uniform(4, 2)
    print("B: rank ", rank, "has chunk", B.chunk)    
    C = A + B
    print("C: rank ", rank, "has chunk", C.chunk)    
    C += A
    print("C: rank ", rank, "has chunk", C.chunk)
    
    logC = C.log()
    print("logC: rank ", rank, "has chunk", logC.chunk) 

In [53]:
run_process(4, elemwise_1)

A: rank  3 has chunk tensor([[0.5522, 0.9559]], dtype=torch.float64)
A: rank  0 has chunk tensor([[0.6941, 0.3464]], dtype=torch.float64)
A: rank  1 has chunk tensor([[0.9751, 0.7911]], dtype=torch.float64)
A: rank  2 has chunk tensor([[0.4274, 0.4460]], dtype=torch.float64)
B: rank  2 has chunk tensor([[0.6283, 0.3030]], dtype=torch.float64)
B: rank  0 has chunk tensor([[0.9405, 0.2215]], dtype=torch.float64)
B: rank  3 has chunk tensor([[0.1302, 0.1811]], dtype=torch.float64)
B: rank  1 has chunk tensor([[0.3271, 0.1352]], dtype=torch.float64)
C: rank  2 has chunk tensor([[1.0556, 0.7490]], dtype=torch.float64)
C: rank  1 has chunk tensor([[1.3022, 0.9263]], dtype=torch.float64)
C: rank  3 has chunk tensor([[0.6824, 1.1369]], dtype=torch.float64)
C: rank  0 has chunk tensor([[1.6346, 0.5680]], dtype=torch.float64)
C: rank  2 has chunk tensor([[1.4830, 1.1949]], dtype=torch.float64)
C: rank  1 has chunk tensor([[2.2773, 1.7175]], dtype=torch.float64)
C: rank  0 has chunk tensor([[2.32

Broadcasting (similar to Julia's dot broadcasting) also works as expected:

In [54]:
def dim_broadcasting(rank, size):
    A = distmat.distgen_uniform(4, 2) # [4] x 2 
    print("A: rank ", rank, "has chunk", A.chunk)
    B = distmat.distgen_ones(4, 1) # [4] x 1
    print("B: rank ", rank, "has chunk", B.chunk)
    A += B # B treated as [4] x 2 matrix
    print("A: rank ", rank, "has chunk", A.chunk)
    C = 2 * torch.ones(1, 2, dtype=torch.float64) # 1 x 2
    A += C # C treated as [4] x 2 matrix
    print("A: rank ", rank, "has chunk", A.chunk)     

In [55]:
run_process(4, dim_broadcasting)

A: rank  3 has chunk tensor([[0.5522, 0.9559]], dtype=torch.float64)
A: rank  0 has chunk tensor([[0.6941, 0.3464]], dtype=torch.float64)
A: rank  1 has chunk tensor([[0.9751, 0.7911]], dtype=torch.float64)
A: rank  2 has chunk tensor([[0.4274, 0.4460]], dtype=torch.float64)
B: rank  1 has chunk tensor([[1.]], dtype=torch.float64)
B: rank  3 has chunk tensor([[1.]], dtype=torch.float64)
B: rank  2 has chunk tensor([[1.]], dtype=torch.float64)
B: rank  0 has chunk tensor([[1.]], dtype=torch.float64)
A: rank  3 has chunk tensor([[1.5522, 1.9559]], dtype=torch.float64)
A: rank  1 has chunk tensor([[1.9751, 1.7911]], dtype=torch.float64)
A: rank  2 has chunk tensor([[1.4274, 1.4460]], dtype=torch.float64)
A: rank  0 has chunk tensor([[1.6941, 1.3464]], dtype=torch.float64)
A: rank  1 has chunk tensor([[3.9751, 3.7911]], dtype=torch.float64)
A: rank  3 has chunk tensor([[3.5522, 3.9559]], dtype=torch.float64)
A: rank  2 has chunk tensor([[3.4274, 3.4460]], dtype=torch.float64)
A: rank  0 ha

For general functions, we have `.apply()` and `.apply_binary()`.

In [56]:
def elemwise_2(rank, size):
    A = distmat.distgen_uniform(4, 2)
    print("A: rank ", rank, "has chunk", A.chunk)
    B = distmat.distgen_uniform(4, 2)
    print("B: rank ", rank, "has chunk", B.chunk) 
    Asqp1 = A.apply(lambda x: x**2 + 1)
    print("Asqp1: rank ", rank, "has chunk", Asqp1.chunk)    
    AsqpBsq = A.apply_binary(B, lambda x, y: x**2 + y**2)
    print("AsqpBsq: rank ", rank, "has chunk", AsqpBsq.chunk) 

In [57]:
run_process(4, elemwise_2)

A: rank  3 has chunk tensor([[0.5522, 0.9559]], dtype=torch.float64)
A: rank  0 has chunk tensor([[0.6941, 0.3464]], dtype=torch.float64)
A: rank  1 has chunk tensor([[0.9751, 0.7911]], dtype=torch.float64)
A: rank  2 has chunk tensor([[0.4274, 0.4460]], dtype=torch.float64)
B: rank  2 has chunk tensor([[0.6283, 0.3030]], dtype=torch.float64)
B: rank  1 has chunk tensor([[0.3271, 0.1352]], dtype=torch.float64)
B: rank  3 has chunk tensor([[0.1302, 0.1811]], dtype=torch.float64)
B: rank  0 has chunk tensor([[0.9405, 0.2215]], dtype=torch.float64)
Asqp1: rank  2 has chunk tensor([[1.1826, 1.1989]], dtype=torch.float64)
Asqp1: rank  3 has chunk tensor([[1.3049, 1.9137]], dtype=torch.float64)
Asqp1: rank  1 has chunk tensor([[1.9508, 1.6259]], dtype=torch.float64)
Asqp1: rank  0 has chunk tensor([[1.4818, 1.1200]], dtype=torch.float64)
AsqpBsq: rank  2 has chunk tensor([[0.5774, 0.2907]], dtype=torch.float64)
AsqpBsq: rank  3 has chunk tensor([[0.3219, 0.9465]], dtype=torch.float64)
AsqpBs

### Reductions (sum, max, min)

Summations, minimums, and maximums can be carried out in a way similar to local tensors.

In [58]:
def reductions(rank, size):
    A = distmat.distgen_uniform(4, 2)
    print("A: rank ", rank, "has chunk", A.chunk)
    print("sum of A: ", A.sum())
    print("maximum of A: ", A.max())
    print("minimum of A: ", A.min())
    
    sumA_row = A.sum(0) # row sum, a tensor with the same values on all processes 
    sumA_col = A.sum(1) # col sum, a distributed matrix
    print("row sum of A: ", sumA_row)
    
    print("sumA_col: rank ", rank, "has chunk", sumA_col.chunk)

In [59]:
run_process(4, reductions)

A: rank  0 has chunk tensor([[0.6941, 0.3464]], dtype=torch.float64)
A: rank  3 has chunk tensor([[0.5522, 0.9559]], dtype=torch.float64)




A: rank  1 has chunk tensor([[0.9751, 0.7911]], dtype=torch.float64)
A: rank  2 has chunk tensor([[0.4274, 0.4460]], dtype=torch.float64)




sum of A:  tensor(5.1882, dtype=torch.float64)
sum of A:  tensor(5.1882, dtype=torch.float64)
sum of A:  tensor(5.1882, dtype=torch.float64)
sum of A:  tensor(5.1882, dtype=torch.float64)
maximum of A:  tensor(0.9751, dtype=torch.float64)
maximum of A:  tensor(0.9751, dtype=torch.float64)
maximum of A:  tensor(0.9751, dtype=torch.float64)
maximum of A:  tensor(0.9751, dtype=torch.float64)
minimum of A:  tensor(0.3464, dtype=torch.float64)
minimum of A:  tensor(0.3464, dtype=torch.float64)
minimum of A:  tensor(0.3464, dtype=torch.float64)
minimum of A:  tensor(0.3464, dtype=torch.float64)
row sum of A:  tensor([[2.6488, 2.5394]], dtype=torch.float64)
row sum of A:  tensor([[2.6488, 2.5394]], dtype=torch.float64)
row sum of A:  tensor([[2.6488, 2.5394]], dtype=torch.float64)
row sum of A:  tensor([[2.6488, 2.5394]], dtype=torch.float64)
sumA_col: rank  0 has chunk tensor([[1.0406]], dtype=torch.float64)
sumA_col: rank  2 has chunk tensor([[0.8733]], dtype=torch.float64)
sumA_col: rank  

### Diagonals

In [60]:
def diagonals(rank, size):
    if rank == 0:
        p = 4
        data = torch.randn(p, p)
        print("master data: ", data)
    else:
        data = None
        
    data_dist = distmat.dist_data(data, src=0, TType=torch.DoubleTensor)
    
    diag1 = data_dist.diag() # distributed diagonal
    print("diag1: rank ", rank, "has chunk", diag1.chunk)
    
    diag2 = data_dist.diag(distribute=False) # diagonal gathered in each process
    print("diag2: ", diag2)
    
    data_dist.fill_diag_(0) # fill the diagonals with zeros
    print("data_dist: rank ", rank, "has chunk", data_dist.chunk)

In [61]:
run_process(4, diagonals)

master data:  tensor([[ 0.6857,  0.7877, -0.9778,  2.1302],
        [-3.1896,  1.5914, -0.0247, -0.8466],
        [ 1.4205, -1.5741, -0.3572, -0.3097],
        [ 1.1705, -0.5410, -0.7116,  0.0575]])
diag1: rank  0 has chunk tensor([[0.6857]], dtype=torch.float64)
diag1: rank  3 has chunk tensor([[0.0575]], dtype=torch.float64)
diag1: rank  1 has chunk tensor([[1.5914]], dtype=torch.float64)
diag1: rank  2 has chunk tensor([[-0.3572]], dtype=torch.float64)
diag2:  tensor([[ 0.6857],
        [ 1.5914],
        [-0.3572],
        [ 0.0575]], dtype=torch.float64)diag2:  tensor([[ 0.6857],
        [ 1.5914],
        [-0.3572],
        [ 0.0575]], dtype=torch.float64)diag2:  tensor([[ 0.6857],
        [ 1.5914],
        [-0.3572],
        [ 0.0575]], dtype=torch.float64)diag2:  tensor([[ 0.6857],
        [ 1.5914],
        [-0.3572],
        [ 0.0575]], dtype=torch.float64)



data_dist: rank  0 has chunk tensor([[ 0.0000,  0.7877, -0.9778,  2.1302]], dtype=torch.float64)
data_dist: rank  3 

### Matrix multiplications

Six different scenarios of matrix-matrix multiplications, each representing a different configuration of the split dimension of two input
matrices and the output matrix, were considered and implemented. 

| Secnario | $A$ | $B$ | $AB$ | Description | Usage |
|:---|:---|:---|:---|:---|:---|
| 1 | $r \times$ [$p$] | $[p] \times q$ | $r \times$ [$q$]| Inner product, result distributed. | $V^T X$ |
| 2 | $[p] \times q$ | $[q] \times r$ | $[p] \times r$ | Fat matrix multiplied by a thin and tall matrix. | $X W^T$ |
| 3 | $r \times$ [$p$] | $[p] \times s$ | $r \times s$   | Inner product, result broadcasted. Inner product between two thin matrices. | $V^T V$, $W W^T$ |                                                                           
| 4 | $[p] \times r$ | $r \times$ [$q$] | $[p] \times q$ | Outer product, may require large amount of memory. For computing objective function. | $VW$ |
| 5 | $[p] \times r$ | $r \times s$   | $[p] \times s$ | A distributed matrix multiplied by a small, distributed matrix. | $VC$ where $C = WW^T$; $CW$ where $C = V^T V$ (transposed) |
| 6 | $r \times$ [$p$] | $p \times s$   | $r \times s$   | A distributed matrix multiplied by a thin-and-tall broadcasted matrix. | Matrix-broadcasted vector multiplications. |



The implementation of each case is carried
out using the collective communication directives. Matrix multiplication scenarios are automatically selected based on the shapes of the input matrices A and
B, except for the Scenarios 1 and 3 sharing the same input structure. Those two are further
distinguished by the shape of output, AB. The nonnegative matrix factorization involves Scenarios 1 to 5.
Scenario 6 is for matrix-vector multiplications, where broadcasting small vectors is almost
always efficient.

In [62]:
from dist_stat.distmm import *
def test_distmm(rank, size):
    TType = torch.DoubleTensor
    p, q, r = 8, 4, 2
    if rank==0:
        fat   = TType(p, q).normal_()
        thin1 = TType(q, r).normal_()
        thin2 = TType(p, r).normal_()
    else:
        fat, thin1, thin2 = None, TType(q,r), TType(p,r)       
    # broadcast thin1 and thin2 so they have same values across all processes
    dist.broadcast(thin1,0)
    dist.broadcast(thin2,0)
    # distribute the matrices
    fat_dist   = distmat.dist_data(fat, src=0, TType=TType)
    thin1_dist = distmat.dist_data(thin1, src=0, TType=TType)
    thin2_dist = distmat.dist_data(thin2, src=0, TType=TType)
    
    if rank==0:
        print("Scenario 1: thin2^T x fat" ) 
        correct = torch.mm(torch.t(thin2), fat)
        print(correct)
    dist.barrier()
    # (r x [p]) x ([p] x q) = (r x [q]). 
    # `out_sizes` gives information on how to split [q]. Available only if r != q.
    rslt_dist = distmat.mm(thin2_dist.t(), fat_dist, out_sizes=thin1_dist.sizes)
    print("rslt in rand %d: "%(rank,), rslt_dist.chunk)
    assert not rslt_dist.byrow
    dist.barrier()
    
    if rank==0:
        print("Scenario 1 (transposed): fat^T x thin2" )
        correct = torch.mm(torch.t(fat), thin2)
        print(correct)
    # (q x [p]) x ([p] x r) = ([q] x r). 
    # `out_sizes` gives information on how to split q.
    rslt_dist = distmat.mm(fat_dist.t(), thin2_dist, out_sizes=thin1_dist.sizes)
    print("rslt in rand %d: "%(rank,), rslt_dist.chunk)
    assert rslt_dist.byrow
    
    if rank==0:
        print("Scenario 2: fat x thin1")
        correct = torch.mm(fat, thin1)
        print(correct)
    dist.barrier()
    # ([p] x q) x ([q] x r) = ([p] x r).
    rslt_dist = distmat.mm(fat_dist, thin1_dist) 
    print("rslt in rank %d: "%(rank,), rslt_dist.chunk)
    assert rslt_dist.byrow
    dist.barrier()

    if rank==0:
        print("Scenario 2 (transposed): thin1^T x fat^T")
        correct = torch.mm(torch.t(thin1), torch.t(fat))
        print(correct)
    dist.barrier()
    # (r x [q]) x (q x [p]) = (r x [p]).
    rslt_dist = distmat.mm(thin1_dist.t(), fat_dist.t())
    print("rslt in rank %d: "%(rank,), rslt_dist.chunk)
    assert not rslt_dist.byrow
    dist.barrier()

    if rank==0: 
        print("Scenario 3: thin1^T x thin1")
        thin1_thin1 = torch.mm(torch.t(thin1), thin1)
        print("correct: ", thin1_thin1)
    dist.barrier()
    # (r x [p]) x ([p] x s) = (r x s).
    # Selected when no `out_sizes` is given.
    thin1_thin1_bd = distmat.mm(thin1_dist.t(), thin1_dist)
    print("rslt in rank %d: "%(rank,), thin1_thin1_bd)
    dist.barrier()

    if rank==0:
        print("Scenario 4: thin1 x thin2^T" )
        correct = torch.mm(thin1, torch.t(thin2))
        print("correct: ", correct)
    dist.barrier()
    # ([p] x r) x (r x [q]) = ([p] x q).
    rslt_dist = distmat.mm(thin1_dist, thin2_dist.t())
    print("rslt in rank %d: "%(rank,), rslt_dist.chunk)
    assert rslt_dist.byrow

    if rank==0:
        print("Scenario 5: thin2 x thin1_thin1")
        correct = torch.mm(thin2, thin1_thin1)
        print("correct: ", correct)
    dist.barrier()
    rslt_dist = distmat.mm(thin2_dist, thin1_thin1_bd)
    print("rslt in rank %d: "%(rank,), rslt_dist.chunk)
    assert rslt_dist.byrow # the result must be row-major
    dist.barrier()

    if rank==0:
        print("Scenario 5 (transposed): thin1_thin1 x thin2^T")
        correct = torch.mm(thin1_thin1, torch.t(thin2))
        print("correct: ", correct)
    dist.barrier()
    # ([p] x r) x (r x s) = ([p] x s).
    rslt_dist = distmat.mm(thin1_thin1_bd, thin2_dist.t())
    print("rslt in rank %d: "%(rank,), rslt_dist.chunk)
    assert not rslt_dist.byrow # the result is col-major
    dist.barrier()

    if rank==0:
        print("Scenario 6: thin1^T x thin1 (local broadcasted)" )
        correct = torch.mm(torch.t(thin1), thin1)
        print("correct: ", correct)
    # (r x [p]) x (p x s) = (r x s).
    rslt_dist = distmat.mm(thin1_dist.t(), thin1) #distmm_db_b(thin1_dist.t(), thin1)
    print("rslt in rank %d (local broadcasted): "%(rank,), rslt_dist)
    dist.barrier()
 

In [63]:
run_process(4, test_distmm)

Scenario 1: thin2^T x fat
tensor([[ 0.1560,  4.6003,  1.0414, -2.2172],
        [-0.3434,  5.0076,  1.5957,  2.9931]], dtype=torch.float64)




rslt in rand 0:  tensor([[ 0.1560],
        [-0.3434]], dtype=torch.float64)rslt in rand 2:  tensor([[1.0414],
        [1.5957]], dtype=torch.float64)rslt in rand 1:  tensor([[4.6003],
        [5.0076]], dtype=torch.float64)rslt in rand 3:  tensor([[-2.2172],
        [ 2.9931]], dtype=torch.float64)



Scenario 1 (transposed): fat^T x thin2
tensor([[ 0.1560, -0.3434],
        [ 4.6003,  5.0076],
        [ 1.0414,  1.5957],
        [-2.2172,  2.9931]], dtype=torch.float64)
rslt in rand 0:  tensor([[ 0.1560, -0.3434]], dtype=torch.float64)
rslt in rand 1:  tensor([[4.6003, 5.0076]], dtype=torch.float64)
rslt in rand 3:  tensor([[-2.2172,  2.9931]], dtype=torch.float64)
rslt in rand 2:  tensor([[1.0414, 1.5957]], dtype=torch.float64)
Scenario 2: fat x thin1
tensor([[ 3.5298,  1.1274],
        [-0.5137,  0.8368],
        [-3.9244, -2.0757],
        [-1.2072,  0.7788],
        [ 0.9921,  2.5060],
        [ 0.9099,  1.0310],
        [ 0.9737,  0.2209],
        [ 0.5159,  2.2222]], dtype=torc

As stated before, distributed matrix may have a sparse matrix as local data.

In [64]:
def mm_sparse(rank, size):
    def to_sparse(x):
        """ converts dense tensor x to sparse format """
        x_typename = torch.typename(x).split('.')[-1]
        sparse_tensortype = getattr(torch.sparse, x_typename)

        indices = torch.nonzero(x)
        if len(indices.shape) == 0:  # if all elements are zeros
            return sparse_tensortype(*x.shape)
        indices = indices.t()
        values = x[tuple(indices[i] for i in range(indices.shape[0]))]
        return sparse_tensortype(indices, values, x.size())

    TType = torch.DoubleTensor
    q, r = 4, 2
    if rank==0:
        thin1 = TType(q, r).normal_()
    else:
        thin1 = TType(q,r)
        
    # broadcast thin1 and thin2 so they have same values across all processes
    dist.broadcast(thin1,0)
    
    # distribute the matrices
    thin1_dist = distmat.dist_data(thin1, src=0, TType=TType) 

    # construct sparse matrices (no actual zero in this case, but uses sparse matrix data structure)
    thin1_sparse_chunk = to_sparse(thin1_dist.chunk)
    print("Sparse x dense, Scenario 3: in rank ", rank, "we have: ", thin1_sparse_chunk)
    thin1_sparse_dist = THDistMat.from_chunks(thin1_sparse_chunk)

    if rank==0:
        print("correct: ", torch.mm(thin1.t(), thin1))

    r =  distmat.mm(thin1_sparse_dist.t(), thin1_dist )
    print("rslt: ", r)

In [65]:
run_process(4, mm_sparse)

Sparse x dense, Scenario 3: in rank  0 we have:  tensor(indices=tensor([[0, 0],
                       [0, 1]]),
       values=tensor([-0.3172, -0.8660]),
       size=(1, 2), nnz=2, dtype=torch.float64, layout=torch.sparse_coo)Sparse x dense, Scenario 3: in rank  3 we have:  tensor(indices=tensor([[0, 0],
                       [0, 1]]),
       values=tensor([-2.3652, -0.8047]),
       size=(1, 2), nnz=2, dtype=torch.float64, layout=torch.sparse_coo)

Sparse x dense, Scenario 3: in rank  1 we have:  tensor(indices=tensor([[0, 0],
                       [0, 1]]),
       values=tensor([ 1.7482, -0.2759]),
       size=(1, 2), nnz=2, dtype=torch.float64, layout=torch.sparse_coo)Sparse x dense, Scenario 3: in rank  2 we have:  tensor(indices=tensor([[0, 0],
                       [0, 1]]),
       values=tensor([-0.9755,  0.4790]),
       size=(1, 2), nnz=2, dtype=torch.float64, layout=torch.sparse_coo)

correct:  tensor([[9.7023, 1.2284],
        [1.2284, 1.7031]], dtype=torch.float64)




rslt:  tensor([[9.7023, 1.2284],
        [1.2284, 1.7031]], dtype=torch.float64)rslt:  tensor([[9.7023, 1.2284],
        [1.2284, 1.7031]], dtype=torch.float64)rslt:  tensor([[9.7023, 1.2284],
        [1.2284, 1.7031]], dtype=torch.float64)
rslt:  tensor([[9.7023, 1.2284],
        [1.2284, 1.7031]], dtype=torch.float64)




## Nonnegative Matrix Factorization (NMF)

Approximate a nonnegative data matrix $X \in \mathbb{R}^{m \times p}$ by $VW$, $V \in \mathbb{R}^{m \times r}$ and $W \in \mathbb{R}^{r \times p}$. In a simple setting, NMF minimizes
\begin{align*}
f(V, W) =  \|X - VW\|_\mathrm{F}^2.
\end{align*}

Multiplicative algorithm [Lee and Seung, 1999, 2001]:
\begin{align*}
V^{n+1} &= V^n \odot [X (W^n)^T] \oslash [V^n W^n (W^n)^T] \\
W^{n+1} &= W^n \odot [(V^{n+1})^T X] \oslash [(V^{n+1})^T V^{n+1} W^n],
\end{align*}
where $\odot$ and $\oslash$ denote elementwise multiplication and division.

The following code is a simplified version. The full object-oriented version is included in our package.

In [66]:
def nmf(rank, size):
    p = 8; q = 12; r = 3
    maxiter = 1000
    TensorType=torch.DoubleTensor
    data = distmat.distgen_uniform(p, q, TType=TensorType)
    V = distmat.distgen_uniform(p, r, TType=TensorType)
    W = distmat.distgen_uniform(q, r, TType=TensorType).t()
    for i in range(maxiter):
        XWt =  distmat.mm(data, W.t())
        WWt =  distmat.mm(W, W.t())
        VWWt = distmat.mm(V, WWt)
        V.mul_(XWt).div_(VWWt)

        VtX  = distmat.mm(V.t(), data, out_sizes=W.sizes)
        VtV  = distmat.mm(V.t(), V)
        VtVW = distmat.mm(VtV, W)
        W = W.mul_(VtX).div_(VtVW)
        if (i+1) % 100 == 0:
            # print objective
            outer = distmat.mm(V, W)
            val = ((data - outer)**2).sum()
            if rank == 0:
                print("Iteration {}: {}".format(i+1, val))

In [67]:
run_process(4, nmf)



Iteration 100: 2.182438128886236
Iteration 200: 2.171416631750278
Iteration 300: 2.169680386094355
Iteration 400: 2.168717940886758
Iteration 500: 2.167978637258126
Iteration 600: 2.167353482884231
Iteration 700: 2.1668354282699847
Iteration 800: 2.1664281438361552
Iteration 900: 2.1661187200995116
Iteration 1000: 2.165885899848217


Inside the package `dist_stat`, we have implemented...

* Nonnegative matrix factorization:
    * Multiplicative method
    * Alternating proximal gradient method
* Positron Emission Tomography:
    * with $\ell_2$-penalty, MM method
    * with $\ell_1$-penlaty, primal-dual method
* Multidimensional Scaling
* $\ell_1$-regularized Cox regression

In [68]:
def nmf_mult(rank, size):
    import dist_stat.nmf as nmf
    p = 8; q = 12; r = 3
    maxiter = 3000
    TensorType=torch.DoubleTensor
    torch.manual_seed(100)
    data = distmat.distgen_uniform(p, q, TType=TensorType, set_from_master=True) # to guarantee same input matrices throughout experiments
    driver = nmf.NMF(data, r)
    V, W = driver.run(maxiter=maxiter, tol=1e-5, check_interval=100, check_obj=True) 
    # if check_obj=False, the objective value is not estimated.
    # the convergence is determined based on maximum change in V and W.

In [69]:
run_process(4, nmf_mult)

Starting...
p=8, q=12, r=3
  iter	      V_maxdiff	      W_maxdiff	        reldiff	            obj	      time
--------------------------------------------------------------------------------




   100	2.269086434e-04	7.859795175e-04	            inf	2.450653178e+00	   0.43880
   200	4.741822670e-05	1.487613723e-04	3.867071318e-04	2.449319301e+00	   0.43996
   300	2.064260964e-05	7.563094246e-05	2.621296866e-05	2.449228887e+00	   0.40669
   400	2.000570493e-05	7.475431599e-05	2.011233401e-05	2.449159516e+00	   0.42035
   500	2.812954891e-05	9.523429625e-05	3.124107173e-05	2.449051764e+00	   0.40650
   600	4.381995135e-05	1.446528443e-04	5.162722771e-05	2.448873708e+00	   0.43803
   700	6.619583670e-05	2.162175833e-04	8.868082228e-05	2.448567886e+00	   0.39906
   800	9.643257393e-05	3.092734985e-04	1.528160691e-04	2.448040970e+00	   0.40759
   900	1.325321757e-04	4.134715304e-04	2.479164259e-04	2.447186356e+00	   0.39362
  1000	1.650634082e-04	4.965977026e-04	3.522855042e-04	2.445972390e+00	   0.39836
  1100	1.783495447e-04	5.148614934e-04	4.124739364e-04	2.444551602e+00	   0.39160
  1200	1.640825205e-04	4.541542368e-04	3.852711367e-04	2.443225027e+00	   0.39599
  1300	1.3148351

Alternating projected gradient (APG) with ridge penalties:

\begin{align*}
f(V, W; \epsilon) =  \|X - VW\|_\mathrm{F}^2 + \frac{\epsilon}{2} \|V\|_\mathrm{F}^2 + \frac{\epsilon}{2} \|W\|_\mathrm{F}^2
\end{align*}
is minimized. 
The corresponding APG update is given by
\begin{align*}
V^{n+1} &= P_+ \left((1 - \sigma_n \epsilon) V^n - \sigma_n (V^n W^n (W^n)^T - X (W^n)^T) \right) \\
W^{n+1} &= P_+ \left((1 - \tau_n \epsilon) W^n - \tau_n ((V^{n+1})^T V^{n+1} W^n - (V^{n+1})^TX ) \right).
\end{align*}

The below is the APG with $\epsilon=0$.

In [70]:
def nmf_apg(rank, size):
    import dist_stat.nmf_pg as nmf
    p = 8; q = 12; r = 3
    maxiter = 3000
    TensorType=torch.DoubleTensor
    torch.manual_seed(100)
    data = distmat.distgen_uniform(p, q, TType=TensorType, set_from_master=True)
    driver = nmf.NMF(data, r)
    V, W = driver.run(maxiter=maxiter, tol=1e-5, check_interval=100, check_obj=True)

In [71]:
run_process(4, nmf_apg)

Starting...
p=8, q=12, r=3
  iter	      V_maxdiff	      W_maxdiff	        reldiff	            obj	      time
--------------------------------------------------------------------------------




   100	2.424162256e-03	3.832382365e-03	            inf	2.491386053e+00	   0.61891
   200	4.781538175e-04	4.835312406e-04	1.270388614e-02	2.447588284e+00	   0.62712
   300	1.557436503e-04	1.730102196e-04	4.787130896e-04	2.445938668e+00	   0.57892
   400	1.891190770e-04	1.971794805e-04	1.873943553e-04	2.445293040e+00	   0.57223
   500	1.988674663e-04	2.022453200e-04	1.796544182e-04	2.444674189e+00	   0.57298
   600	2.006114927e-04	2.003183483e-04	1.786643789e-04	2.444058858e+00	   0.56944
   700	1.974948598e-04	1.938984347e-04	1.720535717e-04	2.443466398e+00	   0.56738
   800	1.902900729e-04	1.837831141e-04	1.594379137e-04	2.442917466e+00	   0.56681
   900	1.795209595e-04	1.706543356e-04	1.419437735e-04	2.442428835e+00	   0.56539
  1000	1.658913058e-04	1.553432628e-04	1.213973120e-04	2.442010984e+00	   0.56436
  1100	1.502856314e-04	1.387773767e-04	9.986312239e-05	2.441667288e+00	   0.56695
  1200	1.336529807e-04	1.218600833e-04	7.919658999e-05	2.441394742e+00	   0.57079
  1300	1.1687950

When it goes large-scale with GPU or SIMD acceleration, the time difference becomes much smaller with faster convergence.

## $\ell_1$-regularized Cox Regression

We maximize
$$
f(\beta) = L(\beta) - \lambda \|\beta\|_1,
$$
where $L(\beta)$ is the Log-partial likelihood of Cox proportional hazards model:
\begin{align*}
L (\beta) = \sum_{i=1}^m \delta_i \left[\beta^T x_i - \log \left(\sum_{j: y_j \ge y_i} \exp(\beta^T x_j)\right)\right]. 
\end{align*}

* $y_i = \min \{t_i, c_i\}$
    * $t_i$: time to event
    * $c_i$: right-censoring time for that sample
* $\delta = (\delta_1, \dotsc, \delta_m)^T$
    * $\delta_i= I_{\{t_i \le c_i\}}$: indicator for censoredness of sample $i$.  

The gradient of $L(\beta)$ is given by 
\begin{align*}
\nabla L(\beta) = X^T (I-P) \delta,
\end{align*} 
  
where $w_i = \exp(x_i^T \beta)$, $W_j = \sum_{i: y_i \ge y_j} w_i$, $P = (\pi_{ij})$, and 
$$
\pi_{ij} = I(y_i \ge y_j) w_i/W_j.
$$

We use the proximal gradient method:

\begin{align*}
w_i^{n+1} &= \exp(x_i^T \beta); \;\; W_j^{n+1} = \sum_{i: y_i \ge y_j} w_i^{n+1}\\
\pi_{ij}^{n+1} &= I(t_i \ge t_j) w_i^{n+1} / W_j^{n+1} \\
\Delta^{n+1} &= X^T (I - P^{n+1}) \delta, \;\; \text{where $P^{n+1} = (\pi_{ij}^{n+1})$} \\
\beta^{n+1} &= \mathcal{S}_{\lambda}(\beta^n + \sigma \Delta^{n+1}),
\end{align*}

* $\mathcal{S}_\lambda(\cdot)$ is the soft-thresholding operator, the proximity operator of $\lambda \|\cdot \|_1$. 
$$
  [\mathcal{S}_{\lambda}(u)]_i = \mathrm{sign}(u_i) (|u_i| - \lambda)_+.
$$
* Convergence guaranteed when $\sigma \le 1/(2 \|X\|_2^2)$.
* $W_j$ can be computed using `cumsum` function when the data are sorted in nonincreasing order of $y_i$.

In [74]:
def cox_l1(rank, size):
    import dist_stat.cox as cox
    n = 8; p = 12
    lambd = 0.001
    maxiter = 3000
    torch.manual_seed(100)
    TensorType = torch.DoubleTensor
    # The below is how one would create a column-major matrix.
    # We want column-major matrix to invoke matrix multiplication scenario 3. (`beta` is distributed.)
    X = distmat.distgen_normal(p, n, TType=TensorType, set_from_master=True).t()     
    torch.manual_seed(200)
    delta = torch.multinomial(torch.tensor([1., 1.]), n, replacement=True).float().view(-1, 1).type(TensorType) # 50% censored, 50% noncensored
    dist.broadcast(delta, 0) # same delta shared across processes
    cox_driver = cox.COX(X, delta, lambd, seed=300, TType=TensorType, sigma='power') # power iteration to estimate matrix norm
    beta = cox_driver.run(maxiter, tol=1e-5, check_interval=100, check_obj=True)
    zeros = (beta == 0).type(torch.int64).sum() # elementwise equality (resulting in uint8 type) casted to int64 then summed up. 
                                                                    # omitting the casting will cause overflow on high-dimensional data.
    if rank == 0:
        print("number of zeros:", zeros)

In [75]:
run_process(4, cox_l1)



computing max singular value...
iteration 0
done computing max singular value:  tensor(5.5130, dtype=torch.float64)
step size:  0.016451195454478245
Starting...
n=8, p=12
  iter	        maxdiff	        reldiff	            obj	      time
--------------------------------------------------------------------------------
   100	4.263906103e-03	            inf	-6.928162346e-01	   0.38736
   200	2.137461323e-03	1.988562069e-03	-4.120260836e-01	   0.38458
   300	1.407481825e-03	7.553056657e-04	-3.128646711e-01	   0.37742
   400	1.007443583e-03	3.915195766e-04	-2.634000845e-01	   0.35287
   500	7.746264875e-04	2.342474319e-04	-2.344826456e-01	   0.38189
   600	6.264630031e-04	1.527405316e-04	-2.159107600e-01	   0.36084
   700	5.266433517e-04	1.052134521e-04	-2.032509415e-01	   0.34661
   800	4.567976771e-04	7.524416395e-05	-1.942647959e-01	   0.34587
   900	4.066494457e-04	5.527596186e-05	-1.876996717e-01	   0.33700
  1000	3.700033916e-04	4.142489420e-05	-1.827999355e-01	   0.33781
  1100	3.429

## Multi-GPU Demonstration

We demonstrate 10,000 x 10,000 examples on 2-8 GPUs on our server. The scripts in the `examples` directory are designed to automatically select the GPU device.

## Multi-node

The data structure can also be utilized on multi-node clusters. The structure was used for the analysis of 200,000 x 500,000 UK Biobank data.

## Future Direction

MPI-only, lightweight, more flexible version in Julia is in preparation. CUDA-aware MPI support for the central MPI interface [MPI.jl](https://github.com/JuliaParallel/MPI.jl) was added in the process.