# Distributed Multi-Party Linear Regression

Source: https://arxiv.org/pdf/1901.09531.pdf (section 2)

## General imports

In [1]:
import numpy as np
import pandas as pd
from scipy.stats import t
from statsmodels.api import OLS

## Creating simulated data

In [2]:
# Public data
K = 10

# Alice's private data
N1 = 1000
y1 = np.random.randn(N1)
C1 = np.random.randn(N1, K)

# Bob's private data
N2 = 2000
y2 = np.random.randn(N2)
C2 = np.random.randn(N2, K)

# Carla's private data
N3 = 1500
y3 = np.random.randn(N3)
C3 = np.random.randn(N3, K)

## Linear regression

### PRIVATE COMPUTATION - Compression

In [3]:
# Alice computes and secret shares...
yy1 = y1.T @ y1
Cty1 = C1.T @ y1
CtC1 = C1.T @ C1

# Bob computes and secret shares...
yy2 = y2.T @ y2
Cty2 = C2.T @ y2
CtC2 = C2.T @ C2

# Carla computes and secret shares...
yy3 = y3.T @ y3
Cty3 = C3.T @ y3
CtC3 = C3.T @ C3

### SECURE MULTI-PARTY COMPUTATION - Combine

Theoretically, computations below can be done with SMPC to guarantee that no information about the data is leaked.

With no SMPC, there is still a moderated level of security as the data was compressed and it is very difficult to trace it back.

**Computation is now independent of the sample sizes**

In [4]:
D = N1 + N2 + N3 - K

yy = yy1 + yy2 + yy3
Cty = Cty1 + Cty2 + Cty3
CtC = CtC1 + CtC2 + CtC3
invCtC = np.linalg.inv(CtC)

**Computing coefficients and squared standard error:**

In [5]:
beta = np.linalg.solve(CtC, Cty)
sigma_sq = np.diag(invCtC) * (yy - beta @ CtC @ beta) / D

**With these we can compute t statistic and p-value:**

In [6]:
sigma = np.sqrt(sigma_sq)
tstat = beta / sigma
pval = 2 * t.cdf(-abs(tstat), D)

### VERIFY correctness for the columns of X:

**Keeping first results in a pandas DataFrame:**

In [7]:
df = pd.DataFrame({
                    'beta': beta,
                    'sigma': sigma, 
                    'tstat': tstat, 
                    'pval': pval
                  })

**Computing results using OLS model from statsmodel API:**

In [8]:
y = np.concatenate([y1 ,y2, y3])
C = np.concatenate([C1, C2, C3])

#res = np.zeros([K,4])
model = OLS(y, C, hasconst=False)
tmp_res = model.fit()
res = np.array([tmp_res.params, tmp_res.bse, tmp_res.tvalues, tmp_res.pvalues]).T

**Keeping results in a 2nd DataFrame:**

In [9]:
df2 = pd.DataFrame({
                    'beta': res[:,0],
                    'sigma': res[:,1], 
                    'tstat': res[:,2], 
                    'pval': res[:,3]
                  })

**Finally comparing results of both methods**

In [10]:
df = df.apply(lambda x: round(x,10))
df2 = df2.apply(lambda x: round(x,10))
np.array(df == df2).all() # Returns TRUE

True