# Private Set Intersection and Moose

This notebook illustrates how we could use the PSI library [Private-ID](https://github.com/facebookresearch/Private-ID) to help us find the intersection between two datasets owned by two different parties based on a specific commong key. Then compute a metric on the intersected datasets using [Moose](https://github.com/tf-encrypted/moose).

For this example, we will use the [Private-ID protocol](https://github.com/facebookresearch/Private-ID#private-id-1) which map the common User ID from both parties to a single ID spine, so that same User ID addresses map to the same key. The easiest way to understand how this can be used for this use case is to scroll down this notebook to see what's the output of this protocol on each party. As described in [this blog post ](https://engineering.fb.com/2020/07/10/open-source/private-matching/), for example this protocol could be used to count the number of attendees to an event. Let's say Alice got RSVP from 5 participants, Bob got RSVP from 10 participants. They want to compute the number of participants but without sharing to each other the email addresses to count the number of unique email addresses. After running the PrivateID protocol Alice and Bob will get the exact same list of keys as an output. If there were 12 distinct email addresses among the 15 reponses (5 from Alice + 10 from Alice), they will see 12 distinct keys in the output csv files. The difference is that on Alice's side 5 keys will be mapped to the 5 email addresses she got, and she will have NA for the 7 remaining keys where she didn't have email addresses for. Bob will have 10 keys mapped to an email address and 2 keys mapped to NA. 

The PrivateID outputs could be used to also intersect two datasets. After running Private ID on the User IDs of Alice and Bob, we end up with two csv files with the same number of entries and ordered the same way based on the Key. In theory, Alice could share its Keys in plaintext to Bob, Bob performs the intersection of the Kys, then returns the keys in common to Alice. However Keys have a string type and the Moose runtime works best with numeric or boolean type. We can take another approach. In each dataset, we can create a flag for example `user_id_available` to indicate if there is a User ID corresponding to the Key. If we compute a logical And between the column `user_id_available` from Alice and Bob, we will end up with 1 if the Key is presents in Alice and Bob dataset otherwise we will have 0. Then we can filter Alice and Bob dataset where the logical And returned 1, so Alice and Bob have the same entries where the Key is matching and ordered the same way.
The benefit of this is that Alice and Bob learn the Keys they have in common but they don't learn the identity associated with the keys they don't have in common. The downside is that they reveal to each other the common Keys/User IDs.

In the rest of the notebook we will mock up this approach.

Note that instead we could use the [PS3I protocol](https://github.com/facebookresearch/Private-ID#ps3i) which compute the intersection between the two datasets, then return additive shares corresponding to the features contained in each datasets. We could then load these additive shares in Moose then convert these additive shares to replicated shares then compute on them. The benefit is that we don't reveal to Alice and Bob which records are in common, the downside is that it might be more work to implement in Moose. 



TODO:

update intro! Add link to Yann's examples :)


## Generate fake datasets

For Alice and Bob we generate a dataset with a User ID and one feature but we could have multiple features. Some of the records will have a common User ID. We will want to identify the intersection of these two datasets based on User ID then compute a metric between the two features owned respectively by Alice and Bob.

In [1]:
import pathlib
import numpy as np
import pandas as pd

np.random.seed(1234)

_DATA_DIR = pathlib.Path("./data")

def generate_mock_dataset(sample_size, n_features, sample_frac):
    x =  np.random.randn(sample_size, n_features)
    id = np.array(range(sample_size))
    df = pd.DataFrame(x, columns=[f"x_{i}" for i in range(n_features)])
    df.insert(loc=0, column='user_id', value=id)
    df =  df.sample(frac=sample_frac)
    return df

alice_df = generate_mock_dataset(10, 1, 0.6)
bob_df = generate_mock_dataset(10, 1, 0.7)

array([[ 93.24, 174.31,  12.47,  95.4 ,  71.21,  57.93,  76.18, 166.15,
        254.77,   3.89],
       [ 35.64, 186.26, 265.2 ,  13.62,  10.37,  11.54,  25.87, 148.77,
        145.43,  70.43],
       [146.64,  35.36, 108.55,  36.32, 181.33,  19.79,  43.29,  54.06,
         29.56, 207.98],
       [184.24,  51.12,  58.09, 106.27,   1.81,   2.28, 167.98,  81.24,
         62.59, 125.03],
       [ 61.51,  12.14, 128.09,  32.46,  67.06,  14.92,  32.71,  35.3 ,
         90.55, 167.62],
       [ 68.1 ,  70.37,  10.21, 119.51,  27.8 ,   5.22,  35.37,  40.83,
        181.04,  90.33],
       [105.4 ,  98.09,  35.21,  66.6 , 106.55,  22.84,  36.45, 151.48,
          7.53,  29.71],
       [ 72.74,  24.59,  12.79,  36.93, 262.3 , 211.59,  68.56,  15.74,
        138.13,   6.84],
       [ 31.25,  45.43,  78.85,   6.58, 141.85,  13.98,  55.54,  50.19,
         69.31,  24.54],
       [126.94,  26.23, 233.76,  78.17,   8.01, 133.94,  41.35,  20.45,
         47.85,   2.96]])

In [2]:
alice_df

Unnamed: 0,user_id,x_0
9,9,-2.242685
7,7,-0.636524
4,4,-0.720589
3,3,-0.312652
8,8,0.015696
2,2,1.432707


In [3]:
bob_df

Unnamed: 0,user_id,x_0
6,6,0.193421
0,0,0.405453
5,5,-0.655969
1,1,0.289092
8,8,1.318152
9,9,-0.469305
2,2,1.321158


### Run Private-ID

We then run [Private-ID](https://github.com/facebookresearch/Private-ID#private-id-1) on Alice and Bob's User IDs.

```
env RUST_LOG=info cargo run --bin private-id-server -- \
    --host 0.0.0.0:10009 \
    --input data/alice_id.csv \
    --output data/alice_keys.csv \
    --no-tls
    

env RUST_LOG=info cargo run --bin private-id-client -- \
    --company localhost:10009 \
    --input data/bob_id.csv \
    --output data/bob_keys.csv  \
    --no-tls
```

In [4]:
alice_id = alice_df["user_id"]
alice_id.to_csv(_DATA_DIR / "alice_id.csv", index=False, header=False)

bob_id = bob_df["user_id"]
bob_id.to_csv(_DATA_DIR / "bob_id.csv", index=False, header=False)

Here is below the outputs from Private-ID. As you can see they both contain the same list of keys ordered the same way but for some of them they have a User Id mapped to them. 

In [5]:
alice_keys = pd.read_csv(_DATA_DIR / "alice_keys.csv", names=["key", "user_id"])

alice_keys

Unnamed: 0,key,user_id
0,5ECA1F3B5D532833954C237B5CF11BACB329719CC85665...,9.0
1,70147DEE68416AB8F3BCF7F6AA60321C99F52443DBF85F...,
2,701731B8EFAF6F1C10969548518075B24253EBC146B9CE...,4.0
3,80A9169C82C7BDADD7A8D3D74D12460247E38B5BCB6537...,
4,848DD114E23192EC84ECAB825B441336A3A495B14BCF1C...,8.0
5,88FCC7398F8F45FA7FC9DC9E6601DD6B426A171AEB4AF5...,
6,C47DA34C4DDF7B8EFE3364D17783DBC899F5551122FD0A...,2.0
7,E2837EFB36B8CA2889C57BFDA9522CD921ADE1375D17DA...,3.0
8,F07853F41ED2894DC57ABD239F663A9EB643A84F6B9191...,7.0
9,F0D77D7FDB658EF3FE9BD6E223C161F87D19D902EA5C18...,


In [6]:
bob_keys = pd.read_csv(_DATA_DIR / "bob_keys.csv", names=["key", "user_id"])

bob_keys

Unnamed: 0,key,user_id
0,5ECA1F3B5D532833954C237B5CF11BACB329719CC85665...,9.0
1,70147DEE68416AB8F3BCF7F6AA60321C99F52443DBF85F...,1.0
2,701731B8EFAF6F1C10969548518075B24253EBC146B9CE...,
3,80A9169C82C7BDADD7A8D3D74D12460247E38B5BCB6537...,0.0
4,848DD114E23192EC84ECAB825B441336A3A495B14BCF1C...,8.0
5,88FCC7398F8F45FA7FC9DC9E6601DD6B426A171AEB4AF5...,6.0
6,C47DA34C4DDF7B8EFE3364D17783DBC899F5551122FD0A...,2.0
7,E2837EFB36B8CA2889C57BFDA9522CD921ADE1375D17DA...,
8,F07853F41ED2894DC57ABD239F663A9EB643A84F6B9191...,
9,F0D77D7FDB658EF3FE9BD6E223C161F87D19D902EA5C18...,5.0


We then merge the Private-ID output to Alice and Bob's datasets so in the datasets we have the all the keys from Private-ID with the corresponding User ID and the feature x_0.

In [7]:
alice_df = pd.merge(alice_keys, alice_df, on='user_id', how='left')
bob_df = pd.merge(bob_keys, bob_df, on='user_id', how='left')

In [8]:
alice_df

Unnamed: 0,key,user_id,x_0
0,5ECA1F3B5D532833954C237B5CF11BACB329719CC85665...,9.0,-2.242685
1,70147DEE68416AB8F3BCF7F6AA60321C99F52443DBF85F...,,
2,701731B8EFAF6F1C10969548518075B24253EBC146B9CE...,4.0,-0.720589
3,80A9169C82C7BDADD7A8D3D74D12460247E38B5BCB6537...,,
4,848DD114E23192EC84ECAB825B441336A3A495B14BCF1C...,8.0,0.015696
5,88FCC7398F8F45FA7FC9DC9E6601DD6B426A171AEB4AF5...,,
6,C47DA34C4DDF7B8EFE3364D17783DBC899F5551122FD0A...,2.0,1.432707
7,E2837EFB36B8CA2889C57BFDA9522CD921ADE1375D17DA...,3.0,-0.312652
8,F07853F41ED2894DC57ABD239F663A9EB643A84F6B9191...,7.0,-0.636524
9,F0D77D7FDB658EF3FE9BD6E223C161F87D19D902EA5C18...,,


In [9]:
bob_df

Unnamed: 0,key,user_id,x_0
0,5ECA1F3B5D532833954C237B5CF11BACB329719CC85665...,9.0,-0.469305
1,70147DEE68416AB8F3BCF7F6AA60321C99F52443DBF85F...,1.0,0.289092
2,701731B8EFAF6F1C10969548518075B24253EBC146B9CE...,,
3,80A9169C82C7BDADD7A8D3D74D12460247E38B5BCB6537...,0.0,0.405453
4,848DD114E23192EC84ECAB825B441336A3A495B14BCF1C...,8.0,1.318152
5,88FCC7398F8F45FA7FC9DC9E6601DD6B426A171AEB4AF5...,6.0,0.193421
6,C47DA34C4DDF7B8EFE3364D17783DBC899F5551122FD0A...,2.0,1.321158
7,E2837EFB36B8CA2889C57BFDA9522CD921ADE1375D17DA...,,
8,F07853F41ED2894DC57ABD239F663A9EB643A84F6B9191...,,
9,F0D77D7FDB658EF3FE9BD6E223C161F87D19D902EA5C18...,5.0,-0.655969


In each dataset, we create a flag `user_id_available` by checking if the user id is missing or not. For the column feature x_0, we replace the missing values with 0. The value doesn't matter since these records will be filtered out when taking the intersection.

In [10]:
bob_df['user_id_available'] = np.where(bob_df['user_id'].isnull(), 0, 1)
bob_df['x_0'] = bob_df['x_0'].fillna(0)
bob_df


Unnamed: 0,key,user_id,x_0,user_id_available
0,5ECA1F3B5D532833954C237B5CF11BACB329719CC85665...,9.0,-0.469305,1
1,70147DEE68416AB8F3BCF7F6AA60321C99F52443DBF85F...,1.0,0.289092,1
2,701731B8EFAF6F1C10969548518075B24253EBC146B9CE...,,0.0,0
3,80A9169C82C7BDADD7A8D3D74D12460247E38B5BCB6537...,0.0,0.405453,1
4,848DD114E23192EC84ECAB825B441336A3A495B14BCF1C...,8.0,1.318152,1
5,88FCC7398F8F45FA7FC9DC9E6601DD6B426A171AEB4AF5...,6.0,0.193421,1
6,C47DA34C4DDF7B8EFE3364D17783DBC899F5551122FD0A...,2.0,1.321158,1
7,E2837EFB36B8CA2889C57BFDA9522CD921ADE1375D17DA...,,0.0,0
8,F07853F41ED2894DC57ABD239F663A9EB643A84F6B9191...,,0.0,0
9,F0D77D7FDB658EF3FE9BD6E223C161F87D19D902EA5C18...,5.0,-0.655969,1


In [11]:
alice_df['user_id_available'] = np.where(alice_df['user_id'].isnull(), 0, 1)
alice_df['x_0'] = alice_df['x_0'].fillna(0)
alice_df



Unnamed: 0,key,user_id,x_0,user_id_available
0,5ECA1F3B5D532833954C237B5CF11BACB329719CC85665...,9.0,-2.242685,1
1,70147DEE68416AB8F3BCF7F6AA60321C99F52443DBF85F...,,0.0,0
2,701731B8EFAF6F1C10969548518075B24253EBC146B9CE...,4.0,-0.720589,1
3,80A9169C82C7BDADD7A8D3D74D12460247E38B5BCB6537...,,0.0,0
4,848DD114E23192EC84ECAB825B441336A3A495B14BCF1C...,8.0,0.015696,1
5,88FCC7398F8F45FA7FC9DC9E6601DD6B426A171AEB4AF5...,,0.0,0
6,C47DA34C4DDF7B8EFE3364D17783DBC899F5551122FD0A...,2.0,1.432707,1
7,E2837EFB36B8CA2889C57BFDA9522CD921ADE1375D17DA...,3.0,-0.312652,1
8,F07853F41ED2894DC57ABD239F663A9EB643A84F6B9191...,7.0,-0.636524,1
9,F0D77D7FDB658EF3FE9BD6E223C161F87D19D902EA5C18...,,0.0,0


### Moose Computation

To the Moose computation we will feed Alice's `user_id_available` flag and feature and Bob's `user_id_available` flag and feature.

We do the intersection using a logical And operation between `user_id_available` flags from Alice and Bob, filter the datasets where the logical And returns 1, then compute a metric privately.

In [63]:
x_a = np.absolute(np.round((np.random.randn(10, 10) * 100), decimals=2))
user_id_available_a = alice_df.user_id_available.values
print(x_a)

x_b = np.absolute(np.round((np.random.randn(10, 10) * 100), decimals=2))
user_id_available_b = bob_df.user_id_available.values
print(user_id_available_b)
print(user_id_available_a)


#TODO: clean up values here and add intro text


[[ 66.34  23.14  31.31  41.06  96.32 121.48 130.11 159.7   73.67  70.59]
 [103.16 109.13  49.56 132.65  84.12   9.24 108.99 206.96  95.82  49.29]
 [ 82.    98.58 160.53 148.79  90.82 242.4   33.41  47.52  20.57  21.28]
 [196.8  207.13 115.54  86.21  82.15  66.8   36.82   2.01  82.32  16.55]
 [ 72.09 129.59  52.78  46.34  15.08 113.93  95.44   5.18  14.7   38.4 ]
 [120.9   21.39  11.4   94.49  18.34 171.43   2.46  45.41  27.23  30.58]
 [ 39.04  42.42  20.85  42.94 135.77  16.56   4.09 183.7  208.03   3.81]
 [ 66.55  20.57  70.59 261.28   2.53  17.83   6.46 120.5  388.09  97.45]
 [ 41.52 175.2   48.5   17.09  74.89  62.98  81.11 213.38  23.85 179.89]
 [160.46  11.87  76.22 183.64  55.9   18.33  98.91  77.5   59.33 120.86]]
[1 1 0 1 1 1 1 0 0 1]
[1 0 1 0 1 0 1 1 1 0]


In [64]:
np.save(_DATA_DIR / "x_a", x_a)
np.save(_DATA_DIR / "x_b", x_b)
np.save(_DATA_DIR / "user_id_available_a", user_id_available_a)
np.save(_DATA_DIR / "user_id_available_b", user_id_available_b)

In [73]:
import pymoose as pm

FIXED = pm.fixed(24, 40)

alice = pm.host_placement(name="alice")
bob = pm.host_placement(name="bob")
carole = pm.host_placement(name="carole")
rep = pm.replicated_placement(name="rep", players=[alice, bob, carole])
mirrored = pm.mirrored_placement(name="mirrored", players=[alice, bob, carole])

@pm.computation
def psi_and_agg():    
    with alice:
        x_a = pm.load("x_a", dtype=pm.float64)
        user_id_available_a = pm.load("user_id_available_a", dtype=pm.bool_)

    with bob:
        x_b = pm.load("x_b", dtype=pm.float64)
        user_id_available_b = pm.load("user_id_available_b", dtype=pm.bool_)

        # # Compute logical And between user_id_available from Alice and Bob.
        # # If it returns 1, it means the User ID was in Alice and Bob's datasets
        exist_in_alice_and_bob_bool = pm.logical_and(
            user_id_available_a, user_id_available_b
        )

        # # Filter Bob's feature to keep only records where exist_in_alice_and_bob_bool returned 1
        x_b_sub = pm.select(x_b, axis=0, index=exist_in_alice_and_bob_bool)
        x_b_sub = pm.cast(x_b_sub, dtype=FIXED)

    with alice:
        # Filter Alice's feature to keep only records where exist_in_alice_and_bob_bool returned 1
        x_a_sub = pm.select(x_a,  axis=0, index=exist_in_alice_and_bob_bool)
        x_a_sub = pm.cast(x_a_sub, dtype=FIXED)

    with mirrored:
        ten_percent = pm.constant(0.1, dtype=FIXED)
        
    with rep:
        # Aggregation: average ratio between sum of x_a_sub & x_b_sub
        spend_per_category = x_a_sub + x_b_sub        
        spend_per_user = pm.sum(spend_per_category, axis=1)
        category_percent = spend_per_category / pm.expand_dims(spend_per_user, axis=1)
        res = pm.greater(category_percent, ten_percent)

    with alice:
        res = pm.cast(res, dtype=pm.float64)
        res = pm.save("agg_result", res)

    return res

In [74]:
executors_storage = {
    "alice": {"x_a": x_a, "user_id_available_a": user_id_available_a.astype(np.bool_)},
    "bob": {"x_b": x_b, "user_id_available_b": user_id_available_b.astype(np.bool_)}
}

runtime = pm.LocalMooseRuntime(
    identities=["alice", "bob", "carole"],
    storage_mapping=executors_storage,
)

runtime.set_default()

_ = psi_and_agg()

agg_result = runtime.read_value_from_storage("alice", "agg_result")
print("Aggregation result with Moose", agg_result)

Aggregation result with Moose [[1. 0. 0. 0. 0. 1. 1. 1. 1. 0.]
 [0. 1. 0. 1. 0. 1. 1. 0. 1. 0.]
 [0. 0. 0. 0. 1. 0. 0. 1. 1. 0.]]


In [72]:
# In plaintext

# TODO: compute plaintext result here

inner_bool = np.logical_and(user_id_available_a, user_id_available_b)
x_a_sub = x_a[np.where(inner_bool==1)]
x_b_sub = x_b[np.where(inner_bool==1)]

agg = np.divide(np.sum(x_a_sub), np.sum(x_b_sub))
print("Aggregation result in plaintext", agg)

Aggregation result in plaintext 0.8745500855985636
