## CellCnn [1] data generation

source: https://github.com/eiriniar/CellCnn

""" Copyright 2016-2017 ETH Zurich, Eirini Arvaniti and Manfred Claassen.

This module contains data preprocessing/distribution functions.

"""
The code is slightly changed depending on original implementation to make it compatible with decentralized settings



In this example, we preprocess and distribute a mass Acute Myeloid Leukaemia (AML) dataset[2] for 3-class classification problem for healthy, cytogenetically normal (CN), and core-binding factor translocation (CBF). For each cell, the dataset includes mass cytometry measurements of 16 markers for each cell. 


The dataset comprises mass cytometry measurements of 16 markers, as in original cellCnn[1] analysis, we use the AML samples on the AML samples with at least 10% CD34+ blast cells with availability of additional cytogenetic information. 

To run this example, 

    - download the [AML cell dataset] at https://imsb.ethz.ch/research/claassen/Software/cellcnn.html, under ALL dataset zip folder
    - uncompress and place it in the data/cellCNN/ folder

Data distribution: We fix the test set for all experimental settings, the training dataset is then generated by distribution different donors for each institution depending on number of institutions.

[1] E. Arvaniti and M. Claassen. Sensitive detection of rare disease-associated cell subsets via representation learning.Nat Commun, 8:1–10, 2017
[2] . Levine, E. Simonds, S. Bendall, K. Davis, E.-A. Amir, M. Tadmor, O. Litvin, H. Fienberg, A. Jager, E. Zunder, R. Finck, A. Gedman,I. Radtke, J. Downing, D. Pe’er, and G. Nolan. Data-driven phenotypic dissection of aml reveals progenitor-like cells that correlatewith prognosis.Cell, 162, 2015.

In [1]:
import os, sys, errno, glob
import numpy as np
import pandas as pd
import pickle
import cellCNN_utils  
import sklearn.utils as sku
from cellCNN_utils import loadFCS, ftrans, mkdir_p, get_items, generate_data, generate_normalized_data
from pathlib import Path
%pylab inline

rand_seed = 12345
np.random.seed(rand_seed)
stim = 'AML'

d = Path().resolve()
sys.path.append(d)

# WDIR = os.path.join(cellCnn.__path__[0], 'examples')
OUTDIR = os.path.join(d, 'output_%s' % stim)
mkdir_p(OUTDIR)

LOOKUP_PATH = os.path.join(d, 'AML.pkl')

with open(LOOKUP_PATH, 'rb') as f:
    u = pickle._Unpickler(f)
    u.encoding = 'latin'
    lookup = u.load()



Populating the interactive namespace from numpy and matplotlib


In [2]:
#patients SJ10, SJ12, SJ13 were characterized as CN
#patients SJ1, SJ2, SJ3, SJ4, SJ5 presented CBF
labels = ['CD19', 'CD11b', 'CD34', 'CD45', 'CD123', 'CD33', 'CD47', 'CD7', 'CD15', 'CD44', 'CD38', 'CD3', 'CD117', 'HLA-DR', 'CD64', 'CD41']
trainCN, trainCBF, trainHealthy = [], [],[]
testCN, testCBF, testHealthy = [], [],[]
for key, val in lookup.items():
    if "SJ" not in key: 
        if key == "healthy_BM":
            trainHealthy.append(val[0][1])
            trainHealthy.append(val[1][1])
            trainHealthy.append(val[2][1])
            testHealthy.append(val[3][1])
            testHealthy.append(val[4][1])

    if key == 'SJ10' or key == 'SJ12':
        trainCN.append(val)
    if key == 'SJ13':
        testCN.append(val)
    if key == 'SJ1' or key == 'SJ2': 
        trainCBF.append(val)
    if key == 'SJ3' or key == 'SJ4' or key == 'SJ5':
        testCBF.append(val)

test_samples = testCN+ testCBF+ testHealthy
test_phenotypes = [1,2,2,2,0,0 ] 

train_phenotypes = [0, 1 ,2]  #healthy, cn, cbf

x_trainHealthy = sku.shuffle(np.vstack(trainHealthy))
x_trainCN = sku.shuffle(np.vstack(trainCN))
x_trainCBF = sku.shuffle(np.vstack(trainCBF))

train_samples = [x_trainHealthy,x_trainCN, x_trainCBF]


### Generate original data (with transform)
In the following, 
- We generate training data with $ncell=200$ cells per sample and $nsubset=500$ samples per class
- We generate the test data for $ncell=200$ cells per sample from test indices, called X_test
- We generate another test set 'per-individual' in test indices using maximum number of cells to use for phenotype prediction, called X_test_all

Processed data is placed under originalAML/ folder

The script prints the max number of cells for the current example (i.e., 12440 for this dataset) which then will be used as a parameter in the golang protocol.

In [3]:
from sklearn.utils import shuffle

scaler,x_tr,y_tr,x_test,y_test = generate_data(train_samples, train_phenotypes, 'originalAML/', valid_samples=test_samples, valid_phenotypes=test_phenotypes, ncell=200, nsubset=500, verbose=0)

#generate also the test set on full ncell per sample:
def generate_for_pheno_prediction(new_samples,phenotypes,scaler):
        ncell_per_sample = np.min([x.shape[0] for x in new_samples])
        print(f"Predictions based on multi-cell inputs containing {ncell_per_sample} cells.")
        nmark = len(new_samples[0][1])
        # z-transform the new samples if we did that for the training samples
        if scaler is not None:
            new_samples = [scaler.transform(x) for x in new_samples]
        new_samples = [shuffle(x)[:ncell_per_sample].reshape(1, ncell_per_sample, nmark)
                           for x in new_samples]
        data_test = np.vstack(new_samples)
        mkdir_p('originalAML/X_test_all/')
        for i in range(len(data_test)):
            np.savetxt('originalAML/' + 'X_test_all/' + str(i) +'.txt', (transpose(data_test[i])))
        np.savetxt('originalAML/' + 'y_test_all.txt', phenotypes)
        return data_test,phenotypes

data_test,phenotypes=generate_for_pheno_prediction(test_samples,test_phenotypes,scaler)
print(shape(data_test))
print(shape(x_test))
print(shape(y_test))

scale
Generating multi-cell inputs...
Done.
Predictions based on multi-cell inputs containing 12440 cells.
(6, 12440, 16)
(1498, 16, 200)
(1498,)


### Generate  data split between $nhosts$ parties
In the following, 
- We generate training data with $ncell=200$ cells per sample and $nsubset=700$ samples per class, per party
- Example below distributes the train indices per donor for 2 parties

Processed data is placed under splitAML/host_i for party-i


In [5]:
np.random.seed(12345)
# Here we randomly split the samples in training/test sets.
nhosts= 2
cofactor = 5

#distribute train indices balanced among n hosts:
numH = [0,1,2] #training set healthy sample indices
numCN = [0,1] #training set CN indices
numCBF = [0,1] #training set CBF indices
group1_list = np.flip(np.array_split(numpy.array(numH), nhosts))
group2_list = numpy.array_split(numpy.array(numCN), nhosts)
group3_list = numpy.array_split(numpy.array(numCBF), nhosts)

split_idx_1 = []
split_idx_2 = []
split_idx_3 = []
for i in range(nhosts):
    split_idx_1.append(group1_list[i].tolist())
    split_idx_2.append(group2_list[i].tolist())
    split_idx_3.append(group3_list[i].tolist())


print("Global train splitted among hosts - indices:")
print(split_idx_1)
print(split_idx_2)
print(split_idx_3)
#make sure each client gets at least one patient, comment in the following part to shuffle otherwise
# random.shuffle(split_idx_1)
# random.shuffle(split_idx_2)
# random.shuffle(split_idx_3)
# print(split_idx_1)
# print(split_idx_2)
# print(split_idx_3)
for i in range(nhosts):
    print("\nHost no.", i, ":")
    folder_path = 'splitAML/host' + str(i) + '/'
    trainHealthyidx = split_idx_1[i]
    trainCNTempidx = split_idx_2[i]
    trainCBFTempidx = split_idx_3[i]
    trainHealthyTemp,trainCNTemp,trainCBFTemp = [],[],[]
    for idx in trainHealthyidx:
        trainHealthyTemp.append(trainHealthy[idx])
    for idx in trainCNTempidx:
        trainCNTemp.append(trainCN[idx])
    for idx in trainCBFTempidx:
        trainCBFTemp.append(trainCBF[idx])
    train_phenotypes = []
    train_samples =[]
     # load the training samples
    if len(trainHealthyTemp) != 0:
        x_trainHealthy = sku.shuffle(np.vstack(trainHealthyTemp))
        train_phenotypes.append(0)
    if len(trainCNTemp) != 0:
        x_trainCN = sku.shuffle(np.vstack(trainCNTemp))
        train_phenotypes.append(1)
    if len(trainCBFTemp) != 0:
        x_trainCBF = sku.shuffle(np.vstack(trainCBFTemp))
        train_phenotypes.append(2)
    train_samples = [x_trainHealthy] + [x_trainCN]+ [x_trainCBF]
    generate_data(train_samples, train_phenotypes, folder_path, ncell=200, nsubset=700, verbose=0,generate_valid_set=False)

Global train splitted among hosts - indices:
[[2], [1], [0]]
[[0], [1], []]
[[0], [1], []]

Host no. 0 :
scale
Generating multi-cell inputs...
Done.

Host no. 1 :
scale
Generating multi-cell inputs...
Done.

Host no. 2 :
scale
Generating multi-cell inputs...
Done.
