## 1. Random Forest with Fingerprint

In this baseline model, we want to use compound's fingerprint to predict the outcome of assay. This is a single task, and the prediction performance is measured by the test set with cross-validation.

The fingerprint is extracted from all compounds with length `1024`.

In [2]:
import numpy as np
import pandas as pd
import pickle
from json import load, dump
from sklearn.ensemble import RandomForestClassifier
from collections import Counter
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import cross_validate, StratifiedKFold, GridSearchCV
from sklearn.linear_model import LogisticRegression

In [9]:
# Load the output matrix
output_matrix_npz = np.load("./resource/output_matrix.npz")
compound_inchi = output_matrix_npz["compound_inchi"]
compound_broad_id = output_matrix_npz["compound_broad_id"]
output_matrix = output_matrix_npz["output_matrix"]
output_matrix.shape

(9404, 141)

In [4]:
# Load the fingerprint
fingerprint = np.load("./resource/fp_features.npz")

# Build the feature array
fp_feature = np.zeros((len(compound_broad_id), fingerprint["features"].shape[1]))

# Choose selected compounds and arrange them by the output matrix order
fp_index = dict(zip(fingerprint["names"].astype(str), range(len(fingerprint["names"]))))

for b in range(len(compound_broad_id)):
    bid = compound_broad_id[b]
    fp_feature[b, :] = fingerprint["features"][fp_index[bid], :]

It took one hour to create this $9404 \times 1024$ feature matrix with finger print on condor.

In [8]:
fp_feature = np.load("./resource/extracted_fp_feature.npz")["feature"]
fp_feature.shape

(9404, 1024)

From the PriA-SSB study, some good parameters for a random forests are selected with a different assay dataset. We expect the best parameters (`RF_h`) to perform well in this dataset.

For each assay, we only train and predict on compounds that give non-NA results.

![](https://user-images.githubusercontent.com/3532898/46812713-e8a93280-cd3a-11e8-949a-3469236a3943.png)

In [74]:
def train_rf_on_assay(assay_array, rf, kfold=5, n_jobs=4):
    """
    Train and measure a random forest model in a cross validation fasshion.
    The model is only trained on compounds which give non-NA results.
    """
    # Filter out NA from the assay_array
    y_index = [False if a == -1 else True for a in assay_array ]
    y = assay_array[y_index]
    x = fp_feature[y_index, :]
    
    if Counter(y)[1] < kfold:
        print("Warning: the number of total postives is less than kfold")
        
    # One-hot encode y array
    #y = np.vstack([[1, 0] if i == 1 else [0, 1] for i in y])
    
    # Build the cross validation scheme
    # Since some assays have extremely small number of positives,
    # I will use StratifiedKFold to preserve the proportion of positive in test set
    
    my_kfold_gen = StratifiedKFold(n_splits=kfold, shuffle=True)
    scoring = ["f1", "accuracy", "precision", "recall", "average_precision", "roc_auc"]
    cv = cross_validate(rf, x, y, scoring=scoring, cv=my_kfold_gen, n_jobs=n_jobs,
                        return_train_score=False)
    
    cv["total_count"] = Counter(y)
    
    return cv

In [86]:
results = {}

rf_classifier = RandomForestClassifier(n_estimators=8000, max_features="log2",
                                       min_samples_leaf=1, class_weight="balanced")

for i in range(output_matrix.shape[1]):
    print(i)
    results[i] = train_rf_on_assay(output_matrix[:,i], rf_classifier, kfold=4)

We do a stratified 5-fold cross validation for each assay and record metrics on 5 test sets.

It took 2 hour to train these 141 single tasks. We can visualize the results.

In [8]:
results = np.load("./resource/results.npz")['results'].item()

In [6]:
def get_mean_score(results):
    """
    Aggregate each score over 5 test sets.
    """
    mean_df = {
        "f1": [],
        "accuracy": [],
        "average_precision": [],
        "roc_auc": [],
        "precision": [],
        "recall": [],
        "pos_num": [],
        "neg_num": []
    }
    
    for k, r in results.items():
        mean_df["f1"].append(np.mean(r["test_f1"]))
        mean_df["accuracy"].append(np.mean(r["test_accuracy"]))
        mean_df["average_precision"].append(np.mean(r["test_average_precision"]))
        mean_df["roc_auc"].append(np.mean(r["test_roc_auc"]))
        mean_df["precision"].append(np.mean(r["test_precision"]))
        mean_df["recall"].append(np.mean(r["test_recall"]))
        mean_df["pos_num"].append(r["total_count"][1])
        mean_df["neg_num"].append(r["total_count"][0])
    
    return pd.DataFrame(mean_df)


In [9]:
mean_df = get_mean_score(results)
mean_df.to_csv("./mean_df.csv", index=False)
mean_df.head(10)

Unnamed: 0,f1,accuracy,average_precision,roc_auc,precision,recall,pos_num,neg_num
0,0.08,0.989659,0.249031,0.75384,0.2,0.05,17,1530
1,0.759369,0.682051,0.781926,0.746656,0.676644,0.867821,196,143
2,0.183729,0.781891,0.520559,0.776232,0.533333,0.121905,74,256
3,0.158615,0.828373,0.400695,0.69803,0.805556,0.090476,210,932
4,0.030199,0.930673,0.168273,0.654066,0.4,0.015692,128,1747
5,0.660623,0.701898,0.792594,0.774347,0.704681,0.622222,180,206
6,0.0,0.992296,0.02635,0.657967,0.0,0.0,24,3221
7,0.58,0.47,0.777778,0.633333,0.5,0.733333,11,10
8,0.250102,0.67802,0.50982,0.675496,0.553435,0.162791,215,431
9,0.019755,0.88601,0.26424,0.685354,0.566667,0.010095,396,3078


In [37]:
print(("Across 141 assays, the average f1={:.2f}%, accuracy={:.2f}%, ap={:.2f}%, auc={:.2f}%, " + \
       "precision={:.2f}%, recall={:.2f}%.").format(
    np.mean(mean_df["f1"]) * 100,
    np.mean(mean_df["accuracy"]) * 100,
    np.mean(mean_df["average_precision"]) * 100,
    np.mean(mean_df["roc_auc"]) * 100,
    np.mean(mean_df["precision"]) * 100,
    np.mean(mean_df["recall"]) * 100))

Across 141 assays, the average f1=10.03%, accuracy=91.96%, ap=30.35%, auc=72.19%, precision=27.76%, recall=7.83%.


![random_forest_plot_1.png](./plots/random_forest_plot_1.png)

### 1.1. Comments

- Accuracies are pretty high except some outliers with few assays. It is due to the skewness of positive samples.
- This baseline model performs poorly based on metrics `F1`, `AP`, `Precision`, `Recall`.
- However, this model has high `AUC`, which is the main metric used in the ICCR paper.
- Using `AUC`, finterprint baseline gives a similar result as ICCR paper's advanced model (attached below).
- `AP` has a very large variance. Low sample size (both pos and neg) tends to give high `AP` values.

![](https://i.imgur.com/Z790iMB.png)

### 1.2. Random Forest + Finger Print using Converted InChI

I have built an output matrix for the intersected compounds using converted InChI strings. I have also excluded some problematic entries (conversion collision) from the intersection.

This output matrix uses 209 assays (increasing from 141), and 26638 compounds (increasing from 9404). It should give us a more fair comparison.

In [2]:
# Load the output matrix
output_matrix_npz = np.load("./resource/output_matrix_convert.npz")
compound_inchi = output_matrix_npz["compound_inchi"]
compound_broad_id = output_matrix_npz["compound_broad_id"]
output_matrix = output_matrix_npz["output_matrix"]
output_matrix.shape

(26638, 209)

In [5]:
# Load the fingerprint
fingerprint = np.load("./resource/fp_features.npz")

# Build the feature array
fp_feature = np.zeros((len(compound_broad_id), fingerprint["features"].shape[1]))

# Choose selected compounds and arrange them by the output matrix order
fp_index = dict(zip(fingerprint["names"].astype(str), range(len(fingerprint["names"]))))

for b in range(len(compound_broad_id)):
    bid = compound_broad_id[b]
    fp_feature[b, :] = fingerprint["features"][fp_index[bid], :]

In [12]:
results = np.load("./resource/results_converted.npz")['results'].item()
mean_df = get_mean_score(results)
mean_df.to_csv("./mean_df_converted.csv", index=False)
print(mean_df.shape)
mean_df.head(10)

(209, 8)


Unnamed: 0,f1,accuracy,average_precision,roc_auc,precision,recall,pos_num,neg_num
0,0.08,0.988705,0.408457,0.873482,0.4,0.044444,45,3762
1,0.762071,0.689145,0.78592,0.742719,0.682989,0.864217,383,283
2,0.0,0.998618,0.16892,0.628251,0.0,0.0,13,9395
3,0.712121,0.591429,0.643214,0.411111,0.604286,0.9,18,14
4,0.0,0.818182,0.22215,0.388889,0.0,0.0,10,45
5,0.273907,0.784866,0.512232,0.736592,0.660476,0.175484,154,506
6,0.064461,0.865974,0.391856,0.733942,0.866667,0.03404,383,2400
7,0.012698,0.930264,0.195145,0.671045,0.3,0.006504,309,4122
8,0.502233,0.685106,0.675068,0.72811,0.657874,0.407571,297,462
9,0.0,0.992969,0.056756,0.633252,0.0,0.0,58,8333


In [14]:
print(("Across 209 assays, the average f1={:.2f}%, accuracy={:.2f}%, ap={:.2f}%, auc={:.2f}%, " + \
       "precision={:.2f}%, recall={:.2f}%.").format(
    np.mean(mean_df["f1"]) * 100,
    np.mean(mean_df["accuracy"]) * 100,
    np.mean(mean_df["average_precision"]) * 100,
    np.mean(mean_df["roc_auc"]) * 100,
    np.mean(mean_df["precision"]) * 100,
    np.mean(mean_df["recall"]) * 100))

Across 209 assays, the average f1=13.18%, accuracy=90.72%, ap=31.74%, auc=71.22%, precision=30.44%, recall=11.43%.


![](./plots/random_forest_plot_2.png)

### 1.2.1. Comment

We can see the same pattern as the random forest results from 141 assays. In conclusion, `AUC` is not a good metrics, and it is good that we have this fingerprint baseline.

## 2. Logistic Regression with Inception Feature

We can use the Inception V3 extracted features from U20S images to build end-to-end single task models. It can tell us more about our future image-to-assay model.

In this section, we will use the data with ~~209~~ 212 assays. Due to the size limit of Gluster, we only have 244 plates. The paper reports that they have 375 plates, and their shared download script implies there are 406 plates. Therefore, we will not have features for all used compounds in the 212 assays.

In [6]:
# Load the output matrix
output_matrix_npz = np.load("./resource/output_matrix_convert_collision.npz")
compound_inchi = output_matrix_npz["compound_inchi"]
compound_broad_id = output_matrix_npz["compound_broad_id"]
output_matrix = output_matrix_npz["output_matrix"]
output_matrix.shape

(27241, 212)

In [7]:
# It is slow to work with DataFrame, we can make a dictionary with BID -> PID+WID
mean_well_df = pd.read_csv("./resource/merged_mean_table.csv")

In [8]:
all_pids = mean_well_df["Metadata_Plate"].tolist()
all_wids = mean_well_df["Metadata_Well"].tolist()
all_bids = mean_well_df["Metadata_broad_sample"].tolist()

mean_well_dict = {}
intersect_bids = set(compound_broad_id)
for i in range(len(all_bids)):
    cur_bid = all_bids[i]
    if cur_bid in intersect_bids:
        if cur_bid not in mean_well_dict:
            mean_well_dict[cur_bid] = {"pid": [all_pids[i]],
                                       "wid": [all_wids[i]]}
        else:
            mean_well_dict[cur_bid]["pid"].append(all_pids[i])
            mean_well_dict[cur_bid]["wid"].append(all_wids[i])
     
print("From our 244 plates, we found {} compounds over {} total intersected compounds.".format(
    len(mean_well_dict), len(compound_broad_id)))

From our 244 plates, we found 17915 compounds over 27241 total intersected compounds.


In [9]:
# Reorder pids and wids to match output matrix row order
ordered_pids, ordered_wids = [], []
compound_replicate = {}

for b in compound_broad_id:
    if b in mean_well_dict:
        ordered_pids.extend(mean_well_dict[b]["pid"])
        ordered_wids.extend(mean_well_dict[b]["wid"])
        compound_replicate[b] = len(mean_well_dict[b]["pid"])

In [9]:
Counter(compound_replicate.values())

Counter({3: 2107, 4: 14806, 2: 544, 1: 457, 8: 1})

In our dataset, most compounds have 4 replicates (tested on 4 wells). For each well, we have 6-9 field of view shots, so we have about 24 samples for one compound.

In [10]:
# Extract features for those plates & wells
pwid = ["{}_{}".format(ordered_pids[i], ordered_wids[i]) for i in range(len(ordered_wids))]
pwid_set = set(pwid)

combined_feature = np.load("/Users/JayWong/Downloads/combined_feature.npz")
combined_feature_pwids = combined_feature["names"]

# Get matching indices
matched_indices = []
matched_pwids = []
for i in range(len(combined_feature_pwids)):
    if combined_feature_pwids[i] in pwid_set:
        matched_indices.append(i)
        matched_pwids.append(combined_feature_pwids[i])

In [15]:
all_feature = combined_feature["features"]

In [23]:
matched_feature = all_feature[matched_indices,:]

In [24]:
matched_names = combined_feature_pwids[matched_indices]

In [25]:
np.savez("./resource/matched_collision_raw_features.npz", features=matched_feature, names=matched_names)

In [56]:
matched_features = np.load("./resource/matched_184_raw_features.npz")["features"]
matched_names = np.load("./resource/matched_184_raw_features.npz")["names"]
output_matrix = np.load("./resource/output_matrix_inception.npz")["output_matrix"]

In [29]:
# Rearrange output matrix to match our feature matrix
print(matched_feature.shape)

pwd_to_b_dict = {}
for k, v in mean_well_dict.items():
    for i in range(len(v["pid"])):
        pwd = "{}_{}".format(v["pid"][i], v["wid"][i])
        if pwd in pwd_to_b_dict:
            print(pwd)
        pwd_to_b_dict[pwd] = k
        
output_compound_b_to_index = dict(zip(compound_broad_id, range(len(compound_broad_id))))

# Create the indices to select entries from output matrix
select_bids = []
select_indices = []
for pw in matched_names:
    bid = pwd_to_b_dict[pw]
    select_bids.append(bid)
    select_indices.append(output_compound_b_to_index[bid])

print("{} compounds found in our extracted features (we only extracted 184 plates).".format(len(set(select_bids))))

# Rearrange output matrix
rearranged_output_matrix = output_matrix[select_indices,:]
rearranged_output_matrix.shape
np.savez("./resource/output_matrix_collision_inception.npz", output_matrix=rearranged_output_matrix)

(553044, 4096)
15192 compounds found in our extracted features (we only extracted 184 plates).


In [59]:
def train_lr_on_assay(aid, kfold=5, n_jobs=4):
    """
    (1). Use cross validation to find the best regularization parameter
    (2). Report the best parameter, and scores on the test set

    The model is only trained on compounds which give non-NA results.
    """
    assay_array = rearranged_output_matrix[:, aid]

    # Filter out NA from the assay_array
    y_index = [False if a == -1 else True for a in assay_array ]
    y = assay_array[y_index]
    x = matched_features[y_index, :]
    
    print(x.shape, y.shape)
    return
    
    if Counter(y)[1] < kfold:
        print("Warning: the number of total postives is less than kfold")
        
    
    # Check if the training samples are too small
    if Counter(y)[1] < 10 or Counter(y)[0] < 10:
        print("Not enough training samples for assay {}".formt(aid))
        return
    
    # Build the cross validation scheme
    # Since some assays have extremely small number of positives,
    # I will use StratifiedKFold to preserve the proportion of positive in test set
    
    my_kfold_gen = StratifiedKFold(n_splits=kfold, shuffle=True)
    scoring = ["f1", "accuracy", "precision", "recall", "average_precision", "roc_auc"]
    gs = GridSearchCV(LogisticRegression(penalty='l1', verbose=0,
                                         solver='liblinear'),
                      {'C': [0.01, 0.1, 1, 10]},
                      cv=my_kfold_gen,
                      scoring=scoring,
                      refit=False,
                      n_jobs=n_jobs, verbose=2)
    
    gs.fit(x, y)
    result = gs.cv_results_
    result["count"] = Counter(y)
    return result

It takes 12 hours to train 3 assays due to the size of training data ($546690 \times 4096$).

- We can use L1 logistic regression to select features first (reduce to 512 features from 4096)
- We also can use UMAP to do the feature selection
- Set `max_iter` to a smaller number (100)

In [24]:
assay_array = output_matrix[:, 5].copy()

# Filter out NA from the assay_array
y_index = [False if a == -1 else True for a in assay_array ]
y = assay_array[y_index].copy()
x = matched_features[y_index, :].copy()

In [27]:
lr = LogisticRegression(penalty='l1', verbose=1, solver='saga', max_iter=1000)
lr.fit(x,y)

abs_coef = np.abs(lr.coef_[0])
largest_coef_index = np.argpartition(abs_coef, -512)[-512:]

np.savez("feature_index.npz", index=largest_coef_index)

We finished training with 212 assays (ignoring collision) with full 4096 features using Condor batch submission. There are 4 assays jobs requiring 200+GB memory and more time to train.

- `SAGA` solver works better than `Liblinear`
- Across 212-4=208 assays, the average f1=34.87%, accuracy=84.48%, ap=33.12%, auc=85.22%, precision=29.77%, recall=75.17%.

![](./plots/lr_plot_1.png)

- The average performance is better than fingerprint random forest.
- Accuracy has a larger variance (matching AUC).
- Similar to the fingerprint random forest, this model struggles with high positive size.

## 3. Logistic Regression with Inception Features for 349 Plates

In [2]:
# Load the output matrix
output_matrix_npz = np.load("./resource/output_matrix_convert_collision.npz")
compound_inchi = output_matrix_npz["compound_inchi"]
compound_broad_id = output_matrix_npz["compound_broad_id"]
output_matrix = output_matrix_npz["output_matrix"]
output_matrix.shape

(27241, 212)

In [3]:
# It is slow to work with DataFrame, we can make a dictionary with BID -> PID+WID
mean_well_df = pd.read_csv("./resource/merged_mean_table_349.csv")

In [4]:
all_pids = mean_well_df["Metadata_Plate"].tolist()
all_wids = mean_well_df["Metadata_Well"].tolist()
all_bids = mean_well_df["Metadata_broad_sample"].tolist()

mean_well_dict = {}
intersect_bids = set(compound_broad_id)
for i in range(len(all_bids)):
    cur_bid = all_bids[i]
    if cur_bid in intersect_bids:
        if cur_bid not in mean_well_dict:
            mean_well_dict[cur_bid] = {"pid": [all_pids[i]],
                                       "wid": [all_wids[i]]}
        else:
            mean_well_dict[cur_bid]["pid"].append(all_pids[i])
            mean_well_dict[cur_bid]["wid"].append(all_wids[i])
     
print("From our 349 plates, we found {} compounds over {} total intersected compounds.".format(
    len(mean_well_dict), len(compound_broad_id)))

From our 349 plates, we found 24857 compounds over 27241 total intersected compounds.


In [13]:
# Reorder pids and wids to match output matrix row order
ordered_pids, ordered_wids = [], []
compound_replicate = {}

for b in compound_broad_id:
    if b in mean_well_dict:
        ordered_pids.extend(mean_well_dict[b]["pid"])
        ordered_wids.extend(mean_well_dict[b]["wid"])
        compound_replicate[b] = len(mean_well_dict[b]["pid"])

In [14]:
Counter(compound_replicate.values())

Counter({3: 3041, 4: 19680, 8: 768, 1: 972, 2: 392, 7: 1, 6: 3})

## 4. Logistic Regression with Normalized Inception Features for 406 Plates

In [2]:
# Load the output matrix
output_matrix_npz = np.load("./resource/output_matrix_convert_collision.npz")
compound_inchi = output_matrix_npz["compound_inchi"]
compound_broad_id = output_matrix_npz["compound_broad_id"]
output_matrix = output_matrix_npz["output_matrix"]
output_matrix.shape

(27241, 212)

In [7]:
# It is slow to work with DataFrame, we can make a dictionary with BID -> PID+WID
mean_well_df = pd.read_csv("./resource/merged_table_406.csv")

In [8]:
all_pids = mean_well_df["Metadata_Plate"].tolist()
all_wids = mean_well_df["Metadata_Well"].tolist()
all_bids = mean_well_df["Metadata_broad_sample"].tolist()

mean_well_dict = {}
intersect_bids = set(compound_broad_id)
for i in range(len(all_bids)):
    cur_bid = all_bids[i]
    if cur_bid in intersect_bids:
        if cur_bid not in mean_well_dict:
            mean_well_dict[cur_bid] = {"pid": [all_pids[i]],
                                       "wid": [all_wids[i]]}
        else:
            mean_well_dict[cur_bid]["pid"].append(all_pids[i])
            mean_well_dict[cur_bid]["wid"].append(all_wids[i])
     
print("From all 406 plates, we found {} compounds over {} total intersected compounds.".format(
    len(mean_well_dict), len(compound_broad_id)))

From all 406 plates, we found 26939 compounds over 27241 total intersected compounds.


In [9]:
mean_well_df = None

In [10]:
# Reorder pids and wids to match output matrix row order
ordered_pids, ordered_wids = [], []
compound_replicate = {}

for b in compound_broad_id:
    if b in mean_well_dict:
        ordered_pids.extend(mean_well_dict[b]["pid"])
        ordered_wids.extend(mean_well_dict[b]["wid"])
        compound_replicate[b] = len(mean_well_dict[b]["pid"])

In [11]:
Counter(compound_replicate.values())

Counter({3: 2558, 4: 22026, 8: 1522, 7: 238, 2: 392, 1: 200, 6: 3})

In [12]:
# Extract features for those plates & wells
pwid = ["{}_{}".format(ordered_pids[i], ordered_wids[i]) for i in range(len(ordered_wids))]
pwid_set = set(pwid)

combined_feature = np.load("./resource/normed_features.npz")
combined_feature_pwids = combined_feature["names"]

# Get matching indices
matched_indices = []
matched_pwids = []
for i in range(len(combined_feature_pwids)):
    if combined_feature_pwids[i] in pwid_set:
        matched_indices.append(i)
        matched_pwids.append(combined_feature_pwids[i])
        
print("{} images out of {} images are used".format(len(matched_indices), len(combined_feature_pwids)))

659267 images out of 913930 images are used


In [None]:
all_feature = combined_feature["features"]
matched_feature = all_feature[matched_indices,:]

In [None]:
matched_names = combined_feature_pwids[matched_indices]

In [None]:
np.savez("./resource/matched_collision_raw_features.npz", features=matched_feature, names=matched_names)

## 5. Use Mean Well Features to Predict Activity

In [2]:
# Load the output matrix
output_matrix_npz = np.load("./resource/output_matrix_convert_collision.npz")
compound_inchi = output_matrix_npz["compound_inchi"]
compound_broad_id = output_matrix_npz["compound_broad_id"]
output_matrix = output_matrix_npz["output_matrix"]
output_matrix.shape

(27241, 212)

In [5]:
# It is slow to work with DataFrame, we can make a dictionary with BID -> PID+WID
mean_well_df = pd.read_csv("./resource/merged_table_406.csv")

In [4]:
# It is too slow to operate on the dataframe, we read line by line
with open("./resource/merged_table_406.csv", 'r') as fp:
    for line in fp:
        attrs = fp.split(',')
        pid = int(attrs[0])
        wid = attrs[1]
        bid = attrs[6]
        

In [6]:
all_pids = mean_well_df["Metadata_Plate"].tolist()
all_wids = mean_well_df["Metadata_Well"].tolist()
all_bids = mean_well_df["Metadata_broad_sample"].tolist()

mean_well_dict = {}
intersect_bids = set(compound_broad_id)
for i in range(len(all_bids)):
    cur_bid = all_bids[i]
    if cur_bid in intersect_bids:
        if cur_bid not in mean_well_dict:
            mean_well_dict[cur_bid] = {"pid": [all_pids[i]],
                                       "wid": [all_wids[i]]}
        else:
            mean_well_dict[cur_bid]["pid"].append(all_pids[i])
            mean_well_dict[cur_bid]["wid"].append(all_wids[i])
     
print("From all 406 plates, we found {} compounds over {} total intersected compounds.".format(
    len(mean_well_dict), len(compound_broad_id)))

From all 406 plates, we found 26939 compounds over 27241 total intersected compounds.


In [7]:
# Reorder pids and wids to match output matrix row order
ordered_pids, ordered_wids = [], []
compound_replicate = {}

for b in compound_broad_id:
    if b in mean_well_dict:
        ordered_pids.extend(mean_well_dict[b]["pid"])
        ordered_wids.extend(mean_well_dict[b]["wid"])
        compound_replicate[b] = len(mean_well_dict[b]["pid"])

ordered_pid_wid_pairs = set(zip(ordered_pids, ordered_wids))

In [None]:
# Extract overlapping rows from the well df
matched_indices = []

for i, r in mean_well_df.iterrows():
    if i == 20:
        print(1)

In [13]:
# It is too slow to operate on the dataframe, we read line by line
with open("./resource/merged_table_406.csv", 'r') as fp:
    with open("./resource/merged_table_406_intersect.csv", 'w') as fo:
        fo.write(fp.readline())
        for line in fp:
            attrs = line.split(',')
            pid = int(attrs[0])
            wid = attrs[1]
            pid_wid_pair = (pid, wid)
            if pid_wid_pair in ordered_pid_wid_pairs:
                fo.write(line)

In [3]:
mean_well_df = pd.read_csv("./resource/merged_table_406_intersect.csv")

In [4]:
mean_well_df.head()

Unnamed: 0,Metadata_Plate,Metadata_Well,Metadata_Assay_Plate_Barcode,Metadata_Plate_Map_Name,Metadata_well_position,Metadata_ASSAY_WELL_ROLE,Metadata_broad_sample,Metadata_mmoles_per_liter,Metadata_solvent,Metadata_pert_id,...,Nuclei_Texture_Variance_DNA_5_0,Nuclei_Texture_Variance_ER_10_0,Nuclei_Texture_Variance_ER_3_0,Nuclei_Texture_Variance_ER_5_0,Nuclei_Texture_Variance_Mito_10_0,Nuclei_Texture_Variance_Mito_3_0,Nuclei_Texture_Variance_Mito_5_0,Nuclei_Texture_Variance_RNA_10_0,Nuclei_Texture_Variance_RNA_3_0,Nuclei_Texture_Variance_RNA_5_0
0,25855,a01,25855,H-CBLF-002-4,a01,treated,BRD-K14087339-001-01-6,4.759276,DMSO,BRD-K14087339,...,3.377425,1.258327,1.328166,1.288077,1.436918,1.645024,1.571905,2.358518,2.383887,2.380844
1,25855,a02,25855,H-CBLF-002-4,a02,treated,BRD-K53903148-001-01-7,4.924897,DMSO,BRD-K53903148,...,3.349516,1.308108,1.367728,1.343064,1.488288,1.612786,1.543314,3.132551,2.955101,2.99226
2,25855,a03,25855,H-CBLF-002-4,a03,treated,BRD-K37357048-001-01-8,5.138267,DMSO,BRD-K37357048,...,3.815022,1.584991,1.748497,1.70585,1.439318,1.577664,1.529118,3.023558,2.832276,2.911847
3,25855,a04,25855,H-CBLF-002-4,a04,treated,BRD-K25385069-001-01-7,4.91672,DMSO,BRD-K25385069,...,3.486581,1.385492,1.472731,1.43572,1.588143,1.813048,1.776891,3.168433,2.967504,3.046684
4,25855,a05,25855,H-CBLF-002-4,a05,treated,BRD-K63140065-001-01-3,5.226743,DMSO,BRD-K63140065,...,4.170932,1.832773,1.935458,1.886178,1.626151,1.706171,1.688343,2.77361,2.698169,2.705382


In [5]:
matched_feature = mean_well_df.values

In [19]:
all_bids = mean_well_df["Metadata_broad_sample"].tolist()

In [18]:
# Rearrange output matrix to match our feature matrix
print(matched_feature.shape)
output_compound_b_to_index = dict(zip(compound_broad_id, range(len(compound_broad_id))))

(110622, 1800)


In [21]:
# Create the indices to select entries from output matrix
select_bids = []
select_indices = []
for bid in all_bids:
    select_bids.append(bid)
    select_indices.append(output_compound_b_to_index[bid])

print("{} compounds found in our extracted features.".format(len(set(select_bids))))

# Rearrange output matrix
rearranged_output_matrix = output_matrix[select_indices,:]
rearranged_output_matrix.shape
np.savez("./resource/output_matrix_collision_meanwell.npz", output_matrix=rearranged_output_matrix)

26939 compounds found in our extracted features.


In [22]:
rearranged_output_matrix.shape

(110622, 212)

In [19]:
# Ditch string (categorical) features
float_cols = []
for c in range(matched_feature.shape[1]):
    col = matched_feature[:, c]
    try:
        temp = col.astype(np.float64)
        float_cols.append(c)
    except ValueError:
        pass
    
print(len(float_cols))

1787


In [20]:
float_matrix = matched_feature[:, float_cols]

In [31]:
float_matrix = float_matrix.astype(np.float64)

In [52]:
selected_cols = [i for i in range(1787) if i != 3]
float_matrix = float_matrix[:, selected_cols]

In [53]:
np.savez("meanwell_train_features.npz", features=float_matrix)

In [29]:
# and Metadata_pert_id_vendor
np.array(list(mean_well_df))[list(set(range(1800)) - set(float_cols))]

array(['Metadata_Well', 'Metadata_Plate_Map_Name',
       'Metadata_well_position', 'Metadata_ASSAY_WELL_ROLE',
       'Metadata_broad_sample', 'Metadata_solvent', 'Metadata_pert_id',
       'Metadata_pert_mfc_id', 'Metadata_pert_well', 'Metadata_cell_id',
       'Metadata_broad_sample_type', 'Metadata_pert_vehicle',
       'Metadata_pert_type'], dtype='<U51')

In [55]:
float_matrix.shape

(110622, 1786)