## 1. Random Forest with Fingerprint

In this baseline model, we want to use compound's fingerprint to predict the outcome of assay. This is a single task, and the prediction performance is measured by the test set with cross-validation.

The fingerprint is extracted from all compounds with length `1024`.

In [1]:
import numpy as np
import pandas as pd
from json import load, dump
from sklearn.ensemble import RandomForestClassifier
from collections import Counter
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import cross_validate, StratifiedKFold

In [9]:
# Load the output matrix
output_matrix_npz = np.load("./resource/output_matrix.npz")
compound_inchi = output_matrix_npz["compound_inchi"]
compound_broad_id = output_matrix_npz["compound_broad_id"]
output_matrix = output_matrix_npz["output_matrix"]
output_matrix.shape

(9404, 141)

In [4]:
# Load the fingerprint
fingerprint = np.load("./resource/fp_features.npz")

# Build the feature array
fp_feature = np.zeros((len(compound_broad_id), fingerprint["features"].shape[1]))

# Choose selected compounds and arrange them by the output matrix order
fp_index = dict(zip(fingerprint["names"].astype(str), range(len(fingerprint["names"]))))

for b in range(len(compound_broad_id)):
    bid = compound_broad_id[b]
    fp_feature[b, :] = fingerprint["features"][fp_index[bid], :]

It took one hour to create this $9404 \times 1024$ feature matrix with finger print on condor.

In [8]:
fp_feature = np.load("./resource/extracted_fp_feature.npz")["feature"]
fp_feature.shape

(9404, 1024)

From the PriA-SSB study, some good parameters for a random forests are selected with a different assay dataset. We expect the best parameters (`RF_h`) to perform well in this dataset.

For each assay, we only train and predict on compounds that give non-NA results.

![](https://user-images.githubusercontent.com/3532898/46812713-e8a93280-cd3a-11e8-949a-3469236a3943.png)

In [74]:
def train_rf_on_assay(assay_array, rf, kfold=5, n_jobs=4):
    """
    Train and measure a random forest model in a cross validation fasshion.
    The model is only trained on compounds which give non-NA results.
    """
    # Filter out NA from the assay_array
    y_index = [False if a == -1 else True for a in assay_array ]
    y = assay_array[y_index]
    x = fp_feature[y_index, :]
    
    if Counter(y)[1] < kfold:
        print("Warning: the number of total postives is less than kfold")
        
    # One-hot encode y array
    #y = np.vstack([[1, 0] if i == 1 else [0, 1] for i in y])
    
    # Build the cross validation scheme
    # Since some assays have extremely small number of positives,
    # I will use StratifiedKFold to preserve the proportion of positive in test set
    
    my_kfold_gen = StratifiedKFold(n_splits=kfold, shuffle=True)
    scoring = ["f1", "accuracy", "precision", "recall", "average_precision", "roc_auc"]
    cv = cross_validate(rf, x, y, scoring=scoring, cv=my_kfold_gen, n_jobs=n_jobs,
                        return_train_score=False)
    
    cv["total_count"] = Counter(y)
    
    return cv

In [86]:
results = {}

rf_classifier = RandomForestClassifier(n_estimators=8000, max_features="log2",
                                       min_samples_leaf=1, class_weight="balanced")

for i in range(output_matrix.shape[1]):
    print(i)
    results[i] = train_rf_on_assay(output_matrix[:,i], rf_classifier, kfold=4)

We do a stratified 5-fold cross validation for each assay and record metrics on 5 test sets.

It took 2 hour to train these 141 single tasks. We can visualize the results.

In [5]:
results = np.load("./resource/results.npz")['results'].item()

In [22]:
def get_mean_score(results):
    """
    Aggregate each score over 5 test sets.
    """
    mean_df = {
        "f1": [],
        "accuracy": [],
        "average_precision": [],
        "roc_auc": [],
        "precision": [],
        "recall": [],
        "pos_num": [],
        "neg_num": []
    }
    
    for k, r in results.items():
        mean_df["f1"].append(np.mean(r["test_f1"]))
        mean_df["accuracy"].append(np.mean(r["test_accuracy"]))
        mean_df["average_precision"].append(np.mean(r["test_average_precision"]))
        mean_df["roc_auc"].append(np.mean(r["test_roc_auc"]))
        mean_df["precision"].append(np.mean(r["test_precision"]))
        mean_df["recall"].append(np.mean(r["test_recall"]))
        mean_df["pos_num"].append(r["total_count"][1])
        mean_df["neg_num"].append(r["total_count"][0])
    
    return pd.DataFrame(mean_df)

mean_df = get_mean_score(results)
mean_df.to_csv("./mean_df.csv", index=False)

In [23]:
mean_df.head(10)

Unnamed: 0,f1,accuracy,average_precision,roc_auc,precision,recall,pos_num,neg_num
0,0.08,0.989659,0.249031,0.75384,0.2,0.05,17,1530
1,0.759369,0.682051,0.781926,0.746656,0.676644,0.867821,196,143
2,0.183729,0.781891,0.520559,0.776232,0.533333,0.121905,74,256
3,0.158615,0.828373,0.400695,0.69803,0.805556,0.090476,210,932
4,0.030199,0.930673,0.168273,0.654066,0.4,0.015692,128,1747
5,0.660623,0.701898,0.792594,0.774347,0.704681,0.622222,180,206
6,0.0,0.992296,0.02635,0.657967,0.0,0.0,24,3221
7,0.58,0.47,0.777778,0.633333,0.5,0.733333,11,10
8,0.250102,0.67802,0.50982,0.675496,0.553435,0.162791,215,431
9,0.019755,0.88601,0.26424,0.685354,0.566667,0.010095,396,3078


In [37]:
print(("Across 141 assays, the average f1={:.2f}%, accuracy={:.2f}%, ap={:.2f}%, auc={:.2f}%, " + \
       "precision={:.2f}%, recall={:.2f}%.").format(
    np.mean(mean_df["f1"]) * 100,
    np.mean(mean_df["accuracy"]) * 100,
    np.mean(mean_df["average_precision"]) * 100,
    np.mean(mean_df["roc_auc"]) * 100,
    np.mean(mean_df["precision"]) * 100,
    np.mean(mean_df["recall"]) * 100))

Across 141 assays, the average f1=10.03%, accuracy=91.96%, ap=30.35%, auc=72.19%, precision=27.76%, recall=7.83%.


![random_forest_plot_1.png](./plots/random_forest_plot_1.png)

### 1.1. Comments

- Accuracies are pretty high except some outliers with few assays. It is due to the skewness of positive samples.
- This baseline model performs poorly based on metrics `F1`, `AP`, `Precision`, `Recall`.
- However, this model has high `AUC`, which is the main metric used in the ICCR paper.
- Using `AUC`, finterprint baseline gives a similar result as ICCR paper's advanced model (attached below).
- `AP` has a very large variance. Low sample size (both pos and neg) tends to give high `AP` values.

![](https://i.imgur.com/Z790iMB.png)