## Random Forest with Fingerprint

In this baseline model, we want to use compound's fingerprint to predict the outcome of assay. This is a single task, and the prediction performance is measured by the test set with cross-validation.

The fingerprint is extracted from all compounds with length `1024`.

In [53]:
import numpy as np
import pandas as pd
from json import load, dump
from sklearn.ensemble import RandomForestClassifier
from collections import Counter
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import cross_validate, StratifiedKFold

In [9]:
# Load the output matrix
output_matrix_npz = np.load("./resource/output_matrix.npz")
compound_inchi = output_matrix_npz["compound_inchi"]
compound_broad_id = output_matrix_npz["compound_broad_id"]
output_matrix = output_matrix_npz["output_matrix"]
output_matrix.shape

(9404, 141)

In [4]:
# Load the fingerprint
fingerprint = np.load("./resource/fp_features.npz")

# Build the feature array
fp_feature = np.zeros((len(compound_broad_id), fingerprint["features"].shape[1]))

# Choose selected compounds and arrange them by the output matrix order
fp_index = dict(zip(fingerprint["names"].astype(str), range(len(fingerprint["names"]))))

for b in range(len(compound_broad_id)):
    bid = compound_broad_id[b]
    fp_feature[b, :] = fingerprint["features"][fp_index[bid], :]

It took one hour to create this $9404 \times 1024$ feature matrix with finger print on condor.

In [8]:
fp_feature = np.load("./resource/extracted_fp_feature.npz")["feature"]
fp_feature.shape

(9404, 1024)

From the PriA-SSB study, some good parameters for a random forests are selected with a different assay dataset. We expect the best parameters (`RF_h`) to perform well in this dataset.

For each assay, we only train and predict on compounds that give non-NA results.

![](https://user-images.githubusercontent.com/3532898/46812713-e8a93280-cd3a-11e8-949a-3469236a3943.png)

In [74]:
def train_rf_on_assay(assay_array, rf, kfold=5, n_jobs=4):
    """
    Train and measure a random forest model in a cross validation fasshion.
    The model is only trained on compounds which give non-NA results.
    """
    # Filter out NA from the assay_array
    y_index = [False if a == -1 else True for a in assay_array ]
    y = assay_array[y_index]
    x = fp_feature[y_index, :]
    
    if Counter(y)[1] < kfold:
        print("Warning: the number of total postives is less than kfold")
        
    # One-hot encode y array
    #y = np.vstack([[1, 0] if i == 1 else [0, 1] for i in y])
    
    # Build the cross validation scheme
    # Since some assays have extremely small number of positives,
    # I will use StratifiedKFold to preserve the proportion of positive in test set
    
    my_kfold_gen = StratifiedKFold(n_splits=kfold, shuffle=True)
    scoring = ["f1", "accuracy", "precision", "recall", "average_precision", "roc_auc"]
    cv = cross_validate(rf, x, y, scoring=scoring, cv=my_kfold_gen, n_jobs=n_jobs,
                        return_train_score=False)
    
    cv["total_count"] = Counter(y)
    
    return cv

In [86]:
results = {}

rf_classifier = RandomForestClassifier(n_estimators=8000, max_features="log2",
                                       min_samples_leaf=1, class_weight="balanced")

for i in range(output_matrix.shape[1]):
    print(i)
    results[i] = train_rf_on_assay(output_matrix[:,i], rf_classifier, kfold=4)

In [78]:
np.savez("results.npz", results=results)

In [79]:
tt = np.load("results.npz")

In [84]:
tt["results"].item()

{0: {'fit_time': array([35.44126105, 36.70299816, 36.27047396, 36.63322711]),
  'score_time': array([12.07667804, 11.70032191, 11.90154123, 11.62328291]),
  'test_f1': array([0., 0., 0., 0.]),
  'test_accuracy': array([0.9871134 , 0.98966408, 0.98963731, 0.98963731]),
  'test_precision': array([0., 0., 0., 0.]),
  'test_recall': array([0., 0., 0., 0.]),
  'test_average_precision': array([0.45146689, 0.31650327, 0.76351351, 0.59364261]),
  'test_roc_auc': array([0.79869452, 0.9151436 , 0.95451571, 0.93520942]),
  'total_count': Counter({0.0: 1530, 1.0: 17})},
 1: {'fit_time': array([25.67761922, 25.42191577, 25.7296598 , 25.69925499]),
  'score_time': array([7.66351986, 7.72381997, 7.6417191 , 7.68871307]),
  'test_f1': array([0.74545455, 0.71028037, 0.76785714, 0.77310924]),
  'test_accuracy': array([0.67058824, 0.63529412, 0.69411765, 0.67857143]),
  'test_precision': array([0.67213115, 0.65517241, 0.68253968, 0.65714286]),
  'test_recall': array([0.83673469, 0.7755102 , 0.87755102, 0