# ML4DD Summer School Hackathon

The final days of the Machine Learning For Drug Discovery summer school ends with a hackathon. We will use Polaris as a tool to get the associated benchmarks and datasets. First things first, we will install Polaris from PyPi.

We next need to authenticate ourselves to Polaris. If you haven't done so yet, you can create an account at https://polarishub.io. Afterwards, you can simply run the command below.

In [None]:
!polaris login

In [None]:
# @title Set an owner

owner = 'cwognum' # @param {type:"string"}

print(f"You have set \"{owner}\" as the owner")

You have set "cwognum" as the owner


# Solubility Benchmark

The first benchmark we will use is `polaris/adme-fang-solu-1`. The associated page for this benchmark on the Polaris Hub can be found at https://polarishub.io/benchmarks/polaris/adme-fang-solu-1.

In [4]:
import polaris as po
import datamol as dm
import numpy as np

In [5]:
benchmark = po.load_benchmark("polaris/adme-fang-solu-1")

[32m2024-06-20 13:27:19.546[0m | [1mINFO    [0m | [36mpolaris._artifact[0m:[36m_validate_version[0m:[36m66[0m - [1mThe version of Polaris that was used to create the artifact (0.0.0) is different from the currently installed version of Polaris (dev).[0m
[32m2024-06-20 13:27:19.577[0m | [1mINFO    [0m | [36mpolaris._artifact[0m:[36m_validate_version[0m:[36m66[0m - [1mThe version of Polaris that was used to create the artifact (0.0.0) is different from the currently installed version of Polaris (dev).[0m


We will use Datamol's `dm.to_fp` to directly featurize the inputs.

In [6]:
train, test = benchmark.get_train_test_split(featurization_fn=dm.to_fp)
train[0]

(array([0, 0, 0, ..., 0, 0, 0], dtype=uint8), 1.567849451)

As a model, we will train a simple Random Forest model from scikit-learn.

In [7]:
from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor(max_depth=5)
model.fit(train.X, train.y)

Using that model, we can then generate our predictions for the test set.

In [8]:
y_pred = model.predict(test.X)

And finally, we evaluate our predictions

In [10]:
benchmark.evaluate?

In [9]:
results = benchmark.evaluate(y_pred)
results

Test set,Target label,Metric,Score
test,LOG_SOLUBILITY,mean_absolute_error,0.4893312409
test,LOG_SOLUBILITY,mean_squared_error,0.4780744443
test,LOG_SOLUBILITY,r2,0.1182344545
test,LOG_SOLUBILITY,spearmanr,0.36067468
test,LOG_SOLUBILITY,pearsonr,0.4174033474
test,LOG_SOLUBILITY,explained_var,0.1252657093
name,,,
description,,,
tags,,,
user_attributes,,,

0,1
slug,polaris
external_id,org_2gtoaJIVrgRqiIR8Qm5BnpFCbxu
type,organization

Test set,Target label,Metric,Score
test,LOG_SOLUBILITY,mean_absolute_error,0.4893312409
test,LOG_SOLUBILITY,mean_squared_error,0.4780744443
test,LOG_SOLUBILITY,r2,0.1182344545
test,LOG_SOLUBILITY,spearmanr,0.36067468
test,LOG_SOLUBILITY,pearsonr,0.4174033474
test,LOG_SOLUBILITY,explained_var,0.1252657093


There are multiple metadata fields we can fill in to provide additional information about these results.

In [11]:
results.name = "my-first-result"
results.description = "ECFP fingerprints with a Random Forest"

And finally - We can upload our results to the Hub! The results will be private.

In [12]:
results.upload_to_hub(owner=owner);

NameError: name 'owner' is not defined

# Kinase Selectivity

The second benchmark we will use is `polaris/pkis1-kit-wt-mut-c-1`. Using this benchmark is very similar to before, except for one difference: This is a multi-task benchmark.

In [13]:
benchmark = po.load_benchmark("polaris/pkis1-kit-wt-mut-c-1")
train, test = benchmark.get_train_test_split(featurization_fn=dm.to_fp)
train[0]

[32m2024-06-20 13:30:14.490[0m | [1mINFO    [0m | [36mpolaris._artifact[0m:[36m_validate_version[0m:[36m66[0m - [1mThe version of Polaris that was used to create the artifact (0.0.0) is different from the currently installed version of Polaris (dev).[0m
[32m2024-06-20 13:30:14.515[0m | [1mINFO    [0m | [36mpolaris._artifact[0m:[36m_validate_version[0m:[36m66[0m - [1mThe version of Polaris that was used to create the artifact (0.0.0) is different from the currently installed version of Polaris (dev).[0m


(array([0, 0, 0, ..., 0, 0, 0], dtype=uint8),
 {'CLASS_KIT_(T6701_mutant)': 0.0,
  'CLASS_KIT_(V560G_mutant)': 0.0,
  'CLASS_KIT': 0.0})

As we can see, the targets are now returned to us as a dictionary. Let's train a multi-task model on this data! We first preprocess the data to be in a format we can use with scikit-learn.

In [14]:
ys = train.y
ys = np.stack([ys[target] for target in benchmark.target_cols], axis=1)
ys.shape

(277, 3)

Now that we're working with a multi-task dataset, it's also possible for these arrays to be sparse. Let's filter out any data points that doesn't have readouts for _all_ targets.

In [15]:
mask = ~np.any(np.isnan(ys), axis=1)
mask.sum()

276

In [16]:
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(max_depth=5)
model.fit(train.X[mask], ys[mask])

In [21]:
y_pred = model.predict(test.X)
y_pred.shape

(87, 3)

In addition to `y_pred`, we also need to specify `y_prob` as this benchmark uses the AUROC measure.

In [22]:
y_prob = model.predict_proba(test.X)
y_prob = np.stack(y_prob, axis=1)
y_prob.shape

(87, 3, 2)

Polaris expects a dictionary, so let's convert our results again.

In [23]:
y_pred = {k: y_pred[:, idx] for idx, k in enumerate(benchmark.target_cols)}
y_prob = {k: y_prob[:, idx, 1] for idx, k in enumerate(benchmark.target_cols)}

And let's evaluate our predictions!

In [24]:
benchmark.evaluate(y_pred=y_pred, y_prob=y_prob)

Test set,Target label,Metric,Score
test,CLASS_KIT_(T6701_mutant),accuracy,0.8390804598
test,CLASS_KIT_(V560G_mutant),accuracy,0.8620689655
test,CLASS_KIT,accuracy,0.6206896552
test,CLASS_KIT_(T6701_mutant),f1,0.0
test,CLASS_KIT_(V560G_mutant),f1,0.0
test,CLASS_KIT,f1,0.0
test,CLASS_KIT_(T6701_mutant),roc_auc,0.6834637965
test,CLASS_KIT_(V560G_mutant),roc_auc,0.7055555556
test,CLASS_KIT,roc_auc,0.7817059484
test,CLASS_KIT_(T6701_mutant),pr_auc,0.3456398643

0,1
slug,polaris
external_id,org_2gtoaJIVrgRqiIR8Qm5BnpFCbxu
type,organization

Test set,Target label,Metric,Score
test,CLASS_KIT_(T6701_mutant),accuracy,0.8390804598
test,CLASS_KIT_(V560G_mutant),accuracy,0.8620689655
test,CLASS_KIT,accuracy,0.6206896552
test,CLASS_KIT_(T6701_mutant),f1,0.0
test,CLASS_KIT_(V560G_mutant),f1,0.0
test,CLASS_KIT,f1,0.0
test,CLASS_KIT_(T6701_mutant),roc_auc,0.6834637965
test,CLASS_KIT_(V560G_mutant),roc_auc,0.7055555556
test,CLASS_KIT,roc_auc,0.7817059484
test,CLASS_KIT_(T6701_mutant),pr_auc,0.3456398643


Although this works, we're not required to train a multi-task model. Polaris doesn't impose any restrictions on the methodology. You could e.g. also train multiple single-task models.

In [None]:
from sklearn.ensemble import RandomForestClassifier

models = {target: RandomForestClassifier(max_depth=5) for target in benchmark.target_cols}
X = train.X

for target, model in models.items():
  y = train.y[target]
  mask = ~np.isnan(y)
  model.fit(X[mask], y[mask])

y_prob = {target: model.predict_proba(test.X)[:, 1] for target, model in models.items()}
y_pred = {target: model.predict(test.X) for target, model in models.items()}

results = benchmark.evaluate(y_pred=y_pred, y_prob=y_prob)

Finally, let's upload our results to the Hub again!

In [None]:
results.name = "my-second-result"
results.description = "ECFP fingerprints with a Random Forest"

In [None]:
results.upload_to_hub(owner=owner);

[32m2024-06-20 13:00:01.006[0m | [32m[1mSUCCESS [0m | [36mpolaris.hub.client[0m:[36mupload_results[0m:[36m492[0m - [32m[1mYour result has been successfully uploaded to the Hub. View it here: https://polarishub.io/benchmarks/polaris/pkis1-kit-wt-mut-c-1/tWYuqlNYeoXeqKTKNN82D[0m


The End.