This notebook is developed from [Kaggle's Competition](https://www.kaggle.com/c/lish-moa) on prediction drugs' mechanisms of action.

Some Kaggle's notebook inspirations: 

1) [Oleksandr Sirenko](https://github.com/oleksandrsirenko/mechanisms-of-action-moa-prediction#inbox_tray-how-to-get-data)

2) [Kushal Agrawal](https://www.kaggle.com/kushal1506/moa-prediction-complete-walkthrough-eda-ensemble/notebook)

3) [Aavajanar](https://github.com/aavajanar/KAGGLE-MOA)

# Background
The Connectivity Map, a project within the Broad Institute of MIT and Harvard, the Laboratory for Innovation Science at Harvard (LISH), and the NIH Common Funds Library of Integrated Network-Based Cellular Signatures (LINCS), present this challenge with the goal of advancing drug development through improvements to MoA prediction algorithms.

Task: Predict multiple targets of the Mechanism of Action (MoA) response(s) of different samples (sig_id), given various inputs such as gene expression data and cell viability data

Two notes:
- the training data has an optional set of MoA labels that aren't included in the test data and not used for scoring
- the re-run dataset has approximately 4x the number of examples seen in the Public test

## What is the Mechanism of Action (MoA) of a drug? And why is it important?
In the past, scientists derived drugs from natural products or were inspired by traditional remedies. Very common drugs, such as paracetamol, known in the US as acetaminophen, were put into clinical use decades before the biological mechanisms driving their pharmacological activities were understood. Today, with the advent of more powerful technologies, drug discovery has changed from the serendipitous approaches of the past to a more targeted model based on an understanding of the underlying biological mechanisms of a disease. In this new framework, scientists seek to identify a protein target associated w/ a disease and develop a molecule that can modulate that protein target. As a shorthand to describe the biological activitiy of a given molecule, scientists assign a label referred to as mechanism-of-action of MoA for short.

## How do we determine the MoAs of a new drug?
One approach is to treat a sample of human cells with the drug and then analyze the cellular responses with algorithms that search for similarity to known patterns in large genomic databases, such as libraries of gene expression or cell viability patterns of drugs with known MoAs.

The data combine gene expression and cell viability data, is based on a new technology that measures simulatenously (within the same samples) human cells' responses to drugs in a pool of 100 different cell types (thus solving the problm of identifying ex-ante, which cell types are better suited for a given drug). In addition, you will have access to MoA annotations for > 5,000 drugs in this dataset.

## How to evaluate the accuracy of a solution?
Based on the MoA annotations, the accuracy of solutions will be evaluated on the average value of the logarithmic loss function applied to each drug-MoA annotation pair.

## Data
- `train_features.csv` - Features for the training set. Features g- signify gene expression data, and c- signify cell viability data. cp_type indicates samples treated with a compound (cp_vehicle) or with a control perturbation (ctrl_vehicle); control perturbations have no MoAs; cp_time and cp_dose indicate treatment duration (24, 48, 72 hours) and dose (high or low).
- `train_targets_scored.csv` - The binary MoA targets that are scored.
- `train_targets_nonscored.csv` - Additional (optional) binary MoA responses for the training data. These are not predicted nor scored.
- `test_features.csv` - Features for the test data. You must predict the probability of each scored MoA for each row in the test data.
- `sample_submission.csv` - A submission file in the correct format.

In [None]:
! pip install tensorflow-addons
#! unzip lish-moa.zip



In [None]:
import numpy as np 
import pandas as pd 
import os

from sklearn.model_selection import KFold
from sklearn.metrics import log_loss
 
import tensorflow as tf
import tensorflow_addons as tfa
import tensorflow.keras.backend as K
import tensorflow.keras.layers as L
import tensorflow.keras.models as M


In [None]:
# Load datasets, try first 1000 lines for example
test_df = pd.read_csv('test_features.csv')
train_df = pd.read_csv('train_features.csv')
train_target_df = pd.read_csv('train_targets_scored.csv')
sub = pd.read_csv('sample_submission.csv')

target_cols = train_target_df.columns[1:]
N_TARGETS = len(target_cols)

In [None]:
test_df = test_df.iloc[0:10000:,]
train_df = train_df.iloc[0:10000,:]
sub = sub.iloc[0:10000,:]

In [None]:
train_df.head()

Unnamed: 0,sig_id,cp_type,cp_time,cp_dose,g-0,g-1,g-2,g-3,g-4,g-5,g-6,g-7,g-8,g-9,g-10,g-11,g-12,g-13,g-14,g-15,g-16,g-17,g-18,g-19,g-20,g-21,g-22,g-23,g-24,g-25,g-26,g-27,g-28,g-29,g-30,g-31,g-32,g-33,g-34,g-35,...,c-60,c-61,c-62,c-63,c-64,c-65,c-66,c-67,c-68,c-69,c-70,c-71,c-72,c-73,c-74,c-75,c-76,c-77,c-78,c-79,c-80,c-81,c-82,c-83,c-84,c-85,c-86,c-87,c-88,c-89,c-90,c-91,c-92,c-93,c-94,c-95,c-96,c-97,c-98,c-99
0,id_000644bb2,trt_cp,24,D1,1.062,0.5577,-0.2479,-0.6208,-0.1944,-1.012,-1.022,-0.0326,0.5548,-0.0921,1.183,0.153,0.5574,-0.4015,0.1789,-0.6528,-0.7969,0.6342,0.1778,-0.3694,-0.5688,-1.136,-1.188,0.694,0.4393,0.2664,0.1907,0.1628,-0.2853,0.5819,0.2934,-0.5584,-0.0916,-0.301,-0.1537,0.2198,...,0.4805,0.4965,0.368,0.8427,0.1042,0.1403,0.1758,1.257,-0.5979,1.225,-0.0553,0.7351,0.581,0.959,0.2427,0.0495,0.4141,0.8432,0.6162,-0.7318,1.212,0.6362,-0.4427,0.1288,1.484,0.1799,0.5367,-0.1111,-1.012,0.6685,0.2862,0.2584,0.8076,0.5523,-0.1912,0.6584,-0.3981,0.2139,0.3801,0.4176
1,id_000779bfc,trt_cp,72,D1,0.0743,0.4087,0.2991,0.0604,1.019,0.5207,0.2341,0.3372,-0.4047,0.8507,-1.152,-0.4201,-0.0958,0.459,0.0803,0.225,0.5293,0.2839,-0.3494,0.2883,0.9449,-0.1646,-0.2657,-0.3372,0.3135,-0.4316,0.4773,0.2075,-0.4216,-0.1161,-0.0499,-0.2627,0.9959,-0.2483,0.2655,-0.2102,...,0.4083,0.0319,0.3905,0.7099,0.2912,0.4151,-0.284,-0.3104,-0.6373,0.2887,-0.0765,0.2539,0.4443,0.5932,0.2031,0.7639,0.5499,-0.3322,-0.0977,0.4329,-0.2782,0.7827,0.5934,0.3402,0.1499,0.442,0.9366,0.8193,-0.4236,0.3192,-0.4265,0.7543,0.4708,0.023,0.2957,0.4899,0.1522,0.1241,0.6077,0.7371
2,id_000a6266a,trt_cp,48,D1,0.628,0.5817,1.554,-0.0764,-0.0323,1.239,0.1715,0.2155,0.0065,1.23,-0.4797,-0.5631,-0.0366,-1.83,0.6057,-0.3278,0.6042,-0.3075,-0.1147,-0.057,-0.0799,-0.8181,-1.532,0.2307,0.4901,0.478,-1.397,4.624,-0.0437,1.287,-1.853,0.6069,0.429,0.1783,0.0018,-1.18,...,-0.5477,-0.7576,-0.0444,0.1894,-0.0014,-2.364,-0.4682,0.121,-0.5177,-0.0604,0.1682,-0.4436,0.4963,0.1363,0.3335,0.976,-0.0427,-0.1235,0.0959,0.069,-0.9416,-0.7548,-0.1109,-0.6272,0.3019,0.1172,0.1093,-0.3113,0.3019,-0.0873,-0.725,-0.6297,0.6103,0.0223,-1.324,-0.3174,-0.6417,-0.2187,-1.408,0.6931
3,id_0015fd391,trt_cp,48,D1,-0.5138,-0.2491,-0.2656,0.5288,4.062,-0.8095,-1.959,0.1792,-0.1321,-1.06,-0.8269,-0.3584,-0.8511,-0.5844,-2.569,0.8183,-0.0532,-0.8554,0.116,-2.352,2.12,-1.158,-0.7191,-0.8004,-1.467,-0.0107,-0.8995,0.2406,-0.2479,-1.089,-0.7575,0.0881,-2.737,0.8745,0.5787,-1.674,...,-2.122,-0.3752,-2.382,-3.735,-2.974,-1.493,-1.66,-3.166,0.2816,-0.299,-1.187,-0.5044,-1.775,-1.612,-0.9215,-1.081,-3.052,-3.447,-2.774,-1.846,-0.5568,-3.396,-2.951,-1.155,-3.262,-1.539,-2.46,-0.9417,-1.555,0.2431,-2.099,-0.6441,-5.63,-1.378,-0.8632,-1.288,-1.621,-0.8784,-0.3876,-0.8154
4,id_001626bd3,trt_cp,72,D2,-0.3254,-0.4009,0.97,0.6919,1.418,-0.8244,-0.28,-0.1498,-0.8789,0.863,-0.2219,-0.5121,-0.9577,1.175,0.2042,0.197,0.1244,-1.709,-0.3543,-0.516,-0.333,-0.2685,0.7649,0.2057,1.372,0.6835,0.8056,-0.3754,-1.209,0.2965,-0.0712,0.6389,0.6674,-0.0783,1.174,-0.711,...,-0.2274,0.3215,0.1535,-0.464,-0.5943,0.3973,0.15,0.5178,0.5159,0.6091,0.1813,-0.4249,0.7832,0.6529,0.5648,0.4817,0.0587,0.5303,0.6376,-0.3966,-1.495,-0.9625,-0.0541,0.6273,0.4563,0.0698,0.8134,0.1924,0.6054,-0.1824,0.0042,0.0048,0.667,1.069,0.5523,-0.3031,0.1094,0.2885,-0.3786,0.7125


In [None]:
test_df.head()

Unnamed: 0,sig_id,cp_type,cp_time,cp_dose,g-0,g-1,g-2,g-3,g-4,g-5,g-6,g-7,g-8,g-9,g-10,g-11,g-12,g-13,g-14,g-15,g-16,g-17,g-18,g-19,g-20,g-21,g-22,g-23,g-24,g-25,g-26,g-27,g-28,g-29,g-30,g-31,g-32,g-33,g-34,g-35,...,c-60,c-61,c-62,c-63,c-64,c-65,c-66,c-67,c-68,c-69,c-70,c-71,c-72,c-73,c-74,c-75,c-76,c-77,c-78,c-79,c-80,c-81,c-82,c-83,c-84,c-85,c-86,c-87,c-88,c-89,c-90,c-91,c-92,c-93,c-94,c-95,c-96,c-97,c-98,c-99
0,id_0004d9e33,trt_cp,24,D1,-0.5458,0.1306,-0.5135,0.4408,1.55,-0.1644,-0.214,0.2221,-0.326,1.939,-0.2305,-0.367,1.304,1.461,0.0043,0.6816,-0.2304,-0.0635,-0.203,-0.6821,-0.6242,0.1297,-0.0338,0.3372,0.2254,0.4795,0.7642,0.6638,-0.248,-0.1183,-0.4847,-0.0179,-0.8204,-0.5296,-1.507,-0.0144,...,-0.1353,0.0494,0.8939,0.227,0.2876,-0.3065,0.6519,-0.8156,-1.496,0.3796,0.0877,-1.023,-0.0206,-0.4149,-0.6258,-0.2688,0.4403,-0.49,0.291,0.0473,-0.0914,0.3087,-0.0612,-0.9128,-0.9399,0.0173,0.0519,-0.0035,-0.5184,-0.3485,0.0981,0.7978,-0.143,-0.2067,-0.2303,-0.1193,0.021,-0.0502,0.151,-0.775
1,id_001897cda,trt_cp,72,D1,-0.1829,0.232,1.208,-0.4522,-0.3652,-0.3319,-1.882,0.4022,-0.3528,0.1271,0.9303,0.3173,-1.012,-0.3213,0.0607,-0.5389,-0.803,-1.06,-0.0978,-0.8156,-0.6514,0.6812,0.5246,0.0,0.503,-0.15,-0.1433,2.091,-0.6556,-0.6012,-0.4104,-0.058,-0.3608,0.2197,-0.7101,1.343,...,-0.7458,0.0458,-0.3644,-1.818,-0.0358,-0.7925,-0.2693,-0.0938,-0.1833,-0.7402,-1.409,0.1987,0.046,-1.352,-0.3445,-0.0909,-0.6337,-0.5788,-0.7885,0.0996,-1.948,-1.272,-0.7223,-0.5838,-1.362,-0.7671,0.4881,0.5913,-0.4333,0.1234,-0.119,-0.1852,-1.031,-1.367,-0.369,-0.5382,0.0359,-0.4764,-1.381,-0.73
2,id_002429b5b,ctl_vehicle,24,D1,0.1852,-0.1404,-0.3911,0.131,-1.438,0.2455,-0.339,-0.3206,0.6944,0.5837,-0.0553,-0.6222,2.543,-0.7857,0.8163,-0.0495,0.1806,1.029,-0.5204,-1.107,0.7365,-0.3835,-0.5771,0.0523,-0.269,0.1674,0.601,-0.666,0.0276,0.0924,0.2785,-0.3943,-0.4602,-0.0673,-1.342,0.3127,...,0.4369,-1.496,1.239,-1.222,0.6624,-0.7336,-0.5248,0.0727,0.1455,0.5364,-0.0823,0.5734,0.4876,0.7088,1.075,0.4689,1.087,-0.5036,-0.3451,0.5087,1.11,0.7886,0.2093,-0.4617,1.487,0.1985,1.175,-0.5693,0.5062,-0.1925,-0.2261,0.337,-1.384,0.8604,-1.953,-1.014,0.8662,1.016,0.4924,-0.1942
3,id_00276f245,trt_cp,24,D2,0.4828,0.1955,0.3825,0.4244,-0.5855,-1.202,0.5998,-0.1799,0.9365,0.2942,1.399,-1.108,1.161,-0.8134,0.0743,-0.3096,-0.9956,1.71,1.22,-0.1258,-0.59,-0.7956,-0.6765,0.0,-0.882,0.7609,-0.0882,0.5217,0.9587,-0.4764,0.269,0.4753,0.0196,0.2775,-0.7142,0.7079,...,-1.219,-0.5564,-0.2831,0.5902,-0.2881,-0.0013,-0.4036,-0.4076,-0.5593,-0.5505,-0.4806,-0.0846,0.5884,0.231,0.3956,0.6495,-0.2511,-0.2207,0.1274,-0.5378,0.5649,-0.1231,0.1586,-0.426,-0.6815,-0.4753,0.2611,-1.178,0.1909,-1.232,0.126,0.157,-0.1784,-1.12,-0.4325,-0.9005,0.8131,-0.1305,0.5645,-0.5809
4,id_0027f1083,trt_cp,48,D1,-0.3979,-1.268,1.913,0.2057,-0.5864,-0.0166,0.5128,0.6365,0.2611,-1.112,-0.1747,-0.2053,-0.0006,0.0254,0.1784,-0.1314,1.07,0.0477,-0.0756,0.1287,0.1403,0.7609,-0.1095,0.0,-1.095,-0.078,0.9048,-0.3007,0.4351,-0.1558,0.2101,0.485,0.1139,0.6745,-0.9101,-0.669,...,0.6412,0.2123,0.0646,0.5924,0.5284,-0.4421,0.6547,-0.8098,0.9812,-0.0349,0.7816,1.169,0.414,1.025,0.8642,0.5993,0.2091,-0.1521,1.561,-0.677,0.36,-0.1436,-0.0949,1.064,0.8321,0.9727,-0.5567,0.224,0.8949,0.8668,0.4965,0.7578,-0.158,1.051,0.5742,1.09,-0.2962,-0.5313,0.9931,1.838


### Basic Setup and helpers


In [None]:
SEED = 1234
EPOCHS = 4
BATCH_SIZE = 16
FOLDS = 3
REPEATS = 2
LR = 0.05
N_TARGETS = len(target_cols)

In [None]:
def seed_everything(seed):
  np.random.seed(seed)
  os.environ['PYTHONHASHSEED'] = str(seed)
  tf.random.set_seed(seed)

In [None]:
def multi_log_loss(y_true, y_pred):
  losses = []
  for col in y_true.columns:
    losses.append(log_loss(y_true.loc[:, col], y_pred.loc[:, col]))
  return np.mean(losses)

### Encode Categoricals to Binary

In [None]:
def preprocess_df(data):
  data['cp_type'] = (data['cp_type'] == 'trt_cp').astype(int)
  data['cp_dose'] = (data['cp_dose'] == 'D2').astype(int)
  return data

In [None]:
x_train = preprocess_df(train_df.drop(columns="sig_id"))
x_test =preprocess_df(test_df.drop(columns="sig_id"))
y_train = train_target_df.drop(columns="sig_id")
N_FEATURES = x_train.shape[1]

### Define Model Architecture


In [None]:
def create_model():
  model = tf.keras.Sequential([
                               tf.keras.layers.Input(N_FEATURES),
                               tf.keras.layers.BatchNormalization(),
                               tf.keras.layers.Dropout(0.2),
                               tfa.layers.WeightNormalization(tf.keras.layers.Dense(2048, activation = 'relu')
                               ),
                               tf.keras.layers.BatchNormalization(),
                               tf.keras.layers.Dropout(0.5),
                               tfa.layers.WeightNormalization(tf.keras.layers.Dense(2048, activation = 'relu')
                               ),
                               tf.keras.layers.BatchNormalization(),
                               tf.keras.layers.Dropout(0.4),
                               tfa.layers.WeightNormalization(tf.keras.layers.Dense(2048, activation = 'relu')
                               ),
                               tf.keras.layers.BatchNormalization(),
                               tf.keras.layers.Dropout(0.5),
                               tfa.layers.WeightNormalization(tf.keras.layers.Dense(N_TARGETS, activation = 'sigmoid'))
  ])
  model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate = LR), loss = 'binary_crossentropy', metrics=['accuracy'])
  return model

### Main CV and Model Training Function


In [None]:
def build_train(resume_models = None, repeat_number = 0, folds = 3, skip_folds = 0):
  models = []
  oof_preds = y_train.copy()

  kfold = KFold(folds, shuffle = True)
  for fold, (train_ind, val_ind) in enumerate(kfold.split(x_train)):
    print('\n')
    print('-'*50)
    print(f'Training fold {fold + 1}')

    cb_lr_schedule = tf.keras.callbacks.ReduceLROnPlateau(monitor = 'binary_crossentropy',
                                                          factor = 0.4,
                                                          patience = 2,
                                                          verbose = 1,
                                                          min_delta = 0.0001,
                                                          mode = 'auto')
    checkpoint_path = f'repeat:{repeat_number}_Fold:{fold}.hdf5'
    cb_checkpt = tf.keras.callbacks.ModelCheckpoint(checkpoint_path, 
                                                    monitor = 'val_loss',
                                                    verbose = 0,
                                                    save_best_only = True,
                                                    save_weights_only = True,
                                                    mode = 'min')
    model = create_model()
    model.fit(x_train.values[train_ind],
              y_train.values[train_ind],
              validation_data = (x_train.values[val_ind], y_train.values[val_ind]),
              callbacks = [cb_lr_schedule, cb_checkpt],
              epochs = EPOCHS, batch_size = BATCH_SIZE, verbose = 2
              )
    model.load_weights(checkpoint_path)
    oof_preds.loc[val_ind, :] = model.predict(x_train.values[val_ind])
    models.append(model)

  return models, oof_preds

In [None]:
models = []
oof_preds = []
seed_everything(SEED)
for i in range(REPEATS):
  m, oof = build_train(repeat_number=i, folds = FOLDS)
  models = models + m
  oof_preds.append(oof)



--------------------------------------------------
Training fold 1
Epoch 1/4
417/417 - 109s - loss: 0.0303 - accuracy: 0.0332 - val_loss: 0.0210 - val_accuracy: 0.0678 - lr: 0.0500 - 109s/epoch - 262ms/step
Epoch 2/4
417/417 - 104s - loss: 0.0209 - accuracy: 0.0432 - val_loss: 0.0192 - val_accuracy: 0.0567 - lr: 0.0500 - 104s/epoch - 251ms/step
Epoch 3/4
417/417 - 107s - loss: 0.0194 - accuracy: 0.0636 - val_loss: 0.0186 - val_accuracy: 0.0699 - lr: 0.0500 - 107s/epoch - 256ms/step
Epoch 4/4
417/417 - 106s - loss: 0.0186 - accuracy: 0.0722 - val_loss: 0.0181 - val_accuracy: 0.0840 - lr: 0.0500 - 106s/epoch - 253ms/step


--------------------------------------------------
Training fold 2
Epoch 1/4
417/417 - 106s - loss: 0.0293 - accuracy: 0.0274 - val_loss: 0.0217 - val_accuracy: 0.0288 - lr: 0.0500 - 106s/epoch - 255ms/step
Epoch 2/4
417/417 - 105s - loss: 0.0207 - accuracy: 0.0391 - val_loss: 0.0195 - val_accuracy: 0.0555 - lr: 0.0500 - 105s/epoch - 252ms/step
Epoch 3/4
417/417 - 10

In [None]:
mean_oof_preds = y_train.copy()
mean_oof_preds.loc[:, target_cols] = 0
for i, p in enumerate(oof_preds):
    print(f"Repeat {i + 1} OOF Log Loss: {multi_log_loss(y_train, p)}")
    mean_oof_preds.loc[:, target_cols] += p[target_cols]

mean_oof_preds.loc[:, target_cols] /= len(oof_preds)
print(f"Mean OOF Log Loss: {multi_log_loss(y_train, mean_oof_preds)}")
# mean_oof_preds.loc[x_train['cp_type'] == 0, target_cols] = 0
# print(f"Mean OOF Log Loss (ctl adjusted): {multi_log_loss(y_train, mean_oof_preds)}")

Repeat 1 OOF Log Loss: 0.007718477756900262
Repeat 2 OOF Log Loss: 0.0076175487503934965
Mean OOF Log Loss: 0.007341890669705138


In [None]:
test_preds = sub.copy()
test_preds[target_cols] = 0
for model in models:
    test_preds.loc[:,target_cols] += model.predict(x_test)
test_preds.loc[:,target_cols] /= len(models)
test_preds.loc[x_test['cp_type'] == 0, target_cols] = 0
test_preds.to_csv('submission.csv', index=False)