<div style="width: 100%; clear: both;">
<div style="float: left; width: 50%;">
<img src="http://www.uoc.edu/portal/_resources/common/imatges/marca_UOC/UOC_Masterbrand.jpg", align="left">
</div>
<div style="float: right; width: 50%;">
<p style="margin: 0; padding-top: 22px; text-align:right;">M2.880 · TFM · Área 3 aula 1</p>
<p style="margin: 0; text-align:right;">2021-2 · Máster universitario en Ciencia de datos (<i>Data science</i>)</p>
<p style="margin: 0; text-align:right; padding-button: 100px;">Estudios de Informática, Multimedia y Telecomunicación

</p>
</div>
</div>
<div style="width:100%;">&nbsp;</div>


# TFM: 

## Clasificación de imágenes de recursión celular:

El coste de algunos medicamentos y tratamientos médicos ha subido tanto en los últimos años que muchos pacientes tienen que prescindir de ellos. Una de las razones más sorprendentes del coste es el tiempo que se tarda en sacar nuevos tratamientos al mercado. A pesar de las mejoras en la tecnología y la ciencia, la investigación y el desarrollo siguen retrasados. De hecho, encontrar nuevos tratamientos lleva, de media, más de 10 años y cuesta cientos de millones de dólares.

Recursion Pharmaceuticals, creadores del mayor conjunto de datos de imágenes biológicas del sector, generado íntegramente de forma interna, cree que la IA tiene el potencial de mejorar y agilizar drásticamente el proceso de descubrimiento de fármacos. Más concretamente, sus esfuerzos podrían ayudarles a entender cómo interactúan los fármacos con las células humanas.

En este proyecto se tiene que desentrañar el ruido experimental de las señales biológicas reales. La propuesta clasificará imágenes de células sometidas a una de las 1.108 perturbaciones genéticas diferentes. Puedes ayudar a eliminar el ruido introducido por la ejecución técnica y la variación ambiental entre experimentos.

Si se tiene éxito, se podría mejorar drásticamente la capacidad de la industria para modelar imágenes celulares según su biología relevante. A su vez, la aplicación de la IA podría disminuir en gran medida el coste de los tratamientos y garantizar que estos lleguen a los pacientes con mayor rapidez.


El proyecto que se presenta es el reto de la plataforma Kaggle alojado en https://www.kaggle.com/c/recursion-cellular-image-classification. Uno de los principales retos para aplicar la IA a los datos de microscopía biológica es que incluso las réplicas más cuidadosas de un proceso no parecerán idénticas. Este conjunto de datos supone un reto para desarrollar un modelo de identificación de réplicas que sea robusto frente al ruido experimental.

Los mismos siRNAs (perturbaciones genéticas efectivas) se han aplicado repetidamente a múltiples líneas celulares, para un total de 51 lotes experimentales. Cada lote tiene cuatro placas, cada una de las cuales tiene 308 pozos llenos. Para cada pozo, se ha realizado imágenes de microscopio desde dos perspectivas y a través de seis canales de imagen. No todos los lotes tienen necesariamente todos los pozos llenos o todos los siRNA presentes.

Hemos resumido esta descripción a lo esencial; para más detalles, consulte [RxRx.ai](https://www.rxrx.ai).


**El objetivo principal de la práctica es desarrollar modelos basados en el aprendizaje automático para clasificar con la mayor precisión posible, las perturbaciones genéticas aplicadas a las células.**

In [15]:
#%pylab inline
import sys
import os
import cv2
import glob
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import warnings
import numpy as np # linear algebra
import random
warnings.filterwarnings('ignore')


In [16]:
import skimage.io
from skimage.transform import resize
from imgaug import augmenters as iaa
from tqdm import tqdm
import PIL
from PIL import Image, ImageOps
from sklearn.utils import class_weight, shuffle
from keras.losses import binary_crossentropy, categorical_crossentropy
from keras.applications.densenet import preprocess_input
import keras.backend as K
import tensorflow as tf
from sklearn.metrics import f1_score, fbeta_score, cohen_kappa_score
from tensorflow.keras.utils import Sequence
from tensorflow.keras.utils import to_categorical
from sklearn.model_selection import train_test_split

In [17]:
from keras.callbacks import EarlyStopping
from keras.preprocessing.image import ImageDataGenerator
from keras.models import Sequential, load_model
from keras.layers import (Activation, Dropout, Flatten, Dense, GlobalMaxPooling2D, GlobalAveragePooling2D,
                          BatchNormalization, Input, Conv2D)
from keras.applications.densenet import DenseNet121
from keras.callbacks import ModelCheckpoint
from keras import metrics
from tensorflow.keras.optimizers import Adam, Nadam 
from keras import backend as K
import keras
from keras.models import Model

In [18]:
# All rellevant imports
import tensorflow as tf
import keras
import tensorflow_addons as tfa

In [19]:
print("TensorFlow version:", tf.__version__)
print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))
print(tf.test.gpu_device_name())
tf.config.list_physical_devices('GPU')

TensorFlow version: 2.6.3
Num GPUs Available:  0



[]

In [20]:
!git clone https://github.com/recursionpharma/rxrx1-utils

fatal: destination path 'rxrx1-utils' already exists and is not an empty directory.


In [21]:
sys.path.append('rxrx1-utils')
import rxrx.io as rio

### Variables constantes

In [27]:
SIZE = 224 # size of images
NUM_CLASSES = 1108

root_dir_train = '../input/recursion-cellular-image-classification-224-jpg/train/train/'
root_dir_test = '../input/recursion-cellular-image-classification-224-jpg/test/test/'

In [None]:
df_train = pd.read_csv('../input/recursion-cellular-image-classification-224-jpg/new_train.csv')
df_test = pd.read_csv('../input/recursion-cellular-image-classification-224-jpg/new_test.csv')

In [None]:
df_train['cell_type'] = df_train['experiment'].str.split('-').str[0]
df_test['cell_type'] = df_test['experiment'].str.split('-').str[0]

### Definición de imgaug

In [46]:
# https://github.com/aleju/imgaug
sometimes = lambda aug: iaa.Sometimes(0.5, aug)
seq = iaa.Sequential([
    sometimes(
        iaa.OneOf([
            iaa.Add((-10, 10), per_channel=0.5),
            iaa.Multiply((0.9, 1.1), per_channel=0.5),
            iaa.ContrastNormalization((0.9, 1.1), per_channel=0.5)
        ])
    ),
    iaa.Fliplr(0.5),
    iaa.Crop(percent=(0, 0.1)),
],random_order=True)

### Definición del generador customizado

In [50]:
class My_Generator(Sequence):

    def __init__(self, image_filenames, labels, batch_size, is_train=True, augment=False, root_dir= '../input/recursion-cellular-image-classification-224-jpg/train/train/'):
        
        self.image_filenames, self.labels = image_filenames, labels
        self.batch_size = batch_size
        self.is_train = is_train
        self.is_augment = augment
        self.root_dir = root_dir
        if(self.is_train):
            self.on_epoch_end()
    

    def __len__(self):
        return int(np.ceil(len(self.image_filenames) / float(self.batch_size)))

    def __getitem__(self, idx):
        batch_x = self.image_filenames[idx * self.batch_size:(idx + 1) * self.batch_size]
        batch_y = self.labels[idx * self.batch_size:(idx + 1) * self.batch_size]

        if(self.is_train):
            return self.train_generate(batch_x, batch_y)
        return self.valid_generate(batch_x, batch_y)

    def on_epoch_end(self):
        
        if(self.is_train):
            self.image_filenames, self.labels = shuffle(self.image_filenames, self.labels)
        else:
            pass

    def train_generate(self, batch_x, batch_y):
        
        batch_images = []
        
        for (sample, label) in zip(batch_x, batch_y):
            
            img = cv2.imread(self.root_dir + sample)

            if(self.is_augment):
                img = seq.augment_image(img)

            batch_images.append(img)
            
        batch_images = np.array(batch_images, np.float32)/255
        batch_y = np.array(batch_y, np.float32)
        
        return batch_images, batch_y

    def valid_generate(self, batch_x, batch_y):
        
        batch_images = []
        
        for (sample, label) in zip(batch_x, batch_y):
            img = cv2.imread(self.root_dir + sample)
            batch_images.append(img)
            
        batch_images = np.array(batch_images, np.float32)/255
        
        batch_y = np.array(batch_y, np.float32)
        return batch_images, batch_y

### Función para crear el modelo

In [51]:
def create_model(input_shape, n_out, weight_imagenet=True):
    
    input_tensor = tf.keras.Input(shape=input_shape)
    
    base_model = ''
    if weight_imagenet:
        base_model = tf.keras.applications.DenseNet121(include_top=False, weights='imagenet', input_tensor=input_tensor)
    else:
        base_model = tf.keras.applications.DenseNet121(include_top=False, weights=None, input_tensor=input_tensor)
    
    x = GlobalAveragePooling2D()(base_model.output)
    
    x = Dense(1024, activation='relu')(x)
    
    final_output = Dense(n_out, activation='softmax', name='final_output')(x)

    model = Model(input_tensor, final_output)

    return model

## Models (FROM SCRATCH) (PSEUDO-LABELS)

### Preparar dataframe con los valores pseudo-labels

In [14]:
df_pseudo_labels = pd.read_csv('../input/submitv3/submit_v3.csv')
df_pseudo_labels['cell_type'] = df_pseudo_labels['id_code'].str.split('-').str[0]
df_test = pd.read_csv('../input/recursion-cellular-image-classification-224-jpg/new_test.csv')
df_pseudo_labels['cell_type'] = df_pseudo_labels['id_code'].str.split('-').str[0]
df_test_pseudo_labels = pd.merge(df_test, df_pseudo_labels, how="left", on=["id_code"])
df_train_pseudo_labels = pd.concat([df_train, df_test_pseudo_labels], ignore_index=False)

### División del conjunto de entrenamiento en train y validation

In [15]:
x = df_train_pseudo_labels['filename']
y = df_train_pseudo_labels['sirna']

x, y = shuffle(x, y, random_state=10)

y = to_categorical(y, num_classes=NUM_CLASSES)

train_x, valid_x, train_y, valid_y = train_test_split(x, y, test_size=0.15, stratify=y, random_state=10)

### Model DenseNet121 (train from scratch) (using pseudo-labels)

In [16]:
epochs = 20; batch_size = 8
es = EarlyStopping(monitor='val_loss', verbose=1, patience=5, restore_best_weights=True)

train_generator = My_Generator(train_x, train_y, batch_size, is_train=True, augment=True, root_dir = '../input/train-testpseudo/train_testpseudo/')
valid_generator = My_Generator(valid_x, valid_y, batch_size, is_train=False, root_dir = '../input/train-testpseudo/train_testpseudo/')

In [17]:
model = create_model(input_shape=(SIZE,SIZE,3),n_out=NUM_CLASSES, weight_imagenet=False)

# train all layers
for layer in model.layers:
    layer.trainable = True

model.compile(optimizer=Nadam(1e-4),
              loss='categorical_crossentropy',
              metrics=['accuracy'])

model.load_weights('../input/model-freeze-layers-weights/model_densenet121_freeze_layers_weight.h5')

history = model.fit(
            train_generator,
            steps_per_epoch=np.ceil(float(len(train_x)) / float(batch_size)),
            validation_data=valid_generator,
            validation_steps=np.ceil(float(len(valid_x)) / float(batch_size)),
            epochs=epochs,
            verbose=1,
            callbacks=[es])

2022-05-11 12:18:38.528348: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-05-11 12:18:38.529265: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-05-11 12:18:38.529976: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-05-11 12:18:38.530703: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-05-11 12:18:38.531335: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from S

Epoch 1/20


2022-05-11 12:19:09.819429: I tensorflow/stream_executor/cuda/cuda_dnn.cc:369] Loaded cuDNN version 8005


Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


In [18]:
model.save_weights("./model_dense121_trained_pseudo_weight.h5")

## Models (HEPG2, HUVEC, RPE, U2OS) (PSEUDO-LABELS)

In [24]:
epochs = 15; batch_size = 8

es = EarlyStopping(monitor='val_loss', verbose=1, patience=5, restore_best_weights=True)

### HEPG2 PSEUDO TRAIN

In [23]:
x = df_train_hepg2['filename']
y = df_train_hepg2['sirna']

x, y = shuffle(x, y, random_state=10)
y = to_categorical(y, num_classes=NUM_CLASSES)
train_x, valid_x, train_y, valid_y = train_test_split(x, y, test_size=0.2, stratify=y, random_state=10)

# TRAIN, VALID GENERATORS
train_generator = My_Generator(train_x, train_y, batch_size, is_train=True, augment=False)
valid_generator = My_Generator(valid_x, valid_y, batch_size, is_train=False)

# CREATE MODEL
model_hepg2 = create_model(input_shape=(SIZE,SIZE,3),n_out=NUM_CLASSES, weight_imagenet=False)
model_hepg2.compile(optimizer=Nadam(5e-5),
              loss='categorical_crossentropy',
              metrics=['accuracy'])
model_hepg2.load_weights('../input/model-train-pseudo/model_dense121_trained_pseudo_weight.h5')

In [25]:
# FIT
# train all layers
for layer in model_hepg2.layers:
    layer.trainable = True
history = model_hepg2.fit(
            train_generator,
            steps_per_epoch=np.ceil(float(len(train_x)) / float(batch_size)),
            validation_data=valid_generator,
            validation_steps=np.ceil(float(len(valid_x)) / float(batch_size)),
            epochs=epochs,
            verbose=1,
            callbacks=[es])

Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Restoring model weights from the end of the best epoch.
Epoch 00009: early stopping


In [26]:
# SAVE WEIGHTS
model_hepg2.save_weights("./model_trained_pseudo_hepg2_weight2.h5")

### HUVEC PSEUDO TRAIN

In [27]:
x = df_train_huvec['filename']
y = df_train_huvec['sirna']

x, y = shuffle(x, y, random_state=10)
y = to_categorical(y, num_classes=NUM_CLASSES)
train_x, valid_x, train_y, valid_y = train_test_split(x, y, test_size=0.2, stratify=y, random_state=10)

# TRAIN, VALID GENERATORS
train_generator = My_Generator(train_x, train_y, batch_size, is_train=True, augment=False)
valid_generator = My_Generator(valid_x, valid_y, batch_size, is_train=False)

In [28]:
# CREATE MODEL
model_huvec = create_model(input_shape=(SIZE,SIZE,3),n_out=NUM_CLASSES, weight_imagenet=False)
model_huvec.compile(optimizer=Nadam(5e-5),
              loss='categorical_crossentropy',
              metrics=['accuracy'])
model_huvec.load_weights('../input/model-train-pseudo/model_dense121_trained_pseudo_weight.h5')

# FIT
# train all layers
for layer in model_huvec.layers:
    layer.trainable = True
history = model_huvec.fit(
            train_generator,
            steps_per_epoch=np.ceil(float(len(train_x)) / float(batch_size)),
            validation_data=valid_generator,
            validation_steps=np.ceil(float(len(valid_x)) / float(batch_size)),
            epochs=epochs,
            verbose=1,
            callbacks=[es])



Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Restoring model weights from the end of the best epoch.
Epoch 00009: early stopping


In [29]:
# SAVE WEIGHTS
model_huvec.save_weights("./model_trained_pseudo_huvec_weight.h5")

### RPE PSEUDO TRAIN

In [33]:
x = df_train_rpe['filename']
y = df_train_rpe['sirna']

x, y = shuffle(x, y, random_state=10)
y = to_categorical(y, num_classes=NUM_CLASSES)
train_x, valid_x, train_y, valid_y = train_test_split(x, y, test_size=0.2, stratify=y, random_state=10)

# TRAIN, VALID GENERATORS
train_generator = My_Generator(train_x, train_y, batch_size, is_train=True, augment=False)
valid_generator = My_Generator(valid_x, valid_y, batch_size, is_train=False)

In [34]:
# CREATE MODEL
model_rpe = create_model(input_shape=(SIZE,SIZE,3),n_out=NUM_CLASSES, weight_imagenet=False)
model_rpe.compile(optimizer=Nadam(5e-5),
              loss='categorical_crossentropy',
              metrics=['accuracy'])
model_rpe.load_weights('../input/model-train-pseudo/model_dense121_trained_pseudo_weight.h5')

# FIT
# train all layers
for layer in model_rpe.layers:
    layer.trainable = True
history = model_rpe.fit(
            train_generator,
            steps_per_epoch=np.ceil(float(len(train_x)) / float(batch_size)),
            validation_data=valid_generator,
            validation_steps=np.ceil(float(len(valid_x)) / float(batch_size)),
            epochs=epochs,
            verbose=1,
            callbacks=[es])



Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Restoring model weights from the end of the best epoch.
Epoch 00007: early stopping


In [35]:
# SAVE WEIGHTS
model_rpe.save_weights("./model_trained_pseudo_rpe_weight.h5")

### RPE PSEUDO TRAIN

In [36]:
x = df_train_u2os['filename']
y = df_train_u2os['sirna']

x, y = shuffle(x, y, random_state=10)
y = to_categorical(y, num_classes=NUM_CLASSES)
train_x, valid_x, train_y, valid_y = train_test_split(x, y, test_size=0.2, stratify=y, random_state=10)

# TRAIN, VALID GENERATORS
train_generator = My_Generator(train_x, train_y, batch_size, is_train=True, augment=False)
valid_generator = My_Generator(valid_x, valid_y, batch_size, is_train=False)

In [37]:
# CREATE MODEL
model_u2os = create_model(input_shape=(SIZE,SIZE,3),n_out=NUM_CLASSES, weight_imagenet=False)
model_u2os.compile(optimizer=Nadam(5e-5),
              loss='categorical_crossentropy',
              metrics=['accuracy'])
model_u2os.load_weights('../input/model-train-pseudo/model_dense121_trained_pseudo_weight.h5')

# FIT
# train all layers
for layer in model_u2os.layers:
    layer.trainable = True
history = model_u2os.fit(
            train_generator,
            steps_per_epoch=np.ceil(float(len(train_x)) / float(batch_size)),
            validation_data=valid_generator,
            validation_steps=np.ceil(float(len(valid_x)) / float(batch_size)),
            epochs=epochs,
            verbose=1,
            callbacks=[es])

Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Restoring model weights from the end of the best epoch.
Epoch 00008: early stopping


In [39]:
# SAVE WEIGHTS
model_u2os.save_weights("./model_trained_pseudo_u2os_weight.h5")

# PREDICT (using pseudo labeling)

In [40]:
model_hepg2 = create_model(input_shape=(SIZE,SIZE,3),n_out=NUM_CLASSES, weight_imagenet=False)

model_hepg2.compile(optimizer=Nadam(5e-5),
              loss='categorical_crossentropy',
              metrics=['accuracy'])

model_hepg2.load_weights('../input/model-hepg2-pseudo/model_trained_pseudo_hepg2_weight.h5')

In [41]:
model_huvec = create_model(input_shape=(SIZE,SIZE,3),n_out=NUM_CLASSES, weight_imagenet=False)

model_huvec.compile(optimizer=Nadam(5e-5),
              loss='categorical_crossentropy',
              metrics=['accuracy'])

model_huvec.load_weights('../input/model-huvec-pseudo/model_trained_pseudo_huvec_weight.h5')

In [42]:
model_rpe = create_model(input_shape=(SIZE,SIZE,3),n_out=NUM_CLASSES, weight_imagenet=False)

model_rpe.compile(optimizer=Nadam(5e-5),
              loss='categorical_crossentropy',
              metrics=['accuracy'])

model_rpe.load_weights('../input/model-rpe-pseudo/model_trained_pseudo_rpe_weight.h5')

In [43]:
model_u2os = create_model(input_shape=(SIZE,SIZE,3),n_out=NUM_CLASSES, weight_imagenet=False)

model_u2os.compile(optimizer=Nadam(5e-5),
              loss='categorical_crossentropy',
              metrics=['accuracy'])

model_u2os.load_weights('../input/model-u2os-pseudo/model_trained_pseudo_u2os_weight.h5')

In [44]:
root_dir_test = '../input/recursion-cellular-image-classification-224-jpg/test/test/'

In [45]:
id_code = df_test['id_code'].unique()

In [46]:
def get_predict(img, name):
    
    score_predict = 0
    
    if 'HEPG2' in name:

        score_predict = model_hepg2.predict(img)

    elif 'HUVEC' in name:

        score_predict = model_huvec.predict(img)

    elif 'RPE' in name:

        score_predict = model_rpe.predict(img)

    else: ## U2OS

        score_predict = model_u2os.predict(img)
    
    return score_predict

In [47]:
test_predict = []

for i, name in tqdm(enumerate(id_code)):
    
    #Predict S1
    image1 = cv2.imread(root_dir_test + name + '_s1.jpeg')
    image1 = (image1[np.newaxis])/255
    score_predict1 = get_predict(image1, name)
    
    #Predict S2
    image2 = cv2.imread(root_dir_test + name + '_s2.jpeg') 
    image2 = (image2[np.newaxis])/255
    score_predict2 = get_predict(image2, name)
    
    #Predict AVG
    score_avg = 0.5 * (score_predict1 + score_predict2)
    
    sirna = np.argmax(score_avg)
    
    test_predict.append([name, sirna])

19897it [52:35,  6.31it/s]


In [48]:
df_submit = pd.DataFrame(test_predict, columns = ['id_code','sirna'])

In [49]:
df_submit

Unnamed: 0,id_code,sirna
0,HEPG2-08_1_B03,909
1,HEPG2-08_1_B04,1015
2,HEPG2-08_1_B05,819
3,HEPG2-08_1_B06,222
4,HEPG2-08_1_B07,585
...,...,...
19892,U2OS-05_4_O19,461
19893,U2OS-05_4_O20,595
19894,U2OS-05_4_O21,98
19895,U2OS-05_4_O22,1087


In [50]:
df_submit.to_csv('submit_v4.csv', index=False)

## Actualización de la predicción utilizando "plates leak" (pseudo-labeling)

In [7]:
train_csv = pd.read_csv("../input/recursion-cellular-image-classification-224-jpg/new_train.csv")
test_csv = pd.read_csv("../input/recursion-cellular-image-classification-224-jpg/new_test.csv")
sub = pd.read_csv("../input/submit-v4/submit_v4.csv")

In [33]:
plate_groups = np.zeros((1108,4), int)
for sirna in range(1108):
    grp = train_csv.loc[train_csv.sirna==sirna,:].plate.value_counts().index.values
    assert len(grp) == 3
    plate_groups[sirna,0:3] = grp
    plate_groups[sirna,3] = 10 - grp.sum()
    
plate_groups[:10,:]

array([[4, 2, 3, 1],
       [1, 3, 4, 2],
       [2, 4, 1, 3],
       [1, 3, 4, 2],
       [3, 1, 2, 4],
       [1, 3, 4, 2],
       [1, 3, 4, 2],
       [2, 4, 1, 3],
       [1, 3, 4, 2],
       [4, 2, 3, 1]])

In [36]:
plate_groups[:10,:]

array([[4, 2, 3, 1],
       [1, 3, 4, 2],
       [2, 4, 1, 3],
       [1, 3, 4, 2],
       [3, 1, 2, 4],
       [1, 3, 4, 2],
       [1, 3, 4, 2],
       [2, 4, 1, 3],
       [1, 3, 4, 2],
       [4, 2, 3, 1]])

In [53]:
new_test = test_csv.drop(test_csv[test_csv.filename.str.contains('_s2.jpeg')].index)

In [54]:
all_test_exp = test_csv.experiment.unique()

group_plate_probs = np.zeros((len(all_test_exp),4))
for idx in range(len(all_test_exp)):
    
    preds = sub.loc[test_csv.experiment == all_test_exp[idx],'sirna'].values
    
    pp_mult = np.zeros((len(preds),1108))
    
    pp_mult[range(len(preds)),preds] = 1
    
    sub_test = new_test.loc[test_csv.experiment == all_test_exp[idx],:]
    
    assert len(pp_mult) == len(sub_test)
        
    for j in range(4):
        mask = np.repeat(plate_groups[np.newaxis, :, j], len(pp_mult), axis=0) == \
               np.repeat(sub_test.plate.values[:, np.newaxis], 1108, axis=1)
        
        group_plate_probs[idx,j] = np.array(pp_mult)[mask].sum()/len(pp_mult)

In [55]:
pd.DataFrame(group_plate_probs, index = all_test_exp)

Unnamed: 0,0,1,2,3
HEPG2-08,0.110208,0.086721,0.178862,0.62421
HEPG2-09,0.127256,0.598375,0.163357,0.111011
HEPG2-10,0.727437,0.087545,0.08574,0.099278
HEPG2-11,0.688969,0.075949,0.125678,0.109403
HUVEC-17,0.84296,0.041516,0.058664,0.056859
HUVEC-18,0.690154,0.091238,0.082204,0.136405
HUVEC-19,0.070397,0.064079,0.822202,0.043321
HUVEC-20,0.028881,0.018051,0.920578,0.032491
HUVEC-21,0.047834,0.055054,0.054152,0.84296
HUVEC-22,0.880866,0.034296,0.049639,0.035199


In [56]:
exp_to_group = group_plate_probs.argmax(1)
print(exp_to_group)

[3 1 0 0 0 0 2 2 3 0 0 3 1 0 0 0 1 3]


In [57]:
## FALTA ACABAR DE MIRAR BIEN LA FUNCION

predicted = []

for i, name in tqdm(enumerate(id_code)):    
    image1 = cv2.imread(root_dir_test + name + '_s1.jpeg')
    image1 = (image1[np.newaxis])/255
    score_predict1 = get_predict(image1, name)
    
    image2 = cv2.imread(root_dir_test + name + '_s2.jpeg') 
    image2 = (image2[np.newaxis])/255
    score_predict2 = get_predict(image2, name)
    
    predicted.append(0.5 * (score_predict1 + score_predict2))

19897it [47:09,  7.03it/s]


In [58]:
pred_prueba = np.stack(predicted).squeeze()

In [59]:
pred_prueba.shape

(19897, 1108)

In [60]:
def select_plate_group(pp_mult, idx):
    
    sub_test = new_test.loc[test_csv.experiment == all_test_exp[idx],:]
    
    assert len(pp_mult) == len(sub_test)
    
    mask = np.repeat(plate_groups[np.newaxis, :, exp_to_group[idx]], len(pp_mult), axis=0) != \
           np.repeat(sub_test.plate.values[:, np.newaxis], 1108, axis=1)
    
    pp_mult[mask] = 0
    
    return pp_mult

In [61]:
sub_copia = sub.copy()

In [62]:
indices = (test_csv.experiment == all_test_exp[idx])

In [63]:
for idx in range(len(all_test_exp)):
    #print('Experiment', idx)
    indices = (new_test.experiment == all_test_exp[idx])
    
    preds = pred_prueba[indices,:].copy()
    
    preds = select_plate_group(preds, idx)
    sub_copia.loc[indices,'sirna'] = preds.argmax(1)

In [64]:
sub_copia

Unnamed: 0,id_code,sirna
0,HEPG2-08_1_B03,223
1,HEPG2-08_1_B04,980
2,HEPG2-08_1_B05,836
3,HEPG2-08_1_B06,980
4,HEPG2-08_1_B07,419
...,...,...
19892,U2OS-05_4_O19,396
19893,U2OS-05_4_O20,595
19894,U2OS-05_4_O21,98
19895,U2OS-05_4_O22,1087


In [65]:
sub_copia.to_csv('submit_v5.csv', index=False)