# **Filtro y selección final de las imagenes**

Dada la complejidad computacional de los datos, es necesaria la eliminación y muestreo de los datos.

In [4]:
import os
from pandas import read_csv

images = read_csv( 'train.csv' )
images.head()

Unnamed: 0,image_id,data_provider,isup_grade,gleason_score
0,0005f7aaab2800f6170c399693a96917,karolinska,0,0+0
1,000920ad0b612851f8e01bcc880d9b3d,karolinska,0,0+0
2,0018ae58b01bdadc8e347995b69f99aa,radboud,4,4+4
3,001c62abd11fa4b57bf7a6c603a11bb9,karolinska,4,4+4
4,001d865e65ef5d2579c190a0e0350d8f,karolinska,0,0+0


Se eliminan inicialmente las imágenes y máscaras de los datos provistos por la base de datos Karolinska.

In [5]:
karonlinska_images = list(images[images['data_provider'] == 'karolinska']['image_id'])
images_path = 'train_images/' 
masks_path = 'train_label_masks/'
for image in karonlinska_images:
    try:
        os.remove( images_path + image + '.tiff' )
        os.remove( masks_path + image + '_mask.tiff' )
    except FileNotFoundError as e:
        pass

Finalmente, los índices restantes se guardan en un csv aparte.

In [6]:
radboud_images = images[images['data_provider'] == 'radboud']
radboud_images = radboud_images.reset_index( drop=True ).drop( columns=['data_provider'] )
radboud_images.head()

Unnamed: 0,image_id,isup_grade,gleason_score
0,0018ae58b01bdadc8e347995b69f99aa,4,4+4
1,004dd32d9cd167d9cc31c13b704498af,1,3+3
2,0068d4c7529e34fd4c9da863ce01a161,3,4+3
3,006f6aa35a78965c92fffd1fbd53a058,3,4+3
4,007433133235efc27a39f11df6940829,0,negative


Se verifican las distribuciones de cada Gleason Score

In [7]:
radboud_images["gleason_score"].value_counts(normalize=True)

gleason_score
negative    0.187403
4+3         0.179264
3+3         0.165116
3+4         0.130814
4+4         0.127907
4+5         0.124225
5+4         0.042829
5+5         0.021512
3+5         0.012984
5+3         0.007946
Name: proportion, dtype: float64

El muestreo mantiene la misma distribución.

In [8]:
from sklearn.model_selection import  StratifiedShuffleSplit

split = StratifiedShuffleSplit(n_splits=1, test_size=0.9, random_state=42)
for sample_idx, _ in split.split(radboud_images, radboud_images["gleason_score"]):
    images_final = radboud_images.iloc[sample_idx]
images_final = images_final.reset_index( drop=True )
images_final["gleason_score"].value_counts(normalize=True)

gleason_score
negative    0.187984
4+3         0.180233
3+3         0.164729
3+4         0.129845
4+4         0.127907
4+5         0.124031
5+4         0.042636
5+5         0.021318
3+5         0.013566
5+3         0.007752
Name: proportion, dtype: float64

In [9]:
images_final.head()

Unnamed: 0,image_id,isup_grade,gleason_score
0,81cdd8883ff6f9a18e75f2085db633be,1,3+3
1,8d96e24f96029fb89ad33d017f1fbcfc,5,4+5
2,8e9fa23bd67888cbddd7203f3a2db9fe,1,3+3
3,798b6d2cc34c30279d4155deebf16647,3,4+3
4,f7c97aeccfaeb2d9e84414c5c5e6db22,0,negative


Se guardan las imágenes en un csv.

In [10]:
images_final.to_csv( 'images.csv', index=False )