# Analysis of SIIM-ISIC Melanoma Classification Metadata and Images

# Introduction

## The Competition

Skin cancer is common cancer type and despite beign mostly non malignant, due to high case numbers it's pretty serious diasease and can lead serious cases if not detected, treated in time. It's usually diagnosed by eye for primarily and followed by further clinical analysis if needed. Even though the rares outcome is called melanoma it's the most deadly one, so early detection is pretty important. For this task using computer aided diagnosis might be helpful for primarily steps and early detections. Better detection might save thousands of lives.

This competition might help reaching that goal and I hope it can help people around the world...

## Updates:

### 23/07/2020:
- Added adversarial validation,
- Updated metadata by removing biased features,
- Created simplier machine learning model.

### 25/07/2020:
- Added deep learning part
- Included EfficientNet modelling
- Ensembled metadata and EffNet predictions

### 01/01/2020:
- Added external [notebook with past years tabular data here](https://www.kaggle.com/datafan07/eda-modelling-of-the-external-data-inc-ensemble)
- Small fixes


## About the Notebook

First of all this is **pretty early version of this notebook**, I decided to start part by part before I fully commit my submission, so for now this notebook covers such as:

- EDA of the metadata,
- Extracting basic image attributes like image size, colors etc.
- Creating new features from existing data,
- Design a machine learning model by using these simple features
- Make predictions using our model and tabular data
- Deep learning part will be added in future...

I think using metadata for understanding the problem is really important and plus side is we can use it to improve our scores, for now we only going to use tabular data for submissions. This way we can see it's power and it can help us with future CNN modelling. This notebook going to try answer questions like these:

- How's the data looking?
- Do we have complete dataset?
- How's the target distribution looking? Is it balanced?
- What are the effects of scan site on outcome?
- Does age effects skin lesion type?
- Is there difference between female and male patients in terms of target?
- How many unique patient data we have and how many scans they had? Is it important?
- Is image quality, colors, size have meaningful impact on the outcome?
- Can we see similar observations when we analyse both train and test dataset, if not why?
- And much more...



# First Impressions and Getting Tools Ready

Let's buckle up and get our tools ready for our work! We start with importing neccesary libraries. Since we going to do mostly EDA our libraries are going to be related with tabular data and visualization.

In [1]:
!pip install -q efficientnet

You should consider upgrading via the '/opt/conda/bin/python3.7 -m pip install --upgrade pip' command.[0m


In [2]:
# loading packages

import pandas as pd
import numpy as np

#

import matplotlib
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec

#

import seaborn as sns
import plotly.express as px

#

import os
import random
import re
import math
import time

from tqdm import tqdm
from tqdm.keras import TqdmCallback


from pandas_summary import DataFrameSummary

import warnings


warnings.filterwarnings('ignore') # Disabling warnings for clearer outputs



seed_val = 42
random.seed(seed_val)
np.random.seed(seed_val)

We set some custom styling with our notebook for aesthetics...

In [3]:
# Setting color palette.
orange_black = [
    '#fdc029', '#df861d', '#FF6347', '#aa3d01', '#a30e15', '#800000', '#171820'
]

# Setting plot styling.
plt.style.use('ggplot')

In [4]:
# Setting file paths for our notebook:

base_path = '/kaggle/input/siim-isic-melanoma-classification'
train_img_path = '/kaggle/input/siim-isic-melanoma-classification/jpeg/train/'
test_img_path = '/kaggle/input/siim-isic-melanoma-classification/jpeg/test/'
img_stats_path = '/kaggle/input/melanoma2020imgtabular'

# Loading the Data

We'll continue by loading metadata we're given. Train data has 8 features, 33126 observations and Test data 5 features, 10982 observations.

#### Train Dataset Consists Of:

1. image name -> the filename of specific image for the train set
2. patient_id -> identifies the unique patient
3. sex -> gender of the patient
4. age_approx -> approx age of the patient at time of scanning
5. anatom_site_general_challenge -> location of the scan site
6. diagnosis -> information about the diagnosis
7. benign_malignant - indicates scan result if it's malignant or benign
8. target -> same as above but better for modelling since it's binary

And the next dataset we going to inspect test. It has same features as train set except for scan results, well that's why it's test set right?!

#### Test Dataset Consists Of:

1. image name -> the filename of specific image for the train set
2. patient_id -> identifies the unique patient
3. sex -> gender of the patient
4. age_approx -> approx age of the patient at time of scanning
5. anatom_site_general_challenge -> location of the scan site

# Machine Learning to Neural Networks

This part we gonna train more complicated models by using images themselves. For this part I was inspired by AgentAuers's 'Incredible TPUs' [here](https://www.kaggle.com/agentauers/incredible-tpus-finetune-effnetb0-b6-at-once). It's a great notebook and you should check that, again thanks for AgentAuers for letting me use some of his code as baseline for this part of the notebook! Also thanks to Chris Deotte for great datasets with tfrecords! 

We start by importing neccesary packages and setting random seed.

In [5]:
# Importing packages

import tensorflow as tf
import tensorflow.keras.backend as K
import efficientnet.tfkeras as efn
from kaggle_datasets import KaggleDatasets

tf.random.set_seed(seed_val)

In [6]:
# Loading image storage buckets

GCS_PATH = KaggleDatasets().get_gcs_path('melanoma-384x384')

filenames_train = np.array(tf.io.gfile.glob(GCS_PATH + '/train*.tfrec'))
filenames_test = np.array(tf.io.gfile.glob(GCS_PATH + '/test*.tfrec'))

In [7]:
# Setting TPU as main device for training, if you get warnings while working with tpu's ignore them.

DEVICE = 'TPU'
if DEVICE == 'TPU':
    print('connecting to TPU...')
    try:        
        tpu = tf.distribute.cluster_resolver.TPUClusterResolver()
        print('Running on TPU ', tpu.master())
    except ValueError:
        print('Could not connect to TPU')
        tpu = None

    if tpu:
        try:
            print('Initializing  TPU...')
            tf.config.experimental_connect_to_cluster(tpu)
            tf.tpu.experimental.initialize_tpu_system(tpu)
            strategy = tf.distribute.experimental.TPUStrategy(tpu)
            print('TPU initialized')
        except _:
            print('Failed to initialize TPU!')
    else:
        DEVICE = 'GPU'

if DEVICE != 'TPU':
    print('Using default strategy for CPU and single GPU')
    strategy = tf.distribute.get_strategy()

if DEVICE == 'GPU':
    print('Num GPUs Available: ',
          len(tf.config.experimental.list_physical_devices('GPU')))

print('REPLICAS: ', strategy.num_replicas_in_sync)
AUTO = tf.data.experimental.AUTOTUNE

connecting to TPU...
Running on TPU  grpc://10.0.0.2:8470
Initializing  TPU...
TPU initialized
REPLICAS:  8


Here we set config for our next steps. You can play with these but mind the memory sizes with the batches & image sizes.

In [8]:
# you can edit these settings.

cfg = dict(
           epochs=18,
           batch_size=32,
           img_size=384,
           lr_start=0.000005,
           lr_max=0.00000125,
           lr_min=0.000001,
           lr_rampup=5,
           lr_sustain=0,
           lr_decay=0.8,
           
    
           transform_prob=1.0,
           rot=180.0,
           shr=2.0,
           hzoom=8.0,
           wzoom=8.0,
           hshift=8.0,
           wshift=8.0,
    
           optimizer='adam',
           label_smooth_fac=0.05,
           tta_steps=20
            
        )

In [9]:
def get_mat(rotation, shear, height_zoom, width_zoom, height_shift,
            width_shift):
    
    ''' Settings for image preparations '''

    # CONVERT DEGREES TO RADIANS
    rotation = math.pi * rotation / 180.
    shear = math.pi * shear / 180.

    # ROTATION MATRIX
    c1 = tf.math.cos(rotation)
    s1 = tf.math.sin(rotation)
    one = tf.constant([1], dtype='float32')
    zero = tf.constant([0], dtype='float32')
    rotation_matrix = tf.reshape(
        tf.concat([c1, s1, zero, -s1, c1, zero, zero, zero, one], axis=0),
        [3, 3])

    # SHEAR MATRIX
    c2 = tf.math.cos(shear)
    s2 = tf.math.sin(shear)
    shear_matrix = tf.reshape(
        tf.concat([one, s2, zero, zero, c2, zero, zero, zero, one], axis=0),
        [3, 3])

    # ZOOM MATRIX
    zoom_matrix = tf.reshape(
        tf.concat([
            one / height_zoom, zero, zero, zero, one / width_zoom, zero, zero,
            zero, one
        ],
                  axis=0), [3, 3])

    # SHIFT MATRIX
    shift_matrix = tf.reshape(
        tf.concat(
            [one, zero, height_shift, zero, one, width_shift, zero, zero, one],
            axis=0), [3, 3])

    return K.dot(K.dot(rotation_matrix, shear_matrix),
                 K.dot(zoom_matrix, shift_matrix))


def transform(image, cfg):
    
    ''' This function takes input images of [: , :, 3] sizes and returns them as randomly rotated, sheared, shifted and zoomed. '''

    DIM = cfg['img_size']
    XDIM = DIM % 2  # fix for size 331

    rot = cfg['rot'] * tf.random.normal([1], dtype='float32')
    shr = cfg['shr'] * tf.random.normal([1], dtype='float32')
    h_zoom = 1.0 + tf.random.normal([1], dtype='float32') / cfg['hzoom']
    w_zoom = 1.0 + tf.random.normal([1], dtype='float32') / cfg['wzoom']
    h_shift = cfg['hshift'] * tf.random.normal([1], dtype='float32')
    w_shift = cfg['wshift'] * tf.random.normal([1], dtype='float32')

    # GET TRANSFORMATION MATRIX
    m = get_mat(rot, shr, h_zoom, w_zoom, h_shift, w_shift)

    # LIST DESTINATION PIXEL INDICES
    x = tf.repeat(tf.range(DIM // 2, -DIM // 2, -1), DIM)
    y = tf.tile(tf.range(-DIM // 2, DIM // 2), [DIM])
    z = tf.ones([DIM * DIM], dtype='int32')
    idx = tf.stack([x, y, z])

    # ROTATE DESTINATION PIXELS ONTO ORIGIN PIXELS
    idx2 = K.dot(m, tf.cast(idx, dtype='float32'))
    idx2 = K.cast(idx2, dtype='int32')
    idx2 = K.clip(idx2, -DIM // 2 + XDIM + 1, DIM // 2)

    # FIND ORIGIN PIXEL VALUES
    idx3 = tf.stack([DIM // 2 - idx2[0, ], DIM // 2 - 1 + idx2[1, ]])
    d = tf.gather_nd(image, tf.transpose(idx3))

    return tf.reshape(d, [DIM, DIM, 3])

def prepare_image(img, cfg=None, augment=True):
    
    ''' This function loads the image, resizes it, casts a tensor to a new type float32 in our case, transforms it using the function just above, then applies the augmentations.'''
    
    img = tf.image.decode_jpeg(img, channels=3)
    img = tf.image.resize(img, [cfg['img_size'], cfg['img_size']],
                          antialias=True)
    img = tf.cast(img, tf.float32) / 255.0

    if augment:
        if cfg['transform_prob'] > tf.random.uniform([1], minval=0, maxval=1):
            img = transform(img, cfg)

        img = tf.image.random_flip_left_right(img)
        img = tf.image.random_saturation(img, 0.7, 1.3)
        img = tf.image.random_contrast(img, 0.8, 1.2)
        img = tf.image.random_brightness(img, 0.1)

    return img

These functions below for reading labeled tfrecords.

In [10]:
def read_labeled_tfrecord(example):
    LABELED_TFREC_FORMAT = {
        'image': tf.io.FixedLenFeature([], tf.string),
        'image_name': tf.io.FixedLenFeature([], tf.string),
        'patient_id': tf.io.FixedLenFeature([], tf.int64),
        'sex': tf.io.FixedLenFeature([], tf.int64),
        'age_approx': tf.io.FixedLenFeature([], tf.int64),
        'anatom_site_general_challenge': tf.io.FixedLenFeature([], tf.int64),
        'diagnosis': tf.io.FixedLenFeature([], tf.int64),
        'target': tf.io.FixedLenFeature([], tf.int64),
        #'width': tf.io.FixedLenFeature([], tf.int64),
        #'height': tf.io.FixedLenFeature([], tf.int64)
    }

    example = tf.io.parse_single_example(example, LABELED_TFREC_FORMAT)
    return example['image'], example['target']


def read_unlabeled_tfrecord(example):
    UNLABELED_TFREC_FORMAT = {
        'image': tf.io.FixedLenFeature([], tf.string),
        'image_name': tf.io.FixedLenFeature([], tf.string),
        'patient_id': tf.io.FixedLenFeature([], tf.int64),
        'sex': tf.io.FixedLenFeature([], tf.int64),
        'age_approx': tf.io.FixedLenFeature([], tf.int64),
        'anatom_site_general_challenge': tf.io.FixedLenFeature([], tf.int64),
    }
    example = tf.io.parse_single_example(example, UNLABELED_TFREC_FORMAT)
    return example['image'], example['image_name']

def count_data_items(filenames):
    n = [
        int(re.compile(r'-([0-9]*)\.').search(filename).group(1))
        for filename in filenames
    ]
    return np.sum(n)

In [11]:
def getTrainDataset(files, cfg, augment=True, shuffle=True):
    
    ''' This function reads the tfrecord train images, shuffles them, apply augmentations to them and prepares the data for training. '''
    
    ds = tf.data.TFRecordDataset(files, num_parallel_reads=AUTO)
    ds = ds.cache()

    if shuffle:
        opt = tf.data.Options()
        opt.experimental_deterministic = False
        ds = ds.with_options(opt)

    ds = ds.map(read_labeled_tfrecord, num_parallel_calls=AUTO)
    ds = ds.repeat()
    if shuffle:
        ds = ds.shuffle(2048)
    ds = ds.map(lambda img, label:
                (prepare_image(img, augment=augment, cfg=cfg), label),
                num_parallel_calls=AUTO)
    ds = ds.batch(cfg['batch_size'] * strategy.num_replicas_in_sync)
    ds = ds.prefetch(AUTO)
    return ds

def getTestDataset(files, cfg, augment=False, repeat=False):
    
    ''' This function reads the tfrecord test images and prepares the data for predicting. '''
    
    ds = tf.data.TFRecordDataset(files, num_parallel_reads=AUTO)
    ds = ds.cache()
    if repeat:
        ds = ds.repeat()
    ds = ds.map(read_unlabeled_tfrecord, num_parallel_calls=AUTO)
    ds = ds.map(lambda img, idnum:
                (prepare_image(img, augment=augment, cfg=cfg), idnum),
                num_parallel_calls=AUTO)
    ds = ds.batch(cfg['batch_size'] * strategy.num_replicas_in_sync)
    ds = ds.prefetch(AUTO)
    return ds

def get_model():
    
    ''' This function gets the layers inclunding efficientnet ones. '''
    
    model_input = tf.keras.Input(shape=(cfg['img_size'], cfg['img_size'], 3),
                                 name='img_input')

    dummy = tf.keras.layers.Lambda(lambda x: x)(model_input)

    outputs = []

    x = efn.EfficientNetB3(include_top=False,
                           weights='noisy-student',
                           input_shape=(cfg['img_size'], cfg['img_size'], 3),
                           pooling='avg')(dummy)
    x = tf.keras.layers.Dense(1, activation='sigmoid')(x)
    outputs.append(x)

    x = efn.EfficientNetB4(include_top=False,
                           weights='noisy-student',
                           input_shape=(cfg['img_size'], cfg['img_size'], 3),
                           pooling='avg')(dummy)
    x = tf.keras.layers.Dense(1, activation='sigmoid')(x)
    outputs.append(x)

    x = efn.EfficientNetB5(include_top=False,
                           weights='noisy-student',
                           input_shape=(cfg['img_size'], cfg['img_size'], 3),
                           pooling='avg')(dummy)
    x = tf.keras.layers.Dense(1, activation='sigmoid')(x)
    outputs.append(x)

    model = tf.keras.Model(model_input, outputs, name='aNetwork')
    model.summary()
    return model

In [12]:
def compileNewModel(cfg):
    
    ''' Configuring the model with losses and metrics. '''    
    
#     with strategy.scope():
#         model = get_model()

    with strategy.scope():
        model = get_model()
        model.compile(optimizer=cfg['optimizer'],
                      loss=[
                          tf.keras.losses.BinaryCrossentropy(
                              label_smoothing=cfg['label_smooth_fac']),
                          tf.keras.losses.BinaryCrossentropy(
                              label_smoothing=cfg['label_smooth_fac']),
                          tf.keras.losses.BinaryCrossentropy(
                              label_smoothing=cfg['label_smooth_fac'])
                      ],
                      metrics=[tf.keras.metrics.AUC(name='auc')])
    return model

def getLearnRateCallback(cfg):
    
    ''' Using callbacks for learning rate adjustments. '''
    
    lr_start = cfg['lr_start']
    lr_max = cfg['lr_max'] * strategy.num_replicas_in_sync * cfg['batch_size']
    lr_min = cfg['lr_min']
    lr_rampup = cfg['lr_rampup']
    lr_sustain = cfg['lr_sustain']
    lr_decay = cfg['lr_decay']

    def lrfn(epoch):
        if epoch < lr_rampup:
            lr = (lr_max - lr_start) / lr_rampup * epoch + lr_start
        elif epoch < lr_rampup + lr_sustain:
            lr = lr_max
        else:
            lr = (lr_max - lr_min) * lr_decay**(epoch - lr_rampup -
                                                lr_sustain) + lr_min
        return lr

    lr_callback = tf.keras.callbacks.LearningRateScheduler(lrfn, verbose=False)
    return lr_callback

#  Cross Validation

In [13]:
# ###  def learnModel(model, ds_train, stepsTrain, cfg):
# from sklearn.model_selection import KFold

# GCS_PATH2 = KaggleDatasets().get_gcs_path('isic2019-384x384')
# INC2019 = INC2018 = False
# FOLDS = 5
# skf = KFold(n_splits=FOLDS,shuffle=True,random_state=seed_val)

# files_test = np.sort(np.array(tf.io.gfile.glob(GCS_PATH + '/test*.tfrec')))
# stepsTest = count_data_items(files_test) / (cfg['batch_size'] * strategy.num_replicas_in_sync)
# z = np.zeros((cfg['batch_size'] * strategy.num_replicas_in_sync))
# ds_testAug = getTestDataset(files_test, cfg, augment=True, repeat=True).map(lambda img, label: (img, (z, z, z)))

# test_probs = []
# for fold, (idxT, idxV) in enumerate(skf.split(np.arange(15))):
#     if fold not in [0]: continue
    
#     files_train = tf.io.gfile.glob([GCS_PATH + '/train%.2i*.tfrec'%x for x in idxT])
#     files_valid = tf.io.gfile.glob([GCS_PATH + '/train%.2i*.tfrec'%x for x in idxV])
#     print('#### Using original data')
#     print('files_train #:', len(files_train))
    
#     if INC2019:
#         files_train += tf.io.gfile.glob([GCS_PATH2 + '/train%.2i*.tfrec'%x for x in idxT*2+1])
#         print('#### Add 2019 external data')
#         print('files_train #:', len(files_train))
#     if INC2018:
#         files_train += tf.io.gfile.glob([GCS_PATH2 + '/train%.2i*.tfrec'%x for x in idxT*2])
#         print('#### Add 2018 external data')
#         print('files_train #:', len(files_train))
    
#     stepsTrain = count_data_items(files_train) / (cfg['batch_size'] * strategy.num_replicas_in_sync)
#     stepsValid = count_data_items(files_valid) / (cfg['batch_size'] * strategy.num_replicas_in_sync)
    
#     ds_train = getTrainDataset(files_train, cfg, augment=True, shuffle=True).map(lambda img, label: (img, (label, label, label)))
#     ds_valid = getTrainDataset(files_valid, cfg, augment=False, shuffle=False).map(lambda img, label: (img, (label, label, label)))
        
#     K.clear_session()
#     with strategy.scope():
#         model = compileNewModel(cfg)
        
#     sv = tf.keras.callbacks.ModelCheckpoint( # val_loss
#         'auc_fold%i_e{epoch}.h5'%fold, monitor='val_dense_auc', verbose=0, save_best_only=False,
#         save_weights_only=True, mode='max', save_freq='epoch')

#     callbacks = [sv, getLearnRateCallback(cfg)]
#     history = model.fit(ds_train,
#                         validation_data=ds_valid,
#                         verbose=1,
#                         steps_per_epoch=stepsTrain,
#                         validation_steps=stepsValid,
#                         epochs=cfg['epochs'],
#                         callbacks=callbacks)
    

# #     # test time augmentations for predictions (20 in our case, you can increase it a little in cfg) and taking mean of them
# #     probs = model.predict(ds_testAug, verbose=1, steps=stepsTest * cfg['tta_steps'])
# #     probs = np.stack(probs)
# #     probs = probs[:, :count_data_items(filenames_test) * cfg['tta_steps']]
# #     probs = np.stack(np.split(probs, cfg['tta_steps'], axis=1), axis=1)
# #     test_probs.append(probs)

In [14]:
# DISPLAY_PLOT = True
# if DISPLAY_PLOT:
#     plt.figure(figsize=(15,5))
#     plt.plot(np.arange(cfg['epochs']),history.history['dense_auc'],'-o',label='Train AUC',color='#ff7f0e')
#     plt.plot(np.arange(cfg['epochs']),history.history['val_dense_auc'],'-o',label='Val AUC',color='#1f77b4')
#     x = np.argmax( history.history['val_dense_auc'] ); y = np.max( history.history['val_dense_auc'] )
#     xdist = plt.xlim()[1] - plt.xlim()[0]; ydist = plt.ylim()[1] - plt.ylim()[0]
#     plt.scatter(x,y,s=200,color='#1f77b4'); plt.text(x-0.03*xdist,y-0.13*ydist,'max auc\n%.2f'%y,size=14)
#     plt.ylabel('AUC',size=14); plt.xlabel('Epoch',size=14)
#     plt.legend(loc=2)
#     plt2 = plt.gca().twinx()
#     plt2.plot(np.arange(cfg['epochs']),history.history['dense_loss'],'-o',label='Train Loss',color='#2ca02c')
#     plt2.plot(np.arange(cfg['epochs']),history.history['val_dense_loss'],'-o',label='Val Loss',color='#d62728')
#     x = np.argmin( history.history['val_dense_loss'] ); y = np.min( history.history['val_dense_loss'] )
#     ydist = plt.ylim()[1] - plt.ylim()[0]
#     plt.scatter(x,y,s=200,color='#d62728'); plt.text(x-0.03*xdist,y+0.05*ydist,'min loss',size=14)
#     plt.ylabel('Loss',size=14)
# #     plt.title('FOLD %i - Image Size %i, EfficientNet B%i, inc2019=%i, inc2018=%i'%
# #             (fold+1,IMG_SIZES[fold],EFF_NETS[fold],INC2019[fold],INC2018[fold]),size=18)
#     plt.legend(loc=3)
#     plt.show()  

In [15]:
# import pickle
# model.load_weights('./auc_fold3_e13.h5')
# probs = model.predict(ds_testAug, verbose=1, steps=stepsTest * cfg['tta_steps'])
# probs = np.stack(probs)
# probs = probs[:, :count_data_items(filenames_test) * cfg['tta_steps']]
# probs = np.stack(np.split(probs, cfg['tta_steps'], axis=1), axis=1)
# test_probs.append(probs)

# file = open('test_probs_3', 'wb')
# pickle.dump(test_probs, file)
# file.close()

# fp = ['../input/test-probs/test_probs_0', '../input/test-probs/test_probs_1',
#      '../input/test-probs/test_probs_2', '../input/test-probs/test_probs_3',
#      '../input/test-probs/test_probs_4']

# all_test_probs = []
# for f in fp:
#     file = open(f, 'rb')
#     test_probs = pickle.load(file)
#     all_test_probs += test_probs
#     file.close()

# len(all_test_probs)

In [16]:
# probs = np.mean(all_test_probs, axis=0)
# probs = np.mean(probs, axis=1)
# test = pd.read_csv(os.path.join(base_path, 'test.csv'))

# y_test_sorted = np.zeros((3, probs.shape[1]))
# test = test.reset_index()
# test = test.set_index('image_name')

# i = 0
# ds_test = getTestDataset(filenames_test, cfg)
# for img, imgid in tqdm(iter(ds_test.unbatch())):
#     imgid = imgid.numpy().decode('utf-8')
#     y_test_sorted[:, test.loc[imgid]['index']] = probs[:, i, 0]
#     i += 1

    
# # creating .csv files for each effnet

# sample = pd.read_csv(os.path.join(base_path, 'sample_submission.csv'))
# # for i in range(y_test_sorted.shape[0]):
# #     submission = sample
# #     submission['target'] = y_test_sorted[i]
# #     submission.to_csv('submission_model_%s.csv' % i, index=False)

# # blending effnets into a single .csv file    

# submission = sample
# submission['target'] = np.mean(y_test_sorted, axis=0)
# submission.to_csv('blended_effnets_auc_5fold.csv', index=False)

# # loading recently created .csv files from working directory

# effnet = pd.read_csv('./blended_effnets_auc_5fold.csv')
# meta = pd.read_csv('../input/sscsv/meta_simplified_img_data.csv')

# sample['target'] = (effnet['target'] * 0.9 + meta['target'] * 0.1 )

# # final submissions

# sample.to_csv('ensembled_auc_5fold.csv', header=True, index=False)

# Train all

In [17]:
def learnModel(model, ds_train, stepsTrain, cfg, ds_val=None, stepsVal=0):
    
    ''' Fitting things together for training '''
    
    # callbacks = [getLearnRateCallback(cfg)]

    sv = tf.keras.callbacks.ModelCheckpoint( # val_loss
        'auc_e{epoch}.h5', monitor='val_dense_auc', verbose=0, save_best_only=False,
        save_weights_only=True, mode='max', save_freq='epoch')

    callbacks = [sv, getLearnRateCallback(cfg)]
    
    history = model.fit(ds_train,
                        validation_data=ds_val,
                        verbose=1,
                        steps_per_epoch=stepsTrain,
                        validation_steps=stepsVal,
                        epochs=cfg['epochs'],
                        callbacks=callbacks)

    return history

GCS_PATH2 = KaggleDatasets().get_gcs_path('isic2019-384x384')
INC2019_2018 = False

files_train = tf.io.gfile.glob([GCS_PATH + '/train*.tfrec'])
print('#### Using original data')
print('files_train #:', len(files_train))

if INC2019_2018: 
    files_train += tf.io.gfile.glob(GCS_PATH2 + '/train*.tfrec')
    print('#### Add external data')
    print('files_train #:', len(files_train))

ds_train = getTrainDataset(files_train, cfg, augment=True, shuffle=True).map(lambda img, label: (img, (label, label, label)))

stepsTrain = count_data_items(files_train) / (cfg['batch_size'] * strategy.num_replicas_in_sync)

model = compileNewModel(cfg)
history = learnModel(model, ds_train, stepsTrain, cfg)

In [18]:
plt.figure(figsize=(15,5))
plt.plot(np.arange(cfg['epochs']),history.history['dense_auc'],'-o',label='Train AUC',color='#ff7f0e')
#plt.plot(np.arange(cfg['epochs']),history.history['val_dense_auc'],'-o',label='Val AUC',color='#1f77b4')
x = np.argmax( history.history['dense_auc'] )
y = np.max( history.history['dense_auc'] )
xdist = plt.xlim()[1] - plt.xlim()[0]; ydist = plt.ylim()[1] - plt.ylim()[0]
plt.scatter(x,y,s=200,color='#1f77b4'); plt.text(x-0.03*xdist,y-0.13*ydist,'max auc\n%.2f'%y,size=14)
plt.ylabel('AUC',size=14); plt.xlabel('Epoch',size=14)
plt.legend(loc=2)

In [20]:
model = compileNewModel(cfg)
model.load_weights('../input/ssweights/auc_e13.h5')

files_test = np.sort(np.array(tf.io.gfile.glob(GCS_PATH + '/test*.tfrec')))
steps = count_data_items(files_test) / (cfg['batch_size'] * strategy.num_replicas_in_sync)
z = np.zeros((cfg['batch_size'] * strategy.num_replicas_in_sync))

# loading test data

ds_testAug = getTestDataset(files_test, cfg, augment=True, repeat=True).map(lambda img, label: (img, (z, z, z)))

# test time augmentations for predictions (20 in our case, you can increase it a little in cfg) and taking mean of them

probs = model.predict(ds_testAug, verbose=1, steps=steps * cfg['tta_steps'])
probs = np.stack(probs)
probs = probs[:, :count_data_items(filenames_test) * cfg['tta_steps']]
probs = np.stack(np.split(probs, cfg['tta_steps'], axis=1), axis=1)
probs = np.mean(probs, axis=1)

test = pd.read_csv(os.path.join(base_path, 'test.csv'))

y_test_sorted = np.zeros((3, probs.shape[1]))
test = test.reset_index()
test = test.set_index('image_name')

i = 0
ds_test = getTestDataset(filenames_test, cfg)
for img, imgid in tqdm(iter(ds_test.unbatch())):
    imgid = imgid.numpy().decode('utf-8')
    y_test_sorted[:, test.loc[imgid]['index']] = probs[:, i, 0]
    i += 1

    
# creating .csv files for each effnet
sample = pd.read_csv(os.path.join(base_path, 'sample_submission.csv'))
# for i in range(y_test_sorted.shape[0]):
#     submission = sample
#     submission['target'] = y_test_sorted[i]
#     submission.to_csv('submission_model_%s.csv' % i, index=False)

# blending effnets into a single .csv file    

submission = sample
submission['target'] = np.mean(y_test_sorted, axis=0)
submission.to_csv('blended_effnets.csv', index=False)

# loading recently created .csv files from working directory

effnet = pd.read_csv('./blended_effnets.csv')

Model: "aNetwork"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
img_input (InputLayer)          [(None, 384, 384, 3) 0                                            
__________________________________________________________________________________________________
lambda_1 (Lambda)               (None, 384, 384, 3)  0           img_input[0][0]                  
__________________________________________________________________________________________________
efficientnet-b3 (Model)         (None, 1536)         10783528    lambda_1[0][0]                   
__________________________________________________________________________________________________
efficientnet-b4 (Model)         (None, 1792)         17673816    lambda_1[0][0]                   
___________________________________________________________________________________________

10982it [00:26, 411.64it/s]


### Use blended_effnets.csv in https://www.kaggle.com/datafan07/eda-modelling-of-the-external-data-inc-ensemble to get eda_ensembled.csv


### Use eda_ensembled.csv in https://www.kaggle.com/paklau9/minmax-highest-public-lb-9619 to get final submission.csv

In [22]:
import shutil
shutil.copy('../input/cs0099/submission.csv', './submission.csv')
shutil.copy('../input/cs0099/submission_6.csv', './submission_6.csv')
shutil.copy('../input/cs0099/submission_jig.csv', './submission_jig.csv')
shutil.copy('../input/cs0099/submission_mean.csv', './submission_mean.csv')

'./submission_mean.csv'

In [27]:
import numpy as np
import pandas as pd

import numpy as np
import pandas as pd 
import os 

def MinMaxBestBaseStacking(input_folder, best_base, output_path):
    sub_base = pd.read_csv(best_base)
    all_files = os.listdir(input_folder)

    # Read and concatenate submissions
    outs = [pd.read_csv(os.path.join(input_folder, f), index_col=0) for f in all_files]
    concat_sub = pd.concat(outs, axis=1)
    cols = list(map(lambda x: "target" + str(x), range(len(concat_sub.columns))))
    concat_sub.columns = cols
    concat_sub.reset_index(inplace=True)

    print(concat_sub)
    print(concat_sub.iloc[:, 1:6])
    # get the data fields ready for stacking
    concat_sub['is_iceberg_max'] = concat_sub.iloc[:, 1:6].max(axis=1)
    concat_sub['is_iceberg_min'] = concat_sub.iloc[:, 1:6].min(axis=1)
    concat_sub['is_iceberg_mean'] = concat_sub.iloc[:, 1:6].mean(axis=1)
    concat_sub['is_iceberg_median'] = concat_sub.iloc[:, 1:6].median(axis=1)

    # set up cutoff threshold for lower and upper bounds
    cutoff_lo = 0.66
    cutoff_hi = 0.33
    
#     cutoff_lo = 0.85
#     cutoff_hi = 0.17

    concat_sub['is_iceberg_base'] = sub_base['target']
    concat_sub['target'] = np.where(np.all(concat_sub.iloc[:, 1:6] > cutoff_lo, axis=1),
                                        concat_sub['is_iceberg_max'],
                                        np.where(np.all(concat_sub.iloc[:, 1:6] < cutoff_hi, axis=1),
                                                 concat_sub['is_iceberg_min'],
                                                 concat_sub['is_iceberg_base']))
    concat_sub[['image_name', 'target']].to_csv(output_path,
                                            index=False, float_format='%.12f')

In [28]:
#MinMaxBestBaseStacking('../input/cs0099/', '../input/cs0099/submission_mean.csv', 'submission.csv')
MinMaxBestBaseStacking('../input/ss-minmax-csv/', '../input/sscsv/eda_ensembled.csv', 'submission.csv')


         image_name   target0   target1   target2   target3
0      ISIC_0052060  0.012047  0.032010  0.026612  0.022257
1      ISIC_0052349  0.009849  0.025807  0.023487  0.016384
2      ISIC_0058510  0.012835  0.027831  0.023767  0.023354
3      ISIC_0073313  0.010287  0.030409  0.024287  0.017497
4      ISIC_0073502  0.011848  0.029881  0.027824  0.018092
...             ...       ...       ...       ...       ...
10977  ISIC_9992485  0.008851  0.029353  0.019685  0.012813
10978  ISIC_9996992  0.027116  0.028810  0.035447  0.020065
10979  ISIC_9997917  0.054770  0.040276  0.049058  0.062400
10980  ISIC_9998234  0.013760  0.028119  0.029427  0.026082
10981  ISIC_9999302  0.056559  0.056665  0.047329  0.080993

[10982 rows x 5 columns]
        target0   target1   target2   target3
0      0.012047  0.032010  0.026612  0.022257
1      0.009849  0.025807  0.023487  0.016384
2      0.012835  0.027831  0.023767  0.023354
3      0.010287  0.030409  0.024287  0.017497
4      0.011848  0.02988