![Mayo-Strip-AI](https://storage.googleapis.com/kaggle-competitions/kaggle/37333/logos/header.png)

This is a multi class image classification competition comprising the images that depict clots with an etiology (that is, origin) known to be either CE (Cardioembolic) or LAA (Large Artery Atherosclerosis). There are also supplemental images with either an unknown etiology or an etiology other than CE or LAA.

Our goal is to classify the images into CE or LAA in the test set for each patient.

<h3>Key Terms</h3>

* <b> Ischemic Stroke</b> : An ischemic stroke occurs when the blood supply to part of the brain is interrupted or reduced, preventing brain tissue from getting oxygen and nutrients. Source: <a href="https://www.mayoclinic.org/diseases-conditions/stroke/symptoms-causes/syc-20350113#:~:text=An%20ischemic%20stroke%20occurs%20when,brain%20damage%20and%20other%20complications">mayoclinic.org</a>

* <b> Stroke Etiology</b>: Two main causes of stroke 1. A blocked artery (ischemic stroke) 2. Leaking or bursting of a blood vessel (hemorrhagic stroke) Source: <a href="https://www.mayoclinic.org/diseases-conditions/stroke/symptoms-causes/syc-20350113#:~:text=There%20are%20two%20main%20causes,doesn't%20cause%20lasting%20symptoms">mayoclinic.org</a>

* <b> Whole slide imaging (WSI)</b> : Commonly referred to as “virtual microscopy” refers to scanning of conventional glass slides in order to produce digital slides, is the most recent imaging modality being employed by pathology departments worldwide. WSI consists of two processes. 1. First process is to utilize specialized hardware (scanner) to digitize glass slides to generate a large representative digital image known as digital slide 2. The second process employs specialized software (ie, virtual slide viewer) to view and/or analyze these enormous digital files. Source: <a href="https://www.dovepress.com/whole-slide-imaging-in-pathology-advantages-limitations-and-emerging-p-peer-reviewed-fulltext-article-PLMI#:~:text=Whole%20slide%20imaging%20(WSI)%2C,%2C%20educational%2C%20and%20research%20purposes">dovepress.com</a>

In [None]:
import numpy as np
import pandas as pd

import plotly.graph_objects as go
from plotly.offline import iplot, init_notebook_mode
init_notebook_mode(connected=True)
import seaborn as sns
import matplotlib.pyplot as plt
from plotly.offline import iplot, init_notebook_mode
init_notebook_mode(connected=True)
import plotly_express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from plotly.offline import init_notebook_mode
import plotly.io as pio
from plotly.subplots import make_subplots
# setting default template to plotly_white for all visualizations
pio.templates.default = "plotly_white"
%matplotlib inline
import gc

from colorama import Fore, Back, Style

y_ = Fore.YELLOW
r_ = Fore.RED
g_ = Fore.GREEN
b_ = Fore.BLUE
m_ = Fore.MAGENTA
c_ = Fore.CYAN
res = Style.RESET_ALL

import warnings
warnings.filterwarnings('ignore')
#import os
#for dirname, _, filenames in os.walk('/kaggle/input'):
    #for filename in filenames:
        #print(os.path.join(dirname, filename))
#        pass

In [None]:
PATH = '/kaggle/input/mayo-clinic-strip-ai/'
submission = pd.read_csv(PATH + 'sample_submission.csv', index_col=None)
train = pd.read_csv(PATH + 'train.csv', index_col=None)
test = pd.read_csv(PATH + 'test.csv', index_col=None)
pd.set_option('display.max_columns', None)  
pd.set_option('display.max_colwidth', None)
print(f"{y_}Train csv shape : {train.shape}{res}\n{g_}Test csv shape : {test.shape}{res}")

In [None]:
train.info()

In [None]:
train

In [None]:
test

In [None]:
def plot_class_pct():
    label_grp = train.groupby(['label'])['image_id'].count().reset_index()
    colors = {'CE' : '#DCD427',
    'LAA' : '#0092CC',
             }
    label_map = {'CE' : "CE (Cardioembolic)",'LAA' : "LAA (Large Artery Atherosclerosis)"}
    label_grp['color'] = label_grp['label'].apply(lambda x: colors[x])
    label_grp['lbl'] = label_grp['label'].apply(lambda x: label_map[x])    
    #label_grp.
    pio.templates.default = "plotly_dark"
    label_grp['pct'] = round((label_grp['image_id'] / label_grp['image_id'].sum())*100,2)
    fig = go.Figure(data=[go.Pie(labels=label_grp['lbl'],
                             values=label_grp['pct'],
                             hole=.3,
                             pull=[0.1, 0.1]
                            )
                     ]
               )
    fig.update_traces(hoverinfo='label+percent', textinfo='percent', textfont_size=16,
                  marker=dict(colors=label_grp['color'], line=dict(color='#000000', width=2))
                 )
    fig.update_layout(title={'text': "% of labels in training data",
        'y':0.9,
        'x':0.45,
        'xanchor': 'center',
        'yanchor': 'top'})
    fig.show()    

In [None]:
plot_class_pct()

In [None]:
def plot_test_center_pct():
    test_center = train.groupby(['center_id'])['image_id'].count().reset_index()
    pio.templates.default = "plotly_dark"
    test_center['pct'] = round((test_center['image_id'] / test_center['image_id'].sum())*100,2)
    fig = go.Figure(data=[go.Pie(labels=test_center['center_id'],
                             values=test_center['pct'],
                             hole=.3,
                             pull=[0.1, 0.1]
                            )
                     ]
               )
    fig.update_traces(hoverinfo='label+percent', textinfo='percent', textfont_size=16,
                  marker=dict(#colors=test_center['color'], 
                              line=dict(color='#000000', width=2))
                 )
    fig.update_layout(title={'text': "% of samples in training data from test centers",
        'y':0.9,
        'x':0.45,
        'xanchor': 'center',
        'yanchor': 'top'})
    fig.show()    

In [None]:
plot_test_center_pct()

Test center 11 has the highest number of train samples (34%)

In [None]:
center_grp = train.groupby(['center_id','label'])['image_id'].count().reset_index()
center_grp.rename(columns={'image_id':'count'},inplace=True)

x = list(center_grp.query("label =='CE'")['center_id'])

fig = go.Figure(data=[
    go.Bar(name='CE', x=x, y=list(center_grp.query("label =='CE'")['count']), marker=dict(color='#DCD427')
),
    go.Bar(name='LAA', x=x, y=list(center_grp.query("label =='LAA'")['count']), marker=dict(color='#0092CC')
)
])
# Change the bar mode
fig.update_layout(barmode='group')
fig.update_layout(
    xaxis = dict(
        tickmode = 'array',
        tickvals = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11],
        title = "Test Center"
    ),
    yaxis = dict(
        title = "Count"
    )    
)
fig.update_layout(title_text="Labels by Test Center")
fig.show()

Test Center 3 has more LAA (Large Artery Atherosclerosis) images than CE (Cardioembolic) images. All other test centers have more CE images. This may not be significant information that influences the classification but an anomaly to note

In [None]:
patient_img_count = train.groupby(['patient_id'])['image_id'].count().reset_index().sort_values(by="image_id", ascending=False).reset_index(drop=True)
patient_img_count.rename(columns={'image_id':'num_images'},inplace=True)

fig = go.Figure(data=[go.Histogram(x=patient_img_count['num_images'], marker=dict(color='#F2A514'))])
fig.update_layout(
    xaxis = dict(
        title = "Number of images"
    ),
    yaxis = dict(
        title = "Count"
    )    
)
fig.update_layout(title_text="Image count per patient - Histogram")
fig.show()

Most patients in the training set have only one image. 

## OpenSlide 

OpenSlide Python is a Python interface to the OpenSlide library.

OpenSlide library provides simple interface for reading whole-slide images, also known as virtual slides, These images can occupy tens of gigabytes when uncompressed, and so cannot be easily read using standard tools or libraries, which are designed for images that can be comfortably uncompressed into RAM. Source : <a href="https://openslide.org/api/python/#openslide.OpenSlide"> openslide.org</a>

#### Reference
Thanks to the author of this <a href="https://www.kaggle.com/code/naotous/mayo-eda-whole-slide-images"> notebook </a> for introducing me to OpenSlide library among others. 

In [None]:
from openslide import open_slide
import json
import openslide

sample_file = PATH + "/train/3d10be_0.tif"

slide = open_slide(sample_file)
slide_props = slide.properties
print(f"{g_}Properties : {json.dumps(dict(slide_props), sort_keys=True, indent=4)}")
lide_dims = slide.dimensions
print(f"{m_}Dimension of the sample image{slide.dimensions}")
print(f"{b_}Number of levels in this image : {len(slide.level_dimensions)}")
print(f"{g_}Each level is downsampled by : {slide.level_downsamples}")


Since the actual file is large, Let's get the thumbnail and display it.

In [None]:
slide_thumb = slide.get_thumbnail(size=(600, 600))
_ = plt.figure(figsize=(6,6))
_ = plt.imshow(np.array(slide_thumb))    
_ = plt.title(sample_file)

In [None]:
def show_slide(image_id, ax):
    file = f"{PATH}/train/{image_id}.tif"
    slide = open_slide(file)
    slide_thumb = slide.get_thumbnail(size=(600, 600))
    ax.imshow(np.array(slide_thumb))    


In [None]:
#list(train.query("label == 'LAA' and image_num == 0")['image_id'])

In [None]:
def display_sample(samples, title):
    fig1, ax1 = plt.subplots(1,len(samples), figsize=(18, 5), facecolor='w', edgecolor='b')
    fig1.subplots_adjust(hspace =.3, wspace=0.3)
    if len(samples) == 1:
        show_slide(samples[0],ax1)
    else:
        axs = ax1.ravel()
        for image_id, ax in zip(samples, axs):
            show_slide(image_id,ax)
    plt.tight_layout(pad=3.0)
    plt.subplots_adjust(top=0.91)
    plt.suptitle(title,fontsize = 20)
    plt.show()

In [None]:
ce_samples = ['0ed87f_0']
display_sample(ce_samples, 'CE Sample')

In [None]:
laa_samples = [ '1db82d_0', '1f018f_0']
display_sample(laa_samples, 'LAA Samples')

Work in progress