<h1><center><b>Mayo Clinic - STRIP AI - Exploratory Data Analysis</b></center></h1>

![](https://storage.googleapis.com/kaggle-competitions/kaggle/37333/logos/header.png)

<a id="top"></a>
<div class="list-group" id="list-tab" role="tablist"><h2 class="list-group-item list-group-item-action active" data-toggle="list" style='color:white; background:red; border:0' role="tab" aria-controls="home"><center>Quick Navigation</center></h2>

* [0. Goal and Context of Competition](#0)
* [1. Basic Data Exploration](#1)
* [2. Images Visualizations](#2)
* [3. Baseline Submission](#3)

<a id="0"></a>
<h2 style='background:red; border:0; color:white'><center>Goal of the Competition and Context</center></h2>


The goal of this competition is to classify the blood clot origins in ischemic stroke. Using whole slide digital pathology images, you'll build a model that differentiates between the two major acute ischemic stroke (AIS) etiology subtypes: cardiac and large artery atherosclerosis.

Your work will enable healthcare providers to better identify the origins of blood clots in deadly strokes, making it easier for physicians to prescribe the best post-stroke therapeutic management and reducing the likelihood of a second stroke.

<b>Context</b>
 
Stroke remains the second-leading cause of death worldwide. Each year in the United States, over 700,000 individuals experience an ischemic stroke caused by a blood clot blocking an artery to the brain. A second stroke (23% of total events are recurrent) worsens the chances of the patient’s survival. However, subsequent strokes may be mitigated if physicians can determine stroke etiology, which influences the therapeutic management following stroke events.

During the last decade, mechanical thrombectomy has become the standard of care treatment for acute ischemic stroke from large vessel occlusion. As a result, retrieved clots became amenable to analysis. Healthcare professionals are currently attempting to apply deep learning-based methods to predict ischemic stroke etiology and clot origin. However, unique data formats, image file sizes, as well as the number of available pathology slides create challenges you could lend a hand in solving.

The Mayo Clinic is a nonprofit American academic medical center focused on integrated health care, education, and research. Stroke Thromboembolism Registry of Imaging and Pathology (STRIP) is a uniquely large multicenter project led by Mayo Clinic Neurovascular Lab with the aim of histopathologic characterization of thromboemboli of various etiologies and examining clot composition and its relation to mechanical thrombectomy revascularization.

To decrease the chances of subsequent strokes, the Mayo Clinic Neurovascular Research Laboratory encourages data scientists to improve artificial intelligence-based etiology classification so that physicians are better equipped to prescribe the correct treatment. New computational and artificial intelligence approaches could help save the lives of stroke survivors and help us better understand the world's second-leading cause of death.

This competition is about predicting the origins of blood clot resulting in ischemic stroke using Digital Pathology Slides (aka Whole Slide Images (WSI)) of the thrombotic material extracted mechanically during acute neurovascular procedures. Collaborative efforts across 18 institutions, led by Waleed Brinjikji, MD of the Mayo Clinic Rochester allowed the compilation of this unique dataset (https://pubmed.ncbi.nlm.nih.gov/33722963/).

<a id="1"></a>
<h2 style='background:red; border:0; color:white'><center>Basic Data Exploration<center></h2>

In [None]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sn
import cv2
import tifffile
from PIL import Image
from tqdm.auto import tqdm
import plotly.express as px

In [None]:
BASE_PATH = "../input/mayo-clinic-strip-ai/"
Image.MAX_IMAGE_PIXELS = 5_000_000_000

## Train dataset

In [None]:
df_train = pd.read_csv(
    os.path.join(BASE_PATH, "train.csv")
)
df_train

In [None]:
df_train.info()

## Test dataset

In [None]:
df_test = pd.read_csv(
    os.path.join(BASE_PATH, "test.csv")
)
df_test

## Other dataset

In [None]:
df_other = pd.read_csv(
    os.path.join(BASE_PATH, "other.csv")
)
df_other

### Sample Submission

In [None]:
df_sub = pd.read_csv(
    os.path.join(BASE_PATH, "sample_submission.csv"))

if len(df_sub)>4:
    eda=False
else:
    eda=True

df_sub

### Number of samples

In [None]:
n_pat = df_train["patient_id"].unique().size
print(f"Number of Train images: {df_train.shape[0]}")
print("Number of patients in Train:", n_pat)
print(f"Number of Test images: {df_test.shape[0]}")
print("Numver of patients in Test:", df_test["patient_id"].unique().size)

In [None]:
df = pd.crosstab(index=df_train.patient_id, columns=df_train.label)
df.loc[df.CE>1,"CE"]=1
df.loc[df.LAA>1,"LAA"]=1

pCE = df.CE.sum()/n_pat
pLAA = df.LAA.sum()/n_pat

if df.sum().sum() == n_pat:
    print("Label is unique by Patient")
else:
    print("Label not is unique by Patient")

In [None]:
fig, axs = plt.subplots(nrows=1, ncols=2, figsize=(14, 7))
axs[0] = sn.countplot(x="label", data=df_train, ax=axs[0])
axs[0].bar_label(axs[0].containers[0])
axs[1] = df_train[["label"]].value_counts().plot.pie(autopct='%1.1f%%', 
                                                     ylabel="label", 
                                                     labels = ["CE","LAA"], 
                                                     shadow=True)

In [None]:
df = df_train[["center_id"]].value_counts().reset_index(name="cnt")
lb = list(df.center_id.values)
od = df.sort_values("cnt", ascending=False).center_id

fig, axs = plt.subplots(nrows=1, ncols=2, figsize=(14, 7))
cl = sn.color_palette("Paired")
axs[0] = sn.barplot(x="center_id",  y="cnt", data=df, 
                    order=od, palette=cl, ax=axs[0])
axs[0].bar_label(axs[0].containers[0])
axs[1] = df["cnt"].plot.pie(autopct='%1.1f%%', ylabel="center_id", 
                            labels=lb, shadow=True, colors=cl)

### Image_num distribution

In [None]:
fig, axs = plt.subplots(nrows=1, ncols=2, figsize=(14, 7))
lb = list(range(5))
ex = (0,0.1,0.1,0.4,0.6)
axs[0] = sn.countplot(x="image_num", data=df_train, dodge=False, ax=axs[0])
axs[0].bar_label(axs[0].containers[0])
axs[1] = df_train[["image_num"]].value_counts().plot.pie(autopct='%1.1f%%', 
                                                         ylabel="image_num", 
                                                         labels=lb, explode=ex)
plt.show()

### Patients with more than two images

In [None]:
df = (df_train.groupby(["patient_id","label"])["image_num"].count().reset_index(name='image_count'))
df[df["image_count"]>2].set_index("patient_id").style.background_gradient(cmap='Reds')

### Number of Centers by Patients

In [None]:
df = df_train.groupby(["patient_id"])["center_id"].count().reset_index(name="center_count")
lb = list(range(1,6))
fig, axs = plt.subplots(nrows=1, ncols=2, figsize=(14, 7))
axs[0] = sn.countplot(x="center_count", data=df, dodge=False, ax=axs[0])
axs[0].bar_label(axs[0].containers[0])
axs[1] = df[["center_count"]].value_counts().plot.pie(autopct='%1.1f%%', 
                                                      ylabel="center_count", 
                                                      labels=lb, explode=ex)
plt.show()


### Labels by Center

In [None]:
df = pd.crosstab(index=df_train.center_id, columns=df_train.label).reset_index()
df["LAA/CE"] = df.LAA/df.CE
ax = df.plot(x="center_id", y=["CE","LAA"], kind="bar", width=0.8, figsize=(14,7))
ax.bar_label(ax.containers[0])
ax.bar_label(ax.containers[1])
df.plot(y=["LAA/CE"], secondary_y="LAA/CE", color="lightgreen", linewidth=4, ax=ax);

### Labels by Image_num

In [None]:
df = pd.crosstab(index=df_train.image_num, columns=df_train.label).reset_index()
df["LAA/CE"] = df.LAA/df.CE
ax = df.plot(x="image_num", y=["CE","LAA"], kind="bar", width=0.8, figsize=(14,7))
ax.bar_label(ax.containers[0])
ax.bar_label(ax.containers[1]);

### Include Image Sizes in Train dataset

In [None]:
%%time
sizes = []
for name in df_train["image_id"]:
    img = Image.open(os.path.join(BASE_PATH, "train", f"{name}.tif"))
    sizes.append({"img_height": img.height, 
                  "img_width": img.width, 
                  "img_size": img.size[0]*img.size[1]/(1024**2)})

df_train = pd.concat([pd.DataFrame(sizes),df_train], axis=1)
del sizes
df_train

In [None]:
df_train.describe()

### Image analysis size distribution

In [None]:
fig, axs = plt.subplots(nrows=1, ncols=2, figsize=(15, 5))
for i, col in enumerate(["img_height", "img_width"]):
    _= sn.histplot(df_train[[col]], ax=axs[i], bins=40, kde=True)

### Image Sizes by Center

In [None]:
df = df_train.groupby(["center_id"])[["img_height", "img_width"]].mean().reset_index()
ax = df.plot(x="center_id", kind="bar", width=0.8, figsize=(14,7))

<a id="2"></a>
<h2 style='background:red; border:0; color:white'><center>Images Visualizations</center></h2>

In [None]:
#Utility functions

def read_image(image_id, dset, scale=None, verbose=1):
    with tifffile.TiffFile(os.path.join(BASE_PATH, dset, f"{image_id}.tif")) as tif:
        tif_tags = {}
        for tag in tif.pages[0].tags.values():
            name, value = tag.name, tag.value
            tif_tags[name] = value
        del tif_tags["TileOffsets"] 
        del tif_tags["TileByteCounts"]
        image = tif.pages[0].asarray()
    
    if verbose:
        print(f"[{image_id}] Image shape: {image.shape}")
    
    if scale:
        new_size = (image.shape[1] // scale, image.shape[0] // scale)
        image = cv2.resize(image, new_size, interpolation=cv2.INTER_AREA)
        if image.shape[1]>1.5*image.shape[0]:
            out=cv2.transpose(image)
            image=cv2.flip(out,flipCode=0)
        
        if verbose:
            print(f"[{image_id}] Resized Image shape: {image.shape}")
        
    return image, tif_tags

def plot_image(image, image_id):
    plt.figure(figsize=(16, 10))
    plt.imshow(image)
    plt.title(f"Image {image_id}", fontsize=18)  
    plt.axis('off')
    plt.show()
    
def plot_list_img(sample_ids, dset, scale=20):
    sample_images = []
    for sample_id in sample_ids:
        sample_images.append(read_image(sample_id, dset, scale=scale, verbose=0)[0])
    plt.figure(figsize=(16, 16))
    for ind, (tmp_id, tmp_image) in enumerate(zip(sample_ids, sample_images)):
        plt.subplot(2, 5, ind + 1)
        plt.imshow(tmp_image)
        plt.title(f"{tmp_id}", fontsize=10) 
        plt.axis("off")

### Iterative View

In [None]:
if eda:
    img_id = "026c97_0"
    img, tags = read_image(img_id, "train", scale=2)
    print(tags)
    fig = px.imshow(img)
    fig.show()

## Train images 

### Images of patient id = 09644e (CE)

In [None]:
if eda:
    sample_ids = ["09644e_0","09644e_1","09644e_2","09644e_3","09644e_4"]
    plot_list_img(sample_ids, "train", scale=100)

### Images of patient id = 91b9d3 (LAA)

In [None]:
if eda:
    sample_ids = ["91b9d3_0","91b9d3_1","91b9d3_2","91b9d3_3","91b9d3_4"]
    plot_list_img(sample_ids, "train", scale=100)

### CE class sample

In [None]:
if eda:
    sample_ids = df_train[df_train.label=="CE"].image_id[:10].values
    plot_list_img(sample_ids, "train", scale=100)

### LAA class sample

In [None]:
if eda:
    sample_ids = df_train[df_train.label=="LAA"].image_id[:10].values
    plot_list_img(sample_ids, "train", scale=100)

## Test Images

In [None]:
if eda:
    sample_ids = df_test.image_id[:4].values
    plot_list_img(sample_ids, "train", scale=100)

### Compare images (Test = Train ?)

In [None]:
if eda:
    for image_id in list(df_test.image_id[:4].values):
        image_train = read_image(image_id, "train", 100, verbose=0)[0]
        image_test = read_image(image_id, "test", 100, verbose=0)[0]
        diff = cv2.absdiff(image_train, image_test)
        print(f"Sum of DIFF of images {image_id} = ", diff.sum())

<a id="3"></a>
<h2 style='background:red; border:0; color:white'><center>Baseline Submission</center></h2>

In [None]:
df_sub.CE = 0.45
df_sub.LAA = 0.55
df_sub.to_csv("submission.csv", index=False)
df_sub 

### WORK IN PROGRESS...