# Cervix EDA

In this competition we have a multi-class classification problem with **three** classes. We are asked, given an image, to identify the cervix type.

From the data description:

*In this competition, you will develop algorithms to correctly classify cervix types based on cervical images. These different types of cervix in our data set are all considered normal (not cancerous), but since the transformation zones aren't always visible, some of the patients require further testing while some don't. This decision is very important for the healthcare provider and critical for the patient. Identifying the transformation zones is not an easy task for the healthcare providers, therefore, an algorithm-aided decision will significantly improve the quality and efficiency of cervical cancer screening for these patients.*

The submission format is asking for a probability for each of the three different cervix types.

In this notebook we will be looking at:

* basic dataset stats like number of samples per class, image sizes
* different embeddings of RGB image space
* pairwise distances and a clustermap of images in RGB space
* (linear) model selection with basic multi class evaluation metrics.

**If you like this kernel, please give an upvote, thanks! :)**

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from skimage.io import imread, imshow
import cv2

%matplotlib inline
import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.tools as tls

from subprocess import check_output
print(check_output(["ls", "../input/train"]).decode("utf8"))

We are given training images for each of cervix types. Lets first count them for each class.

In [2]:
from glob import glob
basepath = '../input/train/'

all_cervix_images = []

for path in sorted(glob(basepath + "*")):
    cervix_type = path.split("/")[-1]
    cervix_images = sorted(glob(basepath + cervix_type + "/*"))
    all_cervix_images = all_cervix_images + cervix_images

all_cervix_images = pd.DataFrame({'imagepath': all_cervix_images})
all_cervix_images['filetype'] = all_cervix_images.apply(lambda row: row.imagepath.split(".")[-1], axis=1)
all_cervix_images['type'] = all_cervix_images.apply(lambda row: row.imagepath.split("/")[-2], axis=1)
all_cervix_images.head()

## Image types

Now that we have the data in a handy dataframe we can do a few aggregations on the data. Let us first see how many images there are for each cervix type and which file types they have.

All files are in JPG format and Type 2 is the most common one with a little bit more than 50% in the training data in total, Type 1 on the other hand has a little bit less than 20% in the training data.

In [None]:
print('We have a total of {} images in the whole dataset'.format(all_cervix_images.shape[0]))
type_aggregation = all_cervix_images.groupby(['type', 'filetype']).agg('count')
type_aggregation_p = type_aggregation.apply(lambda row: 1.0*row['imagepath']/all_cervix_images.shape[0], axis=1)

fig, axes = plt.subplots(nrows=2, ncols=1, figsize=(10, 8))

type_aggregation.plot.barh(ax=axes[0])
axes[0].set_xlabel("image count")
type_aggregation_p.plot.barh(ax=axes[1])
axes[1].set_xlabel("training size fraction")

Now, lets read the files for each type to get an idea about how the images look like.

The images seem to vary alot in they formats, the first two samples have only a circular area with the actual image, the last sample has the image in a rectangle.

In [None]:
fig = plt.figure(figsize=(12,8))

i = 1
for t in all_cervix_images['type'].unique():
    ax = fig.add_subplot(1,3,i)
    i+=1
    f = all_cervix_images[all_cervix_images['type'] == t]['imagepath'].values[0]
    plt.imshow(plt.imread(f))
    plt.title('sample for cervix {}'.format(t))

### Additional images

In [None]:
print(check_output(["ls", "../input/additional"]).decode("utf8"))

In [None]:
basepath = '../input/additional/'

all_cervix_images_a = []

for path in sorted(glob(basepath + "*")):
    cervix_type = path.split("/")[-1]
    cervix_images = sorted(glob(basepath + cervix_type + "/*"))
    all_cervix_images_a = all_cervix_images_a + cervix_images

all_cervix_images_a = pd.DataFrame({'imagepath': all_cervix_images_a})
all_cervix_images_a['filetype'] = all_cervix_images_a.apply(lambda row: row.imagepath.split(".")[-1], axis=1)
all_cervix_images_a['type'] = all_cervix_images_a.apply(lambda row: row.imagepath.split("/")[-2], axis=1)
all_cervix_images_a.head()

In [None]:
print('We have a total of {} images in the whole dataset'.format(all_cervix_images_a.shape[0]))
type_aggregation = all_cervix_images_a.groupby(['type', 'filetype']).agg('count')
type_aggregation_p = type_aggregation.apply(lambda row: 1.0*row['imagepath']/all_cervix_images_a.shape[0], axis=1)

fig, axes = plt.subplots(nrows=2, ncols=1, figsize=(10, 8))

type_aggregation.plot.barh(ax=axes[0])
axes[0].set_xlabel("image count")
type_aggregation_p.plot.barh(ax=axes[1])
axes[1].set_xlabel("training size fraction")

In [None]:
fig = plt.figure(figsize=(12,8))

i = 1
for t in all_cervix_images_a['type'].unique():
    ax = fig.add_subplot(1,3,i)
    i+=1
    f = all_cervix_images_a[all_cervix_images_a['type'] == t]['imagepath'].values[0]
    plt.imshow(plt.imread(f))
    plt.title('sample for cervix {}'.format(t))

### All images

In [None]:
all_cervix_images_ = pd.concat( [all_cervix_images, all_cervix_images_a], join='outer' )
print(all_cervix_images_)

In [None]:
print('We have a total of {} images in the whole dataset'.format(all_cervix_images_.shape[0]))
type_aggregation = all_cervix_images_.groupby(['type', 'filetype']).agg('count')
type_aggregation_p = type_aggregation.apply(lambda row: 1.0*row['imagepath']/all_cervix_images_a.shape[0], axis=1)

fig, axes = plt.subplots(nrows=2, ncols=1, figsize=(10, 8))

type_aggregation.plot.barh(ax=axes[0])
axes[0].set_xlabel("image count")
type_aggregation_p.plot.barh(ax=axes[1])
axes[1].set_xlabel("training size fraction")

In [None]:
fig = plt.figure(figsize=(12,8))

i = 1
for t in all_cervix_images_['type'].unique():
    ax = fig.add_subplot(1,3,i)
    i+=1
    f = all_cervix_images_[all_cervix_images_['type'] == t]['imagepath'].values[0]
    plt.imshow(plt.imread(f))
    plt.title('sample for cervix {}'.format(t))