# Skin Images to Features

**Examining what we have**

**What diseases do we have?**

In medicine the dx is an abbreviation for diagnosis and here these are short for.

From the original text about the dataset we have this quite technical medical detail
Cases include a representative collection of all important diagnostic categories in the realm of pigmented lesions:

- Actinic keratoses and intraepithelial carcinoma / Bowen's disease (akiec)
- basal cell carcinoma (bcc)
- benign keratosis-like lesions (solar lentigines / seborrheic keratoses and lichen-planus like keratoses, bkl)
- dermatofibroma (df)
- melanoma (mel)
- melanocytic nevi (nv)
- vascular lesions (angiomas, angiokeratomas, pyogenic granulomas and hemorrhage, vasc). ### Simplified
- nv  →  melanocytic nevi  →  0
- mel  →  melanoma  →  1
- bcc  →  basal cell carcinoma  →  2
- akiec  →  Actinic keratoses and intraepithelial carcinoma  →  3
- vasc  →  vascular lesions  →  4
- bkl  →  benign keratosis-like lesions  →  5
- df  →  dermatofibroma  →  6

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
from matplotlib.offsetbox import OffsetImage, AnnotationBbox
import seaborn as sns
plt.rcParams["figure.figsize"] = (15, 10)
plt.rcParams["figure.dpi"] = 125
plt.rcParams["font.size"] = 14
plt.rcParams['font.family'] = ['sans-serif']
plt.rcParams['font.sans-serif'] = ['DejaVu Sans']
plt.style.use('ggplot')
sns.set_style("whitegrid", {'axes.grid': False})
plt.rcParams['image.cmap'] = 'viridis' # grayscale looks better

In [None]:
from pathlib import Path
import numpy as np
import pandas as pd
from skimage.io import imread as imread
from skimage.util import montage as montage2d
from skimage.color import label2rgb
from PIL import Image
base_dir = Path('..') / 'input' / 'skin-cancer-mnist-ham10000'

In [None]:
#Load and Process Data

image_overview_df = pd.read_csv(base_dir / 'HAM10000_metadata.csv')
all_image_ids = {c_path.stem: c_path for c_path in base_dir.glob('**/*.jpg')}
image_overview_df['image_path'] = image_overview_df['image_id'].map(all_image_ids.get)
image_overview_df.dropna(inplace=True) # remove values that are missing
print(image_overview_df.shape[0], 'image, recipe pairs loaded')
image_overview_df.sample(3)

In [None]:
image_overview_df.drop(['age'], axis=1).describe()

In [None]:
dx_name_dict = {
    'nv': 'melanocytic nevi',
    'mel': 'melanoma',
    'bcc': 'basal cell carcinoma',
    'akiec': 'Actinic keratoses and intraepithelial carcinoma',
    'vasc': 'vascular lesions',
    'bkl': 'benign keratosis-like',
    'df': 'dermatofibroma'
}
image_overview_df['dx_name'] = image_overview_df['dx'].map(dx_name_dict.get)
dx_name_id_dict = {name: id for id, name in enumerate(dx_name_dict.keys())}
image_overview_df['dx_id'] = image_overview_df['dx'].map(dx_name_id_dict.get).astype(int)
image_overview_df.sample(3)

In [None]:
image_overview_df['dx_name'].value_counts()

In [None]:
fig, m_axs = plt.subplots(3, 3, figsize=(20, 20))
for c_ax, (_, c_row) in zip(m_axs.flatten(), 
                            image_overview_df.head(9).iterrows()):
    c_ax.imshow(imread(c_row['image_path']))
    c_ax.set_title('{dx_name}\nAge: {age}, Loc: {localization}'.format(**c_row))
    c_ax.axis('off')

**Create Color Features**

**We start with simple color features by grouping the image**

In [None]:
test_row = image_overview_df.iloc[1]
print(test_row)

**Reduce the number of colors**

**Currently we have 8-bit and 3 channels (Red, Green, Blue). This means we have 16,581,375 different colors. We can convert the image to 8-bit format to reduce the number of colors by a factor of 65536**

In [None]:
test_image = Image.open(test_row['image_path']) # normal image
# convert to 8bit color (animated GIF) and then back
web_image = test_image.convert('P', palette='WEB', dither=None)
few_color_image = web_image.convert('RGB')
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 6))
ax1.imshow(test_image)
ax2.imshow(few_color_image)

In [None]:
print('Unique colors before', len(set([tuple(rgb) for rgb in np.array(test_image).reshape((-1, 3))])))
print('Unique colors after', len(set([tuple(rgb) for rgb in np.array(few_color_image).reshape((-1, 3))])))

In [None]:
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(12, 6))
for c_channel, c_name in enumerate(['red', 'green', 'blue']):
    ax1.hist(np.array(test_image)[:, :, c_channel].ravel(), 
             color=c_name[0], 
             label=c_name, 
             bins=np.arange(256), 
             alpha=0.5)
    ax2.hist(np.array(few_color_image)[:, :, c_channel].ravel(), 
             color=c_name[0], 
             label=c_name, 
             bins=np.arange(256), 
             alpha=0.5)

**How do the colors look?**

In [None]:
idx_to_color = np.array(web_image.getpalette()).reshape((-1, 3))/255.0

In [None]:
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(12, 6))
ax1.imshow(few_color_image)
counts, bins = np.histogram(web_image, bins=np.arange(256))
for i in range(counts.shape[0]):
    ax2.bar(bins[i], counts[i], color=idx_to_color[i])
ax2.set_yscale('log')
ax2.set_xlabel('Color Id')
ax2.set_ylabel('Pixel Count')

**Calculate for Many Images**

In [None]:
def color_count_feature(in_path):
    raw_image = Image.open(in_path) 
    web_image = raw_image.convert('P', palette='WEB', dither=None)
    counts, bins = np.histogram(np.array(web_image).ravel(), bins=np.arange(256))
    return counts*1.0/np.prod(web_image.size) # normalize output

In [None]:
%%time
image_subset_df = image_overview_df.sample(100).copy()
image_subset_df['color_features'] = image_subset_df['image_path'].map(color_count_feature)
image_subset_df.sample(3)

In [None]:
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(20, 10))
combined_features = np.stack(image_subset_df['color_features'].values, 0)
ax1.imshow(combined_features)
ax1.set_title('Raw Color Counts')
ax1.set_xlabel('Color')
ax1.set_ylabel('Frequency')
color_wise_average = np.tile(np.mean(combined_features, 0, keepdims=True), (combined_features.shape[0], 1))
ax2.imshow(combined_features/color_wise_average, vmin=0.05, vmax=20)
ax2.set_title('Normalized Color Counts')
ax2.set_xlabel('Color')
ax2.set_ylabel('Frequency')

**PCA Components**

**We can use a tool called principle component analysis to show the images in features**

In [None]:
from sklearn.decomposition import PCA
xy_pca = PCA(n_components=2)
xy_coords = xy_pca.fit_transform(combined_features)
image_subset_df['x'] = xy_coords[:, 0]
image_subset_df['y'] = xy_coords[:, 1]

In [None]:
fig, ax1 = plt.subplots(1,1, figsize=(15, 15))
for _, c_row in image_subset_df.iterrows():
    ax1.plot(c_row['x'], c_row['y'], '*')
    ax1.text(s=c_row['dx_name'][:15], x=c_row['x'], y=c_row['y'])

In [None]:
def show_xy_images(in_df, image_zoom=1):
    fig, ax1 = plt.subplots(1,1, figsize=(10, 10))
    artists = []
    for _, c_row in in_df.iterrows():
        c_img = Image.open(c_row['image_path']).resize((64, 64))
        img = OffsetImage(c_img, zoom=image_zoom)
        ab = AnnotationBbox(img, (c_row['x'], c_row['y']), xycoords='data', frameon=False)
        artists.append(ax1.add_artist(ab))
    ax1.update_datalim(in_df[['x', 'y']])
    ax1.autoscale()
    ax1.axis('off')
show_xy_images(image_subset_df)

**TSNE Representation**

**Rather than using simple PCA we can come up with a fancier representation called TSNE**

In [None]:
from sklearn.manifold import TSNE
tsne = TSNE(n_iter=250, verbose=True)
xy_coords = tsne.fit_transform(combined_features)
image_subset_df['x'] = xy_coords[:, 0]
image_subset_df['y'] = xy_coords[:, 1]

In [None]:
show_xy_images(image_subset_df)

**Calculate for all images**

In [None]:
%%time
image_overview_df['color_features'] = image_overview_df['image_path'].map(color_count_feature).map(lambda x: x.tolist())
image_overview_df.sample(3)

In [None]:
image_overview_df['image_path'] = image_overview_df['image_path'].map(str)

In [None]:
image_overview_df.to_json('color_features.json')

In [None]:
!ls -lh