# Project 3: Image Recognition in Medical Treatment
## Diagnosing Skin Lesions using Machine Learning in Image Processing 
---

**Group 9: Aidan Stocks, Hugo Reinicke, Nicola Clark, Jonas-Mika Senghaas**

Submission: *19.04.2021* / Last Modified: *08.04.2021*

---

This notebook contains the step-by-step data science process performed on the *ISIC 2017* public test data and official training data on medical image recognition. The goal of this project was to extract and automatically analyse features from medical images of skin lesions in order to predict whether or not the person has ** using machine learning and image processing.

The initial data (containing the medical images, masked images and information on features and disease) was given for 150 medical images (equivalent to the public test data of the *ISIC 2017* challenge) by the project manager *Veronika *.
To develop more accurate models, we extended the initially given data by the official training data that could be obtained from the official [ISIC 2017 Website](https://challenge.isic-archive.com/data)

## Introduction
---
The amount of medical imaging - just as data in any other field - has increased tremendously within the last decade, making it more and more difficult to manually inspect medical images for diagnosis purposes.

Furthermore, people have proven to be hesitant of visiting doctors because of seemingly 'light' issues, which did not seem to be important enough to occupy a doctor's time. With skin diseases being especially effective in treatment if detected early, this is fatal. 
An easy-to-use app that implements automated detection of skin diseases from the sofa, would address this issue - ultimately saving lives.

## Running this Notebook
---
This notebook contains all code to reproduce the findings of the project as can be seen on the [GitHub](https://github.com/jonas-mika/fyp2021p03g09) page of this project. In order to read in the data correctly, the global paths configured in the section `Constants` need to be correct. The following file structure - as prepared in the `submission.zip` - was followed throughout the project and is recommended to use (alternatively the paths in the section `Constants` can be adjusted):

```
*project tree structure*
```
*Note that the rest of the file structure as can be seen on the [GitHub](https://github.com/jonas-mika/fyp2021p03g09) page of the project generates automatically*

## Required Libraries and Further Imports
---
Throughout the project, we will use a range of both built-in and external Python Libraries. This notebook will only run if all libraries and modules are correctly installed on your local machines. 
To install missing packages use `pip install <package_name>` (PIP (Python Package Index) is the central package management system, read more [here](https://pypi.org/project/pip/)). 

In case you desire further information about the used packages, click the following links to find detailed documentations:
- [Pandas](https://pandas.pydata.org/)
- [Numpy](https://numpy.org/)
- [Matplotlib](https://matplotlib.org/stable/index.html)
- [PIL](https://pillow.readthedocs.io/en/stable/)
- [SciKit Learn](https://scikit-learn.org/stable/)
- [SciKit Image](https://scikit-image.org/)
- [Scipy](https://www.scipy.org/)

In [115]:
%%capture
!pip install scikit-image

In [3]:
# external libraries
import pandas as pd                                    # provides major datastructure pd.DataFrame() to store datasets
import numpy as np                                     # used for numerical calculations and fast array manipulations
import matplotlib.pyplot as plt                        # visualisation of data
import sklearn                                         # machine learning in python
import skimage                                         # image processing in python
from PIL import Image                                  # fork from PIL (python image library), deals with images in python

# python standard libraries
import json                                            # data transfer to json format
import os                                              # automates saving of export files (figures, summaries, ...)
import random                                          # randomness in coloring of plots
import re                                              # used for checking dateformat in data cleaning

Since this project makes heavy use of functions to achieve maximal efficiency, all functions are stored externally in the package structure `project3'. The following imports are necessary for this notebook to run properly.

In [6]:
from project3.processing import ...
from project3.save import ...
from project3.features import ...

**Remark**: All function used in this project are well documented in their *docstring*. To display the docstring and get an short summary of the function and the specifications of the input argument (including data tupe and small explanation) as well as their return value, type `?<function_name>` in Juptyer.

## Constants
---
To enhance the readibilty, as well as to decrease the maintenance effort, it is useful for bigger projects to define contants that need to be accessed globally throughout the whole notebook in advance. 
The following cell contains all of those global constants. By convention, we write them in caps (https://www.python.org/dev/peps/pep-0008/#constants)

In [101]:
# path lookup dictionary to store the relative paths from the directory containing the jupyter notebooks to important directories in the project
PATH = {}

PATH['data'] = {}
PATH['data']['raw'] = "../data/raw/"
PATH['data']['processed'] = "../data/processed/"
PATH['data']['external'] = "../data/external/"

PATH['images'] = 'images/'
PATH['masks'] = 'masks/'

PATH['reports'] = "../reports/"

# filename lookup dictionary storing the most relevant filenames
FILENAME = {}
FILENAME['diagnosis'] = 'diagnosis.csv'
FILENAME['features'] = 'features.csv'

# there are 57 superpixel images in the images that we want to deal with separately
FILENAME['images'] = sorted([image[:-4] for image in os.listdir(PATH['data']['raw'] + PATH['images']) if not re.match('.*super.*', image)])
FILENAME['superpixels'] = sorted([image for image in os.listdir(PATH['data']['raw'] + PATH['images']) if re.match('.*super.*', image)])
FILENAME['masks'] = sorted([mask[:-4] for mask in os.listdir(PATH['data']['raw'] + PATH['masks'])])

# defining three dictionaries to store data. each dictionary will reference several pandas dataframes
DATA = {}
DATA_EXTERNAL = {}

NAMES = {}
NAMES['datasets'] = ['diagnosis', 'features']
NAMES['images'] = ['images', 'masks']

*TASK 0*
# Data Exploration

---


### Loading in Data

---

The task involves different sources of data, namely:

> **Images**: 150 Medical Images of Skin Lesions

> **Masks**: 150 Binary Masks corresponding to each Image that masks the region of the Skin Lesion

> **Diagnosis**: Dataset storing whether or not the lesion was either *melanoma* or *seborrheic_keratosis* through binary values

> **Features**: Dataset storing the area and perimeter of the skin lesion for each image

We conveniently load in the csv datasets into individual `Pandas DataFrames` using the built-in pandas method `pd.read_csv()`. We store those in our `DATA_RAW` dictionary in the corresponding keys.

All images and masks are stored as `Image` objects of the `PIL` (*Python Image Library*) for convenient handling of image processing functionality.

In [105]:
# load in datasets 
DATA['diagnosis'] = pd.read_csv(PATH['data']['raw'] + FILENAME['diagnosis'])
DATA['features'] = pd.read_csv(PATH['data']['raw'] + FILENAME['features'])

In [146]:
# load in images and masks
DATA['images'] = [Image.open(PATH['data']['raw'] + PATH['images'] + FILENAME['images'][i] + '.jpg') for i in range(len(FILENAME['images']))]
DATA['masks'] = [Image.open(PATH['data']['raw'] + PATH['masks'] + FILENAME['masks'][i] + '.png') for i in range(len(FILENAME['masks']))]

### Inspection of Datasets

---

We can now have a look at our two datasets to get a first impression for what kind of data we are dealing with. We start by reporting the number of records and fields/ variables in each of the datasets by using the shape property of the `pd.DataFrame`. We then continue to have an actual look into the data. Similiar to the head command in terminal, we can use the method `head()` onto our DataFrames, which outputs an inline representation of the first five data records of the dataset.

**Shape**

In [107]:
for dataset in NAMES['datasets']:
    print(f"{dataset.capitalize()}: {DATA[dataset].shape}")

**Diagnosis Dataset**

In [108]:
DATA['diagnosis'].head()

**Features Dataset**


In [109]:
DATA['features'].head()

### Inspection of Images
---
The main part of the project is to 

In [147]:
# load test image using PIL
ex_img = DATA['images'][0]
ex_img_mask= DATA['masks'][0]

fig, ax = plt.subplots(ncols=2, figsize=(16,6))
ax[0].imshow(ex_img)
ax[1].imshow(ex_img_mask, cmap='gray')

print(ex_img.format, ex_img.size, ex_img.mode)
print(ex_img_mask.format, ex_img_mask.size, ex_img_mask.mode)

In [8]:

print(c, c2)

In [69]:
print(img2.size)
img2_mask.size
#https://note.nkmk.me/en/python-pillow-composite/
testIm = Image.composite(img2,img2_mask,img2_mask)
testIm

In [112]:
darkBackground = Image.new("RGB", img2.size, 0)
testIm = Image.composite(img2,darkBackground,img2_mask)
testIm

In [127]:
# crop lesion using .getbbox()
testIm = testIm.crop(testIm.getbbox())

*TASK 1*
# Extracting Features 

---

In [141]:
# luminance score (might be a feature)
gray = np.array(testIm.convert('L').getdata())
mask = gray > 0
print(np.mean(gray[mask]))
# plt.hist(gray[mask]) # plot it

In [124]:
# extract single color
r,g,b = testIm.getpixel((1,1))

print(r,g,b)

*TASK 2*
# Predict Diagnosis

---

*TASK 3*
# Open Question: ...

---

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=e5a81e04-857d-4bae-844e-8fb924df483a' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>