# Addressing the Problem

### Skin Disease Classification

The U.S. Census Bureau projects that by 2050, about half of all patients seen in the U.S. healthcare system will have skin of color (SOC). As these communities continue to grow, so will the likelihood of dermatologists encountering cutaneous (skin) diseases that happen more frequently in SOC patients, happen exclusively in SOC patients, and/or present differently in SOC patients than their White counterparts. Aiding dermatologists in their understanding, and more importantly diagnoses, of SOC patients’ cutaneous disease presentations is paramount to delivering life-saving
quality of care to these communities. As a result, the following practicum project will explore the use of computer vision for classification of cutaneous diseases in SOC patients.

### Problem Structure

The following research notebook addresses the supervised classifcation problem of determining whether a given image of a skin disease is benign (0) or malignant (1). The metric(s) to be used for evaluating the neural network's performance is TBD.

### Import Necessary Modules

In [1]:
# Standard Python libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from IPython.display import display

# Model construction & evaluation
from sklearn.model_selection import train_test_split
import tensorflow as tf
from keras import models, layers

# AWS-specific libraries
import boto3

2024-04-02 16:04:12.067242: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE4.1 SSE4.2 AVX AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


# Data Retrieval

As can be seen below, we begin by fetching our metadata CSV from the appropriate S3 location. The data workflow pipeline has already run the metadata transform Glue job on the raw metadata prior to the execution of this notebook, so we're merely fetching the cleaned product. However, further feature engineering will be performed on the metadata and images in preparation for model fitting.

Upon further inspection of the metadata, we can also see that it contains 656 fully-populated records, two of which being integer features and the other two being string features.

### Data Fetch

In [2]:
bucket = 'poc-skin-disease-detection-cv'
data_file_key = 'ddidiversedermatologyimages/metadata/transform'
data_file_name = 'part-00000-f5ce70f1-c572-4cbc-818e-e8dda587c3ef-c000.csv'

s3 = boto3.client('s3')
obj = s3.get_object(Bucket=bucket, Key='{}/{}'.format(data_file_key, data_file_name)) 

ddi_df = pd.read_csv(obj['Body'])

In [3]:
ddi_df.head()

Unnamed: 0,DDI_file,skin_tone,malignant,disease
0,000001.png,56,1,melanoma-in-situ
1,000002.png,56,1,melanoma-in-situ
2,000003.png,56,1,mycosis-fungoides
3,000004.png,56,1,squamous-cell-carcinoma-in-situ
4,000005.png,12,1,basal-cell-carcinoma


### Data Inspection

In [4]:
ddi_df.shape

(656, 4)

In [5]:
ddi_df.dtypes

DDI_file     object
skin_tone     int64
malignant     int64
disease      object
dtype: object

In [6]:
ddi_df.isnull().sum()

DDI_file     0
skin_tone    0
malignant    0
disease      0
dtype: int64

In [7]:
ddi_df.describe()

Unnamed: 0,skin_tone,malignant
count,656.0,656.0
mean,33.966463,0.260671
std,17.511578,0.439336
min,12.0,0.0
25%,12.0,0.0
50%,34.0,0.0
75%,56.0,1.0
max,56.0,1.0


# Exploratory Data Analysis (EDA)

# Feature Engineering & Analysis

# Model Training & Selection

# Model Evaluation

# Future Work