> This notebook uses the descriptive data csv and XGBoost to experiment with modeling using the metadata to see if there is any signal in that.

In [22]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import xgboost
from xgboost import XGBClassifier

from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.preprocessing import StandardScaler

In [23]:
data = pd.read_csv("descriptive_data.csv")

In [24]:
data.head()

Unnamed: 0.1,Unnamed: 0,lesion_id,image_id,dx,dx_type,age,sex,localization,pixel_avg
0,0,HAM_0000118,ISIC_0027419,bkl,histo,80.0,male,scalp,184.320017
1,1,HAM_0000118,ISIC_0025030,bkl,histo,80.0,male,scalp,177.003825
2,2,HAM_0002730,ISIC_0026769,bkl,histo,80.0,male,scalp,181.487038
3,3,HAM_0002730,ISIC_0025661,bkl,histo,80.0,male,scalp,165.210795
4,4,HAM_0001466,ISIC_0031633,bkl,histo,75.0,male,ear,188.305992


In [25]:
data = pd.get_dummies(data, columns = ["localization"])

In [26]:
data["sex"] = data["sex"].map({"male": 0, "female": 1})

In [27]:
data["dx"].value_counts()

nv       6705
mel      1113
bkl      1099
bcc       514
akiec     327
vasc      142
df        115
Name: dx, dtype: int64

In [28]:
data["label"] = data["dx"].map({"nv": 4, "mel": 6, "bkl": 2, "bcc": 1, "akiec": 0, "vasc": 5, "df":3 })

In [29]:
df4_train = data[data["label"] == 4].sample(115)
df6_train = data[data["label"] == 6].sample(115)
df2_train = data[data["label"] == 2].sample(115)
df1_train = data[data["label"] == 1].sample(115)
df0_train = data[data["label"] == 0].sample(115)
df5_train = data[data["label"] == 5].sample(115)
df3_train = data[data["label"] == 3].sample(115)

In [30]:
data = df4_train.append(df6_train).append(df2_train).append(df1_train).append(df0_train).append(df5_train).append(df3_train)
data.reset_index(inplace = True)

In [31]:
data.head()

Unnamed: 0.1,index,Unnamed: 0,lesion_id,image_id,dx,dx_type,age,sex,pixel_avg,localization_abdomen,...,localization_foot,localization_genital,localization_hand,localization_lower extremity,localization_neck,localization_scalp,localization_trunk,localization_unknown,localization_upper extremity,label
0,5110,5110,HAM_0003503,ISIC_0027555,nv,follow_up,60.0,1.0,170.99575,0,...,0,0,0,1,0,0,0,0,0,4
1,6906,6906,HAM_0006974,ISIC_0026541,nv,histo,45.0,0.0,151.727157,1,...,0,0,0,0,0,0,0,0,0,4
2,7418,7418,HAM_0005571,ISIC_0032803,nv,histo,35.0,0.0,151.211645,0,...,0,0,0,0,1,0,0,0,0,4
3,8341,8341,HAM_0002628,ISIC_0030852,nv,histo,50.0,0.0,130.609435,0,...,0,0,0,1,0,0,0,0,0,4
4,8788,8788,HAM_0006881,ISIC_0025798,nv,histo,60.0,1.0,173.635359,0,...,0,0,0,0,0,0,0,0,0,4


In [40]:
data["label"].value_counts(normalize = True)

6    0.142857
5    0.142857
4    0.142857
3    0.142857
2    0.142857
1    0.142857
0    0.142857
Name: label, dtype: float64

> The features for this model will be age, sex, localization and pixel average. The baseline for the model is 67% as that is the distribution of the majority class.

In [33]:
X = data.drop(["Unnamed: 0", "lesion_id", "image_id", "dx", "dx_type", "label", "index"], axis = 1)

In [34]:
y = data["label"]

In [35]:
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify = y)

ss = StandardScaler()
X_train  = ss.fit_transform(X_train)
X_test = ss.transform(X_test)

In [36]:
xgb = XGBClassifier(random_state = 30)

In [37]:
xgb.fit(X_train, y_train)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=0.1, max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=None, n_estimators=100, n_jobs=1,
              nthread=None, objective='multi:softprob', random_state=30,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, verbosity=1)

In [38]:
xgb.score(X_train, y_train)

0.6583747927031509

In [39]:
xgb.score(X_test, y_test)

0.41089108910891087

> This model has an accuracy that is 27% above the baseline with not tuning and little features. It may be useful to combine this with cnn models in future tests.