# Relabeling Dermatopathologic Diagnoses

Below workflow of relabeling dermatopathologic diagnoses is given. Provided regex-search terms are examples and do not reflect all used terms in HAM-10000 dataset curation, as they differed between dataset-sources in regard to language and specialised terminology of pathology groups. 


In [1]:
import pandas as pd
import warnings
warnings.filterwarnings("ignore", 'This pattern has match groups')

# Read data

In [2]:
df = pd.read_csv('unify_diagnoses_example.csv', encoding='utf-8', delimiter=';', index_col=0)
df['dx'] = df.diagnosis.str.lower()
df.head()

Unnamed: 0_level_0,diagnosis,dx
caseno,Unnamed: 1_level_1,Unnamed: 2_level_1
1,Bowen's disease,bowen's disease
2,Junctional melanocytic nevus,junctional melanocytic nevus
3,"Unclear melanocytic lesion, recommend complete...","unclear melanocytic lesion, recommend complete..."
4,Dermatofibroma,dermatofibroma
5,"Lentigo maligna melanoma, Clark level III, Bre...","lentigo maligna melanoma, clark level iii, bre..."


# Relabel

In [3]:
def relabel(df, column, search, result, test=False):
    print(30*"-")
    if test:
        print("Terms to relabel:")
    else:
        print(f"Terms relabeled to \"{result}\":")
    print(30*"-")
    print(df[df[column].str.contains(search)][column].value_counts())
    if test:
        return None
    else:
        df.loc[df[column].str.contains(search), column] = result


## Always inspect found selection before proceeding to the next term

In [4]:
relabel(df, "dx", "(unclear)|(no sign)|(no residual)|(scar)|(collis)", "nonuse", test=True)

------------------------------
Terms to relabel:
------------------------------
unclear melanocytic lesion, recommend complete excision with 0.5cm margins    2
fibrosis and acanthosis, consistent with scar after previous biopsy           1
collision of seborrheic keratosis and basal cell carcinoma, in toto           1
no signs of residual tumor                                                    1
Name: dx, dtype: int64


## Iterate over all groups

### Rows to exclude

In [5]:
# Non-useable diagnoses
relabel(df, "dx", "(unclear)|(no sign)|(no residual)|(scar)|(collis)", "nonuse")

------------------------------
Terms relabeled to "nonuse":
------------------------------
unclear melanocytic lesion, recommend complete excision with 0.5cm margins    2
fibrosis and acanthosis, consistent with scar after previous biopsy           1
collision of seborrheic keratosis and basal cell carcinoma, in toto           1
no signs of residual tumor                                                    1
Name: dx, dtype: int64


### Malignant cases

In [6]:
# mel
# Melanoma may be used as first iteration to correctly label "nevus associated melanomas" as "melanoma"
relabel(df, "dx", "(lentigo maligna)|(ssm)|(mela[n]*oma)", "mel")

------------------------------
Terms relabeled to "mel":
------------------------------
superficial spreading melanoma, breslow thickness 0.4mm, clark iii, <1 mitoses/mm2    1
nodular melaoma, breslow thickness 4mm, <1 mitoses/mm2, lateral margins clear         1
partial biopsy of lentigo maligna with all lateral margins involved                   1
superficial spreading melanoma, in situ, in toto                                      1
ssm in association with a preexisting dermal nevus                                    1
lentigo maligna melanoma, clark level iii, breslow 0.7mm, <1 mitosis/mm2, in toto     1
Name: dx, dtype: int64


In [7]:
# bcc
relabel(df, "dx", "(bcc)|(basal cell carcinom)", "bcc")

------------------------------
Terms relabeled to "bcc":
------------------------------
morpheiform bcc, in toto                                 1
pigmented nodular basal cell carcinoma (punch biopsy)    1
basal cell carcinoma (punch biopsy)                      1
Name: dx, dtype: int64


In [8]:
# akiec
relabel(df, "dx", "(intraepithelial carc)|(bowen)|(actinic keratosis)", "akiec")

------------------------------
Terms relabeled to "akiec":
------------------------------
actinic keratosis without signs of invasive alterations    1
intraepithelial carcinoma, in toto                         1
bowen's disease                                            1
Name: dx, dtype: int64


### Benign cases

In [9]:
# nevus
relabel(df, "dx", "(n[ä|a]*e*vus(?! sebaceus))|(Compound)|(reed)", "nv")

------------------------------
Terms relabeled to "nv":
------------------------------
junctional nevus                2
junctional melanocytic nevus    1
irritated clark's naevus        1
recurrent naevus                1
compound nevus                  1
spindlecell-nevus reed          1
clark's nevus                   1
blue nevus                      1
Name: dx, dtype: int64


In [10]:
# bkl
relabel(df, "dx", "(seb k)|(verr.*seborrhoica)|(seborrh[o]*eic ker)", "bkl")

------------------------------
Terms relabeled to "bkl":
------------------------------
seborrheic keratosis     4
verruca seborrhoica      1
seborrhoeic keratosis    1
Name: dx, dtype: int64


In [11]:
# vasc
relabel(df, "dx", "angio(kerato)*m", "vasc")

------------------------------
Terms relabeled to "vasc":
------------------------------
thrombosed angioma              1
eruptive capillary angioma      1
angiokeratoma (punch biopsy)    1
hemangioma                      1
angioma (punch biopsy)          1
Name: dx, dtype: int64


In [12]:
# df
relabel(df, "dx", "dermatofibrom", "df")

------------------------------
Terms relabeled to "df":
------------------------------
dermatofibroma                  3
hemosiderotic dermatofibroma    1
Name: dx, dtype: int64


# Restrict to used classes

In [13]:
df = df[df.dx.isin(['mel', 'nv', 'bcc', 'bkl', 'vasc', 'df', 'akiec'])]
df.dx.value_counts()

nv       9
bkl      6
mel      6
vasc     5
df       4
bcc      3
akiec    3
Name: dx, dtype: int64