# 

# Data Analysis

Our study focuses on the “Differentiated Thyroid Cancer Recurrence” dataset @borzooei2023 hosted by the UCI Machine Learning Repository. The UCI Machine Learning Repository offers a wide array of datasets used for empirical analysis in machine learning and data mining @ucidata. Established by the University of California, Irvine, this repository facilitates academic and educational pursuits by providing free access to datasets that cover various domains. As of March, 2024, it hosts and maintains over 600 datasets.

In [None]:
raw_data <- readr::read_csv(here::here('data/raw-data.csv'))


Rows: 383 Columns: 17
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (16): Gender, Smoking, Hx Smoking, Hx Radiothreapy, Thyroid Function, Ph...
dbl  (1): Age

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

The “Differentiated Thyroid Cancer Recurrence” dataset encompasses 383 samples or observations, and a range of 17 variables pertinent to thyroid cancer, including patient demographics, clinical features, and pathological details, all aimed at elucidating patterns associated with cancer recurrence.

We will employ six distinct modeling methods to analyze our dataset: Artificial Neural Network (ANN), K-Nearest Neighbors (KNN), Support Vector Machine (SVM), Logistic Regression (LR), Random Forest (RF), and Extreme Gradient Boosting (XGBoost). Each of these methods brings unique strengths to the analysis, with ANN providing deep learning capabilities, KNN offering simplicity and ease of interpretation, SVM delivering powerful discriminative classification, LR providing an intuitive and easily trainable implementation, and the ensemble methods RF and XGBoost offering highly robust and accurate tree algorithms – thereby encompassing a comprehensive approach to predicting cancer recurrence in the studied dataset.

To prepare our data for modeling, we fix a typographical error, remove duplicate observations, and transform categorical variables into factors.

In [None]:
#' Load raw data.
cleaned_data <- 
  readr::read_csv(here::here('data/raw-data.csv')) |>
  dplyr::distinct() |>
  dplyr::rename(`Hx Radiotherapy` = 'Hx Radiothreapy') |>
  dplyr::mutate(Gender = ifelse(Gender == 'F', 'Female', 'Male')) |>
  dplyr::mutate(
    Gender = factor(Gender, levels = c('Female', 'Male')),
    Smoking = factor(Smoking, levels = c('Yes', 'No')),
    `Hx Smoking` = factor(`Hx Smoking`, levels = c('Yes', 'No')),
    `Hx Radiotherapy` = factor(`Hx Radiotherapy`, levels = c('Yes', 'No')),
    `Thyroid Function` = factor(
      `Thyroid Function`,
      levels = c('Euthyroid', 'Clinical Hyperthyroidism',
                 'Subclinical Hyperthyroidism', 'Clinical Hypothyroidism',
                 'Subclinical Hypothyroidism')),
    `Physical Examination` = factor(`Physical Examination`,
                                    levels = c('Normal', 'Diffuse goiter', 
                                               'Single nodular goiter-right',
                                               'Single nodular goiter-left', 
                                               'Multinodular goiter')),
    Adenopathy = factor(Adenopathy,
                        levels = c('No', 'Right', 'Left', 'Bilateral', 
                                   'Posterior', 'Extensive')),
    Pathology = factor(
      Pathology,
      levels = c('Papillary', 'Micropapillary', 'Follicular',
                 'Hurthel cell')),
    Focality = factor(Focality, levels = c('Uni-Focal', 'Multi-Focal')),
    `T` = factor(`T`, levels = c('T1a', 'T1b', 'T2', 'T3a', 'T3b', 'T4a',
                                 'T4b')),
    N = factor(N, levels = c('N0', 'N1b', 'N1a')),
    M = factor(M, levels = c('M0', 'M1')),
    Stage = factor(Stage, levels = c('I', 'II', 'III', 'IVA', 'IVB')),
    Response = factor(
      Response,
      levels = c('Excellent', 'Biochemical Incomplete',
                 'Structural Incomplete', 'Indeterminate')),
    Risk = factor(Risk, levels = c('Low', 'Intermediate', 'High')),
    Recurred = factor(Recurred, levels = c('Yes', 'No'))
  )


Rows: 383 Columns: 17
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (16): Gender, Smoking, Hx Smoking, Hx Radiothreapy, Thyroid Function, Ph...
dbl  (1): Age

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

In [None]:
# Save cleaned data.
readr::write_csv(x = cleaned_data, file = here::here('data', 'cleaned-data.csv'))


In [None]:
# Total observations.
total_obs <- nrow(cleaned_data)

# Number of males/females & gender percentages.
fem_no <- sum(cleaned_data$Gender == 'Female')
males_no <- sum(cleaned_data$Gender == 'Male')
fem_perc <- round(fem_no/total_obs*100, 1)
males_perc <- round(males_no/total_obs*100, 1)

# Gender by recurrence.
fem_rec_yes <- sum(cleaned_data$Gender == 'Female' & cleaned_data$Recurred == 'Yes')
fem_rec_no <- sum(cleaned_data$Gender == 'Female' & cleaned_data$Recurred == 'No')
male_rec_yes <- sum(cleaned_data$Gender == 'Male' & cleaned_data$Recurred == 'Yes')
male_rec_no <- sum(cleaned_data$Gender == 'Male' & cleaned_data$Recurred == 'No')
fem_rec_yes_perc <- round(fem_rec_yes/fem_no*100, 1)
fem_rec_no_perc <- round(fem_rec_no/fem_no*100, 1)
male_rec_yes_perc <- round(male_rec_yes/males_no*100, 1)
male_rec_no_perc <- round(male_rec_no/males_no*100, 1)


After removing duplicates, our data has 364 observations. Out of the 17 variables, 16 will be used as features, leaving `Recurred` as the target variable to be predicted. Among the patients, there is a significant disparity between males and females: 293(80.5%) are females and 71(19.5%) are males. Males are about evenly distributed in terms of cancer recurrence with 59.2% total recurred cases. On the other hand, females are not evenly distributed in terms of cancer recurrence with 22.5% total recurred cases (see @fig-gender-dist-html ).

In [None]:

# Note: this plot will only show for PDF versions of the paper.
knitr::include_graphics(here::here('images/gender_dist_plot.png'))


In [None]:

# Note: this plot will only show for HTML versions of the paper.

# Gender distribution grouped by cancer recurrence.
gender_dist_plot <- cleaned_data |>
  dplyr::mutate(fem_total = sum(Gender == 'Female'),
                male_total= sum(Gender == 'Male')) |>
  dplyr::group_by(Gender, Recurred) |>
  dplyr::reframe(count = dplyr::n(), fem_total, male_total) |>
  dplyr::mutate(
    count = ifelse(Gender == 'Female',
                   round(count/fem_total*100, 1),
                   round(count/male_total*100, 1))) |>
  dplyr::distinct() |>
  plotly::plot_ly(
    x = ~Gender,
    y = ~count,
    color = ~Recurred,
    text = ~Recurred,
    opacity = 0.7,
    type = 'bar',
    hovertemplate = '<b>Gender</b>: %{x} <br><b>Recurred</b>: %{text} <br><b>Percentage</b>: %{y} <extra></extra>'
    ) |>
  plotly::config(displayModeBar = FALSE) |>
  plotly::layout(bargap = 0.5, barmode = 'stack',
                 yaxis = list(title = '', ticksuffix = '%'),
                 legend = list(title = list(text = '<b>Recurred</b>'))
  )

plotly::save_image(gender_dist_plot, here::here('images/gender_dist_plot.png'),
                   width = 500, scale = 4)






The distribution of `Age` by cancer recurrence is shown in @fig-age-dist-html. Note that, in general, older patients are more likely to recur.

In [None]:

# Note: this plot will only show for PDF versions of the paper.
knitr::include_graphics(here::here('images/age_dist_plot.png'))


In [None]:

# Note: this plot will only show for HTML versions of the paper.

#' Age Distribution grouped by cancer recurrence.
age_dist_plot <- cleaned_data |>
  plotly::plot_ly() |>
  plotly::add_trace(
    x = ~Age,
    color = ~Recurred,
    text = ~Recurred,
    opacity = 0.7, #marker = list(color = '02d46a'),
    type = 'histogram',
    histnorm = 'percent',
    hovertemplate = '<b>Age Range</b>: %{x} years <br><b>Percentage</b>: %{y:.1f}%<br><b>Recurred</b>: %{text}<extra></extra>'
    ) |>
  plotly::config(displayModeBar = FALSE) |>
  plotly::layout(bargap = 0.1, barmode = 'stack',
                 yaxis = list(ticksuffix = '%'),
                 legend = list(title = list(text = '<b>Recurred</b>'))
  )

plotly::save_image(age_dist_plot, here::here('images/age_dist_plot.png'), scale = 4)






In [None]:
# Create a summary of the data features to be shown in a table.
dt_summary <- purrr::map(
  colnames(cleaned_data |> dplyr::select(-Age, -Recurred)),
  \(x) paste0(unique(sort(cleaned_data[[x]])), collapse = ', ')
)
names(dt_summary) <- colnames(cleaned_data |> dplyr::select(-Age, -Recurred))
dt_summary <- tibble::as_tibble(dt_summary) |>
  tidyr::pivot_longer(
    cols = dplyr::everything(),
    names_to = 'Feature',
    values_to = 'Values'
  )


Besides `Age`, the rest of the features are categorical. One interesting categorical feature is `Adenopathy`. It represents the presence of swollen lymph nodes during physical examination. The different adenopathy types observed are no adenopathy, anterior right, anterior left, bilateral (i.e., both sides of the body), posterior, and extensive (i.e., involves all the locations). Note the high correlation between swollen lymph nodes and DTC recurrence rate (see @fig-aden-dist-html).

In [None]:

# Note: this plot will only show for PDF versions of the paper.
knitr::include_graphics(here::here('images/aden_dist_plot.png'))


In [None]:

# Note: this plot will only show for HTML versions of the paper.

# Adenopathy distribution by cancer recurrence.
aden_dist_plot <- cleaned_data |>
  #' Find first the total number of patients by Adenopathy.
  #' This will be used to find the percentage of repeating cases next.
  dplyr::reframe(
    total_aden = dplyr::n(),
    .by = Adenopathy,
    Recurred # Keep this column.
  ) |>
  # Calculate the number of cancer recurrences per adepathy. Calculate %.
  dplyr::reframe(
    total_rec = round(dplyr::n()/total_aden*100, 1),
    .by = c(Adenopathy, Recurred),
    total_aden = total_aden
  ) |>
  dplyr::distinct() |>
  plotly::plot_ly(
    x = ~Adenopathy,
    y = ~total_rec,
    color = ~Recurred,
    text = ~Recurred,
    opacity = 0.7,
    type = 'bar',
    hovertext = ~total_aden,
    hovertemplate = '<b>Adenopathy</b>: %{x} <br><b>Recurred</b>: %{text} <br><b>Percentage</b>: %{y} of %{hovertext} patients <extra></extra>'
    ) |>
  plotly::config(displayModeBar = FALSE) |>
  plotly::layout(bargap = 0.1, barmode = 'stack',
                 yaxis = list(title = '', ticksuffix = '%'),
                 legend = list(title = list(text = '<b>Recurred</b>'))
  )

plotly::save_image(aden_dist_plot, here::here('images/aden_dist_plot.png'), scale = 4)






A summary of all the features and their categories are shown in @tbl-summary-html.