In [1]:
library(tidyverse)
library(janitor)

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.2     [32m✔[39m [34mtibble   [39m 3.3.0
[32m✔[39m [34mlubridate[39m 1.9.4     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.1.0     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors

Attaching package: ‘janitor’

The following objects are masked from ‘package:stats’:

    chisq.test, fisher.test



package ‘janitor’ was built under R version 4.5.2 


In [2]:
url <- "http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data"
wbcd_raw <- read.table(url, sep = ",", header = FALSE, stringsAsFactors = FALSE)
head(wbcd_raw)

        V1 V2    V3    V4     V5     V6      V7      V8     V9     V10    V11
1   842302  M 17.99 10.38 122.80 1001.0 0.11840 0.27760 0.3001 0.14710 0.2419
2   842517  M 20.57 17.77 132.90 1326.0 0.08474 0.07864 0.0869 0.07017 0.1812
3 84300903  M 19.69 21.25 130.00 1203.0 0.10960 0.15990 0.1974 0.12790 0.2069
4 84348301  M 11.42 20.38  77.58  386.1 0.14250 0.28390 0.2414 0.10520 0.2597
5 84358402  M 20.29 14.34 135.10 1297.0 0.10030 0.13280 0.1980 0.10430 0.1809
6   843786  M 12.45 15.70  82.57  477.1 0.12780 0.17000 0.1578 0.08089 0.2087
      V12    V13    V14   V15    V16      V17     V18     V19     V20     V21
1 0.07871 1.0950 0.9053 8.589 153.40 0.006399 0.04904 0.05373 0.01587 0.03003
2 0.05667 0.5435 0.7339 3.398  74.08 0.005225 0.01308 0.01860 0.01340 0.01389
3 0.05999 0.7456 0.7869 4.585  94.03 0.006150 0.04006 0.03832 0.02058 0.02250
4 0.09744 0.4956 1.1560 3.445  27.23 0.009110 0.07458 0.05661 0.01867 0.05963
5 0.05883 0.7572 0.7813 5.438  94.44 0.011490 0.02461 0.05688 0.

Atribuimos nomes às colunas (conforme a documentação)

In [3]:
colnames(wbcd_raw) <- c(
  "ID", "diagnosis",
  paste0("radius ", c("mean", "SE", "worst")),
  paste0("texture ", c("mean", "SE", "worst")),
  paste0("perimeter_", c("mean", "SE", "worst")),
  paste0("area_", c("mean", "SE", "worst")),
  paste0("smoothness_", c("mean", "SE", "worst")),
  paste0("compactness_", c("mean", "SE", "worst")),
  paste0("Concavity_", c("mean", "SE", "worst")),
  paste0("concave_points_", c("mean", "SE", "worst")),
  paste0("Symmetry ", c("mean", "SE", "worst")),
  paste0("fractal_dim_", c("mean", "SE", "worst"))
)
head(wbcd_raw)

        ID diagnosis radius mean radius SE radius worst texture mean texture SE
1   842302         M       17.99     10.38       122.80       1001.0    0.11840
2   842517         M       20.57     17.77       132.90       1326.0    0.08474
3 84300903         M       19.69     21.25       130.00       1203.0    0.10960
4 84348301         M       11.42     20.38        77.58        386.1    0.14250
5 84358402         M       20.29     14.34       135.10       1297.0    0.10030
6   843786         M       12.45     15.70        82.57        477.1    0.12780
  texture worst perimeter_mean perimeter_SE perimeter_worst area_mean area_SE
1       0.27760         0.3001      0.14710          0.2419   0.07871  1.0950
2       0.07864         0.0869      0.07017          0.1812   0.05667  0.5435
3       0.15990         0.1974      0.12790          0.2069   0.05999  0.7456
4       0.28390         0.2414      0.10520          0.2597   0.09744  0.4956
5       0.13280         0.1980      0.10430       

Abaixo introduzimos problemas comuns.

+ Valores ausentes e codificação inconsistente: Substituimos alguns varoles por `N/A`.

In [4]:
set.seed(20251126)
wbcd_dirty <- wbcd_raw
# num_cols <- wbcd_dirty |> select(where(is.numeric)) |> names()
num_cols <- wbcd_raw |> select(ends_with("_mean")) |> names()
for(col in num_cols) {
  idx <- sample(seq_len(nrow(wbcd_dirty)), size = 0.05 * nrow(wbcd_dirty))
  wbcd_dirty[idx, col] <- "N/A"
}

+ Inconsistência de digitação: Introduzimos problemas na caixa de alguns diagnósticos (M $\to$ m, B $\to$ b) e convenções usadas (B $\to$ Benign).

In [5]:
set.seed(20251126)
wbcd_dirty[sample(which(wbcd_dirty$diagnosis == "M"), 24), "diagnosis"] <- "m"
wbcd_dirty[sample(which(wbcd_dirty$diagnosis == "B"), 24), "diagnosis"] <- c("b", "Benign")

+ Separador decimal incorreto (vírgula): Substituímos em 5% das células numéricas

In [6]:
set.seed(20251126)
wbcd_dirty[, num_cols] <- lapply(
  wbcd_dirty[, num_cols],
  function(x) {
    ifelse(
      runif(length(x)) < 0.05, 
      gsub("\\.", ",", as.character(x)), 
      as.character(x)
    )
  }
)

In [7]:
wbcd_dirty

         ID diagnosis radius mean radius SE radius worst texture mean
1    842302         M      17.990     10.38       122.80       1001.0
2    842517         M      20.570     17.77       132.90       1326.0
3  84300903         M      19.690     21.25       130.00       1203.0
4  84348301         M      11.420     20.38        77.58        386.1
5  84358402         M      20.290     14.34       135.10       1297.0
6    843786         M      12.450     15.70        82.57        477.1
7    844359         M      18.250     19.98       119.60       1040.0
8  84458202         M      13.710     20.83        90.20        577.9
9    844981         m      13.000     21.82        87.50        519.8
10 84501001         M      12.460     24.04        83.97        475.9
11   845636         M      16.020     23.24       102.70        797.8
12 84610002         M      15.780     17.89       103.60        781.0
13   846226         M      19.170     24.80       132.40       1123.0
14   846381         

In [8]:
write.csv(wbcd_dirty, "wbcd_dirty.csv", row.names = FALSE, na = "")