Diagnóstico del cáncer de mama usando sparklyr
===

* *30 min* | Última modificación: Junio 22, 2019

En este documento se ilustra la construcción de modelos de clasificación usando Sparklyr. El tutorial está centrado en el uso del lenguaje y se supone suficiencia del lector en el uso e interpretación de algoritmos de clasificación. 

## Definición del problema

Se desea determinar si una masa mamaria es un tumor benigno o maligno, a partir de las medidas obtenidas de imágenes digitalizadas de la aspiración con una aguja fina. Los valores representan las características de los núcleos celulares presentes en la imagen digital. 

Se tiene una muestra de 569 ejemplos de resultados de las biopsias. Cada registro contiene 32 variables, las cuales corresponden a tres medidas (media, desviación estándar, peor caso) de diez características diferentes (radius, texture, ...).

* Identification number
* Cancer diagnosis ("M" para maligno y "B" para benigno)
* Radius
* Texture
* Perimeter
* Area
* Smoothness
* Compactness
* Concavity
* Concave points
* Symmetry
* Fractal dimension

En términos de los datos, se desea pronosticar si una masa es benigna o maligna (clase B o M) a partir de las 30 variables.

Fuente de los datos: https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)

## Solución

In [1]:
##
## Esta función se usará para ejecutar comandos en el 
## sistema operativo y capturar la salida.
##
systemp <- function(command) cat(system(command, intern = TRUE), sep = '\n')

In [2]:
library(sparklyr)
library(dplyr)
spark_installed_versions()
sc <- spark_connect(master='local', spark_home='/home/vagrant/spark/spark-2.4.3-bin-hadoop2.7')
spark_version(sc)


Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union



spark,hadoop,dir
<chr>,<chr>,<chr>
2.4.3,2.7,/home/vagrant/spark/spark-2.4.3-bin-hadoop2.7


[1] ‘2.4.3’

### Exploración

Se mueve el archivo de la máquina local a sistema HDFS.

In [3]:
## copia el archivo al HDFS
systemp('hdfs dfs -copyFromLocal wisc_bc_data.csv /tmp/wisc_bc_data.csv') 

“running command 'hdfs dfs -copyFromLocal wisc_bc_data.csv /tmp/wisc_bc_data.csv' had status 1”




In [4]:
df <- 
spark_read_csv(sc,                       ## spark_connection
               'wisc_bc_data',           ## nombre de la tabla
               '/tmp/wisc_bc_data.csv')  ## ubicación del archivo
                                         ## en el sistema hdfs
head(df)

[38;5;246m# Source: spark<?> [?? x 32][39m
      id diagnosis radius_mean texture_mean perimeter_mean area_mean
   [3m[38;5;246m<int>[39m[23m [3m[38;5;246m<chr>[39m[23m           [3m[38;5;246m<dbl>[39m[23m        [3m[38;5;246m<dbl>[39m[23m          [3m[38;5;246m<dbl>[39m[23m     [3m[38;5;246m<dbl>[39m[23m
[38;5;250m1[39m 8.42[38;5;246me[39m5 M                18.0         10.4          123.      [4m1[24m001 
[38;5;250m2[39m 8.43[38;5;246me[39m5 M                20.6         17.8          133.      [4m1[24m326 
[38;5;250m3[39m 8.43[38;5;246me[39m7 M                19.7         21.2          130       [4m1[24m203 
[38;5;250m4[39m 8.43[38;5;246me[39m7 M                11.4         20.4           77.6      386.
[38;5;250m5[39m 8.44[38;5;246me[39m7 M                20.3         14.3          135.      [4m1[24m297 
[38;5;250m6[39m 8.44[38;5;246me[39m5 M                12.4         15.7           82.6      477.
[38;5;246m# … with 26 mo

In [5]:
##
## Cantidad de registros leidos
##
count(df)

[38;5;246m# Source: spark<?> [?? x 1][39m
      n
  [3m[38;5;246m<dbl>[39m[23m
[38;5;250m1[39m   569

#### Regresión Logística

In [6]:
##
## Se especifica el modelo de la forma usual
##
model <- ml_logistic_regression(
    df, 
    diagnosis ~ radius_mean + texture_mean + perimeter_mean + area_mean + 
                smoothness_mean + compactness_mean + concavity_mean + 
                concave_points_mean + symmetry_mean + 
                fractal_dimension_mean + radius_se + texture_se + 
                perimeter_se + area_se + smoothness_se + compactness_se + 
                concavity_se + concave_points_se + symmetry_se + 
                fractal_dimension_se + radius_worst + texture_worst + 
                perimeter_worst + area_worst + smoothness_worst + 
                compactness_worst + concavity_worst + 
                concave_points_worst + symmetry_worst + 
                fractal_dimension_worst,
    fit_intercept = TRUE,
    elastic_net_param = 0, 
    reg_param = 0, 
    max_iter = 100,
    prediction_col = "LR", 
    probability_col = "prob_LR",
    raw_prediction_col = "raw_LR")

# Prediction
fitted_LR <- ml_predict(model, df)
head(fitted_LR)

[38;5;246m# Source: spark<?> [?? x 40][39m
      id diagnosis radius_mean texture_mean perimeter_mean area_mean
   [3m[38;5;246m<int>[39m[23m [3m[38;5;246m<chr>[39m[23m           [3m[38;5;246m<dbl>[39m[23m        [3m[38;5;246m<dbl>[39m[23m          [3m[38;5;246m<dbl>[39m[23m     [3m[38;5;246m<dbl>[39m[23m
[38;5;250m1[39m 8.42[38;5;246me[39m5 M                18.0         10.4          123.      [4m1[24m001 
[38;5;250m2[39m 8.43[38;5;246me[39m5 M                20.6         17.8          133.      [4m1[24m326 
[38;5;250m3[39m 8.43[38;5;246me[39m7 M                19.7         21.2          130       [4m1[24m203 
[38;5;250m4[39m 8.43[38;5;246me[39m7 M                11.4         20.4           77.6      386.
[38;5;250m5[39m 8.44[38;5;246me[39m7 M                20.3         14.3          135.      [4m1[24m297 
[38;5;250m6[39m 8.44[38;5;246me[39m5 M                12.4         15.7           82.6      477.
[38;5;246m# … with 34 mo

### Random forest classifier

In [7]:
##
## Se especifica el modelo de la forma usual
##
model <- ml_random_forest_classifier(
    df, 
    diagnosis ~ radius_mean + texture_mean + perimeter_mean + area_mean + 
                smoothness_mean + compactness_mean + concavity_mean + 
                concave_points_mean + symmetry_mean + 
                fractal_dimension_mean + radius_se + texture_se + 
                perimeter_se + area_se + smoothness_se + compactness_se + 
                concavity_se + concave_points_se + symmetry_se + 
                fractal_dimension_se + radius_worst + texture_worst + 
                perimeter_worst + area_worst + smoothness_worst + 
                compactness_worst + concavity_worst + 
                concave_points_worst + symmetry_worst + 
                fractal_dimension_worst,
    num_trees = 20,
    max_depth = 5,
    prediction_col = "RF",
    probability_col = "prob_RF",
    raw_prediction_col = "raw_RF")

# Prediction
fitted_RF <- ml_predict(model, df)
head(fitted_RF)

[38;5;246m# Source: spark<?> [?? x 40][39m
      id diagnosis radius_mean texture_mean perimeter_mean area_mean
   [3m[38;5;246m<int>[39m[23m [3m[38;5;246m<chr>[39m[23m           [3m[38;5;246m<dbl>[39m[23m        [3m[38;5;246m<dbl>[39m[23m          [3m[38;5;246m<dbl>[39m[23m     [3m[38;5;246m<dbl>[39m[23m
[38;5;250m1[39m 8.42[38;5;246me[39m5 M                18.0         10.4          123.      [4m1[24m001 
[38;5;250m2[39m 8.43[38;5;246me[39m5 M                20.6         17.8          133.      [4m1[24m326 
[38;5;250m3[39m 8.43[38;5;246me[39m7 M                19.7         21.2          130       [4m1[24m203 
[38;5;250m4[39m 8.43[38;5;246me[39m7 M                11.4         20.4           77.6      386.
[38;5;250m5[39m 8.44[38;5;246me[39m7 M                20.3         14.3          135.      [4m1[24m297 
[38;5;250m6[39m 8.44[38;5;246me[39m5 M                12.4         15.7           82.6      477.
[38;5;246m# … with 34 mo

#### Gradient-boosted tree classifier

In [8]:
##
## Se especifica el modelo de la forma usual
##
model <- ml_gbt_classifier(
    df, 
    diagnosis ~ radius_mean + texture_mean + perimeter_mean + area_mean + 
                smoothness_mean + compactness_mean + concavity_mean + 
                concave_points_mean + symmetry_mean + 
                fractal_dimension_mean + radius_se + texture_se + 
                perimeter_se + area_se + smoothness_se + compactness_se + 
                concavity_se + concave_points_se + symmetry_se + 
                fractal_dimension_se + radius_worst + texture_worst + 
                perimeter_worst + area_worst + smoothness_worst + 
                compactness_worst + concavity_worst + 
                concave_points_worst + symmetry_worst + 
                fractal_dimension_worst,
    max_iter = 20, 
    max_depth = 5,
    prediction_col = "GBT",
    probability_col = "prob_GBT",
    raw_prediction_col = "raw_GBT")

# Prediction
fitted_GBT <- ml_predict(model, df)
head(fitted_GBT)

[38;5;246m# Source: spark<?> [?? x 40][39m
      id diagnosis radius_mean texture_mean perimeter_mean area_mean
   [3m[38;5;246m<int>[39m[23m [3m[38;5;246m<chr>[39m[23m           [3m[38;5;246m<dbl>[39m[23m        [3m[38;5;246m<dbl>[39m[23m          [3m[38;5;246m<dbl>[39m[23m     [3m[38;5;246m<dbl>[39m[23m
[38;5;250m1[39m 8.42[38;5;246me[39m5 M                18.0         10.4          123.      [4m1[24m001 
[38;5;250m2[39m 8.43[38;5;246me[39m5 M                20.6         17.8          133.      [4m1[24m326 
[38;5;250m3[39m 8.43[38;5;246me[39m7 M                19.7         21.2          130       [4m1[24m203 
[38;5;250m4[39m 8.43[38;5;246me[39m7 M                11.4         20.4           77.6      386.
[38;5;250m5[39m 8.44[38;5;246me[39m7 M                20.3         14.3          135.      [4m1[24m297 
[38;5;250m6[39m 8.44[38;5;246me[39m5 M                12.4         15.7           82.6      477.
[38;5;246m# … with 34 mo